A2A idempotency policy

Covenant A2A is a durable, explicitly leased queue. It does not automatically redeliver leased work after restart. Operators can repair stale leases explicitly, and a disabled-by-default retry scan can requeue only stale tasks that declare idempotent duplicate safety and carry a non-empty key.

Terms

  • Attempt. One lease and execution of a task.
  • Duplicate execution. A task is executed more than once (for example, the receiver crashes after performing work but before posting a result).
  • Retry. Requeueing a task for another attempt without changing the task id.

Policy goals

  1. Make duplicate-work risk explicit and machine-checkable.
  2. Prevent silent duplicate side effects when automation requeues.
  3. Keep retries visible via attempt counters and audit rows.

Task metadata

Tasks may carry explicit idempotency metadata in the A2ATask envelope. The daemon validates that a present key is non-empty, persists the metadata, and returns it through queue/status surfaces.

Idempotency class

  • idempotent: executing the task multiple times with the same task id is safe. Any side effects must be keyed or conditional such that duplicates do not create new external effects.
  • unsafe: duplicate execution may cause external effects. The system must not automatically requeue these tasks.

Manual repair still uses operator_accepted as an explicit human posture. The automated retry gate only accepts task metadata marked idempotent.

Idempotency key

The idempotency key is a stable, caller-chosen key for the logical work unit. For tasks that call external systems that support explicit keys, senders should provide the same key so receivers can forward it consistently.

Receiver-side result cache

When an idempotent task posts a result, the mailbox stores a cached payload keyed by sender, recipient, current task kind, and idempotency key. A later task with the same cache key receives a replayed result immediately instead of being leased again.

JSONL-backed mailboxes persist cache entries in the event log. Task compaction removes resolved task history but keeps cache entries, so future duplicates can still short-circuit after restart.

Explicit retry gate

covenant a2a retry-stale is disabled by default and reports what it would do unless the operator passes --enable.

  1. Never synthesize a new task id. Retries requeue the same task id and increment the attempt counter on the next lease.
  2. Retry only tasks marked idempotent.
  3. Skip tasks without a non-empty idempotency key.
  4. Make retry decisions observable via auto_requeue audit rows and skipped-task report entries.
  5. Bound retry behavior with explicit maximum attempts, maximum requeues, minimum lease age, and scan limits.

Periodic scheduler

The daemon can run the same retry gate on a timer through an explicit environment opt-in. It does not bypass the a2a.repair.requeue capability gate.

COVENANT_A2A_AUTO_RETRY_SCHEDULER=1
COVENANT_A2A_AUTO_RETRY_INTERVAL_MS=60000
COVENANT_A2A_AUTO_RETRY_MIN_LEASE_AGE_MS=300000
COVENANT_A2A_AUTO_RETRY_MAX_ATTEMPTS=3
COVENANT_A2A_AUTO_RETRY_MAX_REQUEUES=1
COVENANT_A2A_AUTO_RETRY_SCAN_LIMIT=100

Every scheduler pass records an a2a_auto_retry_scheduler_scan audit summary. Actual mutations still produce per-task auto_requeue repair rows.

Receiver obligations

Receivers may claim idempotent only when:

  • persistent writes are conditional on the task id (or explicit idempotency key) so replays do not create new records;
  • external calls that support idempotency keys receive the key consistently across retries;
  • results are safe to post multiple times (posting the same result twice must not corrupt mailbox state).

If any step cannot be made idempotent, classify the task as operator_accepted.

Relationship to manual repair

Manual lease repair already requires an explicit duplicate-risk posture (idempotent vs operator_accepted). The retry gate is effectively a daemon-initiated requeue, so it must use task metadata and must never bypass this classification.

Follow-up work

  • Add an explicit typed task-kind field for cache scoping.
  • Add periodic retry scheduling that reuses the existing retry gate.

Related