Point Transaction System
This document is written with the assistance of Claude Code (claude-sonnet-4-6). Design decisions and content are driven by the author; Claude Code is used as a writing assistant throughout the drafting process.
Overview
Purpose
An asynchronous, horizontally scalable transaction ledger for a user-facing points/rewards system. Users deposit, spend, and refund points; every change passes through a durable pipeline that survives crashes without double-spending or losing transactions.
Problem to Solve
Points back real money and real actions (AI jobs, unlocks, service calls). A naive synchronous ledger would either block users on long-running work, lose transactions when a worker restarts mid-job, or allow double-spend from concurrent requests. This system provides a durable, crash-safe, async-first ledger with strong consistency on balance and unambiguous outcome reporting.
Key Requirements
- Correctness under concurrency — no double-spend, no negative available balance, no lost deposits
- Crash-safe — no transaction silently dropped on API, worker, or monitor restart
- Async by default — client never blocks on long-running actions; outcomes pushed via SSE when ready
- Horizontally scalable — credit and debit workers scale independently based on queue lag
- Composable actions — multi-step actions (saga) roll back cleanly on partial failure, always refunding reserved points after rollback
Non-Goals
- Multi-currency or fractional points (single integer ledger)
- Cross-account transfers (user-to-user point transfer not in scope)
- Real-time (sub-second) latency guarantees — the design trades latency for durability and correctness
System Architecture
The system is split into two views: the Application Architecture shows the transaction processing flow, and the Infrastructure Architecture shows how services are orchestrated, scaled, and monitored.
Application Architecture
flowchart TD
%% Application Architecture — transaction processing flow
Client
APIGateway["API Gateway"]
APIService["API Service\n(Node.js, scalable)"]
Redis[("Redis Cluster\n(Distributed Lock)")]
Kafka["Kafka\n(partitioned by user_id)"]
Workers["Transaction Workers\n(Node.js, scalable)"]
PostgreSQL[("PostgreSQL Cluster\n(RDBMS)")]
NotificationService["Notification Service\n(Node.js, scalable)"]
DLT["Dead Letter Topic\n(Kafka DLT)"]
TM["Transaction Monitor\n(Outbox + Watchdog)"]
Client --> APIGateway
APIGateway --> APIService
APIService <-->|"distributed lock"| Redis
APIService -->|"1. reserve + outbox (atomic)"| PostgreSQL
Kafka -->|"2. consume job event"| Workers
Workers -->|"3. confirm / refund"| PostgreSQL
Workers -->|"4. publish result event"| Kafka
Workers -->|"failed jobs"| DLT
Kafka -->|"5. consume result event\n(partition ownership)"| NotificationService
NotificationService -->|"6. SSE push"| Client
TM -->|"poll & publish"| Kafka
TM <-->|"scan outbox &\nstuck reservations"| PostgreSQL
Infrastructure Architecture
flowchart TD
%% Infrastructure Architecture — orchestration, scaling, observability
K8s["Kubernetes\n(orchestration)"]
KEDA["KEDA\n(autoscaler)"]
Kafka["Kafka\n(consumer lag)"]
Prometheus["Prometheus\n(metrics store)"]
Grafana["Grafana\n(dashboards & alerts)"]
AppServices["Application Services\n(API · Workers · Notification)"]
K8s -->|"orchestrates pods"| AppServices
KEDA -->|"monitors consumer lag"| Kafka
KEDA -->|"monitors metrics"| Prometheus
KEDA -->|"scales worker replicas"| AppServices
AppServices -->|"exposes metrics"| Prometheus
Kafka -->|"exposes metrics"| Prometheus
Prometheus -->|"data source"| Grafana
Technology Stack
| Technology | Role | Why |
|---|---|---|
| Node.js | API Service, Workers, Notification Service | Non-blocking I/O fits async-heavy workload; single language across services |
| Kafka | Event broker | Durable, partitioned by user_id for ordered per-user processing; scales horizontally |
| PostgreSQL | Point ledger | ACID guarantees and row-level locking ensure balance integrity |
| Redis | Distributed lock | Sub-millisecond lock acquisition with TTL-based auto-expiry prevents reservation leaks |
| Kubernetes | Container orchestration | Rolling deployments, liveness probes, and pod scaling for all services |
| KEDA | Autoscaler | Scales workers based on Kafka consumer lag and Prometheus metrics without manual intervention |
| Prometheus | Metrics collection | Standard scrape-based metrics across all services and Kafka |
| Grafana | Dashboards & alerting | Visualises Prometheus data; alert rules for DLT depth and reservation leaks |
| Transaction Monitor | Outbox polling + Watchdog | Single Node.js service handling event relay and stuck reservation auto-refund |
Transaction Flow
All transactions flow through Kafka. Workers are a single codebase deployed with different WORKER_TYPE environment variables in Kubernetes — credit workers consume deposit and refund events, debit workers consume use-point events. KEDA scales each worker type independently based on their respective topic lag.
Balance Mutation Rule All balance updates use atomic increment/decrement on the DB columns, never a read-then-write. Any operation that can cause
available_balanceto go negative requires a distributed lock. Operations that only increaseavailable_balance(deposit, refund) do not require a lock.
Point Operation Strategy
| Operation | balance |
reserved_balance |
available_balance (derived) |
Requires Lock |
|---|---|---|---|---|
| Deposit confirm | balance + deposit_amount |
unchanged | increases | No |
| Reserve (use starts) | unchanged | reserved_balance + 100 |
decreases | Yes |
| Confirm (use succeeds) | balance - 100 |
reserved_balance - 100 |
unchanged | No |
| Refund (use fails) | unchanged | reserved_balance - 100 |
increases | No |
available_balance is always derived as balance - reserved_balance — never stored independently.
Kafka Topics
| Topic | Consumer | Purpose |
|---|---|---|
transactions.credit |
Credit Workers | Deposit and refund events |
transactions.debit |
Debit Workers | Use-point events |
transactions.result |
Notification Service | Push result to client via SSE |
transactions.dlq |
Ops / alerting | Failed jobs after retries exhausted |
Deposit Flow
sequenceDiagram
participant C as Client
participant A as API Service
participant D as PostgreSQL
participant K as Kafka
participant W as Credit Worker
participant T as Third-Party Service
participant N as Notification Service
C->>A: POST /transactions/deposit
Note over A,D: Single DB transaction (atomic)
A->>D: INSERT transaction (type=deposit, status=pending)
A->>D: INSERT outbox (event_type=deposit, published=false)
A-->>C: 202 Accepted {transaction_id}
Note over D,K: Transaction Monitor polls outbox → publishes to Kafka
K->>W: consume deposit event
W->>T: invoke third-party payment service
alt Payment success
T-->>W: success
W->>D: UPDATE transaction (status=confirmed)
W->>D: UPDATE account SET balance = balance + deposit_amount
W->>K: publish → transactions.result
K->>N: consume result event
N-->>C: SSE push — deposit confirmed
else Payment failed
T-->>W: failure
W->>D: UPDATE transaction (status=failed)
W->>K: publish → transactions.result (failure)
W->>K: publish → transactions.dlq
K->>N: consume result event
N-->>C: SSE push — deposit failed
end
Use Point Flow
sequenceDiagram
participant C as Client
participant A as API Service
participant R as Redis
participant D as PostgreSQL
participant K as Kafka
participant W as Debit Worker
participant N as Notification Service
C->>A: POST /transactions/use
A->>R: acquire lock (user_id)
R-->>A: lock acquired
A->>D: SELECT available_balance FROM account (primary DB)
alt Insufficient balance
A-->>C: 402 Payment Required
A->>R: release lock
else Sufficient balance
Note over A,D: Single DB transaction (atomic)
A->>D: INSERT transaction (type=use, status=reserved)
A->>D: UPDATE account SET reserved_balance = reserved_balance + 100
A->>D: INSERT outbox (event_type=use, published=false)
A->>R: release lock
A-->>C: 202 Accepted {transaction_id}
Note over D,K: Transaction Monitor polls outbox → publishes to Kafka
K->>W: consume use-point event
alt Action success
W->>D: UPDATE transaction (status=confirmed)
W->>D: UPDATE account SET balance = balance - 100, reserved_balance = reserved_balance - 100
W->>K: publish → transactions.result
K->>N: consume result event
N-->>C: SSE push — use confirmed
else Action failed
W->>D: UPDATE transaction (status=failed)
W->>K: publish → transactions.credit (type=refund)
end
end
Refund Flow (via Credit Worker)
When a debit worker action fails, it publishes a refund event to transactions.credit and stops — it does not notify the client. The credit worker completes the refund and sends the single final result notification: "action failed, points refunded". The client always hears the outcome once, only after the full loop is closed.
sequenceDiagram
participant K as Kafka
participant W as Credit Worker
participant D as PostgreSQL
participant N as Notification Service
participant C as Client
K->>W: consume refund event (linked to original use transaction)
W->>D: INSERT transaction (type=refund, ref=original_transaction_id)
W->>D: UPDATE account SET reserved_balance = reserved_balance - 100
W->>K: publish → transactions.result (action failed, points refunded)
K->>N: consume result event
N-->>C: SSE push — action failed, points refunded
Consistency & Error Handling
Idempotency
Every transaction write carries an idempotency_key constructed as:
{transaction_id}:{operation_type}
Examples:
txn-abc123:deposittxn-abc123:usetxn-abc123:refund
A unique index on transactions(idempotency_key) means a duplicate insert is rejected silently — the existing row is returned. Worker retries and duplicate Kafka message deliveries are safe by default.
Outbox Pattern
To prevent reservations from being stuck when the API crashes between the DB write and the Kafka publish, all Kafka events are written to an outbox table in the same DB transaction as the reservation:
BEGIN;
INSERT INTO transactions (type, status, idempotency_key, ...);
UPDATE accounts SET reserved_balance = reserved_balance + 100 WHERE account_id = ?;
INSERT INTO outbox (event_type, payload, published, created_at) VALUES (..., false, NOW());
COMMIT;
The API never publishes to Kafka directly. A Transaction Monitor Service polls the outbox table and publishes unpublished events to Kafka:
Transaction Monitor → SELECT * FROM outbox WHERE published = false
→ publish to Kafka
→ UPDATE outbox SET published = true
This guarantees atomicity: either both the reservation and its Kafka event exist, or neither does.
Scope note: The outbox pattern applies to the API layer only. Workers publish to Kafka directly — their idempotency keys and Kafka's at-least-once delivery guarantee safe redelivery on retry.
Transaction Monitor Service
A single service responsible for both outbox polling and watchdog duties:
Outbox Poller
- Polls
outboxtable every few seconds for unpublished events - Publishes to Kafka and marks rows as
published = true - Handles the crash-between-write-and-publish failure mode
Watchdog
- Scans
transactionstable for reservations withstatus = reservedandcreated_at < NOW() - 10 minutes - If no corresponding job exists in the queue → triggers auto-refund via
transactions.credit - Alerts on DLQ depth > 0
flowchart TD
TM["Transaction Monitor Service"]
Outbox[("outbox table\n(PostgreSQL)")]
Transactions[("transactions table\n(PostgreSQL)")]
Kafka["Kafka"]
DLT["Dead Letter Topic\n(DLT)"]
TM -->|"poll unpublished events"| Outbox
Outbox -->|"unpublished rows"| TM
TM -->|"publish event"| Kafka
TM -->|"scan stuck reservations\n(> 10 min)"| Transactions
Transactions -->|"stuck reservations"| TM
TM -->|"auto-refund event"| Kafka
TM -->|"monitor DLQ depth"| DLT
Graceful Shutdown
When a Kubernetes pod receives SIGTERM, each service follows a drain sequence to prevent reservation leaks:
API Service
- Stop accepting new requests (readiness probe removed)
- Complete all in-flight DB writes and outbox inserts
- Exit cleanly
Workers (Credit & Debit)
- Stop consuming new messages from Kafka (
consumer.pause()) - Finish the currently executing job — confirm or refund before exit
- Exit cleanly — Kafka offset is committed only after job completes
Transaction Monitor Service
- Finish the current outbox poll cycle
- Complete any in-flight Kafka publishes
- Exit cleanly
If a worker is force-killed before finishing a job, the Kafka offset is not committed — the message is redelivered on next startup. Combined with idempotency keys, redelivery is safe. The watchdog 10-minute threshold acts as the final safety net for any reservation that slips through.
Data Models
accounts
| Column | Type | Notes |
|---|---|---|
id |
uuid |
Primary key |
user_id |
uuid |
FK to users table — one account per user |
balance |
bigint |
Confirmed points available to the user |
reserved_balance |
bigint |
Points locked by active reservations |
created_at |
timestamptz |
Account creation timestamp |
updated_at |
timestamptz |
Last modified timestamp |
available_balance is always derived as balance - reserved_balance — never stored independently.
transactions
| Column | Type | Notes |
|---|---|---|
id |
bigserial |
Auto-increment internal ID for ordering |
uuid |
uuid |
Globally unique ID exposed to clients — used for idempotency and duplicate order prevention |
account_id |
uuid |
FK to accounts |
type |
enum |
deposit, use, refund |
amount |
bigint |
Points involved in this transaction — flexible, always positive |
status |
enum |
pending, reserved, confirmed, failed |
idempotency_key |
text |
Unique index — format: {uuid}:{type} |
ref_transaction_id |
uuid |
FK to original use transaction — populated for refund type only |
created_at |
timestamptz |
Immutable insert timestamp |
updated_at |
timestamptz |
Updated on every status transition |
outbox
| Column | Type | Notes |
|---|---|---|
id |
bigserial |
Auto-increment primary key |
event_type |
text |
Matches transaction type — deposit, use, refund |
payload |
jsonb |
Full event payload published to Kafka |
published |
boolean |
false until Transaction Monitor publishes to Kafka |
created_at |
timestamptz |
Insert timestamp — used to detect stuck outbox rows |
Transaction State Machine
pending ──reserve──► reserved ──confirm──► confirmed
│
└──(action fails)──► failed
refund: pending ──confirm──► confirmed
deposit—pending→confirmedorfaileduse—pending→reserved→confirmed(action succeeded) orfailed(action failed)refund—pending→confirmed(refund always succeeds; retry until confirmed)
Indexes
| Table | Index | Purpose |
|---|---|---|
transactions |
UNIQUE (idempotency_key) |
Duplicate write protection |
transactions |
INDEX (account_id, status, created_at) |
Watchdog scan for stuck reservations |
transactions |
INDEX (uuid) |
Client-facing lookup by transaction UUID |
outbox |
INDEX (published, created_at) |
Outbox poller query performance |
Nested Transaction Design
Some actions are composed of multiple sequential steps — each calling an external resource. The system uses the Saga pattern (orchestration variant) to handle partial failures without leaving points permanently locked or external systems in an inconsistent state.
Step Interface
Every external call implements two methods:
interface Step {
execute(): Promise<Result>
rollback(): Promise<void>
}
- Pure functions (no side effects, safe to retry) implement
rollback()as a no-op. - Non-pure functions (write to external systems, send emails, charge payments) implement
rollback()as a compensation that undoes the side effect.
The orchestrator always calls rollback() on failure — it never needs to know whether a step is pure or not.
Step Log
Each step's state is persisted to a task_steps table so the saga can survive a worker crash and be resumed by the watchdog.
| Column | Type | Notes |
|---|---|---|
id |
bigserial |
Primary key |
transaction_id |
uuid |
FK to transactions.uuid of the parent use transaction |
step_index |
int |
Order of execution (0-based) |
status |
enum |
pending, executed, rolled_back, failed |
created_at |
timestamptz |
Step insert timestamp |
updated_at |
timestamptz |
Updated on every status transition |
Saga Orchestration Flow
Debit Worker (Orchestrator)
├── Step 0: execute() → persist status=executed
├── Step 1: execute() → persist status=executed
└── Step 2: execute() → FAILS → persist status=failed
↓
Compensation (reverse order)
├── Step 1: rollback() → persist status=rolled_back
├── Step 0: rollback() → persist status=rolled_back
└── publish refund event → transactions.credit
The point refund is always issued after all rollbacks complete — never before.
Recovery
If a worker crashes mid-saga, the Kafka offset is not committed and the job is redelivered. The worker reads the task_steps log to determine which steps already executed and resumes from the correct position — re-executing pending steps or re-running rollbacks.
Idempotency on re-execution: Before calling
execute()on a step, the worker checkstask_stepsfor an existingstatus=executedrow. If found, the step is skipped — preventing double side-effects on non-pure functions on redelivery.
Note: A dedicated Kafka topic (
transactions.saga-resume) for watchdog-triggered saga recovery is intentionally out of scope here — it extends the topic model described in the Transaction Flow section and warrants its own design when multi-step action complexity grows.
Scalability
Dynamic Scaling with KEDA
Workers scale independently per Kafka topic based on consumer lag. Credit workers and debit workers are separate Kubernetes Deployments — KEDA watches each topic's lag and scales the corresponding worker pool.
transactions.debit— scales debit workers based on debit topic lagtransactions.credit— scales credit workers based on credit topic lag- Prometheus metrics (queue depth, processing rate) feed additional KEDA triggers for finer-grained scaling decisions
Each worker type can scale to zero during idle periods and burst to handle peak load without affecting the other.
Priority Scheduling
Priority is implemented via topic-level separation. Higher-priority jobs are published to a dedicated high-priority topic; KEDA keeps a higher minimum replica count and lower scale threshold for that topic — ensuring high-priority jobs are processed with lower latency.
transactions.debit.high → KEDA: higher min replicas, aggressive scale threshold
transactions.debit.normal → KEDA: lower min replicas, conservative scale threshold
The API routes each job to the correct topic at publish time based on a priority signal.
Open question: The criteria for determining job priority (e.g. user tier, action type, account balance level) is not yet defined and requires further product decision before implementation.
Cost Optimization
Cost decisions are metrics-driven — optimize where Prometheus data shows actual pressure, not speculatively.
| Metric to watch | Threshold signal | Optimization action |
|---|---|---|
| DB read query latency | Consistently high on balance reads | Add PostgreSQL read replica, route SELECT queries there |
| Redis cache hit rate | Low hit rate on balance lookups | Tune Redis TTL, review cache invalidation strategy |
| Worker idle time | Workers consistently idle | Reduce KEDA minimum replica count |
| Queue depth (sustained high) | Lag growing faster than workers clear it | Increase KEDA max replica count or review worker throughput |
| DLQ depth | Non-zero | Investigate failed jobs — recurring failures may indicate a systemic issue worth fixing rather than retrying |
Flow Extensibility via Kafka
Because each processing step is driven by a Kafka event, the flow is loosely coupled by design. Any service can subscribe to a topic and react independently — without the publisher knowing or caring who is listening.
- Adding a step: Deploy a new consumer that subscribes to an existing topic (e.g.
transactions.result). No changes to existing services required. - Removing a step: Stop the consumer. The topic continues flowing; other consumers are unaffected.
- Reordering steps: Introduce an intermediate topic between two stages and wire the new consumer to it. Upstream and downstream services remain unchanged.
New business requirements map directly to new consumers, not to changes in existing code.
Observability
Prometheus scrapes metrics from all services; Grafana dashboards visualize system health.
| Signal | Source | Alert threshold |
|---|---|---|
| Kafka consumer lag | KEDA / Kafka exporter | Lag growing for > 2 min |
| DLQ depth | Kafka exporter | > 0 |
| Transaction Monitor last poll age | App metric | > 2× poll interval |
| API error rate | Node.js metrics | > 1% of requests |
| Reservation age (stuck) | App metric | > watchdog threshold (10 min) |
Problem Analysis
Long-Duration Tasks, Uncertainty, and Task Loss on Restart
Problem: AI task processing is inherently long-running and non-deterministic. A naive synchronous design would block the client for the full duration, and any service restart mid-task would silently lose the work.
How the architecture solves it:
Long-duration tasks — The API returns 202 Accepted immediately after reserving points and writing the outbox row. The worker processes the job asynchronously; the client receives the result via SSE when the worker completes. The client never blocks.
Uncertainty — The transaction state machine provides a clear, unambiguous status at every point in the lifecycle (pending → reserved → confirmed / failed / refunded). There is no state where the outcome is unknown — every terminal state is explicit.
Task loss on restart — Two layers of protection ensure no job is silently dropped:
- Outbox recovery: If the worker never received the job (crash before publish), the Transaction Monitor re-publishes the outbox row on its next poll cycle.
- Kafka redelivery: If the worker crashed mid-processing, the Kafka offset was not committed — the broker redelivers the message on worker restart.
- Idempotency: The idempotency key (
{transaction_id}:{operation_type}) prevents double-processing if the same job is redelivered after a crash.
Common Node.js Pitfalls in Worker Services
Missing await — Silent Failure
A forgotten await on an async call returns a pending Promise that is never observed. The function appears to succeed, no error is thrown, but the operation never completed. In a worker context this means a transaction step silently did nothing — the job is acknowledged, the Kafka offset is committed, and the work is lost.
Detection: TypeScript with @typescript-eslint/no-floating-promises catches unhandled Promises at compile time. Runtime: unexpected status=reserved rows that never progress despite no errors in logs.
Retry Without Persisted State — Infinite Loop Risk
A retry loop that tracks attempt count only in memory will reset on every restart. If the worker crashes and restarts, the counter resets to zero — the job retries indefinitely, never reaching the DLQ.
Mitigation: Retry attempt count must be persisted. Either use Kafka's built-in retry topic mechanism (which tracks delivery count at the broker level) or write the attempt count to the transactions row before each retry. Never rely on in-memory state for retry logic.
Blocked Event Loop — Worker Stalls Silently
Node.js runs on a single thread. A CPU-intensive operation (large JSON parse, synchronous crypto, tight loop) on the main thread blocks the event loop — the worker stops processing Kafka messages without crashing, Kafka consumer lag grows, and KEDA scales up more replicas rather than fixing the root cause.
Detection: Prometheus worker throughput drops to zero while the pod remains healthy. Use --cpu-prof or clinic.js flame graphs to identify the blocking call. Move heavy computation to a worker thread (worker_threads) or an off-process job.
Redis Client — Silent Errors and Missing Logs
Redis client libraries (e.g. ioredis) suppress many errors by default or emit them only on the error event. If no error handler is attached, the process may crash silently or swallow connection failures without any log output. This is especially common with Redis Streams and distributed lock operations.
Common failure patterns:
- Lock acquisition silently fails —
SET NX PXreturnsnull(lock held), but if the caller does not check the return value, the operation proceeds without the lock. Balance mutations run unprotected. - Stream consumer group not found —
XREADGROUPthrows if the consumer group does not exist yet. The error may not surface if the client's error event is unhandled, leaving the worker silently idle. - Connection lost mid-operation — Redis reconnects automatically, but in-flight commands may be dropped without rejection if the client is configured with
enableOfflineQueue: false. The caller sees no error and assumes success.
Mitigation:
- Always attach an
errorhandler to the Redis client instance. - Wrap lock acquisition in an explicit null-check and treat
nullas a lock contention error — log and retry with backoff. - Run
XGROUP CREATE ... MKSTREAMat worker startup to ensure consumer groups exist before consuming. - Enable structured logging on all Redis operation failures with the command name, key, and error message.
Redis Streams — Not Suitable as a Primary Event Bus
Redis Streams can deliver messages and support consumer groups, but they are not a reliable event bus for critical transaction flows.
Key limitations:
- Message loss on eviction — Redis is an in-memory store. If
maxmemoryis reached and the eviction policy is notnoeviction, stream entries can be silently dropped. There is no durability guarantee comparable to Kafka's log-based storage. - No redelivery guarantee on crash — A consumer that crashes after reading but before acknowledging (
XACK) will have its pending messages recovered only ifXAUTOCLAIMis explicitly polled. This must be implemented manually; there is no automatic redelivery. - Operational complexity — Managing consumer group lag, pending entry lists, and dead-letter handling in Redis Streams requires significant custom code with no ecosystem tooling.
Recommendation: Use a purpose-built job queue library such as BullMQ (backed by Redis) for task-level triggering within a single service boundary. BullMQ handles retries, backoff, job state persistence, and failure queues out of the box, and its jobs survive worker restarts reliably.
For cross-service event delivery — which is the primary use case in this system — Kafka remains the correct choice: durable, replayable, partitioned, and designed for exactly this workload.
Rule of thumb: Redis Streams or BullMQ for in-process / single-service job queues. Kafka for anything that crosses a service boundary or requires durability guarantees.
AI Usage Documentation
This document was produced in a live design session using Claude Code (claude-sonnet-4-6) as a design assistant. The human author drove all architectural decisions; Claude served as a sounding board, presented tradeoffs, and acted as the writer.
What AI produced
- All Markdown formatting, Mermaid diagram syntax, and table structure
- First-draft prose for each section based on decisions reached in conversation
- Conflict detection between sections (field name mismatches, logic inconsistencies, diagram/prose divergence)
- Identification of missing acceptance criteria coverage after each section
What the author verified and refined
- Every architectural decision (reservation model, outbox pattern, saga orchestration, SSE fan-out via Kafka partition ownership, KEDA scaling strategy, priority topic split)
- Correctness of all balance mutation rules (
balance +=,reserved_balance +=/-=) and when distributed locks are required - Data model field names, types, and relationships
- Accuracy of the Problem Understanding section against actual failure modes
Revision history
The Gist at https://gist.github.com/DavidHsaiou/718528da3a6acfce0f99dddd32c5d523 contains the full version history of this document, with each revision corresponding to a completed section or significant revision cycle.