Point Transaction System

This document is written with the assistance of Claude Code (claude-sonnet-4-6). Design decisions and content are driven by the author; Claude Code is used as a writing assistant throughout the drafting process.

Overview

Purpose

An asynchronous, horizontally scalable transaction ledger for a user-facing points/rewards system. Users deposit, spend, and refund points; every change passes through a durable pipeline that survives crashes without double-spending or losing transactions.

Problem to Solve

Points back real money and real actions (AI jobs, unlocks, service calls). A naive synchronous ledger would either block users on long-running work, lose transactions when a worker restarts mid-job, or allow double-spend from concurrent requests. This system provides a durable, crash-safe, async-first ledger with strong consistency on balance and unambiguous outcome reporting.

Key Requirements

  • Correctness under concurrency — no double-spend, no negative available balance, no lost deposits
  • Crash-safe — no transaction silently dropped on API, worker, or monitor restart
  • Async by default — client never blocks on long-running actions; outcomes pushed via SSE when ready
  • Horizontally scalable — credit and debit workers scale independently based on queue lag
  • Composable actions — multi-step actions (saga) roll back cleanly on partial failure, always refunding reserved points after rollback

Non-Goals

  • Multi-currency or fractional points (single integer ledger)
  • Cross-account transfers (user-to-user point transfer not in scope)
  • Real-time (sub-second) latency guarantees — the design trades latency for durability and correctness

System Architecture

The system is split into two views: the Application Architecture shows the transaction processing flow, and the Infrastructure Architecture shows how services are orchestrated, scaled, and monitored.

Application Architecture

flowchart TD
    %% Application Architecture — transaction processing flow

    Client
    APIGateway["API Gateway"]
    APIService["API Service\n(Node.js, scalable)"]
    Redis[("Redis Cluster\n(Distributed Lock)")]
    Kafka["Kafka\n(partitioned by user_id)"]
    Workers["Transaction Workers\n(Node.js, scalable)"]
    PostgreSQL[("PostgreSQL Cluster\n(RDBMS)")]
    NotificationService["Notification Service\n(Node.js, scalable)"]
    DLT["Dead Letter Topic\n(Kafka DLT)"]
    TM["Transaction Monitor\n(Outbox + Watchdog)"]

    Client --> APIGateway
    APIGateway --> APIService
    APIService <-->|"distributed lock"| Redis
    APIService -->|"1. reserve + outbox (atomic)"| PostgreSQL
    Kafka -->|"2. consume job event"| Workers
    Workers -->|"3. confirm / refund"| PostgreSQL
    Workers -->|"4. publish result event"| Kafka
    Workers -->|"failed jobs"| DLT
    Kafka -->|"5. consume result event\n(partition ownership)"| NotificationService
    NotificationService -->|"6. SSE push"| Client
    TM -->|"poll & publish"| Kafka
    TM <-->|"scan outbox &\nstuck reservations"| PostgreSQL

Infrastructure Architecture

flowchart TD
    %% Infrastructure Architecture — orchestration, scaling, observability

    K8s["Kubernetes\n(orchestration)"]
    KEDA["KEDA\n(autoscaler)"]
    Kafka["Kafka\n(consumer lag)"]
    Prometheus["Prometheus\n(metrics store)"]
    Grafana["Grafana\n(dashboards & alerts)"]
    AppServices["Application Services\n(API · Workers · Notification)"]

    K8s -->|"orchestrates pods"| AppServices
    KEDA -->|"monitors consumer lag"| Kafka
    KEDA -->|"monitors metrics"| Prometheus
    KEDA -->|"scales worker replicas"| AppServices
    AppServices -->|"exposes metrics"| Prometheus
    Kafka -->|"exposes metrics"| Prometheus
    Prometheus -->|"data source"| Grafana

Technology Stack

Technology Role Why
Node.js API Service, Workers, Notification Service Non-blocking I/O fits async-heavy workload; single language across services
Kafka Event broker Durable, partitioned by user_id for ordered per-user processing; scales horizontally
PostgreSQL Point ledger ACID guarantees and row-level locking ensure balance integrity
Redis Distributed lock Sub-millisecond lock acquisition with TTL-based auto-expiry prevents reservation leaks
Kubernetes Container orchestration Rolling deployments, liveness probes, and pod scaling for all services
KEDA Autoscaler Scales workers based on Kafka consumer lag and Prometheus metrics without manual intervention
Prometheus Metrics collection Standard scrape-based metrics across all services and Kafka
Grafana Dashboards & alerting Visualises Prometheus data; alert rules for DLT depth and reservation leaks
Transaction Monitor Outbox polling + Watchdog Single Node.js service handling event relay and stuck reservation auto-refund

Transaction Flow

All transactions flow through Kafka. Workers are a single codebase deployed with different WORKER_TYPE environment variables in Kubernetes — credit workers consume deposit and refund events, debit workers consume use-point events. KEDA scales each worker type independently based on their respective topic lag.

Balance Mutation Rule All balance updates use atomic increment/decrement on the DB columns, never a read-then-write. Any operation that can cause available_balance to go negative requires a distributed lock. Operations that only increase available_balance (deposit, refund) do not require a lock.

Point Operation Strategy

Operation balance reserved_balance available_balance (derived) Requires Lock
Deposit confirm balance + deposit_amount unchanged increases No
Reserve (use starts) unchanged reserved_balance + 100 decreases Yes
Confirm (use succeeds) balance - 100 reserved_balance - 100 unchanged No
Refund (use fails) unchanged reserved_balance - 100 increases No

available_balance is always derived as balance - reserved_balance — never stored independently.

Kafka Topics

Topic Consumer Purpose
transactions.credit Credit Workers Deposit and refund events
transactions.debit Debit Workers Use-point events
transactions.result Notification Service Push result to client via SSE
transactions.dlq Ops / alerting Failed jobs after retries exhausted

Deposit Flow

sequenceDiagram
    participant C as Client
    participant A as API Service
    participant D as PostgreSQL
    participant K as Kafka
    participant W as Credit Worker
    participant T as Third-Party Service
    participant N as Notification Service

    C->>A: POST /transactions/deposit
    Note over A,D: Single DB transaction (atomic)
    A->>D: INSERT transaction (type=deposit, status=pending)
    A->>D: INSERT outbox (event_type=deposit, published=false)
    A-->>C: 202 Accepted {transaction_id}

    Note over D,K: Transaction Monitor polls outbox → publishes to Kafka
    K->>W: consume deposit event

    W->>T: invoke third-party payment service
    alt Payment success
        T-->>W: success
        W->>D: UPDATE transaction (status=confirmed)
        W->>D: UPDATE account SET balance = balance + deposit_amount
        W->>K: publish → transactions.result
        K->>N: consume result event
        N-->>C: SSE push — deposit confirmed
    else Payment failed
        T-->>W: failure
        W->>D: UPDATE transaction (status=failed)
        W->>K: publish → transactions.result (failure)
        W->>K: publish → transactions.dlq
        K->>N: consume result event
        N-->>C: SSE push — deposit failed
    end

Use Point Flow

sequenceDiagram
    participant C as Client
    participant A as API Service
    participant R as Redis
    participant D as PostgreSQL
    participant K as Kafka
    participant W as Debit Worker
    participant N as Notification Service

    C->>A: POST /transactions/use
    A->>R: acquire lock (user_id)
    R-->>A: lock acquired

    A->>D: SELECT available_balance FROM account (primary DB)
    alt Insufficient balance
        A-->>C: 402 Payment Required
        A->>R: release lock
    else Sufficient balance
        Note over A,D: Single DB transaction (atomic)
        A->>D: INSERT transaction (type=use, status=reserved)
        A->>D: UPDATE account SET reserved_balance = reserved_balance + 100
        A->>D: INSERT outbox (event_type=use, published=false)
        A->>R: release lock
        A-->>C: 202 Accepted {transaction_id}

        Note over D,K: Transaction Monitor polls outbox → publishes to Kafka

        K->>W: consume use-point event
        alt Action success
            W->>D: UPDATE transaction (status=confirmed)
            W->>D: UPDATE account SET balance = balance - 100, reserved_balance = reserved_balance - 100
            W->>K: publish → transactions.result
            K->>N: consume result event
            N-->>C: SSE push — use confirmed
        else Action failed
            W->>D: UPDATE transaction (status=failed)
            W->>K: publish → transactions.credit (type=refund)
        end
    end

Refund Flow (via Credit Worker)

When a debit worker action fails, it publishes a refund event to transactions.credit and stops — it does not notify the client. The credit worker completes the refund and sends the single final result notification: "action failed, points refunded". The client always hears the outcome once, only after the full loop is closed.

sequenceDiagram
    participant K as Kafka
    participant W as Credit Worker
    participant D as PostgreSQL
    participant N as Notification Service
    participant C as Client

    K->>W: consume refund event (linked to original use transaction)
    W->>D: INSERT transaction (type=refund, ref=original_transaction_id)
    W->>D: UPDATE account SET reserved_balance = reserved_balance - 100
    W->>K: publish → transactions.result (action failed, points refunded)
    K->>N: consume result event
    N-->>C: SSE push — action failed, points refunded

Consistency & Error Handling

Idempotency

Every transaction write carries an idempotency_key constructed as:

{transaction_id}:{operation_type}

Examples:

  • txn-abc123:deposit
  • txn-abc123:use
  • txn-abc123:refund

A unique index on transactions(idempotency_key) means a duplicate insert is rejected silently — the existing row is returned. Worker retries and duplicate Kafka message deliveries are safe by default.

Outbox Pattern

To prevent reservations from being stuck when the API crashes between the DB write and the Kafka publish, all Kafka events are written to an outbox table in the same DB transaction as the reservation:

BEGIN;
  INSERT INTO transactions (type, status, idempotency_key, ...);
  UPDATE accounts SET reserved_balance = reserved_balance + 100 WHERE account_id = ?;
  INSERT INTO outbox (event_type, payload, published, created_at) VALUES (..., false, NOW());
COMMIT;

The API never publishes to Kafka directly. A Transaction Monitor Service polls the outbox table and publishes unpublished events to Kafka:

Transaction Monitor → SELECT * FROM outbox WHERE published = false
                    → publish to Kafka
                    → UPDATE outbox SET published = true

This guarantees atomicity: either both the reservation and its Kafka event exist, or neither does.

Scope note: The outbox pattern applies to the API layer only. Workers publish to Kafka directly — their idempotency keys and Kafka's at-least-once delivery guarantee safe redelivery on retry.

Transaction Monitor Service

A single service responsible for both outbox polling and watchdog duties:

Outbox Poller

  • Polls outbox table every few seconds for unpublished events
  • Publishes to Kafka and marks rows as published = true
  • Handles the crash-between-write-and-publish failure mode

Watchdog

  • Scans transactions table for reservations with status = reserved and created_at < NOW() - 10 minutes
  • If no corresponding job exists in the queue → triggers auto-refund via transactions.credit
  • Alerts on DLQ depth > 0
flowchart TD
    TM["Transaction Monitor Service"]
    Outbox[("outbox table\n(PostgreSQL)")]
    Transactions[("transactions table\n(PostgreSQL)")]
    Kafka["Kafka"]
    DLT["Dead Letter Topic\n(DLT)"]

    TM -->|"poll unpublished events"| Outbox
    Outbox -->|"unpublished rows"| TM
    TM -->|"publish event"| Kafka
    TM -->|"scan stuck reservations\n(> 10 min)"| Transactions
    Transactions -->|"stuck reservations"| TM
    TM -->|"auto-refund event"| Kafka
    TM -->|"monitor DLQ depth"| DLT

Graceful Shutdown

When a Kubernetes pod receives SIGTERM, each service follows a drain sequence to prevent reservation leaks:

API Service

  1. Stop accepting new requests (readiness probe removed)
  2. Complete all in-flight DB writes and outbox inserts
  3. Exit cleanly

Workers (Credit & Debit)

  1. Stop consuming new messages from Kafka (consumer.pause())
  2. Finish the currently executing job — confirm or refund before exit
  3. Exit cleanly — Kafka offset is committed only after job completes

Transaction Monitor Service

  1. Finish the current outbox poll cycle
  2. Complete any in-flight Kafka publishes
  3. Exit cleanly

If a worker is force-killed before finishing a job, the Kafka offset is not committed — the message is redelivered on next startup. Combined with idempotency keys, redelivery is safe. The watchdog 10-minute threshold acts as the final safety net for any reservation that slips through.

Data Models

accounts

Column Type Notes
id uuid Primary key
user_id uuid FK to users table — one account per user
balance bigint Confirmed points available to the user
reserved_balance bigint Points locked by active reservations
created_at timestamptz Account creation timestamp
updated_at timestamptz Last modified timestamp

available_balance is always derived as balance - reserved_balance — never stored independently.

transactions

Column Type Notes
id bigserial Auto-increment internal ID for ordering
uuid uuid Globally unique ID exposed to clients — used for idempotency and duplicate order prevention
account_id uuid FK to accounts
type enum deposit, use, refund
amount bigint Points involved in this transaction — flexible, always positive
status enum pending, reserved, confirmed, failed
idempotency_key text Unique index — format: {uuid}:{type}
ref_transaction_id uuid FK to original use transaction — populated for refund type only
created_at timestamptz Immutable insert timestamp
updated_at timestamptz Updated on every status transition

outbox

Column Type Notes
id bigserial Auto-increment primary key
event_type text Matches transaction type — deposit, use, refund
payload jsonb Full event payload published to Kafka
published boolean false until Transaction Monitor publishes to Kafka
created_at timestamptz Insert timestamp — used to detect stuck outbox rows

Transaction State Machine

pending ──reserve──► reserved ──confirm──► confirmed
                         │
                         └──(action fails)──► failed

refund:  pending ──confirm──► confirmed
  • depositpendingconfirmed or failed
  • usependingreservedconfirmed (action succeeded) or failed (action failed)
  • refundpendingconfirmed (refund always succeeds; retry until confirmed)

Indexes

Table Index Purpose
transactions UNIQUE (idempotency_key) Duplicate write protection
transactions INDEX (account_id, status, created_at) Watchdog scan for stuck reservations
transactions INDEX (uuid) Client-facing lookup by transaction UUID
outbox INDEX (published, created_at) Outbox poller query performance

Nested Transaction Design

Some actions are composed of multiple sequential steps — each calling an external resource. The system uses the Saga pattern (orchestration variant) to handle partial failures without leaving points permanently locked or external systems in an inconsistent state.

Step Interface

Every external call implements two methods:

interface Step {
  execute(): Promise<Result>
  rollback(): Promise<void>
}
  • Pure functions (no side effects, safe to retry) implement rollback() as a no-op.
  • Non-pure functions (write to external systems, send emails, charge payments) implement rollback() as a compensation that undoes the side effect.

The orchestrator always calls rollback() on failure — it never needs to know whether a step is pure or not.

Step Log

Each step's state is persisted to a task_steps table so the saga can survive a worker crash and be resumed by the watchdog.

Column Type Notes
id bigserial Primary key
transaction_id uuid FK to transactions.uuid of the parent use transaction
step_index int Order of execution (0-based)
status enum pending, executed, rolled_back, failed
created_at timestamptz Step insert timestamp
updated_at timestamptz Updated on every status transition

Saga Orchestration Flow

Debit Worker (Orchestrator)
  ├── Step 0: execute() → persist status=executed
  ├── Step 1: execute() → persist status=executed
  └── Step 2: execute() → FAILS → persist status=failed
                               ↓
                    Compensation (reverse order)
                    ├── Step 1: rollback() → persist status=rolled_back
                    ├── Step 0: rollback() → persist status=rolled_back
                    └── publish refund event → transactions.credit

The point refund is always issued after all rollbacks complete — never before.

Recovery

If a worker crashes mid-saga, the Kafka offset is not committed and the job is redelivered. The worker reads the task_steps log to determine which steps already executed and resumes from the correct position — re-executing pending steps or re-running rollbacks.

Idempotency on re-execution: Before calling execute() on a step, the worker checks task_steps for an existing status=executed row. If found, the step is skipped — preventing double side-effects on non-pure functions on redelivery.

Note: A dedicated Kafka topic (transactions.saga-resume) for watchdog-triggered saga recovery is intentionally out of scope here — it extends the topic model described in the Transaction Flow section and warrants its own design when multi-step action complexity grows.

Scalability

Dynamic Scaling with KEDA

Workers scale independently per Kafka topic based on consumer lag. Credit workers and debit workers are separate Kubernetes Deployments — KEDA watches each topic's lag and scales the corresponding worker pool.

  • transactions.debit — scales debit workers based on debit topic lag
  • transactions.credit — scales credit workers based on credit topic lag
  • Prometheus metrics (queue depth, processing rate) feed additional KEDA triggers for finer-grained scaling decisions

Each worker type can scale to zero during idle periods and burst to handle peak load without affecting the other.

Priority Scheduling

Priority is implemented via topic-level separation. Higher-priority jobs are published to a dedicated high-priority topic; KEDA keeps a higher minimum replica count and lower scale threshold for that topic — ensuring high-priority jobs are processed with lower latency.

transactions.debit.high   → KEDA: higher min replicas, aggressive scale threshold
transactions.debit.normal → KEDA: lower min replicas, conservative scale threshold

The API routes each job to the correct topic at publish time based on a priority signal.

Open question: The criteria for determining job priority (e.g. user tier, action type, account balance level) is not yet defined and requires further product decision before implementation.

Cost Optimization

Cost decisions are metrics-driven — optimize where Prometheus data shows actual pressure, not speculatively.

Metric to watch Threshold signal Optimization action
DB read query latency Consistently high on balance reads Add PostgreSQL read replica, route SELECT queries there
Redis cache hit rate Low hit rate on balance lookups Tune Redis TTL, review cache invalidation strategy
Worker idle time Workers consistently idle Reduce KEDA minimum replica count
Queue depth (sustained high) Lag growing faster than workers clear it Increase KEDA max replica count or review worker throughput
DLQ depth Non-zero Investigate failed jobs — recurring failures may indicate a systemic issue worth fixing rather than retrying

Flow Extensibility via Kafka

Because each processing step is driven by a Kafka event, the flow is loosely coupled by design. Any service can subscribe to a topic and react independently — without the publisher knowing or caring who is listening.

  • Adding a step: Deploy a new consumer that subscribes to an existing topic (e.g. transactions.result). No changes to existing services required.
  • Removing a step: Stop the consumer. The topic continues flowing; other consumers are unaffected.
  • Reordering steps: Introduce an intermediate topic between two stages and wire the new consumer to it. Upstream and downstream services remain unchanged.

New business requirements map directly to new consumers, not to changes in existing code.

Observability

Prometheus scrapes metrics from all services; Grafana dashboards visualize system health.

Signal Source Alert threshold
Kafka consumer lag KEDA / Kafka exporter Lag growing for > 2 min
DLQ depth Kafka exporter > 0
Transaction Monitor last poll age App metric > 2× poll interval
API error rate Node.js metrics > 1% of requests
Reservation age (stuck) App metric > watchdog threshold (10 min)

Problem Analysis

Long-Duration Tasks, Uncertainty, and Task Loss on Restart

Problem: AI task processing is inherently long-running and non-deterministic. A naive synchronous design would block the client for the full duration, and any service restart mid-task would silently lose the work.

How the architecture solves it:

Long-duration tasks — The API returns 202 Accepted immediately after reserving points and writing the outbox row. The worker processes the job asynchronously; the client receives the result via SSE when the worker completes. The client never blocks.

Uncertainty — The transaction state machine provides a clear, unambiguous status at every point in the lifecycle (pending → reserved → confirmed / failed / refunded). There is no state where the outcome is unknown — every terminal state is explicit.

Task loss on restart — Two layers of protection ensure no job is silently dropped:

  • Outbox recovery: If the worker never received the job (crash before publish), the Transaction Monitor re-publishes the outbox row on its next poll cycle.
  • Kafka redelivery: If the worker crashed mid-processing, the Kafka offset was not committed — the broker redelivers the message on worker restart.
  • Idempotency: The idempotency key ({transaction_id}:{operation_type}) prevents double-processing if the same job is redelivered after a crash.

Common Node.js Pitfalls in Worker Services

Missing await — Silent Failure

A forgotten await on an async call returns a pending Promise that is never observed. The function appears to succeed, no error is thrown, but the operation never completed. In a worker context this means a transaction step silently did nothing — the job is acknowledged, the Kafka offset is committed, and the work is lost.

Detection: TypeScript with @typescript-eslint/no-floating-promises catches unhandled Promises at compile time. Runtime: unexpected status=reserved rows that never progress despite no errors in logs.


Retry Without Persisted State — Infinite Loop Risk

A retry loop that tracks attempt count only in memory will reset on every restart. If the worker crashes and restarts, the counter resets to zero — the job retries indefinitely, never reaching the DLQ.

Mitigation: Retry attempt count must be persisted. Either use Kafka's built-in retry topic mechanism (which tracks delivery count at the broker level) or write the attempt count to the transactions row before each retry. Never rely on in-memory state for retry logic.


Blocked Event Loop — Worker Stalls Silently

Node.js runs on a single thread. A CPU-intensive operation (large JSON parse, synchronous crypto, tight loop) on the main thread blocks the event loop — the worker stops processing Kafka messages without crashing, Kafka consumer lag grows, and KEDA scales up more replicas rather than fixing the root cause.

Detection: Prometheus worker throughput drops to zero while the pod remains healthy. Use --cpu-prof or clinic.js flame graphs to identify the blocking call. Move heavy computation to a worker thread (worker_threads) or an off-process job.

Redis Client — Silent Errors and Missing Logs

Redis client libraries (e.g. ioredis) suppress many errors by default or emit them only on the error event. If no error handler is attached, the process may crash silently or swallow connection failures without any log output. This is especially common with Redis Streams and distributed lock operations.

Common failure patterns:

  • Lock acquisition silently failsSET NX PX returns null (lock held), but if the caller does not check the return value, the operation proceeds without the lock. Balance mutations run unprotected.
  • Stream consumer group not foundXREADGROUP throws if the consumer group does not exist yet. The error may not surface if the client's error event is unhandled, leaving the worker silently idle.
  • Connection lost mid-operation — Redis reconnects automatically, but in-flight commands may be dropped without rejection if the client is configured with enableOfflineQueue: false. The caller sees no error and assumes success.

Mitigation:

  • Always attach an error handler to the Redis client instance.
  • Wrap lock acquisition in an explicit null-check and treat null as a lock contention error — log and retry with backoff.
  • Run XGROUP CREATE ... MKSTREAM at worker startup to ensure consumer groups exist before consuming.
  • Enable structured logging on all Redis operation failures with the command name, key, and error message.

Redis Streams — Not Suitable as a Primary Event Bus

Redis Streams can deliver messages and support consumer groups, but they are not a reliable event bus for critical transaction flows.

Key limitations:

  • Message loss on eviction — Redis is an in-memory store. If maxmemory is reached and the eviction policy is not noeviction, stream entries can be silently dropped. There is no durability guarantee comparable to Kafka's log-based storage.
  • No redelivery guarantee on crash — A consumer that crashes after reading but before acknowledging (XACK) will have its pending messages recovered only if XAUTOCLAIM is explicitly polled. This must be implemented manually; there is no automatic redelivery.
  • Operational complexity — Managing consumer group lag, pending entry lists, and dead-letter handling in Redis Streams requires significant custom code with no ecosystem tooling.

Recommendation: Use a purpose-built job queue library such as BullMQ (backed by Redis) for task-level triggering within a single service boundary. BullMQ handles retries, backoff, job state persistence, and failure queues out of the box, and its jobs survive worker restarts reliably.

For cross-service event delivery — which is the primary use case in this system — Kafka remains the correct choice: durable, replayable, partitioned, and designed for exactly this workload.

Rule of thumb: Redis Streams or BullMQ for in-process / single-service job queues. Kafka for anything that crosses a service boundary or requires durability guarantees.

AI Usage Documentation

This document was produced in a live design session using Claude Code (claude-sonnet-4-6) as a design assistant. The human author drove all architectural decisions; Claude served as a sounding board, presented tradeoffs, and acted as the writer.

What AI produced

  • All Markdown formatting, Mermaid diagram syntax, and table structure
  • First-draft prose for each section based on decisions reached in conversation
  • Conflict detection between sections (field name mismatches, logic inconsistencies, diagram/prose divergence)
  • Identification of missing acceptance criteria coverage after each section

What the author verified and refined

  • Every architectural decision (reservation model, outbox pattern, saga orchestration, SSE fan-out via Kafka partition ownership, KEDA scaling strategy, priority topic split)
  • Correctness of all balance mutation rules (balance +=, reserved_balance +=/-=) and when distributed locks are required
  • Data model field names, types, and relationships
  • Accuracy of the Problem Understanding section against actual failure modes

Revision history

The Gist at https://gist.github.com/DavidHsaiou/718528da3a6acfce0f99dddd32c5d523 contains the full version history of this document, with each revision corresponding to a completed section or significant revision cycle.