production-grade · learnable in one repo

Distributed logging,
made legible.

TraceHub is a Kafka-shaped logging platform packed into a single monolith you can read end-to-end. Sequence-based ingest, ACK/replay protocol, Redis-backed queue, and a real-time dashboard — wired so the failure modes are visible, not abstract.

Node 20 · Express · Socket.io PostgreSQL 16 Redis 7 Next.js 14 Docker Compose
stack
monorepo · 3 apps
setup
compose up
license
MIT · open
system architecture read end-to-end
auth-svc seq · 1042 payment-svc seq · 8821 notif-svc seq · 3309 Backend API POST /ingest ack · partial · 3 missing Redis queue:logs · queue:retry producer:* · ack:* LogWorker batch · 50 PostgreSQL logs · ack_state Dashboard Next.js · WS live
core protocol

Sequence in. ACK out.
Replay when reality lies.

Every log carries a monotonically increasing seq per service. The backend computes the highest contiguous seq it has, sends the producer an ACK with the exact missing set, and queues those gaps as replay requests. No magic — just three contracts you can read in an afternoon.

01 · INGEST

Producer batches by seq

Each service buffers logs in a Redis sorted set keyed by sequence, then flushes batches via POST /ingest.

producer:payment batch · 100 logs
02 · ACK

Backend replies with truth

Returns the highest contiguous seq seen and lists every gap. The producer trims its buffer up to ackTill.

{ "ackTill":1099 "missing":[1088, 1092] "status":"partial" }
03 · REPLAY

Worker fills the gaps

Missing seqs land in replay_requests. The worker pulls them from the producer buffer, dedupes, and inserts.

replay_requests
POST /ingest · request
{
  "service": "payment-service",
  "batch": [
    { "seq": 1097, "level": "info",
      "message": "charge.created" },
    { "seq": 1099, "level": "error",
      "message": "gateway.timeout" }
  ]
}
200 OK · ack response
{
  "batchId":  "ack_1705312200",
  "ackTill":  1099,
  "missing":  [1088, 1092],
  "status":   "partial",
  "replay":   { "queued": 2 }
}
// producer trims up to seq=1099
// gaps queued in replay_requests
dashboard · 5 pages

Five views, one source of truth.

The Next.js dashboard subscribes to a single Socket.io channel. Each page renders one slice of the same broadcast: queue, replays, services, raw stream, system overview.

Overview Live Logs Replay Queue Services
socket connected
EPS
28.4
↑ streaming
Queue
142
queue:logs
ACK Rate
98.1%
↑ healthy
Replays
3
pending
Live log stream
10:24:01INFOJWT issued
10:24:01ERRORcard declined
10:24:02WARNbounce rate ↑
10:24:03INFOrefund processed
Fault injection
Crash Worker

5s recovery

Drop Network

10s failure

Delay ACK

3s latency

Reset All

clear state

/ · OVERVIEW

System metrics

EPS chart, service status, queue depth, ACK p99 — the screen you keep open while breaking things.

/live-logs

Real-time stream

Every log:new event renders as a row. Filter by service, level, or search text.

/replay

Replay center

Pending + completed replays, manual trigger by service · seq range, live replay:completed events.

/queue

Queue monitor

Depth chart for queue:logs and queue:retry, worker liveness, sim events.

/services

Per-service detail

EPS, error rate, ACK state per producer. Drill in when one service starts gapping.

FAULT INJECTION

Break it on purpose

POST /metrics/control exposes 5 controls. Crash worker, drop network, delay ACK, flush queue, reset.

step-by-step · from clone to ack

Six steps. No hidden magic.

Every command is one you can paste. Every directory is one you can read. Follow top-to-bottom and you'll have producers writing to Postgres through the queue in under two minutes.

01
prerequisites

Install Docker & Node 20

Everything else — Postgres, Redis — is provisioned by compose. Verify once:

~ shell
docker --version    # >= 24
node --version      # v20.x
pnpm --version      # >= 8
02
clone

Pull the monolith

One repo, three apps under /apps: backend, dashboard, producer-simulator.

~ shell
git clone https://github.com/CodeWithZezo/tracehub
cd tracehub
03
start

One command to rule them all

Docker Compose boots postgres, redis, backend, dashboard, and the three producer-simulators simultaneously.

~ shell
docker compose up --build

# or use the helper
chmod +x start.sh && ./start.sh up
04
verify

Check the health endpoint

The backend health probe pings both Postgres and Redis. If both are "up", you're live.

~ shell
curl http://localhost:3001/health

# =>
{ "status": "ok", "postgres": "up", "redis": "up" }
05
send a batch

POST your first ingest

You don't need the simulator. Send your own batch and read the ACK shape directly.

~ shell
curl -X POST http://localhost:3001/ingest \
  -H "content-type: application/json" \
  -d '{
    "service": "payment-service",
    "batch": [
      { "seq": 1, "level": "info",  "message": "hello" },
      { "seq": 2, "level": "error", "message": "oops" }
    ]
  }'
06
observe

Open the dashboard

Logs land within seconds of the queue becoming healthy. The producer-simulator starts automatically.

~ browser
open http://localhost:3000

# pages:
#   /          ▸ system overview
#   /live-logs ▸ real-time stream
#   /queue     ▸ depth + worker
#   /replay    ▸ manual replay
optional · local dev without docker

Run apps natively with pnpm

Keep Postgres + Redis in containers, run the three apps natively for hot reload.

~ shell
# 1. just the dependencies
docker compose up postgres redis -d

# 2. backend
cd apps/backend     && pnpm install && pnpm dev

# 3. producer (new terminal)
cd apps/producer-simulator && pnpm install && pnpm dev

# 4. dashboard (new terminal)
cd apps/dashboard   && pnpm install && pnpm dev
api surface

Eight routes. One websocket.

No SDK, no codegen. Curl-able from day one.

verbpathpurpose
POST/ingestReceive a log batch, return ACK with ackTill + missing seqs.
GET/logsQuery logs by service, level, search text, time range.
GET/logs/replayList pending and completed replay requests.
POST/logs/replayManually trigger a replay for a service / seq range.
GET/metricsSnapshot of EPS, queue depth, error rate, ACK p99.
GET/metrics/queueQueue depth + worker liveness + retry stats.
POST/metrics/controlFault-injection: crash, delay, drop, flush, reset.
GET/healthLiveness probe — pings Postgres + Redis.
websocket events → client
socket.io
metrics:snapshot   // SystemMetrics, every 2s
log:new            // individual LogEntry
ack:sent           // { service, ackTill, missing[] }
replay:completed   // { service, count, latencyMs }
worker:crashed     // auto-recovers in 5s
worker:recovered
sim:control        // fault-injection broadcast
postgres schema
init.sql
CREATE TABLE logs (
  id          BIGSERIAL PRIMARY KEY,
  service     TEXT NOT NULL,
  seq         BIGINT NOT NULL,
  level       TEXT, message TEXT,
  request_id  TEXT, ts TIMESTAMPTZ,
  UNIQUE (service, seq)
);
CREATE TABLE ack_state    ( ... );
CREATE TABLE replay_requests ( ... );
open source · MIT

The whole stack fits in one tab.
Read it. Break it. Extract it.

Each route is structured to peel off into its own service when you're ready — ingest, worker, query, metrics, replay all share the same Postgres schema and Redis keyspace.

tracehub/
├── apps/
│   ├── backend/ // express + socket.io
│   ├── dashboard/ // next.js 14
│   └── producer-simulator/ // fake EC2 svcs
├── postgres/init.sql
├── redis/redis.conf
├── shared/src/index.ts
└── docker-compose.yml