In a previous post I covered why I chose Kafka KRaft and how to configure it. This post is about what happens to data once it's in Kafka: how scan events travel from a mobile device through the pipeline to ClickHouse, fraud detection, and real-time alerts.

This is the QuackNet DePIN pipeline. The problem it solves: thousands of mobile devices simultaneously submitting WiFi/cellular/BLE/GPS scan data. The architecture has to handle bursty write patterns, fan-out to multiple consumers, and schema evolution — without a dedicated ops team.

The Full Flow

Flutter App WiFi / BLE GPS / Cell HTTPS Go API validate enrich produce Kafka KRaft scan-events (6p) scan-enriched (3p) scan-alerts (1p) scan-dlq (1p) retention: 7d ClickHouse batch insert group: ch-writer Fraud Detector location rules group: fraud-svc Push Alerts FCM / APNs group: alert-svc scan-dlq: deserialization failures → manual review flags → scan-enriched

QuackNet event pipeline: scan data fans out to three independent consumer groups.

Three Topics, Three Responsibilities

The pipeline runs four Kafka topics, each with a distinct role:

scan-events (6 partitions) is the entry point. The Go API produces here immediately after validation. Partitioned by device_id so all events from one device land in order on the same partition.

scan-enriched (3 partitions) is the fraud detector's output — scan events annotated with fraud flags, location cluster metadata, and a quality score. Downstream consumers read this topic, not the raw one.

scan-alerts (1 partition) is low-volume: only events that trigger a push notification. The alert service reads this.

scan-dlq (1 partition) is the dead-letter queue. Any consumer that fails to deserialize an event sends it here with the original bytes and the error. Manual review + replay when needed.

Partition Strategy

Partitioning by device_id was the key architectural decision. The alternative — partition by geography — would require re-partitioning when devices move, which Kafka doesn't support without a manual migration.

Six partitions on scan-events means up to 6 parallel consumer instances in the ClickHouse writer group. At QuackNet's current load that's overkill, but it's trivially cheap at this scale and saves a reconfiguration later.

P-0
device_id % 6 = 0
P-1
device_id % 6 = 1
P-2
device_id % 6 = 2
P-3
device_id % 6 = 3
P-4
device_id % 6 = 4
P-5
device_id % 6 = 5

Three Consumer Groups

Each consumer group reads the full topic independently. Adding a new consumer group doesn't affect existing ones — Kafka doesn't care.

ClickHouse Writer ch-writer

Reads scan-enriched. Batches 5,000 events or 2 seconds, whichever comes first, then bulk-inserts into ClickHouse using INSERT INTO scan_events FORMAT JSONEachRow. Offset committed after successful insert. On failure: exponential backoff, no offset commit, retry same batch.

Fraud Detector fraud-svc

Reads scan-events. Runs location anomaly rules (too many locations in 1 min, GPS coordinates not matching cell tower, duplicate scans). Publishes annotated event to scan-enriched. This is the only producer that writes to scan-enriched.

Alert Service alert-svc

Reads scan-alerts. Sends FCM/APNs push for events above a reward threshold. Idempotent: the push payload includes the scan_id so duplicate notifications are detected client-side.

What the Go Producer Looks Like

func (p *ScanProducer) Publish(ctx context.Context, scan *ScanEvent) error { payload, err := json.Marshal(scan) if err != nil { return fmt.Errorf("marshal scan: %w", err) } msg := &sarama.ProducerMessage{ Topic: "scan-events", Key: sarama.StringEncoder(scan.DeviceID), // partition by device Value: sarama.ByteEncoder(payload), } select { case p.producer.Input() <- msg: return nil case err := <-p.producer.Errors(): return fmt.Errorf("kafka produce: %w", err.Err) case <-ctx.Done(): return ctx.Err() } }

The async producer is fire-and-forget from the API's perspective. The API returns 202 Accepted as soon as the message is in the producer's buffer. The error channel is drained in a separate goroutine that logs failures and increments a Prometheus counter.

The Numbers

~2ms API p99 produce latency
5k ClickHouse batch size
7d topic retention
6 scan-events partitions

One Gotcha: Consumer Group Rebalancing

During development I'd restart the fraud detector while a load test was running. Every restart triggered a consumer group rebalance, which paused scan-enriched production for 3–10 seconds. The ClickHouse writer would catch up, but the gap in the dashboard made it look like an outage.

The fix: set session.timeout.ms=45000 and heartbeat.interval.ms=15000 for the fraud detector group. Restarts during tests no longer trigger rebalances until the session actually times out. In production I'd add a proper graceful shutdown that commits offsets before closing — but for dev, this works.

Schema Evolution

I'm serializing with plain JSON (no Avro, no Protobuf). For a solo project where all consumers are mine, this is fine. The trade-off: no schema registry, no backwards-compatibility enforcement.

When I add a new field (e.g., signal_strength to BLE scans), I add it as nullable in all consumers before the producer starts sending it. Deploy consumers first, then the producer. Rollback is safe because old consumers just ignore unknown fields.

If I ever need Avro or Protobuf for inter-org data sharing, the wire format can change without touching the partition scheme or consumer group logic.

What I'd Do Differently

The fraud detector producing to scan-enriched creates a dependency chain: ClickHouse writer only gets enriched data. If the fraud detector is down, ClickHouse falls behind. I considered having the ClickHouse writer consume raw events directly and join fraud annotations at query time — but the ClickHouse schema would have to handle nulls everywhere. For now the chain is acceptable.

The dead-letter queue is the other thing I'd revisit. Right now it's a Kafka topic I check manually. A better design would auto-replay DLQ events after a configurable delay, with a maximum retry count tracked in the message headers.

But both of these are nice to have. The pipeline has been running stable under real load. KRaft single-node Kafka, three consumer groups, no ZooKeeper, no ops team. That's the point.

← Setup: Kafka KRaft for Solo Projects