In a previous post I covered why I chose Kafka KRaft and how to configure it. This post is about what happens to data once it's in Kafka: how scan events travel from a mobile device through the pipeline to ClickHouse, fraud detection, and real-time alerts.
This is the QuackNet DePIN pipeline. The problem it solves: thousands of mobile devices simultaneously submitting WiFi/cellular/BLE/GPS scan data. The architecture has to handle bursty write patterns, fan-out to multiple consumers, and schema evolution — without a dedicated ops team.
The Full Flow
QuackNet event pipeline: scan data fans out to three independent consumer groups.
Three Topics, Three Responsibilities
The pipeline runs four Kafka topics, each with a distinct role:
scan-events (6 partitions) is the entry point. The Go API produces here immediately after validation. Partitioned by device_id so all events from one device land in order on the same partition.
scan-enriched (3 partitions) is the fraud detector's output — scan events annotated with fraud flags, location cluster metadata, and a quality score. Downstream consumers read this topic, not the raw one.
scan-alerts (1 partition) is low-volume: only events that trigger a push notification. The alert service reads this.
scan-dlq (1 partition) is the dead-letter queue. Any consumer that fails to deserialize an event sends it here with the original bytes and the error. Manual review + replay when needed.
Partition Strategy
Partitioning by device_id was the key architectural decision. The alternative — partition by geography — would require re-partitioning when devices move, which Kafka doesn't support without a manual migration.
Six partitions on scan-events means up to 6 parallel consumer instances in the ClickHouse writer group. At QuackNet's current load that's overkill, but it's trivially cheap at this scale and saves a reconfiguration later.
Three Consumer Groups
Each consumer group reads the full topic independently. Adding a new consumer group doesn't affect existing ones — Kafka doesn't care.
ClickHouse Writer ch-writer
Reads scan-enriched. Batches 5,000 events or 2 seconds, whichever comes first, then bulk-inserts into ClickHouse using INSERT INTO scan_events FORMAT JSONEachRow. Offset committed after successful insert. On failure: exponential backoff, no offset commit, retry same batch.
Fraud Detector fraud-svc
Reads scan-events. Runs location anomaly rules (too many locations in 1 min, GPS coordinates not matching cell tower, duplicate scans). Publishes annotated event to scan-enriched. This is the only producer that writes to scan-enriched.
Alert Service alert-svc
Reads scan-alerts. Sends FCM/APNs push for events above a reward threshold. Idempotent: the push payload includes the scan_id so duplicate notifications are detected client-side.
What the Go Producer Looks Like
func (p *ScanProducer) Publish(ctx context.Context, scan *ScanEvent) error {
payload, err := json.Marshal(scan)
if err != nil {
return fmt.Errorf("marshal scan: %w", err)
}
msg := &sarama.ProducerMessage{
Topic: "scan-events",
Key: sarama.StringEncoder(scan.DeviceID), // partition by device
Value: sarama.ByteEncoder(payload),
}
select {
case p.producer.Input() <- msg:
return nil
case err := <-p.producer.Errors():
return fmt.Errorf("kafka produce: %w", err.Err)
case <-ctx.Done():
return ctx.Err()
}
}
The async producer is fire-and-forget from the API's perspective. The API returns 202 Accepted as soon as the message is in the producer's buffer. The error channel is drained in a separate goroutine that logs failures and increments a Prometheus counter.
The Numbers
One Gotcha: Consumer Group Rebalancing
During development I'd restart the fraud detector while a load test was running. Every restart triggered a consumer group rebalance, which paused scan-enriched production for 3–10 seconds. The ClickHouse writer would catch up, but the gap in the dashboard made it look like an outage.
The fix: set session.timeout.ms=45000 and heartbeat.interval.ms=15000 for the fraud detector group. Restarts during tests no longer trigger rebalances until the session actually times out. In production I'd add a proper graceful shutdown that commits offsets before closing — but for dev, this works.
Schema Evolution
I'm serializing with plain JSON (no Avro, no Protobuf). For a solo project where all consumers are mine, this is fine. The trade-off: no schema registry, no backwards-compatibility enforcement.
When I add a new field (e.g., signal_strength to BLE scans), I add it as nullable in all consumers before the producer starts sending it. Deploy consumers first, then the producer. Rollback is safe because old consumers just ignore unknown fields.
If I ever need Avro or Protobuf for inter-org data sharing, the wire format can change without touching the partition scheme or consumer group logic.
What I'd Do Differently
The fraud detector producing to scan-enriched creates a dependency chain: ClickHouse writer only gets enriched data. If the fraud detector is down, ClickHouse falls behind. I considered having the ClickHouse writer consume raw events directly and join fraud annotations at query time — but the ClickHouse schema would have to handle nulls everywhere. For now the chain is acceptable.
The dead-letter queue is the other thing I'd revisit. Right now it's a Kafka topic I check manually. A better design would auto-replay DLQ events after a configurable delay, with a maximum retry count tracked in the message headers.
But both of these are nice to have. The pipeline has been running stable under real load. KRaft single-node Kafka, three consumer groups, no ZooKeeper, no ops team. That's the point.