Kafka Data Flow in a DePIN Pipeline

In a previous post I covered why I chose Kafka KRaft and how to configure it. This post is about what happens to data once it's in Kafka: how scan events travel from a mobile device through the pipeline to ClickHouse, fraud detection, and real-time alerts.

This is the QuackNet DePIN pipeline. The problem it solves: thousands of mobile devices simultaneously submitting WiFi/cellular/BLE/GPS scan data. The architecture has to handle bursty write patterns, fan-out to multiple consumers, and schema evolution — without a dedicated ops team.

The Full Flow

QuackNet event pipeline: scan data fans out to three independent consumer groups.

Three Topics, Three Responsibilities

The pipeline runs four Kafka topics, each with a distinct role:

scan-events (6 partitions) is the entry point. The Go API produces here immediately after validation. Partitioned by device_id so all events from one device land in order on the same partition.

scan-enriched (3 partitions) is the fraud detector's output — scan events annotated with fraud flags, location cluster metadata, and a quality score. Downstream consumers read this topic, not the raw one.

scan-alerts (1 partition) is low-volume: only events that trigger a push notification. The alert service reads this.

scan-dlq (1 partition) is the dead-letter queue. Any consumer that fails to deserialize an event sends it here with the original bytes and the error. Manual review + replay when needed.

Partition Strategy

Partitioning by device_id was the key architectural decision. The alternative — partition by geography — would require re-partitioning when devices move, which Kafka doesn't support without a manual migration.

Six partitions on scan-events means up to 6 parallel consumer instances in the ClickHouse writer group. At QuackNet's current load that's overkill, but it's trivially cheap at this scale and saves a reconfiguration later.

P-0

device_id % 6 = 0

P-1

device_id % 6 = 1

P-2

device_id % 6 = 2

P-3

device_id % 6 = 3

P-4

device_id % 6 = 4

P-5

device_id % 6 = 5

Three Consumer Groups

Each consumer group reads the full topic independently. Adding a new consumer group doesn't affect existing ones — Kafka doesn't care.

ClickHouse Writer ch-writer

Reads scan-enriched. Batches 5,000 events or 2 seconds, whichever comes first, then bulk-inserts into ClickHouse using INSERT INTO scan_events FORMAT JSONEachRow. Offset committed after successful insert. On failure: exponential backoff, no offset commit, retry same batch.

Fraud Detector fraud-svc

Reads scan-events. Runs location anomaly rules (too many locations in 1 min, GPS coordinates not matching cell tower, duplicate scans). Publishes annotated event to scan-enriched. This is the only producer that writes to scan-enriched.

Alert Service alert-svc

Reads scan-alerts. Sends FCM/APNs push for events above a reward threshold. Idempotent: the push payload includes the scan_id so duplicate notifications are detected client-side.

What the Go Producer Looks Like

func (p *ScanProducer) Publish(ctx context.Context, scan *ScanEvent) error {
    payload, err := json.Marshal(scan)
    if err != nil {
        return fmt.Errorf("marshal scan: %w", err)
    }

    msg := &sarama.ProducerMessage{
        Topic: "scan-events",
        Key:   sarama.StringEncoder(scan.DeviceID), // partition by device
        Value: sarama.ByteEncoder(payload),
    }

    select {
    case p.producer.Input() <- msg:
        return nil
    case err := <-p.producer.Errors():
        return fmt.Errorf("kafka produce: %w", err.Err)
    case <-ctx.Done():
        return ctx.Err()
    }
}

The async producer is fire-and-forget from the API's perspective. The API returns 202 Accepted as soon as the message is in the producer's buffer. The error channel is drained in a separate goroutine that logs failures and increments a Prometheus counter.

The Numbers

~2ms API p99 produce latency

5k ClickHouse batch size

7d topic retention

6 scan-events partitions

One Gotcha: Consumer Group Rebalancing

During development I'd restart the fraud detector while a load test was running. Every restart triggered a consumer group rebalance, which paused scan-enriched production for 3–10 seconds. The ClickHouse writer would catch up, but the gap in the dashboard made it look like an outage.

The fix: set session.timeout.ms=45000 and heartbeat.interval.ms=15000 for the fraud detector group. Restarts during tests no longer trigger rebalances until the session actually times out. In production I'd add a proper graceful shutdown that commits offsets before closing — but for dev, this works.

Schema Evolution

I'm serializing with plain JSON (no Avro, no Protobuf). For a solo project where all consumers are mine, this is fine. The trade-off: no schema registry, no backwards-compatibility enforcement.

When I add a new field (e.g., signal_strength to BLE scans), I add it as nullable in all consumers before the producer starts sending it. Deploy consumers first, then the producer. Rollback is safe because old consumers just ignore unknown fields.

If I ever need Avro or Protobuf for inter-org data sharing, the wire format can change without touching the partition scheme or consumer group logic.

What I'd Do Differently

The fraud detector producing to scan-enriched creates a dependency chain: ClickHouse writer only gets enriched data. If the fraud detector is down, ClickHouse falls behind. I considered having the ClickHouse writer consume raw events directly and join fraud annotations at query time — but the ClickHouse schema would have to handle nulls everywhere. For now the chain is acceptable.

The dead-letter queue is the other thing I'd revisit. Right now it's a Kafka topic I check manually. A better design would auto-replay DLQ events after a configurable delay, with a maximum retry count tracked in the message headers.

But both of these are nice to have. The pipeline has been running stable under real load. KRaft single-node Kafka, three consumer groups, no ZooKeeper, no ops team. That's the point.

← Setup: Kafka KRaft for Solo Projects