Streaming Real-Time TikTok Data Into Kafka Pipelines

Q: What should I set partition counts to?

Start at 24 for hot topics (tiktok.posts.events, tiktok.comments.events) and 12 for warmer ones (tiktok.users.snapshots, tiktok.followers.deltas). Partition count is hard to change later without re-keying, so err on the higher side. The cost is metadata overhead on the broker, not throughput.

Q: How do I handle TikTok counter rollbacks where play_count appears to decrease between polls?

Treat counters as eventually consistent. Take MAX in windowed aggregations instead of LAST, and emit a warning metric when a delta is negative beyond a threshold. Some of these are real (deleted comments, removed likes); others are CDN-cache artifacts on the upstream side. Do not propagate them as alerts.

By TikLiveAPI Team · Published on May 29, 2026

Streaming Real-Time TikTok Data Into Kafka Pipelines

Why pipe TikTok data through Kafka

If you only have one downstream consumer for TikTok data, you can hit the TikLiveAPI REST endpoints directly from your worker process and call it a day. The moment you have two consumers - say, a Postgres OLTP store for your product UI and a ClickHouse warehouse for analytics - that direct-call model collapses. You either duplicate the polling cost (paying twice in API credits for the same data), or you build a synchronous fan-out service that becomes a single point of failure.

Kafka solves three specific problems for TikTok ingestion:

Decoupled producers and consumers. Your polling workers do not care who reads the data downstream. Add a new sink tomorrow, replay the topic from offset zero, and you are done.
Replayable history. Retain tiktok.posts.events for 30 days and you can rebuild a derived table from scratch without re-paying for any API calls.
Fan-out to multiple sinks. The same post event can land in Postgres for low-latency lookups, ClickHouse for time-series aggregation, Elasticsearch for full-text search, and S3 for cold archival - all in parallel, each with its own consumer group and lag tolerance.

This post walks through a production-shaped pipeline: TikLiveAPI polling producers, Kafka topics keyed for partition-stable processing, exactly-once semantics, stream processors for engagement and trend detection, and the operational scaffolding (dead-letter topics, back-pressure, monitoring) you will want before any of it sees real traffic. There is also a lighter Redis Streams variant at the end for teams operating below the Kafka complexity floor.

Architecture overview

The pipeline has five layers. From left to right:

TikLiveAPI polling producers. Python workers that hit endpoints like /user-posts/, /post-comments/, and /userinfo-by-id/ on a schedule, parse the response, and write to Kafka. Auth header is X-Api-Key, base URL is https://api.tikliveapi.com. Pricing is one credit per request - see the current credit packages.
Kafka topics per data type. Snapshots, post events, comment events, follower deltas. Each topic has its own key, partition count, retention, and compaction policy.
Schema registry. Either Confluent Schema Registry with Avro or a lighter JSON Schema registry. Producers register schemas at startup; consumers fetch and cache them. Without this, every consumer reimplements the same parsing logic, and a field rename silently breaks four downstream jobs.
Stream processors. Flink, Kafka Streams, or ksqlDB. This layer computes windowed aggregations (engagement rate per hour), joins across topics (comments enriched with author info from the user snapshots topic), and detects anomalies (trending posts).
Sinks. Postgres via Kafka Connect JDBC sink, ClickHouse via the official Kafka engine table, Elasticsearch via the ES sink connector, S3 via the S3 sink connector with hourly Parquet partitioning. Each sink has its own consumer group, so a slow Postgres write does not block the S3 archive.

The producers do not block on the API. They pull from a polling schedule, hit the API, parse, and produce. Everything else - aggregation, enrichment, sinking - happens off the hot path.

Topic design

Bad keys are the single most common Kafka mistake on data ingestion pipelines. Pick the key based on what the downstream stream processors will partition on, and stick with it.

tiktok.users.snapshots

One record per user-snapshot poll. Key is the numeric userid (string). Source endpoint: /userinfo-by-id/ which returns nested user and stats objects in camelCase - followerCount, followingCount, heartCount, videoCount, diggCount. See documentation/users. Topic config: 12 partitions, retention 14 days, no compaction (you want the history of snapshots, not just the latest).

tiktok.posts.events

One record per post discovery or update. Key is aweme_id (the TikTok post id, snake_case). Source endpoints: /user-posts/ for discovery, /post-detail/ for full enrichment. The /post-detail/ response is a flat snake_case object with aweme_id, play, wmplay, hdplay, play_count, digg_count, comment_count, share_count. 24 partitions, retention 30 days, log compaction enabled so the latest version per aweme_id survives. See documentation/posts.

tiktok.comments.events

One record per comment. Key is the comment id field (TikLiveAPI uses id on comments, not cid). Source: /post-comments/ and /post-comment-replies/. Value carries id, video_id, text, create_time, digg_count, reply_total, and the nested user object. 24 partitions, retention 7 days.

tiktok.followers.deltas

Computed, not directly polled. Key is userid. A stream processor reads tiktok.users.snapshots, compares followerCount to the previous snapshot for the same user, and emits a delta record ({userid, t_prev, t_now, delta, prev_count, curr_count}). 12 partitions, retention 90 days. Useful for downstream alerting ("creator X gained 50k followers in 2 hours").

Producer implementation

The producer below uses confluent-kafka, polls /user-posts/ for a configured list of user ids, dedupes by aweme_id using a Bloom filter so re-polls do not flood downstream with duplicates, and writes idempotently to tiktok.posts.events.

import os
import time
import json
import logging
import requests
from confluent_kafka import Producer
from pybloom_live import ScalableBloomFilter

API_BASE = "https://api.tikliveapi.com"
API_KEY = os.environ["TIKLIVEAPI_KEY"]
TOPIC = "tiktok.posts.events"

producer = Producer({
    "bootstrap.servers": os.environ["KAFKA_BOOTSTRAP"],
    "enable.idempotence": True,
    "acks": "all",
    "max.in.flight.requests.per.connection": 5,
    "compression.type": "zstd",
    "linger.ms": 50,
    "batch.size": 65536,
    "transactional.id": "tiktok-posts-producer-1",
})
producer.init_transactions()

seen = ScalableBloomFilter(initial_capacity=1_000_000, error_rate=0.001)
log = logging.getLogger("producer")

def fetch_user_posts(userid, cursor=0):
    r = requests.get(
        f"{API_BASE}/user-posts/",
        headers={"X-Api-Key": API_KEY},
        params={"userid": userid, "count": 35, "cursor": cursor},
        timeout=15,
    )
    r.raise_for_status()
    return r.json()

def delivery_cb(err, msg):
    if err:
        log.error("delivery failed: %s", err)

def poll_and_produce(userid):
    cursor = 0
    while True:
        payload = fetch_user_posts(userid, cursor)
        videos = payload.get("videos", [])
        if not videos:
            return
        producer.begin_transaction()
        try:
            for v in videos:
                aweme_id = v["aweme_id"]
                if aweme_id in seen:
                    continue
                seen.add(aweme_id)
                producer.produce(
                    TOPIC,
                    key=str(aweme_id).encode(),
                    value=json.dumps(v).encode(),
                    on_delivery=delivery_cb,
                )
            producer.commit_transaction()
        except Exception:
            producer.abort_transaction()
            raise
        if not payload.get("hasMore"):
            return
        cursor = payload.get("cursor", 0)
        time.sleep(0.2)

Four things to note. First, enable.idempotence=True plus a transactional.id gives you exactly-once writes within the producer session - duplicates from retries are deduped by the broker. Second, the Bloom filter is a cheap second line of defense for application-level dedup; size it for your expected post volume per day. Third, linger.ms and batch.size are tuned for throughput, not latency - you are polling, not serving requests, so trading 50ms for better compression is fine. Fourth, the polling loop respects the API's hasMore envelope and walks the cursor exactly as the docs describe.

For comments, swap the endpoint to /post-comments/ and use the comment id field as the Kafka key. For followers, hit /user-followers/ - pagination is by the time parameter (a timestamp), not cursor, and the top-level list is followers. The mirror endpoint /user-following/ uses followings as its top-level key (plural with trailing s), which catches people off guard - test against the live response.

Schema example for posts.events

A trimmed Avro schema for the post event topic, derived from /post-detail/:

{
  "type": "record",
  "name": "TikTokPostEvent",
  "namespace": "com.tikliveapi.events",
  "fields": [
    {"name": "aweme_id",       "type": "string"},
    {"name": "region",         "type": ["null", "string"], "default": null},
    {"name": "title",          "type": "string"},
    {"name": "duration",       "type": "int"},
    {"name": "play",           "type": "string"},
    {"name": "wmplay",         "type": "string"},
    {"name": "hdplay",         "type": ["null", "string"], "default": null},
    {"name": "play_count",     "type": "long"},
    {"name": "digg_count",     "type": "long"},
    {"name": "comment_count",  "type": "long"},
    {"name": "share_count",    "type": "long"},
    {"name": "download_count", "type": "long"},
    {"name": "collect_count",  "type": "long"},
    {"name": "create_time",    "type": "long"},
    {"name": "author_id",      "type": "string"},
    {"name": "ingested_at",    "type": "long"}
  ]
}

Note that hdplay is nullable - not every post has an HD version available. create_time is the TikTok-emitted unix timestamp; ingested_at is the wall clock when your producer wrote the event. Keeping both lets downstream jobs distinguish event time from processing time, which Flink needs for watermarks.

Stream processing examples

Engagement rate calculator

Engagement rate per post per hour, computed in Flink SQL over tiktok.posts.events:

SELECT
  aweme_id,
  TUMBLE_END(ingested_at, INTERVAL '1' HOUR) AS window_end,
  MAX(digg_count + comment_count + share_count) AS engagement_total,
  MAX(play_count) AS plays,
  CAST(MAX(digg_count + comment_count + share_count) AS DOUBLE)
    / NULLIF(MAX(play_count), 0) AS engagement_rate
FROM posts_events
GROUP BY aweme_id, TUMBLE(ingested_at, INTERVAL '1' HOUR);

Because the producer emits a new event each time it re-polls a post, taking MAX inside the window gives you the latest counter value within that hour. The result topic is keyed by aweme_id and sunk to ClickHouse for the dashboard.

Trending detector

For trending posts, run a windowed delta across two consecutive windows. Pseudo-Flink:

WITH per_hour AS (
  SELECT aweme_id,
         TUMBLE_END(ingested_at, INTERVAL '1' HOUR) AS h,
         MAX(play_count) AS plays
  FROM posts_events
  GROUP BY aweme_id, TUMBLE(ingested_at, INTERVAL '1' HOUR)
)
SELECT a.aweme_id, a.h, a.plays - b.plays AS delta_plays
FROM per_hour a
JOIN per_hour b ON a.aweme_id = b.aweme_id
              AND a.h = b.h + INTERVAL '1' HOUR
WHERE a.plays - b.plays > 100000;

Threshold tuning belongs in config, not SQL - you will want different thresholds for big and small creators.

Comment sentiment streaming join

Kafka Streams pattern: a topology that consumes tiktok.comments.events, calls a local sentiment model (or a sidecar gRPC service), and joins against the latest tiktok.users.snapshots KTable to enrich each comment with the commenter's follower count. The join is keyed on the comment's user.id. Snapshots become a compacted KTable so the join sees the most recent state per user. The model side of that sidecar is exactly what the comment sentiment analysis pipeline guide walks through.

Consumer patterns

Two patterns matter on the consumer side. First, exactly-once: enable isolation.level=read_committed on consumers that subscribe to topics written by a transactional producer. Without it, you will see aborted transaction records and your downstream counters will overshoot.

Second, partition strategy. Kafka partitions by hashing the key. Because tiktok.posts.events is keyed by aweme_id, all events for the same post land on the same partition, so a single consumer instance owns the entire history of any given post - critical for stateful processing. For tiktok.users.snapshots keyed by userid, partition assignment is effectively userid mod N where N is the partition count. Pick N large enough that you can scale your consumer group to that many instances without re-keying - 24 is a reasonable default for medium-scale workloads.

Operational concerns

Three failure modes will hit you before anything else.

Back-pressure when the API rate-limits

TikLiveAPI charges one credit per request and the credit balance is the de facto rate limit. When your producer pool drains the balance, requests start failing. Build a circuit breaker on the producer side: track HTTP status codes in a rolling window, and when the failure rate crosses a threshold, pause polling and emit a metric. Resume on a backoff. Do not retry tight-loop against a low-balance account - you will burn the residual credits on retries and your monitoring will be the last thing to know. The circuit-breaker and retry-budget patterns are covered in more depth in our guide to building resilient pipelines around TikTok API rate limits.

Dead-letter topics

Schema validation will fail occasionally - TikTok ships new fields, the upstream API surfaces them, and your Avro schema does not yet know. Catch the validation error in the producer, write the raw payload to tiktok.posts.events.dlq with the error message in a header, and continue. A daily job sweeps the DLQ, alerts on volume, and gives you a small backlog of payloads to evolve the schema against.

Monitoring

The metrics worth alerting on: consumer lag per group per topic, producer batch send errors, schema registry compatibility violations, API HTTP error rate from the producer side, credit balance threshold (poll the dashboard's /profile/ page or scrape your own usage records). Lag is the canonical "something is wrong downstream" indicator - alert at 5 minutes for hot topics and 30 minutes for archival sinks. Track these in Prometheus via the JMX exporter and the Kafka Connect REST API.

Alternative lighter setup with Redis Streams

If you are processing under a few hundred thousand events per day, Kafka is overkill. Redis Streams gives you append-only logs, consumer groups, and at-least-once delivery in a single binary you probably already run. Mapping:

Topics become Redis streams (XADD tiktok.posts.events * aweme_id 7123... payload {...}).
Consumer groups become XGROUP with XREADGROUP for partitioned reading.
Replay becomes XRANGE from offset zero.
Schema enforcement happens in application code - there is no Schema Registry equivalent.

What you lose: true partition-based parallelism (a stream is one logical unit, not 24 partitions), Kafka Connect's library of sink connectors, and Flink/ksqlDB compatibility. What you gain: one process to operate, sub-millisecond write latency, and zero ZooKeeper or KRaft to babysit. For a single-developer scraping setup that feeds one Postgres database, Redis Streams behind a polling worker is a perfectly reasonable choice. Graduate to Kafka when you need a second sink or your data engineer starts asking about exactly-once semantics.

Test the API surface against the interactive API playground before wiring up a producer - you can validate the exact JSON shape you will be parsing without writing a single line of streaming code. The full endpoint catalog lives in the documentation, including deep links for users, posts, and search. If you hit edge cases on schema evolution or credit-balance pacing, the contact form reaches the engineering team directly. Your dashboard lives on your profile page, and more architecture posts ship on the blog.

FAQ

Does TikLiveAPI push events to Kafka directly, or do I have to poll?

You poll. TikLiveAPI is a synchronous REST API that does not offer webhooks or push delivery, so a producer worker on your side is required. The polling cadence is yours to choose - tighter for hot users, looser for cold backlog.

How do I avoid duplicate events when re-polling the same user?

Two layers. At the application level, keep a Bloom filter or Redis set of aweme_id values you have already produced in the last 24 hours. At the Kafka level, enable producer idempotence and use aweme_id as the message key with log compaction on the topic - the latest version per key survives, older revisions are eventually removed.

What should I set partition counts to?

Start at 24 for hot topics (tiktok.posts.events, tiktok.comments.events) and 12 for warmer ones (tiktok.users.snapshots, tiktok.followers.deltas). Partition count is hard to change later without re-keying, so err on the higher side. The cost is metadata overhead on the broker, not throughput.

Avro or JSON Schema?

Avro if you have Confluent Schema Registry already running and your stream processors are JVM-based - the binary format is compact and the schema evolution rules are well understood. JSON Schema if your consumers are mixed-language and you value being able to read messages with kafka-console-consumer. Both work; the choice is operational, not technical.

How do I handle TikTok counter rollbacks where `play_count` appears to decrease between polls?

Treat counters as eventually consistent. Take MAX in windowed aggregations instead of LAST, and emit a warning metric when a delta is negative beyond a threshold. Some of these are real (deleted comments, removed likes); others are CDN-cache artifacts on the upstream side. Do not propagate them as alerts.

Build with the TikTok API

Ready to put what you read into code? Try our endpoints live or grab the full reference.

Open Playground Read Documentation

Streaming Real-Time TikTok Data Into Kafka Pipelines

Why pipe TikTok data through Kafka

Architecture overview

Topic design

tiktok.users.snapshots

tiktok.posts.events

tiktok.comments.events

tiktok.followers.deltas

Producer implementation

Schema example for posts.events

Stream processing examples

Engagement rate calculator

Trending detector

Comment sentiment streaming join

Consumer patterns

Operational concerns

Back-pressure when the API rate-limits

Dead-letter topics

Monitoring

Alternative lighter setup with Redis Streams

FAQ

Does TikLiveAPI push events to Kafka directly, or do I have to poll?

How do I avoid duplicate events when re-polling the same user?

What should I set partition counts to?

Avro or JSON Schema?

How do I handle TikTok counter rollbacks where `play_count` appears to decrease between polls?

Build with the TikTok API

Quick Links

Legal

Contact

TikTok API Solutions

Streaming Real-Time TikTok Data Into Kafka Pipelines

Why pipe TikTok data through Kafka

Architecture overview

Topic design

tiktok.users.snapshots

tiktok.posts.events

tiktok.comments.events

tiktok.followers.deltas

Producer implementation

Schema example for posts.events

Stream processing examples

Engagement rate calculator

Trending detector

Comment sentiment streaming join

Consumer patterns

Operational concerns

Back-pressure when the API rate-limits

Dead-letter topics

Monitoring

Alternative lighter setup with Redis Streams

FAQ

Does TikLiveAPI push events to Kafka directly, or do I have to poll?

How do I avoid duplicate events when re-polling the same user?

What should I set partition counts to?

Avro or JSON Schema?

How do I handle TikTok counter rollbacks where play_count appears to decrease between polls?

Related Articles

TikTok Watch Time and Completion Rate via Public API

TikTok For Gaming: API-Driven Marketing in 2026

Capacity Planning Playbook for TikTok Data Pipelines

Build with the TikTok API

Quick Links

Legal

Contact

TikTok API Solutions

How do I handle TikTok counter rollbacks where `play_count` appears to decrease between polls?