If you only have one downstream consumer for TikTok data, you can hit the TikLiveAPI REST endpoints directly from your worker process and call it a day. The moment you have two consumers - say, a Postgres OLTP store for your product UI and a ClickHouse warehouse for analytics - that direct-call model collapses. You either duplicate the polling cost (paying twice in API credits for the same data), or you build a synchronous fan-out service that becomes a single point of failure.
Kafka solves three specific problems for TikTok ingestion:
tiktok.posts.events for 30 days and you can rebuild a derived table from scratch without re-paying for any API calls.This post walks through a production-shaped pipeline: TikLiveAPI polling producers, Kafka topics keyed for partition-stable processing, exactly-once semantics, stream processors for engagement and trend detection, and the operational scaffolding (dead-letter topics, back-pressure, monitoring) you will want before any of it sees real traffic. There is also a lighter Redis Streams variant at the end for teams operating below the Kafka complexity floor.
The pipeline has five layers. From left to right:
/user-posts/, /post-comments/, and /userinfo-by-id/ on a schedule, parse the response, and write to Kafka. Auth header is X-Api-Key, base URL is https://api.tikliveapi.com. Pricing is one credit per request - see pricing.The producers do not block on the API. They pull from a polling schedule, hit the API, parse, and produce. Everything else - aggregation, enrichment, sinking - happens off the hot path.
Bad keys are the single most common Kafka mistake on data ingestion pipelines. Pick the key based on what the downstream stream processors will partition on, and stick with it.
One record per user-snapshot poll. Key is the numeric userid (string). Source endpoint: /userinfo-by-id/ which returns nested user and stats objects in camelCase - followerCount, followingCount, heartCount, videoCount, diggCount. See documentation/users. Topic config: 12 partitions, retention 14 days, no compaction (you want the history of snapshots, not just the latest).
One record per post discovery or update. Key is aweme_id (the TikTok post id, snake_case). Source endpoints: /user-posts/ for discovery, /post-detail/ for full enrichment. The /post-detail/ response is a flat snake_case object with aweme_id, play, wmplay, hdplay, play_count, digg_count, comment_count, share_count. 24 partitions, retention 30 days, log compaction enabled so the latest version per aweme_id survives. See documentation/posts.
One record per comment. Key is the comment id field (TikLiveAPI uses id on comments, not cid). Source: /post-comments/ and /post-comment-replies/. Value carries id, video_id, text, create_time, digg_count, reply_total, and the nested user object. 24 partitions, retention 7 days.
Computed, not directly polled. Key is userid. A stream processor reads tiktok.users.snapshots, compares followerCount to the previous snapshot for the same user, and emits a delta record ({userid, t_prev, t_now, delta, prev_count, curr_count}). 12 partitions, retention 90 days. Useful for downstream alerting ("creator X gained 50k followers in 2 hours").
The producer below uses confluent-kafka, polls /user-posts/ for a configured list of user ids, dedupes by aweme_id using a Bloom filter so re-polls do not flood downstream with duplicates, and writes idempotently to tiktok.posts.events.
import os
import time
import json
import logging
import requests
from confluent_kafka import Producer
from pybloom_live import ScalableBloomFilter
API_BASE = "https://api.tikliveapi.com"
API_KEY = os.environ["TIKLIVEAPI_KEY"]
TOPIC = "tiktok.posts.events"
producer = Producer({
"bootstrap.servers": os.environ["KAFKA_BOOTSTRAP"],
"enable.idempotence": True,
"acks": "all",
"max.in.flight.requests.per.connection": 5,
"compression.type": "zstd",
"linger.ms": 50,
"batch.size": 65536,
"transactional.id": "tiktok-posts-producer-1",
})
producer.init_transactions()
seen = ScalableBloomFilter(initial_capacity=1_000_000, error_rate=0.001)
log = logging.getLogger("producer")
def fetch_user_posts(userid, cursor=0):
r = requests.get(
f"{API_BASE}/user-posts/",
headers={"X-Api-Key": API_KEY},
params={"userid": userid, "count": 35, "cursor": cursor},
timeout=15,
)
r.raise_for_status()
return r.json()
def delivery_cb(err, msg):
if err:
log.error("delivery failed: %s", err)
def poll_and_produce(userid):
cursor = 0
while True:
payload = fetch_user_posts(userid, cursor)
videos = payload.get("videos", [])
if not videos:
return
producer.begin_transaction()
try:
for v in videos:
aweme_id = v["aweme_id"]
if aweme_id in seen:
continue
seen.add(aweme_id)
producer.produce(
TOPIC,
key=str(aweme_id).encode(),
value=json.dumps(v).encode(),
on_delivery=delivery_cb,
)
producer.commit_transaction()
except Exception:
producer.abort_transaction()
raise
if not payload.get("hasMore"):
return
cursor = payload.get("cursor", 0)
time.sleep(0.2)
Four things to note. First, enable.idempotence=True plus a transactional.id gives you exactly-once writes within the producer session - duplicates from retries are deduped by the broker. Second, the Bloom filter is a cheap second line of defense for application-level dedup; size it for your expected post volume per day. Third, linger.ms and batch.size are tuned for throughput, not latency - you are polling, not serving requests, so trading 50ms for better compression is fine. Fourth, the polling loop respects the API's hasMore envelope and walks the cursor exactly as the docs describe.
For comments, swap the endpoint to /post-comments/ and use the comment id field as the Kafka key. For followers, hit /user-followers/ - pagination is by the time parameter (a timestamp), not cursor, and the top-level list is followers. The mirror endpoint /user-following/ uses followings as its top-level key (plural with trailing s), which catches people off guard - test against the live response.
A trimmed Avro schema for the post event topic, derived from /post-detail/:
{
"type": "record",
"name": "TikTokPostEvent",
"namespace": "com.tikliveapi.events",
"fields": [
{"name": "aweme_id", "type": "string"},
{"name": "region", "type": ["null", "string"], "default": null},
{"name": "title", "type": "string"},
{"name": "duration", "type": "int"},
{"name": "play", "type": "string"},
{"name": "wmplay", "type": "string"},
{"name": "hdplay", "type": ["null", "string"], "default": null},
{"name": "play_count", "type": "long"},
{"name": "digg_count", "type": "long"},
{"name": "comment_count", "type": "long"},
{"name": "share_count", "type": "long"},
{"name": "download_count", "type": "long"},
{"name": "collect_count", "type": "long"},
{"name": "create_time", "type": "long"},
{"name": "author_id", "type": "string"},
{"name": "ingested_at", "type": "long"}
]
}
Note that hdplay is nullable - not every post has an HD version available. create_time is the TikTok-emitted unix timestamp; ingested_at is the wall clock when your producer wrote the event. Keeping both lets downstream jobs distinguish event time from processing time, which Flink needs for watermarks.
Engagement rate per post per hour, computed in Flink SQL over tiktok.posts.events:
SELECT
aweme_id,
TUMBLE_END(ingested_at, INTERVAL '1' HOUR) AS window_end,
MAX(digg_count + comment_count + share_count) AS engagement_total,
MAX(play_count) AS plays,
CAST(MAX(digg_count + comment_count + share_count) AS DOUBLE)
/ NULLIF(MAX(play_count), 0) AS engagement_rate
FROM posts_events
GROUP BY aweme_id, TUMBLE(ingested_at, INTERVAL '1' HOUR);
Because the producer emits a new event each time it re-polls a post, taking MAX inside the window gives you the latest counter value within that hour. The result topic is keyed by aweme_id and sunk to ClickHouse for the dashboard.
For trending posts, run a windowed delta across two consecutive windows. Pseudo-Flink:
WITH per_hour AS (
SELECT aweme_id,
TUMBLE_END(ingested_at, INTERVAL '1' HOUR) AS h,
MAX(play_count) AS plays
FROM posts_events
GROUP BY aweme_id, TUMBLE(ingested_at, INTERVAL '1' HOUR)
)
SELECT a.aweme_id, a.h, a.plays - b.plays AS delta_plays
FROM per_hour a
JOIN per_hour b ON a.aweme_id = b.aweme_id
AND a.h = b.h + INTERVAL '1' HOUR
WHERE a.plays - b.plays > 100000;
Threshold tuning belongs in config, not SQL - you will want different thresholds for big and small creators.
Kafka Streams pattern: a topology that consumes tiktok.comments.events, calls a local sentiment model (or a sidecar gRPC service), and joins against the latest tiktok.users.snapshots KTable to enrich each comment with the commenter's follower count. The join is keyed on the comment's user.id. Snapshots become a compacted KTable so the join sees the most recent state per user.
Two patterns matter on the consumer side. First, exactly-once: enable isolation.level=read_committed on consumers that subscribe to topics written by a transactional producer. Without it, you will see aborted transaction records and your downstream counters will overshoot.
Second, partition strategy. Kafka partitions by hashing the key. Because tiktok.posts.events is keyed by aweme_id, all events for the same post land on the same partition, so a single consumer instance owns the entire history of any given post - critical for stateful processing. For tiktok.users.snapshots keyed by userid, partition assignment is effectively userid mod N where N is the partition count. Pick N large enough that you can scale your consumer group to that many instances without re-keying - 24 is a reasonable default for medium-scale workloads.
Three failure modes will hit you before anything else.
TikLiveAPI charges one credit per request and the credit balance is the de facto rate limit. When your producer pool drains the balance, requests start failing. Build a circuit breaker on the producer side: track HTTP status codes in a rolling window, and when the failure rate crosses a threshold, pause polling and emit a metric. Resume on a backoff. Do not retry tight-loop against a low-balance account - you will burn the residual credits on retries and your monitoring will be the last thing to know.
Schema validation will fail occasionally - TikTok ships new fields, the upstream API surfaces them, and your Avro schema does not yet know. Catch the validation error in the producer, write the raw payload to tiktok.posts.events.dlq with the error message in a header, and continue. A daily job sweeps the DLQ, alerts on volume, and gives you a small backlog of payloads to evolve the schema against.
The metrics worth alerting on: consumer lag per group per topic, producer batch send errors, schema registry compatibility violations, API HTTP error rate from the producer side, credit balance threshold (poll the dashboard's /profile/ page or scrape your own usage records). Lag is the canonical "something is wrong downstream" indicator - alert at 5 minutes for hot topics and 30 minutes for archival sinks. Track these in Prometheus via the JMX exporter and the Kafka Connect REST API.
If you are processing under a few hundred thousand events per day, Kafka is overkill. Redis Streams gives you append-only logs, consumer groups, and at-least-once delivery in a single binary you probably already run. Mapping:
XADD tiktok.posts.events * aweme_id 7123... payload {...}).XGROUP with XREADGROUP for partitioned reading.XRANGE from offset zero.What you lose: true partition-based parallelism (a stream is one logical unit, not 24 partitions), Kafka Connect's library of sink connectors, and Flink/ksqlDB compatibility. What you gain: one process to operate, sub-millisecond write latency, and zero ZooKeeper or KRaft to babysit. For a single-developer scraping setup that feeds one Postgres database, Redis Streams behind a polling worker is a perfectly reasonable choice. Graduate to Kafka when you need a second sink or your data engineer starts asking about exactly-once semantics.
Test the API surface against the playground before wiring up a producer - you can validate the exact JSON shape you will be parsing without writing a single line of streaming code. The full endpoint catalog lives in the documentation, including deep links for users, posts, and search. If you hit edge cases on schema evolution or credit-balance pacing, the contact form reaches the engineering team directly. Your dashboard lives at profile, and more architecture posts ship on the blog.
You poll. TikLiveAPI is a synchronous REST API that does not offer webhooks or push delivery, so a producer worker on your side is required. The polling cadence is yours to choose - tighter for hot users, looser for cold backlog.
Two layers. At the application level, keep a Bloom filter or Redis set of aweme_id values you have already produced in the last 24 hours. At the Kafka level, enable producer idempotence and use aweme_id as the message key with log compaction on the topic - the latest version per key survives, older revisions are eventually removed.
Start at 24 for hot topics (tiktok.posts.events, tiktok.comments.events) and 12 for warmer ones (tiktok.users.snapshots, tiktok.followers.deltas). Partition count is hard to change later without re-keying, so err on the higher side. The cost is metadata overhead on the broker, not throughput.
Avro if you have Confluent Schema Registry already running and your stream processors are JVM-based - the binary format is compact and the schema evolution rules are well understood. JSON Schema if your consumers are mixed-language and you value being able to read messages with kafka-console-consumer. Both work; the choice is operational, not technical.
play_count appears to decrease between polls?Treat counters as eventually consistent. Take MAX in windowed aggregations instead of LAST, and emit a warning metric when a delta is negative beyond a threshold. Some of these are real (deleted comments, removed likes); others are CDN-cache artifacts on the upstream side. Do not propagate them as alerts.
Ready to put what you read into code? Try our endpoints live or grab the full reference.