TikTok is a firehose. A creator dashboard, a brand monitoring tool, or an AI content app needs to turn raw video clips into searchable, structured text the moment a URL is dropped in. Doing that well means stitching together three layers: a metadata source, an automatic speech recognition (ASR) layer, and a large language model (LLM) that distills the content into something a product can act on.
This guide shows how to build that pipeline on top of TikLiveAPI. We pull the post metadata and a no-watermark download from /post-detail/, transcribe the audio with Whisper, and feed both signals to an LLM that returns a TL;DR, key takeaways, sentiment, and a suggested call-to-action. The full code runs in around 150 lines of Python and is cheap enough to run on every video in a creator backlog.
The flow is intentionally linear. Each stage produces a typed artifact the next stage consumes, which keeps retries and caching trivial.
/post-detail/ with X-Api-Key. Read play or hdplay for the no-watermark MP4, plus music_info, author, and the engagement counters./post-detail/.The dashboard side of TikLiveAPI does not deduct credits or proxy media. Credits are spent on the API server when you hit endpoints like /post-detail/. The actual video bytes are served by TikTok's CDN, so your download cost is bandwidth only. You can experiment with the request shape in the playground before wiring anything into code.
The detail endpoint returns a flat snake_case object. The top-level keys you care about for summarization are aweme_id, id, play, wmplay, hdplay, music, music_info, play_count, digg_count, comment_count, and share_count. The author object is nested and carries the creator's username and follower stats.
import os, requests
API_KEY = os.environ["TIKLIVE_API_KEY"]
BASE = "https://api.tikliveapi.com"
def fetch_post(url: str) -> dict:
r = requests.get(
f"{BASE}/post-detail/",
params={"url": url},
headers={"X-Api-Key": API_KEY},
timeout=30,
)
r.raise_for_status()
return r.json()
Prefer play over wmplay because the watermark obscures on-screen text and confuses downstream OCR. Use hdplay when you specifically need high resolution for thumbnail extraction or human review. Full field semantics are in the detail endpoint docs.
The URL returned in play is a presigned TikTok CDN link with a short TTL. Download it immediately. If you queue the job, refetch /post-detail/ first so you do not hit a stale URL.
def download_mp4(play_url: str, dest: str) -> str:
with requests.get(play_url, stream=True, timeout=60) as r:
r.raise_for_status()
with open(dest, "wb") as f:
for chunk in r.iter_content(chunk_size=1024 * 256):
f.write(chunk)
return dest
For details on the download endpoint and the difference between play, wmplay, and hdplay, see /documentation/download/video/.
Whisper expects 16 kHz mono PCM. ffmpeg does this in one shot:
import subprocess
def extract_audio(mp4_path: str, wav_path: str) -> str:
subprocess.run(
["ffmpeg", "-y", "-i", mp4_path, "-vn",
"-ac", "1", "-ar", "16000", wav_path],
check=True, capture_output=True,
)
return wav_path
For ASR you have two reasonable choices in 2026:
A pragmatic default is to route by detected language. Run a 5-second probe with a tiny model, branch to distil for English and large-v3 for everything else. The probe adds about 200 ms but cuts average transcription cost by roughly 60 percent on a mixed feed.
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
def transcribe(wav_path: str) -> dict:
segments, info = model.transcribe(wav_path, vad_filter=True)
text = " ".join(s.text.strip() for s in segments)
return {"text": text, "language": info.language,
"duration": info.duration}
A non-trivial slice of TikTok content has no speech: dance clips, slideshows, lip-sync over commercial music, ASMR. If Whisper returns an empty transcript or only the music track's lyrics, you need a fallback so the summary is not garbage.
The cleanest fallback is OCR on the cover frame. /post-detail/ returns a cover URL alongside the play URL. Run Tesseract or PaddleOCR on that image, then feed the recognized on-screen text plus the music_info.title and music_info.author to the LLM. For slideshows with multiple frames, sample one frame per second from the MP4 and OCR each.
def needs_ocr_fallback(transcript: dict) -> bool:
text = transcript["text"].strip()
if len(text) < 20:
return True
word_count = len(text.split())
if transcript["duration"] > 5 and word_count / transcript["duration"] < 0.5:
return True
return False
The threshold of 0.5 words per second is a rough heuristic. Tune it on your own data. Music-only videos almost always come in well below it.
The LLM step is where most teams over-engineer. Resist the urge to chain three calls. One prompt, JSON mode, four fields:
SYSTEM = """You summarize TikTok videos for a content analytics product.
Return strict JSON with keys: tldr, takeaways, sentiment, cta.
- tldr: one sentence, 25 words max.
- takeaways: array of 3 to 5 bullets, each under 20 words.
- sentiment: one of positive, negative, neutral, mixed.
- cta: suggested call-to-action for a brand replying to this video, 15 words max.
Do not invent facts. If the transcript is empty, base the summary on metadata only and say so in tldr."""
USER_TEMPLATE = """Creator: @{username} ({followers} followers)
Engagement: {plays} plays, {likes} likes, {comments} comments
Music: {music_title} by {music_author}
Transcript: {transcript}
On-screen text (OCR): {ocr}"""
Plug this into your LLM client of choice. A small or mid model is plenty for this task; reserve frontier models for cases where you need multi-step reasoning over comments or trend detection.
import json
from anthropic import Anthropic
client = Anthropic()
def summarize(post: dict, transcript: str, ocr: str) -> dict:
author = post.get("author", {})
music = post.get("music_info", {})
user_msg = USER_TEMPLATE.format(
username=author.get("unique_id", "unknown"),
followers=author.get("follower_count", 0),
plays=post.get("play_count", 0),
likes=post.get("digg_count", 0),
comments=post.get("comment_count", 0),
music_title=music.get("title", ""),
music_author=music.get("author", ""),
transcript=transcript or "(empty)",
ocr=ocr or "(none)",
)
resp = client.messages.create(
model="claude-haiku-4-5",
max_tokens=600,
system=SYSTEM,
messages=[{"role": "user", "content": user_msg}],
)
return json.loads(resp.content[0].text)
Glue it together with a single entry point that takes a URL and returns the final JSON. This is the shape you would expose from a worker or a webhook handler.
import tempfile, os
def summarize_tiktok(url: str) -> dict:
post = fetch_post(url)
play_url = post.get("play") or post.get("hdplay")
if not play_url:
raise ValueError("no playable URL in /post-detail/ response")
with tempfile.TemporaryDirectory() as tmp:
mp4 = os.path.join(tmp, "v.mp4")
wav = os.path.join(tmp, "a.wav")
download_mp4(play_url, mp4)
extract_audio(mp4, wav)
transcript = transcribe(wav)
ocr_text = ""
if needs_ocr_fallback(transcript):
ocr_text = ocr_cover(post.get("cover", ""))
summary = summarize(post, transcript["text"], ocr_text)
return {
"aweme_id": post.get("aweme_id"),
"author": post.get("author", {}).get("unique_id"),
"language": transcript["language"],
"duration": transcript["duration"],
"transcript": transcript["text"],
"summary": summary,
"engagement": {
"plays": post.get("play_count"),
"likes": post.get("digg_count"),
"comments": post.get("comment_count"),
"shares": post.get("share_count"),
},
}
The ocr_cover implementation is a single PaddleOCR call against the cover URL. The exact code is left as an exercise because the right OCR library is workload-dependent.
Per-video unit economics break down into three buckets. Numbers below assume an average 30-second clip with about 75 words of speech.
/post-detail/ request. See pricing for current credit cost per call.End to end, a fully hosted pipeline runs about 0.4 to 0.7 cents per video, dominated by ASR if you use a hosted Whisper endpoint. At a million videos a month that is $4k to $7k plus the TikLiveAPI bill. Self-hosting Whisper on a single H100 drops the per-video cost by roughly an order of magnitude once you hit a few hundred thousand calls per month.
For backfills or daily catch-up jobs, the bottleneck is ASR, not the API. Run a worker pool where each worker owns one GPU model handle and feeds it serial jobs, while a separate I/O pool fans out the /post-detail/ calls and downloads.
import asyncio, aiohttp
async def fetch_post_async(session, url):
async with session.get(
f"{BASE}/post-detail/",
params={"url": url},
headers={"X-Api-Key": API_KEY},
) as r:
return await r.json()
async def process_batch(urls, concurrency=20):
sem = asyncio.Semaphore(concurrency)
async with aiohttp.ClientSession() as session:
async def one(u):
async with sem:
return await fetch_post_async(session, u)
return await asyncio.gather(*(one(u) for u in urls))
Twenty concurrent /post-detail/ requests is a reasonable starting point. Watch your rate-limit headers and back off on 429 responses. ASR concurrency should match your GPU count, not your CPU count; over-subscribing the GPU just causes thrashing.
/post-detail/ returns an error payload rather than the usual flat object. Check for the presence of aweme_id before treating the response as a post.play URL will 403 from outside the allowed region. Route downloads through a residential proxy in the matching country, or skip the clip and summarize from metadata alone./post-detail/ at download time, not at enqueue time.Yes. Engagement counters tell the LLM what the audience reacted to. Music info often signals genre and emotional tone. Author context lets the model adjust register. Drop any of these and the CTA field gets noticeably worse.
Multimodal video models are improving fast but still cost five to twenty times more than the Whisper-plus-LLM split for the same output quality. For high-volume summarization the split pipeline wins on both cost and latency.
Cache aggressively. aweme_id is a stable key. Cache the transcript forever and the summary keyed by transcript hash plus prompt version. Refresh engagement counters separately when you need fresh numbers.
The API returns only publicly available content. For commercial summarization products, follow your jurisdiction's rules on derivative works and credit creators in your UI.
Pull an API key from your profile, try one URL in the playground, then drop the Python script above into a worker. If you hit edge cases not covered here, the full documentation covers the response shape for every endpoint, and the team on contact can help with volume pricing. Trend ideas and case studies show up on the blog.
Ready to put what you read into code? Try our endpoints live or grab the full reference.