Archiving TikTok at scale is not the same problem as archiving YouTube, podcasts, or news articles. The videos are short, the URLs are signed and expire quickly, the catalog is constantly being deleted by creators and moderators, and the platform aggressively region-locks content. For journalists tracking misinformation, researchers building academic datasets, brand teams preserving campaign assets, or content tool builders feeding downstream pipelines, the gap between "I can download one video" and "I can reliably capture 50,000 videos a day" is enormous.
The naive approach (one worker, sequential downloads, no metadata layer) breaks within hours. CDN URLs go stale. Duplicate downloads pile up. A single deleted account silently drops thousands of items from your dataset with no error trail. This guide walks through the production patterns we have seen work for high-volume archiving against the TikLiveAPI endpoints, including the specific JSON shapes, the storage layout, the concurrency model, and the legal posture.
Every archive pipeline ultimately resolves to one endpoint: /post-detail/. Given a TikTok URL, it returns a flat snake_case object containing everything you need to persist a video plus its metadata. The relevant fields:
aweme_id - the canonical TikTok video id, your primary keyplay - no-watermark MP4 URL (standard quality)wmplay - watermarked MP4 URLhdplay - HD no-watermark MP4 URL when availablecover, origin_cover, ai_dynamic_cover - thumbnailsmusic and music_info - audio track URL and metadataauthor - object with id, unique_id, nickname, avatarplay_count, digg_count, comment_count, share_count, download_count, collect_countcreate_time, duration, region, titleThat is one credit per video. If you only need the file (no extra metadata), /download-video/ returns a smaller payload with just video and video_hd URLs. For most archive pipelines, /post-detail/ is the right choice because you want to persist the counters and author alongside the file.
Before you can resolve videos you need a list of TikTok URLs (or aweme_id values). TikLiveAPI gives you four practical sources, each with the same paginated videos[] / cursor / hasMore envelope:
cursor (string ms timestamp), continue while hasMore is true.publish_time (0/1/7/30/90/180 day windows) and sort_by (0 relevance, 1 likes, 2 date).Pull the maximum count per page, persist the cursor after every page, and resume from the last cursor on restart. Treat the listing endpoints as your "discovery" tier and persist the raw aweme_id values to a queue table before you spend credits on detail resolution.
The single most expensive mistake in bulk archiving is downloading the same video twice. Deduplication has to happen before the credit is spent, not after the file is written. The aweme_id field is globally unique and stable across endpoints, so build an index on it in your queue table:
CREATE TABLE archive_queue (
aweme_id TEXT PRIMARY KEY,
source TEXT NOT NULL,
state TEXT NOT NULL DEFAULT 'pending',
attempts INT NOT NULL DEFAULT 0,
added_at TIMESTAMPTZ DEFAULT now(),
resolved_at TIMESTAMPTZ
);
Every listing worker does an upsert with ON CONFLICT (aweme_id) DO NOTHING. The detail worker pulls state = 'pending' rows. This gives you a single deduplicated funnel regardless of which discovery endpoint produced the id.
Once you have a queue of aweme_ids, the detail worker calls /post-detail/ for each one. Prefer hdplay when present, fall back to play. Both are no-watermark. The wmplay URL is only useful if you specifically need the TikTok watermark for attribution screenshots.
import requests
API = "https://api.tikliveapi.com"
HEADERS = {"X-Api-Key": "YOUR_API_KEY"}
def resolve_video(tiktok_url: str) -> dict:
r = requests.get(
f"{API}/post-detail/",
headers=HEADERS,
params={"url": tiktok_url},
timeout=20,
)
r.raise_for_status()
d = r.json()
return {
"aweme_id": d["aweme_id"],
"download_url": d.get("hdplay") or d["play"],
"author_id": d["author"]["id"],
"author_handle": d["author"]["unique_id"],
"create_time": d["create_time"],
"duration": d["duration"],
"region": d.get("region"),
"play_count": d.get("play_count", 0),
"digg_count": d.get("digg_count", 0),
"comment_count": d.get("comment_count", 0),
"share_count": d.get("share_count", 0),
"title": d.get("title", ""),
"music_id": (d.get("music_info") or {}).get("id"),
"cover": d.get("cover"),
}
TikTok videos are short, but at scale you cannot afford to buffer them in memory or write partial files that survive a crash. Use chunked HTTP with a temp-file-then-rename pattern so the final filename only exists if the download completed:
import os, tempfile, requests
from pathlib import Path
def download_to_disk(url: str, dest: Path, chunk_size: int = 1 << 16) -> int:
dest.parent.mkdir(parents=True, exist_ok=True)
fd, tmp_path = tempfile.mkstemp(
prefix=dest.name + ".", suffix=".part", dir=dest.parent
)
bytes_written = 0
try:
with os.fdopen(fd, "wb") as out, requests.get(
url, stream=True, timeout=60
) as r:
r.raise_for_status()
for chunk in r.iter_content(chunk_size=chunk_size):
if chunk:
out.write(chunk)
bytes_written += len(chunk)
os.replace(tmp_path, dest)
return bytes_written
except Exception:
try:
os.unlink(tmp_path)
except FileNotFoundError:
pass
raise
The os.replace call is atomic on POSIX and on NTFS, which means your archive directory never contains a half-written MP4. Consumers downstream can safely list the directory without race conditions.
Local disk is fine for the first 100 GB. Past that, push every completed file straight to object storage (S3, Cloudflare R2, Backblaze B2). The key layout matters because it determines listing performance and lifecycle costs.
The pattern that has held up best:
s3://my-archive/
videos/
dt=2026-05-29/
author=alice123/
7387261234567890123.mp4
7387261234567890123.json
author=bob/
7387261198765432109.mp4
7387261198765432109.json
covers/
dt=2026-05-29/
7387261234567890123.jpg
Key choices: partition by ingest date (not create_time) so reprocessing is bounded; key on aweme_id so every object is idempotent and re-uploads are no-ops with an If-None-Match: * precondition; co-locate the JSON metadata next to the MP4 so a single LIST call returns everything needed to rehydrate a row.
Object storage holds the bytes; a database holds the searchable index. Postgres is the default choice. DuckDB works well if your workload is read-mostly analytics over Parquet exports. The minimal schema mirrors the /post-detail/ response:
CREATE TABLE posts (
aweme_id TEXT PRIMARY KEY,
author_id TEXT NOT NULL,
author_handle TEXT NOT NULL,
title TEXT,
region TEXT,
duration INT,
create_time BIGINT,
play_count BIGINT,
digg_count BIGINT,
comment_count BIGINT,
share_count BIGINT,
music_id TEXT,
storage_key TEXT NOT NULL,
bytes BIGINT NOT NULL,
archived_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX posts_author_idx ON posts(author_id);
CREATE INDEX posts_music_idx ON posts(music_id);
CREATE INDEX posts_create_idx ON posts(create_time);
Keep counter snapshots as a separate post_metrics_history table if you care about virality curves. Counters drift over time, so a single row in posts only captures the moment of capture.
The two bottlenecks are different: the API is limited by your credit and rate budget (200 requests per minute on standard plans), the CDN downloads are limited by bandwidth and target host concurrency. Use a single asyncio event loop with two semaphores so each stage is independently tunable.
import asyncio, aiohttp
from pathlib import Path
API = "https://api.tikliveapi.com"
HEADERS = {"X-Api-Key": "YOUR_API_KEY"}
api_sem = asyncio.Semaphore(8) # respect the API rate limit
cdn_sem = asyncio.Semaphore(20) # ~20 parallel video downloads
async def fetch_json(session, path, params):
async with api_sem:
async with session.get(
f"{API}{path}", headers=HEADERS, params=params, timeout=20
) as r:
r.raise_for_status()
return await r.json()
async def stream_to_file(session, url, dest: Path):
dest.parent.mkdir(parents=True, exist_ok=True)
tmp = dest.with_suffix(dest.suffix + ".part")
async with cdn_sem:
async with session.get(url, timeout=120) as r:
r.raise_for_status()
with open(tmp, "wb") as f:
async for chunk in r.content.iter_chunked(1 << 16):
f.write(chunk)
tmp.replace(dest)
async def archive_one(session, tiktok_url: str, root: Path):
detail = await fetch_json(session, "/post-detail/", {"url": tiktok_url})
aid = detail["aweme_id"]
dl = detail.get("hdplay") or detail["play"]
dest = root / detail["author"]["unique_id"] / f"{aid}.mp4"
if dest.exists():
return aid, "skip"
await stream_to_file(session, dl, dest)
return aid, "ok"
async def main(urls, root):
async with aiohttp.ClientSession() as session:
tasks = [archive_one(session, u, root) for u in urls]
return await asyncio.gather(*tasks, return_exceptions=True)
The semaphores keep the API-side concurrency below your plan limit while letting the CDN side scale higher because downloads are I/O bound, not credit bound.
Three error classes need explicit handling, not generic retries:
aweme_id twice; you only burn one extra credit./post-detail/): mark the row state = 'dead' and write the reason to a dead-letter table. Do not retry, do not waste credits..part and atomically rename, a crashed worker leaves orphan .part files. A janitor task that deletes any .part older than 1 hour is sufficient. Do not try to resume a partial download with HTTP Range; TikTok CDN URLs expire faster than your retry window.Combine with an idempotency rule on the metadata write: INSERT ... ON CONFLICT (aweme_id) DO UPDATE SET play_count = EXCLUDED.play_count, archived_at = now(). Re-running the pipeline never duplicates rows.
Three line items dominate the budget. For a 10,000-video-per-day archive:
/post-detail/ plus listing overhead. Discovery is roughly 1 credit per 30 listed items via /user-posts/ at maximum page size. 10,000 detail calls plus ~400 listing calls is ~10,400 credits per day. Pay-as-you-go with no monthly lock-in, see pricing.The dominant cost is almost always storage past the 6-month mark, not credits. Plan your retention policy before you start.
Region-locked videos. A video may be visible in one country and 404 in another. The CDN URL works from anywhere once you have it, but discovery via /search-video/ or /challenge-posts/ can be region-biased. Always pass the region parameter explicitly using a code from /region-list/ if you want reproducible results.
Deleted videos. By the time your detail worker runs, the video may already be gone. The window between discovery and resolution is the riskiest. Keep it short. A common pattern: detail-resolve within 60 seconds of the listing call, not 24 hours later.
Private accounts. If an account flips private between discovery and resolution, /user-posts/ stops returning data and /post-detail/ returns an error. Mark as dead and move on.
Expiring CDN URLs. This is the single biggest gotcha. The play and hdplay URLs returned by /post-detail/ are signed and expire fast, sometimes within minutes. Download immediately. Do not store the URL and download it tomorrow. If you have to queue, queue the aweme_id and re-resolve at download time.
Counter staleness. The play_count in your database reflects the moment of capture. If you need accurate engagement curves, periodically re-resolve videos you care about and append to a history table.
None of the patterns above grant you a license. They give you a way to capture public content reliably. The legal posture for archival generally splits along three lines:
author.unique_id and the original TikTok URL alongside every file. If you ever surface the content publicly, link back.TikLiveAPI itself returns publicly available data and does not require any TikTok credentials. The compliance burden of how you store, share, or analyze that data is on the operator of the archive.
A production archive pipeline ends up as five processes: a listing worker that fans out across /user-posts/, /challenge-posts/, /search-video/, and /music-posts/ and writes aweme_ids to a queue; a detail worker that pulls from the queue and calls /post-detail/ under an API semaphore; a download worker that streams the chunked HTTP response to disk and uploads to object storage; a metadata writer that upserts into Postgres; and a janitor that sweeps orphan .part files and ages out the dead-letter table.
Try the endpoint shapes in the playground, browse the full endpoint reference in the documentation, and start with the 100 free credits you get on email verification. If you have questions about rate limit increases for bulk workloads, the contact page goes directly to support.
The play and hdplay URLs returned by /post-detail/ are TikTok-signed CDN URLs that expire within minutes. Treat them as one-shot tokens: download immediately or re-resolve. Never store the URL itself for later use, always store the aweme_id and re-resolve when you need the bytes.
The play field is the standard-quality MP4 with no watermark. The wmplay field is the same video with the TikTok watermark burned in. The hdplay field is the HD no-watermark version when available. For archive use cases prefer hdplay with a fallback to play.
Deduplicate at the queue level on aweme_id before the detail worker fires. Use INSERT ... ON CONFLICT (aweme_id) DO NOTHING when ingesting from listing endpoints. Detail resolution is the credit cost; listing is much cheaper because pages return up to 30 ids per credit.
Yes. Each response includes cursor (string ms timestamp) and hasMore (bool). Loop while hasMore is true, passing the previous response's cursor. Persist the cursor after every page so a crashed worker resumes exactly where it stopped.
No. The API fetches metadata and returns no-watermark URLs on demand; it does not warehouse video files and does not log your archive contents. The storage layer (S3, R2, B2, or local disk) is entirely on your side, and that is the right place for it because retention policy is a downstream decision tied to your legal posture.
Ready to put what you read into code? Try our endpoints live or grab the full reference.