Bulk TikTok Video Archive: Production Patterns for Scale

Q: Do TikLiveAPI download URLs work forever, or do they expire?

The play and hdplay URLs returned by /post-detail/ are TikTok-signed CDN URLs that expire within minutes. Treat them as one-shot tokens: download immediately or re-resolve. Never store the URL itself for later use, always store the aweme_id and re-resolve when you need the bytes.

Q: What is the difference between play, wmplay, and hdplay?

The play field is the standard-quality MP4 with no watermark. The wmplay field is the same video with the TikTok watermark burned in. The hdplay field is the HD no-watermark version when available. For archive use cases prefer hdplay with a fallback to play.

Q: How do I avoid spending credits on videos I already have?

Deduplicate at the queue level on aweme_id before the detail worker fires. Use INSERT ... ON CONFLICT (aweme_id) DO NOTHING when ingesting from listing endpoints. Detail resolution is the credit cost; listing is much cheaper because pages return up to 30 ids per credit.

Q: Can I paginate /user-posts/ until I have every video a user has ever posted?

Yes. Each response includes cursor (string ms timestamp) and hasMore (bool). Loop while hasMore is true, passing the previous response's cursor. Persist the cursor after every page so a crashed worker resumes exactly where it stopped.

By TikLiveAPI Team · Published on May 29, 2026

Bulk TikTok Video Archive: Production Patterns for Scale

Why Bulk TikTok Archiving Has Unique Failure Modes

Archiving TikTok at scale is not the same problem as archiving YouTube, podcasts, or news articles. The videos are short, the URLs are signed and expire quickly, the catalog is constantly being deleted by creators and moderators, and the platform aggressively region-locks content. For journalists tracking misinformation, researchers building academic datasets, brand teams preserving campaign assets, or content tool builders feeding downstream pipelines, the gap between "I can download one video" and "I can reliably capture 50,000 videos a day" is enormous.

The naive approach (one worker, sequential downloads, no metadata layer) breaks within hours. CDN URLs go stale. Duplicate downloads pile up. A single deleted account silently drops thousands of items from your dataset with no error trail. This guide walks through the production patterns we have seen work for high-volume archiving against the TikLiveAPI endpoints, including the specific JSON shapes, the storage layout, the concurrency model, and the legal posture.

What Data You Actually Get Per Video

Every archive pipeline ultimately resolves to one endpoint: /post-detail/. Given a TikTok URL, it returns a flat snake_case object containing everything you need to persist a video plus its metadata. The relevant fields:

aweme_id - the canonical TikTok video id, your primary key
play - no-watermark MP4 URL (standard quality)
wmplay - watermarked MP4 URL
hdplay - HD no-watermark MP4 URL when available
cover, origin_cover, ai_dynamic_cover - thumbnails
music and music_info - audio track URL and metadata
author - object with id, unique_id, nickname, avatar
play_count, digg_count, comment_count, share_count, download_count, collect_count
create_time, duration, region, title

That is one credit per video. If you only need the file (no extra metadata), /download-video/ returns a smaller payload with just video and video_hd URLs. For most archive pipelines, /post-detail/ is the right choice because you want to persist the counters and author alongside the file.

Step 1: Collecting URLs

Before you can resolve videos you need a list of TikTok URLs (or aweme_id values). TikLiveAPI gives you four practical sources, each with the same paginated videos[] / cursor / hasMore envelope:

/user-posts/ - all public posts by a user id. Pagination via cursor (string ms timestamp), continue while hasMore is true.
/challenge-posts/ - posts for a hashtag/challenge id, optionally filtered by region.
/search-video/ - keyword search with publish_time (0/1/7/30/90/180 day windows) and sort_by (0 relevance, 1 likes, 2 date).
/music-posts/ - all videos using a given music_id, useful for tracking sound trends.

Pull the maximum count per page, persist the cursor after every page, and resume from the last cursor on restart. Treat the listing endpoints as your "discovery" tier and persist the raw aweme_id values to a queue table before you spend credits on detail resolution.

Step 2: Deduplicate Before You Download

The single most expensive mistake in bulk archiving is downloading the same video twice. Deduplication has to happen before the credit is spent, not after the file is written. The aweme_id field is globally unique and stable across endpoints, so build an index on it in your queue table:

CREATE TABLE archive_queue (
  aweme_id   TEXT PRIMARY KEY,
  source     TEXT NOT NULL,
  state      TEXT NOT NULL DEFAULT 'pending',
  attempts   INT  NOT NULL DEFAULT 0,
  added_at   TIMESTAMPTZ DEFAULT now(),
  resolved_at TIMESTAMPTZ
);

Every listing worker does an upsert with ON CONFLICT (aweme_id) DO NOTHING. The detail worker pulls state = 'pending' rows. This gives you a single deduplicated funnel regardless of which discovery endpoint produced the id.

Step 3: Resolve the No-Watermark URL

Once you have a queue of aweme_ids, the detail worker calls /post-detail/ for each one. Prefer hdplay when present, fall back to play. Both are no-watermark. The wmplay URL is only useful if you specifically need the TikTok watermark for attribution screenshots. If you only need the file-fetch flow without the archive scaffolding, the guide on downloading TikTok videos without watermark via the API covers it in isolation.

import requests

API = "https://api.tikliveapi.com"
HEADERS = {"X-Api-Key": "YOUR_API_KEY"}

def resolve_video(tiktok_url: str) -> dict:
    r = requests.get(
        f"{API}/post-detail/",
        headers=HEADERS,
        params={"url": tiktok_url},
        timeout=20,
    )
    r.raise_for_status()
    d = r.json()
    return {
        "aweme_id": d["aweme_id"],
        "download_url": d.get("hdplay") or d["play"],
        "author_id": d["author"]["id"],
        "author_handle": d["author"]["unique_id"],
        "create_time": d["create_time"],
        "duration": d["duration"],
        "region": d.get("region"),
        "play_count": d.get("play_count", 0),
        "digg_count": d.get("digg_count", 0),
        "comment_count": d.get("comment_count", 0),
        "share_count": d.get("share_count", 0),
        "title": d.get("title", ""),
        "music_id": (d.get("music_info") or {}).get("id"),
        "cover": d.get("cover"),
    }

Step 4: Streaming Download

TikTok videos are short, but at scale you cannot afford to buffer them in memory or write partial files that survive a crash. Use chunked HTTP with a temp-file-then-rename pattern so the final filename only exists if the download completed:

import os, tempfile, requests
from pathlib import Path

def download_to_disk(url: str, dest: Path, chunk_size: int = 1 << 16) -> int:
    dest.parent.mkdir(parents=True, exist_ok=True)
    fd, tmp_path = tempfile.mkstemp(
        prefix=dest.name + ".", suffix=".part", dir=dest.parent
    )
    bytes_written = 0
    try:
        with os.fdopen(fd, "wb") as out, requests.get(
            url, stream=True, timeout=60
        ) as r:
            r.raise_for_status()
            for chunk in r.iter_content(chunk_size=chunk_size):
                if chunk:
                    out.write(chunk)
                    bytes_written += len(chunk)
        os.replace(tmp_path, dest)
        return bytes_written
    except Exception:
        try:
            os.unlink(tmp_path)
        except FileNotFoundError:
            pass
        raise

The os.replace call is atomic on POSIX and on NTFS, which means your archive directory never contains a half-written MP4. Consumers downstream can safely list the directory without race conditions.

Step 5: Object Storage Layout

Local disk is fine for the first 100 GB. Past that, push every completed file straight to object storage (S3, Cloudflare R2, Backblaze B2). The key layout matters because it determines listing performance and lifecycle costs.

The pattern that has held up best:

s3://my-archive/
  videos/
    dt=2026-05-29/
      author=alice123/
        7387261234567890123.mp4
        7387261234567890123.json
      author=bob/
        7387261198765432109.mp4
        7387261198765432109.json
  covers/
    dt=2026-05-29/
      7387261234567890123.jpg

Key choices: partition by ingest date (not create_time) so reprocessing is bounded; key on aweme_id so every object is idempotent and re-uploads are no-ops with an If-None-Match: * precondition; co-locate the JSON metadata next to the MP4 so a single LIST call returns everything needed to rehydrate a row.

Step 6: Metadata Persistence

Object storage holds the bytes; a database holds the searchable index. Postgres is the default choice. DuckDB works well if your workload is read-mostly analytics over Parquet exports. The minimal schema mirrors the /post-detail/ response:

CREATE TABLE posts (
  aweme_id        TEXT PRIMARY KEY,
  author_id       TEXT NOT NULL,
  author_handle   TEXT NOT NULL,
  title           TEXT,
  region          TEXT,
  duration        INT,
  create_time     BIGINT,
  play_count      BIGINT,
  digg_count      BIGINT,
  comment_count   BIGINT,
  share_count     BIGINT,
  music_id        TEXT,
  storage_key     TEXT NOT NULL,
  bytes           BIGINT NOT NULL,
  archived_at     TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX posts_author_idx ON posts(author_id);
CREATE INDEX posts_music_idx  ON posts(music_id);
CREATE INDEX posts_create_idx ON posts(create_time);

Keep counter snapshots as a separate post_metrics_history table if you care about virality curves. Counters drift over time, so a single row in posts only captures the moment of capture.

Step 7: Async Parallelism

The two bottlenecks are different: the API is limited by your credit and rate budget (200 requests per minute on standard plans), the CDN downloads are limited by bandwidth and target host concurrency. Use a single asyncio event loop with two semaphores so each stage is independently tunable.

import asyncio, aiohttp
from pathlib import Path

API = "https://api.tikliveapi.com"
HEADERS = {"X-Api-Key": "YOUR_API_KEY"}

api_sem = asyncio.Semaphore(8)     # respect the API rate limit
cdn_sem = asyncio.Semaphore(20)    # ~20 parallel video downloads

async def fetch_json(session, path, params):
    async with api_sem:
        async with session.get(
            f"{API}{path}", headers=HEADERS, params=params, timeout=20
        ) as r:
            r.raise_for_status()
            return await r.json()

async def stream_to_file(session, url, dest: Path):
    dest.parent.mkdir(parents=True, exist_ok=True)
    tmp = dest.with_suffix(dest.suffix + ".part")
    async with cdn_sem:
        async with session.get(url, timeout=120) as r:
            r.raise_for_status()
            with open(tmp, "wb") as f:
                async for chunk in r.content.iter_chunked(1 << 16):
                    f.write(chunk)
    tmp.replace(dest)

async def archive_one(session, tiktok_url: str, root: Path):
    detail = await fetch_json(session, "/post-detail/", {"url": tiktok_url})
    aid = detail["aweme_id"]
    dl = detail.get("hdplay") or detail["play"]
    dest = root / detail["author"]["unique_id"] / f"{aid}.mp4"
    if dest.exists():
        return aid, "skip"
    await stream_to_file(session, dl, dest)
    return aid, "ok"

async def main(urls, root):
    async with aiohttp.ClientSession() as session:
        tasks = [archive_one(session, u, root) for u in urls]
        return await asyncio.gather(*tasks, return_exceptions=True)

The semaphores keep the API-side concurrency below your plan limit while letting the CDN side scale higher because downloads are I/O bound, not credit bound. For backoff strategies and budget-aware schedulers beyond simple semaphores, see the post on building resilient pipelines around API rate limits.

Step 8: Error Recovery

Three error classes need explicit handling, not generic retries:

Transient HTTP errors (timeouts, 5xx from the CDN): retry with exponential backoff up to 3 attempts. The detail call is idempotent because it returns the same aweme_id twice; you only burn one extra credit.
Permanent video errors (404 from the CDN, "video unavailable" from /post-detail/): mark the row state = 'dead' and write the reason to a dead-letter table. Do not retry, do not waste credits.
Resumable partial downloads: because we write to .part and atomically rename, a crashed worker leaves orphan .part files. A janitor task that deletes any .part older than 1 hour is sufficient. Do not try to resume a partial download with HTTP Range; TikTok CDN URLs expire faster than your retry window.

Combine with an idempotency rule on the metadata write: INSERT ... ON CONFLICT (aweme_id) DO UPDATE SET play_count = EXCLUDED.play_count, archived_at = now(). Re-running the pipeline never duplicates rows.

Step 9: Cost Projection

Three line items dominate the budget. For a 10,000-video-per-day archive:

API credits: 1 credit per /post-detail/ plus listing overhead. Discovery is roughly 1 credit per 30 listed items via /user-posts/ at maximum page size. 10,000 detail calls plus ~400 listing calls is ~10,400 credits per day. Pay-as-you-go with no monthly lock-in, see pricing.
Bytes: TikTok videos average ~2-4 MB; HD around 6-10 MB. Plan on ~5 MB per video for HD blend, so ~50 GB ingested per day, ~1.5 TB per month.
Storage: ~$0.015/GB/month on R2/B2 means ~$22/month for the first month, growing linearly. Glacier-class tiering for items older than 90 days drops that by ~70%.

The dominant cost is almost always storage past the 6-month mark, not credits. Plan your retention policy before you start.

Common Failure Modes

Region-locked videos. A video may be visible in one country and 404 in another. The CDN URL works from anywhere once you have it, but discovery via /search-video/ or /challenge-posts/ can be region-biased. Always pass the region parameter explicitly using a code from /region-list/ if you want reproducible results.

Deleted videos. By the time your detail worker runs, the video may already be gone. The window between discovery and resolution is the riskiest. Keep it short. A common pattern: detail-resolve within 60 seconds of the listing call, not 24 hours later.

Private accounts. If an account flips private between discovery and resolution, /user-posts/ stops returning data and /post-detail/ returns an error. Mark as dead and move on.

Expiring CDN URLs. This is the single biggest gotcha. The play and hdplay URLs returned by /post-detail/ are signed and expire fast, sometimes within minutes. Download immediately. Do not store the URL and download it tomorrow. If you have to queue, queue the aweme_id and re-resolve at download time.

Counter staleness. The play_count in your database reflects the moment of capture. If you need accurate engagement curves, periodically re-resolve videos you care about and append to a history table.

Legal Posture

None of the patterns above grant you a license. They give you a way to capture public content reliably. The legal posture for archival generally splits along three lines:

Archival of public content is broadly defensible for journalism, academic research, and internal investigative use. Document your purpose, your retention policy, and the public visibility of every captured item.
Redistribution or republication is a different question entirely. The author retains copyright on the video. Reposting to another platform without permission is infringement regardless of how you got the file. Fair use is a defense, not a permission.
Attribution is cheap insurance. Persist author.unique_id and the original TikTok URL alongside every file. If you ever surface the content publicly, link back.

TikLiveAPI itself returns publicly available data and does not require any TikTok credentials. The compliance burden of how you store, share, or analyze that data is on the operator of the archive. For the case law behind that posture, read our developer guide to the legality of scraping TikTok in 2026.

Putting It Together

A production archive pipeline ends up as five processes: a listing worker that fans out across /user-posts/, /challenge-posts/, /search-video/, and /music-posts/ and writes aweme_ids to a queue; a detail worker that pulls from the queue and calls /post-detail/ under an API semaphore; a download worker that streams the chunked HTTP response to disk and uploads to object storage; a metadata writer that upserts into Postgres; and a janitor that sweeps orphan .part files and ages out the dead-letter table.

Try the endpoint shapes in the playground, browse the full endpoint reference in the documentation, and start with the 100 free credits you get on email verification. If you have questions about rate limit increases for bulk workloads, the contact page goes directly to support.

FAQ

Do TikLiveAPI download URLs work forever, or do they expire?

The play and hdplay URLs returned by /post-detail/ are TikTok-signed CDN URLs that expire within minutes. Treat them as one-shot tokens: download immediately or re-resolve. Never store the URL itself for later use, always store the aweme_id and re-resolve when you need the bytes.

What is the difference between play, wmplay, and hdplay?

The play field is the standard-quality MP4 with no watermark. The wmplay field is the same video with the TikTok watermark burned in. The hdplay field is the HD no-watermark version when available. For archive use cases prefer hdplay with a fallback to play.

How do I avoid spending credits on videos I already have?

Deduplicate at the queue level on aweme_id before the detail worker fires. Use INSERT ... ON CONFLICT (aweme_id) DO NOTHING when ingesting from listing endpoints. Detail resolution is the credit cost; listing is much cheaper because pages return up to 30 ids per credit.

Can I paginate /user-posts/ until I have every video a user has ever posted?

Yes. Each response includes cursor (string ms timestamp) and hasMore (bool). Loop while hasMore is true, passing the previous response's cursor. Persist the cursor after every page so a crashed worker resumes exactly where it stopped.

Will TikLiveAPI store the videos I archive on its servers?

No. The API fetches metadata and returns no-watermark URLs on demand; it does not warehouse video files and does not log your archive contents. The storage layer (S3, R2, B2, or local disk) is entirely on your side, and that is the right place for it because retention policy is a downstream decision tied to your legal posture.

Build with the TikTok API

Ready to put what you read into code? Try our endpoints live or grab the full reference.

Open Playground Read Documentation

Bulk TikTok Video Archive: Production Patterns for Scale

Why Bulk TikTok Archiving Has Unique Failure Modes

What Data You Actually Get Per Video

Step 1: Collecting URLs

Step 2: Deduplicate Before You Download

Step 3: Resolve the No-Watermark URL

Step 4: Streaming Download

Step 5: Object Storage Layout

Step 6: Metadata Persistence

Step 7: Async Parallelism

Step 8: Error Recovery

Step 9: Cost Projection

Common Failure Modes

Legal Posture

Putting It Together

FAQ

Do TikLiveAPI download URLs work forever, or do they expire?

What is the difference between play, wmplay, and hdplay?

How do I avoid spending credits on videos I already have?

Can I paginate /user-posts/ until I have every video a user has ever posted?

Will TikLiveAPI store the videos I archive on its servers?

Build with the TikTok API

Quick Links

Legal

Contact

TikTok API Solutions

Bulk TikTok Video Archive: Production Patterns for Scale

Why Bulk TikTok Archiving Has Unique Failure Modes

What Data You Actually Get Per Video

Step 1: Collecting URLs

Step 2: Deduplicate Before You Download

Step 3: Resolve the No-Watermark URL

Step 4: Streaming Download

Step 5: Object Storage Layout

Step 6: Metadata Persistence

Step 7: Async Parallelism

Step 8: Error Recovery

Step 9: Cost Projection

Common Failure Modes

Legal Posture

Putting It Together

FAQ

Do TikLiveAPI download URLs work forever, or do they expire?

What is the difference between play, wmplay, and hdplay?

How do I avoid spending credits on videos I already have?

Can I paginate /user-posts/ until I have every video a user has ever posted?

Will TikLiveAPI store the videos I archive on its servers?

Related Articles

TikTok Search API Patterns: User, Video, and Hashtag

How to Fetch TikTok Comments at Scale with Pagination

How to Download TikTok Videos Without Watermark via API

Build with the TikTok API

Quick Links

Legal

Contact

TikTok API Solutions