Your analytics team already lives in Looker, Tableau, Hex, or Mode. Your marketing dashboards already join Shopify, HubSpot, and Google Ads on a clean date spine. Adding TikTok creator and content data to that warehouse means social performance suddenly speaks the same SQL dialect as every other revenue source, with no separate logins, no copy-pasted CSV exports, and no third tool to govern.
TikLiveAPI returns clean JSON across 37 REST endpoints, which makes warehouse ingestion straightforward. This guide is for analytics engineers and data team leads who want to move from "scripts that hit an API" to a modeled, tested, and queryable TikTok dataset alongside the rest of the business.
We will cover ELT vs ETL trade-offs, raw landing schemas for BigQuery and Snowflake, dbt staging and marts layers, incremental ingestion with cursor pagination, schema and freshness tests, common BI metrics, cost optimization, and privacy considerations.
You have two realistic paths for getting TikTok responses into your warehouse.
The collector script flattens nested objects, renames camelCase to snake_case, and inserts typed columns directly. The schema is narrow and clean. The downside: every new field on the upstream JSON requires a code change in the collector, and once a field is dropped during transform, it's gone. If a stakeholder asks next week for secUid or avatarLarger, you have to backfill.
The collector inserts the raw API response into a single semi-structured column. Schema changes upstream are silently absorbed - the new keys simply appear inside the JSON blob. dbt models read from raw and produce typed staging and marts.
For analytics use cases, ELT wins. Storage is cheap, raw history is invaluable for debugging, and dbt gives you version-controlled transformations that any analyst can read. The rest of this post assumes ELT.
Land one row per API response. Store the parsed payload in the semi-structured column for the warehouse, plus a few "envelope" columns you want to index and partition on.
CREATE TABLE raw.tiktok_api_responses (
ingested_at TIMESTAMP NOT NULL,
endpoint STRING NOT NULL,
request_param STRING,
http_status INT64,
payload JSON,
request_id STRING
)
PARTITION BY DATE(ingested_at)
CLUSTER BY endpoint, request_param;
BigQuery's native JSON type lets you index into payloads with dot notation and supports schema evolution without DDL changes.
CREATE TABLE raw.tiktok_api_responses (
ingested_at TIMESTAMP_NTZ NOT NULL,
endpoint STRING NOT NULL,
request_param STRING,
http_status NUMBER,
payload VARIANT,
request_id STRING
)
CLUSTER BY (TO_DATE(ingested_at), endpoint);
Snowflake's VARIANT handles arbitrary JSON, with FLATTEN for array unnesting and the : accessor for nested paths.
The collector script calls TikLiveAPI with X-Api-Key auth and inserts one row per response. You can find your key on the profile page after registering, and see all 37 endpoints in the documentation.
# pseudo-Python ingestion loop
import requests, json, datetime
headers = {"X-Api-Key": os.environ["TIKLIVE_KEY"]}
resp = requests.get(
"https://api.tikliveapi.com/userinfo-by-username/",
params={"username": handle},
headers=headers,
timeout=30,
)
warehouse.insert("raw.tiktok_api_responses", {
"ingested_at": datetime.datetime.utcnow(),
"endpoint": "/userinfo-by-username/",
"request_param": handle,
"http_status": resp.status_code,
"payload": resp.text, # raw JSON string
"request_id": str(uuid.uuid4()),
})
A clean three-layer dbt project keeps responsibilities separated.
stg_*) - one model per endpoint, parses raw JSON into typed columns, renames to snake_case, drops nothing.int_*) - joins, deduplication, windowing.dim_*, fact_*) - the tables BI tools actually query.For TikTok, the marts that earn their keep are:
stg_tiktok_users - parsed creator profile snapshotsstg_tiktok_posts - parsed video rows from /user-posts/ and friendsstg_tiktok_comments - parsed comments from /post-comments/dim_tiktok_creator - SCD Type 2 on creator profile with follower historyfact_tiktok_post_engagement - one row per (post, snapshot date) for tracking growthfact_tiktok_follower_snapshot - daily follower counts per creatorThe /userinfo-by-username/ endpoint returns two top-level objects: user (camelCase profile fields) and stats (camelCase counters like followerCount, heartCount, videoCount). Staging parses both into a flat row.
-- models/staging/stg_tiktok_users.sql
{{ config(materialized='view') }}
select
ingested_at as snapshot_at,
json_value(payload, '$.user.id') as user_id,
json_value(payload, '$.user.uniqueId') as unique_id,
json_value(payload, '$.user.nickname') as nickname,
json_value(payload, '$.user.secUid') as sec_uid,
json_value(payload, '$.user.signature') as signature,
cast(json_value(payload, '$.user.verified') as bool) as is_verified,
cast(json_value(payload, '$.user.privateAccount') as bool) as is_private,
json_value(payload, '$.user.avatarLarger') as avatar_url,
json_value(payload, '$.user.bioLink') as bio_link,
cast(json_value(payload, '$.stats.followerCount') as int64) as follower_count,
cast(json_value(payload, '$.stats.followingCount') as int64) as following_count,
cast(json_value(payload, '$.stats.heartCount') as int64) as heart_count,
cast(json_value(payload, '$.stats.videoCount') as int64) as video_count,
cast(json_value(payload, '$.stats.diggCount') as int64) as digg_count
from {{ source('raw', 'tiktok_api_responses') }}
where endpoint = '/userinfo-by-username/'
and http_status = 200
-- models/staging/stg_tiktok_users.sql
{{ config(materialized='view') }}
select
ingested_at as snapshot_at,
payload:user:id::string as user_id,
payload:user:uniqueId::string as unique_id,
payload:user:nickname::string as nickname,
payload:user:secUid::string as sec_uid,
payload:user:signature::string as signature,
payload:user:verified::boolean as is_verified,
payload:user:privateAccount::boolean as is_private,
payload:user:avatarLarger::string as avatar_url,
payload:user:bioLink::string as bio_link,
payload:stats:followerCount::number as follower_count,
payload:stats:followingCount::number as following_count,
payload:stats:heartCount::number as heart_count,
payload:stats:videoCount::number as video_count,
payload:stats:diggCount::number as digg_count
from {{ source('raw', 'tiktok_api_responses') }}
where endpoint = '/userinfo-by-username/'
and http_status = 200
For posts staging, the videos[] array inside each /user-posts/ response needs unnesting. BigQuery uses UNNEST(JSON_QUERY_ARRAY(...)); Snowflake uses LATERAL FLATTEN. Field names inside videos[] are snake_case (aweme_id, play_count, digg_count, comment_count, share_count, create_time), unlike the camelCase user objects - this asymmetry catches people out.
Most TikLiveAPI paginated endpoints return a cursor + hasMore envelope. A few exceptions matter for ingestion logic:
time timestamp parameter, not cursor./user-following/'s top-level list key is followings (plural with trailing s), not following./post-comments/ items use id as the comment identifier, not cid.Your collector should persist the latest cursor (or time) per (endpoint, request_param) tuple, so subsequent runs resume rather than restart. On the dbt side, staging models should be views (cheap), and marts should be incremental.
-- models/marts/fact_tiktok_post_engagement.sql
{{ config(
materialized='incremental',
unique_key=['aweme_id', 'snapshot_date'],
partition_by={'field': 'snapshot_date', 'data_type': 'date'},
cluster_by=['author_user_id']
) }}
select
date(snapshot_at) as snapshot_date,
aweme_id,
author_user_id,
play_count,
digg_count,
comment_count,
share_count,
download_count,
collect_count,
created_at
from {{ ref('stg_tiktok_posts') }}
{% if is_incremental() %}
where snapshot_at > (select max(snapshot_at) from {{ this }})
{% endif %}
For SCD Type 2 on dim_tiktok_creator, use dbt's built-in snapshots against stg_tiktok_users with strategy='check' on follower_count and video_count. You'll get valid_from / valid_to columns and a full history of how each creator's audience moved.
Wire up four tiers of testing. They take 30 minutes to add and catch hours of debugging later.
# models/staging/schema.yml
version: 2
models:
- name: stg_tiktok_users
columns:
- name: user_id
tests: [not_null]
- name: unique_id
tests: [not_null]
- name: follower_count
tests:
- dbt_utils.accepted_range:
min_value: 0
- name: stg_tiktok_posts
columns:
- name: aweme_id
tests: [not_null]
Add unique tests on composite keys like (aweme_id, snapshot_date) for the engagement fact.
# models/staging/sources.yml
sources:
- name: raw
tables:
- name: tiktok_api_responses
loaded_at_field: ingested_at
freshness:
warn_after: {count: 6, period: hour}
error_after: {count: 24, period: hour}
Write a singular test that fails if any post in the last 7 days has a play_count that decreased between snapshots (TikTok counters are monotonic - a decrease usually means a data quality problem).
with weekly as (
select
author_user_id,
date_trunc(snapshot_date, week) as week_start,
max(follower_count) as followers_eow
from {{ ref('fact_tiktok_follower_snapshot') }}
group by 1, 2
)
select
author_user_id,
week_start,
followers_eow,
lag(followers_eow) over (partition by author_user_id order by week_start) as prev_week,
safe_divide(
followers_eow - lag(followers_eow) over (partition by author_user_id order by week_start),
lag(followers_eow) over (partition by author_user_id order by week_start)
) as growth_rate
from weekly
select
p.aweme_id,
p.author_user_id,
p.snapshot_date,
p.digg_count + p.comment_count + p.share_count as interactions,
safe_divide(
p.digg_count + p.comment_count + p.share_count,
f.follower_count
) as engagement_rate
from {{ ref('fact_tiktok_post_engagement') }} p
join {{ ref('fact_tiktok_follower_snapshot') }} f
on f.author_user_id = p.author_user_id
and f.snapshot_date = p.snapshot_date
Parse the title field from stg_tiktok_posts with a regex on #\w+, unnest, group by creator and tag, count occurrences. If you need authoritative hashtag stats, join the result to /challenge-info-name/ output (field is cha_name, not name) for user_count and view_count.
Both warehouses charge for scanned bytes. Three practical levers:
where snapshot_date >= .... Snowflake's automatic clustering does similar with CLUSTER BY TO_DATE(ingested_at).For the raw landing table, set a retention policy (BigQuery partition_expiration_days, Snowflake DATA_RETENTION_TIME_IN_DAYS). Six months of raw JSON is usually enough; older periods can be exported to cheap object storage if you need it.
TikTok profile and post data is public, but warehoused public data still deserves thought before it ships to a shared dashboard.
unique_id if dashboards are shared outside the data team.region codes are coarse (ISO country), but combining them with creator handles can effectively identify individuals - aggregate before display.Document this in your data catalog and tag sensitive columns with row-level access policies (BigQuery Policy Tags, Snowflake Tag-Based Masking Policies).
The end state: a small ingestion service that polls TikLiveAPI on a schedule, lands raw JSON in a partitioned semi-structured table, hands off to a dbt project that parses, tests, and shapes the data into facts and dims, and surfaces metrics through your existing BI tool. No bespoke dashboards, no manual exports, no SQL fragments living in someone's local notebook.
Spin up an account on the pricing page, grab your X-Api-Key, sanity-check responses in the playground, then point your collector at https://api.tikliveapi.com. The 1 request = 1 credit model maps cleanly to "1 row in raw," which makes budgeting trivial: estimate daily ingestion volume, multiply by credit price, done. Questions on volume pricing or warehouse-friendly batch patterns? Talk to us.
ELT. Land raw JSON in a semi-structured column, transform with dbt. Schema changes upstream are absorbed silently, you keep full history of every response, and analysts can reshape data without redeploying the collector. ETL only makes sense if storage cost is a hard constraint.
Most paginated endpoints (/user-posts/, /post-comments/, /search-video/) use cursor + hasMore. The two follower endpoints (/user-followers/, /user-following/) use a time timestamp parameter instead. Store the last cursor or last time per (endpoint, request_param) in a small state table; your collector reads it on each run.
No, TikLiveAPI is request/response only. For change detection, you poll on a schedule (hourly, daily, whatever fits your use case) and diff successive responses inside the warehouse. For most analytics workflows daily snapshots are more than sufficient.
Storage for raw JSON is the dominant cost line. A typical analytics setup tracking 500 creators with daily user-info polls and weekly post pulls runs a few GB per month. Query cost is small if you partition by date and cluster by author. Plan for warehouse cost roughly equal to your TikLiveAPI credit spend at moderate scale.
Yes, via your BI tool's row-level security. Build a dim_tiktok_creator mart with masking applied to sec_uid, raw comment text, and any handle that hasn't been opted-in. Expose fact_tiktok_post_engagement and fact_tiktok_follower_snapshot through curated Looker explores or Hex apps - the dbt tests guarantee shape, the BI tool guarantees access.
Ready to ship the pipeline? Create an account, browse all 37 endpoints in the docs, and read more architecture posts on the blog.
Ready to put what you read into code? Try our endpoints live or grab the full reference.