TikTok Data in Your Warehouse: BigQuery, Snowflake, dbt

Q: How do I handle the cursor vs time pagination split?

Most paginated endpoints (/user-posts/, /post-comments/, /search-video/) use cursor + hasMore. The two follower endpoints (/user-followers/, /user-following/) use a time timestamp parameter instead. Store the last cursor or last time per (endpoint, request_param) in a small state table; your collector reads it on each run.

Q: Can I expose this data to non-analyst stakeholders directly?

Yes, via your BI tool's row-level security. Build a dim_tiktok_creator mart with masking applied to sec_uid, raw comment text, and any handle that hasn't been opted-in. Expose fact_tiktok_post_engagement and fact_tiktok_follower_snapshot through curated Looker explores or Hex apps - the dbt tests guarantee shape, the BI tool guarantees access. Ready to ship the pipeline? Create an account, browse all 37 endpoints in the docs, and read more architecture posts on the blog.

By TikLiveAPI Team · Published on May 29, 2026

TikTok Data in Your Warehouse: BigQuery, Snowflake, dbt

Why Land TikTok Data in Your Warehouse

Your analytics team already lives in Looker, Tableau, Hex, or Mode. Your marketing dashboards already join Shopify, HubSpot, and Google Ads on a clean date spine. Adding TikTok creator and content data to that warehouse means social performance suddenly speaks the same SQL dialect as every other revenue source, with no separate logins, no copy-pasted CSV exports, and no third tool to govern.

TikLiveAPI returns clean JSON across 37 REST endpoints, which makes warehouse ingestion straightforward. This guide is for analytics engineers and data team leads who want to move from "scripts that hit an API" to a modeled, tested, and queryable TikTok dataset alongside the rest of the business.

We will cover ELT vs ETL trade-offs, raw landing schemas for BigQuery and Snowflake, dbt staging and marts layers, incremental ingestion with cursor pagination, schema and freshness tests, common BI metrics, cost optimization, and privacy considerations.

Architecture: ELT or ETL

You have two realistic paths for getting TikTok responses into your warehouse.

ETL: transform before load

The collector script flattens nested objects, renames camelCase to snake_case, and inserts typed columns directly. The schema is narrow and clean. The downside: every new field on the upstream JSON requires a code change in the collector, and once a field is dropped during transform, it's gone. If a stakeholder asks next week for secUid or avatarLarger, you have to backfill.

ELT: dump raw JSON, transform inside the warehouse

The collector inserts the raw API response into a single semi-structured column. Schema changes upstream are silently absorbed - the new keys simply appear inside the JSON blob. dbt models read from raw and produce typed staging and marts.

For analytics use cases, ELT wins. Storage is cheap, raw history is invaluable for debugging, and dbt gives you version-controlled transformations that any analyst can read. The rest of this post assumes ELT.

Raw Landing Schemas

Land one row per API response. Store the parsed payload in the semi-structured column for the warehouse, plus a few "envelope" columns you want to index and partition on.

BigQuery

CREATE TABLE raw.tiktok_api_responses (
  ingested_at   TIMESTAMP NOT NULL,
  endpoint      STRING    NOT NULL,
  request_param STRING,
  http_status   INT64,
  payload       JSON,
  request_id    STRING
)
PARTITION BY DATE(ingested_at)
CLUSTER BY endpoint, request_param;

BigQuery's native JSON type lets you index into payloads with dot notation and supports schema evolution without DDL changes.

Snowflake

CREATE TABLE raw.tiktok_api_responses (
  ingested_at   TIMESTAMP_NTZ NOT NULL,
  endpoint      STRING        NOT NULL,
  request_param STRING,
  http_status   NUMBER,
  payload       VARIANT,
  request_id    STRING
)
CLUSTER BY (TO_DATE(ingested_at), endpoint);

Snowflake's VARIANT handles arbitrary JSON, with FLATTEN for array unnesting and the : accessor for nested paths.

The collector script calls TikLiveAPI with X-Api-Key auth and inserts one row per response. You can find your key on the profile page after registering, and see all 37 endpoints in the documentation.

# pseudo-Python ingestion loop
import requests, json, datetime

headers = {"X-Api-Key": os.environ["TIKLIVE_KEY"]}
resp = requests.get(
    "https://api.tikliveapi.com/userinfo-by-username/",
    params={"username": handle},
    headers=headers,
    timeout=30,
)
warehouse.insert("raw.tiktok_api_responses", {
    "ingested_at":   datetime.datetime.utcnow(),
    "endpoint":      "/userinfo-by-username/",
    "request_param": handle,
    "http_status":   resp.status_code,
    "payload":       resp.text,   # raw JSON string
    "request_id":    str(uuid.uuid4()),
})

Modeling Layers with dbt

A clean three-layer dbt project keeps responsibilities separated.

Staging (stg_*) - one model per endpoint, parses raw JSON into typed columns, renames to snake_case, drops nothing.
Intermediate (int_*) - joins, deduplication, windowing.
Marts (dim_*, fact_*) - the tables BI tools actually query.

For TikTok, the marts that earn their keep are:

stg_tiktok_users - parsed creator profile snapshots
stg_tiktok_posts - parsed video rows from /user-posts/ and friends
stg_tiktok_comments - parsed comments from /post-comments/
dim_tiktok_creator - SCD Type 2 on creator profile with follower history
fact_tiktok_post_engagement - one row per (post, snapshot date) for tracking growth
fact_tiktok_follower_snapshot - daily follower counts per creator

dbt Example: stg_tiktok_users.sql

The /userinfo-by-username/ endpoint returns two top-level objects: user (camelCase profile fields) and stats (camelCase counters like followerCount, heartCount, videoCount). Staging parses both into a flat row.

BigQuery variant

-- models/staging/stg_tiktok_users.sql
{{ config(materialized='view') }}

select
  ingested_at                                           as snapshot_at,
  json_value(payload, '$.user.id')                      as user_id,
  json_value(payload, '$.user.uniqueId')                as unique_id,
  json_value(payload, '$.user.nickname')                as nickname,
  json_value(payload, '$.user.secUid')                  as sec_uid,
  json_value(payload, '$.user.signature')               as signature,
  cast(json_value(payload, '$.user.verified') as bool)  as is_verified,
  cast(json_value(payload, '$.user.privateAccount') as bool) as is_private,
  json_value(payload, '$.user.avatarLarger')            as avatar_url,
  json_value(payload, '$.user.bioLink')                 as bio_link,
  cast(json_value(payload, '$.stats.followerCount')  as int64) as follower_count,
  cast(json_value(payload, '$.stats.followingCount') as int64) as following_count,
  cast(json_value(payload, '$.stats.heartCount')     as int64) as heart_count,
  cast(json_value(payload, '$.stats.videoCount')     as int64) as video_count,
  cast(json_value(payload, '$.stats.diggCount')      as int64) as digg_count
from {{ source('raw', 'tiktok_api_responses') }}
where endpoint = '/userinfo-by-username/'
  and http_status = 200

Snowflake variant

-- models/staging/stg_tiktok_users.sql
{{ config(materialized='view') }}

select
  ingested_at                              as snapshot_at,
  payload:user:id::string                  as user_id,
  payload:user:uniqueId::string            as unique_id,
  payload:user:nickname::string            as nickname,
  payload:user:secUid::string              as sec_uid,
  payload:user:signature::string           as signature,
  payload:user:verified::boolean           as is_verified,
  payload:user:privateAccount::boolean     as is_private,
  payload:user:avatarLarger::string        as avatar_url,
  payload:user:bioLink::string             as bio_link,
  payload:stats:followerCount::number      as follower_count,
  payload:stats:followingCount::number     as following_count,
  payload:stats:heartCount::number         as heart_count,
  payload:stats:videoCount::number         as video_count,
  payload:stats:diggCount::number          as digg_count
from {{ source('raw', 'tiktok_api_responses') }}
where endpoint = '/userinfo-by-username/'
  and http_status = 200

For posts staging, the videos[] array inside each /user-posts/ response needs unnesting. BigQuery uses UNNEST(JSON_QUERY_ARRAY(...)); Snowflake uses LATERAL FLATTEN. Field names inside videos[] are snake_case (aweme_id, play_count, digg_count, comment_count, share_count, create_time), unlike the camelCase user objects - this asymmetry catches people out.

Incremental Models and Cursor-Based Ingestion

Most TikLiveAPI paginated endpoints return a cursor + hasMore envelope. A few exceptions matter for ingestion logic:

/user-followers/ and /user-following/ use a time timestamp parameter, not cursor.
/user-following/'s top-level list key is followings (plural with trailing s), not following.
/post-comments/ items use id as the comment identifier, not cid.

Your collector should persist the latest cursor (or time) per (endpoint, request_param) tuple, so subsequent runs resume rather than restart. On the dbt side, staging models should be views (cheap), and marts should be incremental. For a deeper look at cursor handling and retry behavior on the comments endpoint, see paginating TikTok comments at scale.

-- models/marts/fact_tiktok_post_engagement.sql
{{ config(
    materialized='incremental',
    unique_key=['aweme_id', 'snapshot_date'],
    partition_by={'field': 'snapshot_date', 'data_type': 'date'},
    cluster_by=['author_user_id']
) }}

select
  date(snapshot_at)              as snapshot_date,
  aweme_id,
  author_user_id,
  play_count,
  digg_count,
  comment_count,
  share_count,
  download_count,
  collect_count,
  created_at
from {{ ref('stg_tiktok_posts') }}
{% if is_incremental() %}
  where snapshot_at > (select max(snapshot_at) from {{ this }})
{% endif %}

For SCD Type 2 on dim_tiktok_creator, use dbt's built-in snapshots against stg_tiktok_users with strategy='check' on follower_count and video_count. You'll get valid_from / valid_to columns and a full history of how each creator's audience moved.

Tests in dbt

Wire up four tiers of testing. They take 30 minutes to add and catch hours of debugging later.

Schema tests

# models/staging/schema.yml
version: 2
models:
  - name: stg_tiktok_users
    columns:
      - name: user_id
        tests: [not_null]
      - name: unique_id
        tests: [not_null]
      - name: follower_count
        tests:
          - dbt_utils.accepted_range:
              min_value: 0
  - name: stg_tiktok_posts
    columns:
      - name: aweme_id
        tests: [not_null]

Uniqueness tests on natural keys

Add unique tests on composite keys like (aweme_id, snapshot_date) for the engagement fact.

Freshness tests

# models/staging/sources.yml
sources:
  - name: raw
    tables:
      - name: tiktok_api_responses
        loaded_at_field: ingested_at
        freshness:
          warn_after: {count: 6,  period: hour}
          error_after: {count: 24, period: hour}

Custom assertions

Write a singular test that fails if any post in the last 7 days has a play_count that decreased between snapshots (TikTok counters are monotonic - a decrease usually means a data quality problem).

Common BI Metrics

Follower growth rate, week over week

with weekly as (
  select
    author_user_id,
    date_trunc(snapshot_date, week) as week_start,
    max(follower_count)             as followers_eow
  from {{ ref('fact_tiktok_follower_snapshot') }}
  group by 1, 2
)
select
  author_user_id,
  week_start,
  followers_eow,
  lag(followers_eow) over (partition by author_user_id order by week_start) as prev_week,
  safe_divide(
    followers_eow - lag(followers_eow) over (partition by author_user_id order by week_start),
    lag(followers_eow) over (partition by author_user_id order by week_start)
  ) as growth_rate
from weekly

Engagement rate per post

select
  p.aweme_id,
  p.author_user_id,
  p.snapshot_date,
  p.digg_count + p.comment_count + p.share_count                     as interactions,
  safe_divide(
    p.digg_count + p.comment_count + p.share_count,
    f.follower_count
  )                                                                  as engagement_rate
from {{ ref('fact_tiktok_post_engagement') }} p
join {{ ref('fact_tiktok_follower_snapshot') }} f
  on f.author_user_id = p.author_user_id
 and f.snapshot_date  = p.snapshot_date

Top hashtags per creator

Parse the title field from stg_tiktok_posts with a regex on #\w+, unnest, group by creator and tag, count occurrences. If you need authoritative hashtag stats, join the result to /challenge-info-name/ output (field is cha_name, not name) for user_count and view_count.

Cost Optimization

Both warehouses charge for scanned bytes. Three practical levers:

Partition by snapshot_date. BigQuery skips entire partitions when a query has where snapshot_date >= .... Snowflake's automatic clustering does similar with CLUSTER BY TO_DATE(ingested_at).
Cluster by author_user_id. Most analytical queries filter or join on creator. Clustering pushes co-located rows into the same micro-partitions, slashing scan cost.
Materialize as views in staging, tables in marts. Staging is cheap re-computation; marts are queried by humans and BI tools all day - pay the storage to save the compute.

For the raw landing table, set a retention policy (BigQuery partition_expiration_days, Snowflake DATA_RETENTION_TIME_IN_DAYS). Six months of raw JSON is usually enough; older periods can be exported to cheap object storage if you need it.

Privacy Considerations

TikTok profile and post data is public, but warehoused public data still deserves thought before it ships to a shared dashboard.

Comment text. User-generated, frequently personal. Mask in shared dashboards; expose only aggregate counts.
Commenter handles and avatars. Hash unique_id if dashboards are shared outside the data team.
secUid. Treat as internal-only. It's used as the TikTok internal identifier and shouldn't leak into customer-facing reports.
Geographic data. Post region codes are coarse (ISO country), but combining them with creator handles can effectively identify individuals - aggregate before display.
Right to erasure. If a creator deletes their account, plan a quarterly job to detect 404 responses and either tombstone or hard-delete affected rows.

Document this in your data catalog and tag sensitive columns with row-level access policies (BigQuery Policy Tags, Snowflake Tag-Based Masking Policies).

Putting It All Together

The end state: a small ingestion service that polls TikLiveAPI on a schedule, lands raw JSON in a partitioned semi-structured table, hands off to a dbt project that parses, tests, and shapes the data into facts and dims, and surfaces metrics through your existing BI tool. No bespoke dashboards, no manual exports, no SQL fragments living in someone's local notebook. If you ever need lower latency than batch snapshots, the same collector pattern extends to streaming TikTok data into Kafka.

Spin up an account on the pricing page, grab your X-Api-Key, sanity-check responses in the playground, then point your collector at https://api.tikliveapi.com. The 1 request = 1 credit model maps cleanly to "1 row in raw," which makes budgeting trivial: estimate daily ingestion volume, multiply by credit price, done. Questions on volume pricing or warehouse-friendly batch patterns? Talk to us.

FAQ

Should I use ELT or ETL for TikTok data?

ELT. Land raw JSON in a semi-structured column, transform with dbt. Schema changes upstream are absorbed silently, you keep full history of every response, and analysts can reshape data without redeploying the collector. ETL only makes sense if storage cost is a hard constraint.

How do I handle the cursor vs time pagination split?

Most paginated endpoints (/user-posts/, /post-comments/, /search-video/) use cursor + hasMore. The two follower endpoints (/user-followers/, /user-following/) use a time timestamp parameter instead. Store the last cursor or last time per (endpoint, request_param) in a small state table; your collector reads it on each run.

Does TikLiveAPI offer webhooks for change events?

No, TikLiveAPI is request/response only. For change detection, you poll on a schedule (hourly, daily, whatever fits your use case) and diff successive responses inside the warehouse. For most analytics workflows daily snapshots are more than sufficient.

How much does this cost in BigQuery or Snowflake?

Storage for raw JSON is the dominant cost line. A typical analytics setup tracking 500 creators with daily user-info polls and weekly post pulls runs a few GB per month. Query cost is small if you partition by date and cluster by author. Plan for warehouse cost roughly equal to your TikLiveAPI credit spend at moderate scale.

Can I expose this data to non-analyst stakeholders directly?

Yes, via your BI tool's row-level security. Build a dim_tiktok_creator mart with masking applied to sec_uid, raw comment text, and any handle that hasn't been opted-in. Expose fact_tiktok_post_engagement and fact_tiktok_follower_snapshot through curated Looker explores or Hex apps - the dbt tests guarantee shape, the BI tool guarantees access.

Ready to ship the pipeline? Create an account, browse all 37 endpoints in the docs, and read more architecture posts on the blog.

Build with the TikTok API

Ready to put what you read into code? Try our endpoints live or grab the full reference.

Open Playground Read Documentation

TikTok Data in Your Warehouse: BigQuery, Snowflake, dbt

Why Land TikTok Data in Your Warehouse

Architecture: ELT or ETL

ETL: transform before load

ELT: dump raw JSON, transform inside the warehouse

Raw Landing Schemas

BigQuery

Snowflake

Modeling Layers with dbt

dbt Example: stg_tiktok_users.sql

BigQuery variant

Snowflake variant

Incremental Models and Cursor-Based Ingestion

Tests in dbt

Schema tests

Uniqueness tests on natural keys

Freshness tests

Custom assertions

Common BI Metrics

Follower growth rate, week over week

Engagement rate per post

Top hashtags per creator

Cost Optimization

Privacy Considerations

Putting It All Together

FAQ

Should I use ELT or ETL for TikTok data?

How do I handle the cursor vs time pagination split?

Does TikLiveAPI offer webhooks for change events?

How much does this cost in BigQuery or Snowflake?

Can I expose this data to non-analyst stakeholders directly?

Build with the TikTok API

Quick Links

Legal

Contact

TikTok API Solutions

TikTok Data in Your Warehouse: BigQuery, Snowflake, dbt

Why Land TikTok Data in Your Warehouse

Architecture: ELT or ETL

ETL: transform before load

ELT: dump raw JSON, transform inside the warehouse

Raw Landing Schemas

BigQuery

Snowflake

Modeling Layers with dbt

dbt Example: stg_tiktok_users.sql

BigQuery variant

Snowflake variant

Incremental Models and Cursor-Based Ingestion

Tests in dbt

Schema tests

Uniqueness tests on natural keys

Freshness tests

Custom assertions

Common BI Metrics

Follower growth rate, week over week

Engagement rate per post

Top hashtags per creator

Cost Optimization

Privacy Considerations

Putting It All Together

FAQ

Should I use ELT or ETL for TikTok data?

How do I handle the cursor vs time pagination split?

Does TikLiveAPI offer webhooks for change events?

How much does this cost in BigQuery or Snowflake?

Can I expose this data to non-analyst stakeholders directly?

Related Articles

How to Position Your TikTok-Data Product to Win

TikTok Shares vs Saves: What They Actually Predict

Data Governance for TikTok-Data Apps: A Practical Guide

Build with the TikTok API

Quick Links

Legal

Contact

TikTok API Solutions