TikTok Recommendations from Embeddings: A Build Guide

Q: How do I handle multilingual content?

Swap MiniLM for paraphrase-multilingual-MiniLM-L12-v2. It is the same 384-dim space across 50+ languages, so your existing index stays compatible.

By TikLiveAPI Team · Published on May 29, 2026

TikTok Recommendations from Embeddings: A Build Guide

Recommendation systems built on collaborative filtering struggle when you do not own the user-item interaction matrix. If you are building a TikTok-adjacent discovery product, an analytics dashboard, or a brand-safety tool, you do not have hundreds of millions of watch events to factor. What you do have is rich public content: creator bios, video titles, music metadata, hashtags, and the comments users have left on videos. That is enough to build a content-based recommendation engine that performs surprisingly well, scales horizontally, and gives you a working cold-start story from day one.

This guide walks through the architecture our customers most often build on top of the TikLiveAPI dataset: embedding creators, videos, and user interests into a shared vector space, then serving recommendations through an ANN recall plus cross-encoder rerank pipeline. Code is illustrative Python, the data shapes match the live API, and every number in the cost section is verifiable.

Why embeddings beat keyword matching

The naive approach is to tokenize hashtags and titles and match on TF-IDF or BM25. It is fast and explainable, but it falls apart on the long tail of TikTok content. A cooking creator who never writes "#cooking" but posts videos titled "knife skills you wish you knew" gets buried. A user who comments on skateboarding videos in Portuguese never gets surfaced English-language skate content. Embeddings collapse the synonym and language gap because sentence transformer models map semantically similar phrases to nearby vectors regardless of surface form.

The pattern we recommend has three vector types living in the same 384-dimensional space:

Creator embedding: mean-pooled vector of a creator's last 30 video titles plus their bio
Video embedding: encoded from title, music name, and hashtag list per video
Interest embedding: mean-pooled vector of the last 100 comments a user wrote

Because all three live in the same space, you can score any combination: creator-to-creator similarity, user interest to video, video to video for "more like this" rails.

Pulling the source data

Every embedding starts with a call to the TikLiveAPI dataset. Authentication is a single header on every request. You can test these payloads in the live playground before wiring up your ingestion job.

import requests

BASE = "https://api.tikliveapi.com"
HEADERS = {"X-Api-Key": "YOUR_KEY"}

def get_user_info(username):
    r = requests.get(f"{BASE}/userinfo-by-username/",
                     params={"username": username},
                     headers=HEADERS, timeout=30)
    return r.json()

def get_user_posts(userid, count=30):
    r = requests.get(f"{BASE}/user-posts/",
                     params={"userid": userid, "count": count},
                     headers=HEADERS, timeout=30)
    return r.json().get("videos", [])

def get_post_comments(video_url, count=50):
    r = requests.get(f"{BASE}/post-comments/",
                     params={"url": video_url, "count": count},
                     headers=HEADERS, timeout=30)
    return r.json().get("comments", [])

Note the response shapes: /userinfo-by-username/ returns nested user{} and stats{} objects with camelCase keys, while /user-posts/ returns flat snake_case video objects with a nested music_info{}. Mixing these up is the most common ingestion bug. The users documentation and posts documentation spell out every key.

Building the creator embedding

For each creator you want to index, pull their bio from /userinfo-by-username/ and their last 30 video titles from /user-posts/. Concatenate, encode, and mean-pool. The all-MiniLM-L6-v2 model from sentence-transformers is the right default: 384 dimensions, ~80MB on disk, ~14k sentences per second on a single GPU, ~400 per second on CPU.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def creator_embedding(username):
    info = get_user_info(username)
    userid = info["user"]["id"]
    bio = info["user"].get("signature", "")
    posts = get_user_posts(userid, count=30)
    titles = [p.get("title", "") for p in posts if p.get("title")]
    documents = [bio] + titles
    if not documents:
        return None
    vectors = model.encode(documents, normalize_embeddings=True)
    return np.mean(vectors, axis=0)

Always normalize embeddings. Cosine similarity on normalized vectors reduces to a dot product, which every vector database optimizes for. Mean pooling is naive but works well; if you want to weight recent posts higher, use exponential decay by post timestamp.

Building the video embedding

Video embeddings are the densest layer of your index. For each video, encode a single document built from title, music name, and hashtags. Hashtags often live inside the title string with hash prefixes, but the API also exposes them separately on detail calls.

def video_document(video):
    title = video.get("title", "")
    music = video.get("music_info", {}).get("title", "")
    # Pull hashtags out of the title
    tags = " ".join(t for t in title.split() if t.startswith("#"))
    return f"{title} | music: {music} | tags: {tags}"

def encode_videos(videos):
    docs = [video_document(v) for v in videos]
    return model.encode(docs, normalize_embeddings=True, batch_size=64)

Batch encoding is critical at scale. A single GPU encoder running batch size 64 will process roughly 14,000 short documents per second, which means a million-video index encodes in under two minutes of GPU time.

Building the user interest embedding

This is the layer that gives content-based recommenders a personalized signal without watch history. Pull a user's recent public comments via /post-comments/ across the videos they have engaged with, mean-pool the resulting vectors, and treat that as their interest vector. The /post-comments/ endpoint returns objects with id, video_id, text, digg_count, reply_total, and a nested user{} object.

def interest_embedding(comment_texts):
    if not comment_texts:
        return None
    vectors = model.encode(comment_texts, normalize_embeddings=True)
    return np.mean(vectors, axis=0)

One hundred comments is the sweet spot. Below 20 the signal is noisy; above 200 the mean starts averaging out genuine interests. The same comment corpus can also feed a transformer-based sentiment model if you want tone signals alongside topical interests.

Choosing a vector database

Three options dominate this space and each has a clean answer for when to pick it.

Qdrant is the right default for greenfield builds. It is a Rust ANN engine with strong filter pushdown (you can constrain by language, follower count, posted_at) and supports both HNSW and disk-backed indexes. A million 384-dim vectors comfortably fits in 2GB of RAM with HNSW. Self-host on a single 4-vCPU box for under fifty dollars a month.

Weaviate wins when you want the vector DB to also do the embedding for you with a built-in module pipeline. It is heavier operationally but the GraphQL query surface and module ecosystem (text2vec, ref2vec for object-to-object similarity) save engineering time.

pgvector is correct when you already run Postgres and your index is below a few million vectors. The IVFFlat index is fine up to a million rows; HNSW (Postgres 16+) handles ten million. The operational win is no new data store. The cost is slower query latency under high concurrency.

For TikTok-scale catalogs (10M+ videos refreshed daily), Qdrant or a hosted Pinecone is the realistic answer.

The recall plus rerank pipeline

Production recommendation systems are almost never a single nearest-neighbor lookup. They are a funnel.

Query embedding (user interest vector)
         |
         v
ANN recall: top 200 from Qdrant (sub-10ms)
         |
         v
Feature filtering: language match, follower floor, freshness
         |
         v
Cross-encoder rerank: top 200 -> top 20 (50-100ms)
         |
         v
Business rules: diversity quota, blocklists, sponsored slots
         |
         v
Final 10 results to user

The cross-encoder is the secret. A bi-encoder (what we used for indexing) encodes query and document independently and compares vectors. A cross-encoder takes the query and one candidate together and outputs a relevance score directly. The model cross-encoder/ms-marco-MiniLM-L-6-v2 is the right pick: ~10ms per query-document pair on GPU, so reranking 200 candidates takes about two seconds on CPU or 200ms on a T4.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query_text, candidates):
    pairs = [[query_text, video_document(c)] for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
    return [c for c, _ in ranked[:20]]

Evaluation: offline AUC and online A/B

Offline evaluation requires a labeled set. Build it by treating "user commented on video X" as a positive signal and a random video the user did not engage with as a negative. Compute ROC AUC: the probability that your model ranks a positive higher than a negative.

Realistic targets for a content-only embedding model:

Bi-encoder recall only: AUC 0.68 to 0.74
Bi-encoder plus cross-encoder rerank: AUC 0.78 to 0.84
Hybrid with collaborative signals (if you have them): AUC 0.87+

Online, run a 90/10 A/B with click-through rate and dwell time as primary metrics. Two-week minimum; TikTok-like content has strong weekly seasonality and a one-week test will mislead you.

Cold-start strategy

New users with no comments and no engagement history are the hardest case. The demographic prior approach works well: bucket users by signup country, signup app version, and any onboarding interest selections, then serve the centroid embedding of that bucket's top engagers. Within 48 hours, most users have left enough signal that you can switch them to their personal interest embedding.

For new creators, the bio plus first three videos is enough to seed an embedding. The /user-posts/ endpoint returns videos in reverse chronological order, so your first call already gives you the most recent uploads.

Production architecture

+------------------+      +------------------+
|  Ingestion       |      |  Encoding        |
|  workers         +----->+  service (GPU)   |
|  (poll API)      |      |  MiniLM bi-enc   |
+------------------+      +--------+---------+
                                   |
                                   v
                          +------------------+
                          |  Qdrant cluster  |
                          |  HNSW, 384-dim   |
                          +--------+---------+
                                   |
+------------------+               |
|  Recommendation  +<--------------+
|  API (FastAPI)   |
|  - recall        |
|  - rerank (GPU)  |
|  - rules engine  |
+--------+---------+
         |
         v
+------------------+      +------------------+
|  Edge CDN cache  +----->+  Mobile / web    |
|  (5-min TTL)     |      |  clients         |
+------------------+      +------------------+

Ingestion workers poll the TikLiveAPI dataset on a schedule, push raw payloads to a queue, and the encoding service batches them onto a GPU before upserting into Qdrant. The recommendation API is stateless and horizontally scalable. A 5-minute edge cache absorbs duplicate queries from the same user opening the app multiple times.

Cost analysis at one million daily active users

Assumptions: 1M DAU, 10 recommendation requests per user per day, 5M videos in the catalog refreshed daily, 50k new videos per day.

Data ingestion: 5M videos refreshed at 1 request returning 30 videos each gives ~165k API calls per day. At our standard credit pricing, this is a modest line item.
Encoding: One T4 GPU at ~$0.35/hour can encode 14k docs/sec. 50k new videos per day is 4 seconds of GPU time. Round to one hour per day of dedicated time including overhead: ~$10/month.
Vector DB: 5M vectors at 384 dim is ~7.5 GB on disk, ~15 GB in HNSW memory. A 32 GB RAM Qdrant instance runs comfortably: ~$150/month self-hosted.
Reranking GPU: 10M requests/day at 200 candidates each. With batching, one T4 sustains this with two replicas: ~$500/month.
API service: 4 vCPU FastAPI instances behind a load balancer, 3 replicas: ~$200/month.

Total infrastructure (excluding the TikLiveAPI data subscription): roughly $900 per month at one million DAU. The cost per recommendation served is well under one hundredth of a cent.

FAQ

Why MiniLM and not a larger model? Larger sentence transformers (mpnet-base, multilingual-e5-large) push AUC up by two to four points but triple your encoding cost and double inference latency. Start with MiniLM; upgrade only when you have measured the gain on your specific labeled set.

Should I fine-tune the encoder on my data? Yes, but only after you have a labeled set of at least 50k positive pairs. Use a triplet loss with hard negatives mined from your own ANN index. Expect a 3 to 5 point AUC lift. For the data-handling and consent questions that training on public content raises, see our guide to fine-tuning LLMs on TikTok data.

How often should I re-embed videos? Once at ingestion is usually enough. Video text does not change after upload. Re-embed only if you change the embedding model.

What about video understanding from the actual pixels? Visual embeddings (CLIP, ViT) add a meaningful lift but cost roughly 50x more compute. Start with text embeddings, prove the funnel works, then add a visual modality as a second tower with a learned fusion layer.

How do I handle multilingual content? Swap MiniLM for paraphrase-multilingual-MiniLM-L12-v2. It is the same 384-dim space across 50+ languages, so your existing index stays compatible.

Do I need a graph database for the social signal? Not for v1. Followers and following are valuable but you can encode them as features on the video embedding metadata in Qdrant. Graph DBs (Neo4j, NebulaGraph) become worth it when you want multi-hop traversal like "videos liked by people I follow."

Where to go from here

The full pipeline above gives you a discovery engine that handles cold start, scales to tens of millions of items, and serves recommendations in under 250ms end to end. The dataset side is the easy part: every endpoint mentioned here is documented and live in the playground. Read the posts reference for the exact comment and detail payload shapes, check the users reference for pagination patterns, browse the engineering blog for related deep dives, or open a thread on the contact page if you want to discuss your specific architecture. Existing customers can manage credentials and monitor usage from their profile dashboard.