TikTok API SLA Monitoring and Alerting Patterns

Q: What about cardinality?

Label by endpoint, tier, and status_class. Do not label by username, userid, or keyword - that path leads to a million-series Prometheus disaster.

By TikLiveAPI Team · Published on May 29, 2026

TikTok API SLA Monitoring and Alerting Patterns

Why Your TikTok Dependency Needs Real Observability

If your product depends on a third-party data source, the data source is part of your stack whether you like it or not. When TikLiveAPI is upstream of your feature, every TikTok endpoint you call becomes a node in your service graph that you cannot ssh into, cannot restart, and cannot patch. The only lever you have left is observability: knowing exactly when latency drifts, when error rates climb, and when your budget is on fire so you can shed load, fail over, or page a human before the customer notices.

This guide walks SRE and platform ops teams through a production-grade monitoring stack for TikLiveAPI. We will apply the four golden signals to a real workload, design SLOs per endpoint class, instrument both synthetic and real-user probes, calculate error budgets, configure burn-rate alerting through Prometheus and PagerDuty, and write runbooks that survive a 3 a.m. page. Code samples are Python and YAML, mirroring what most platform teams already run.

The Four Golden Signals Applied to TikLiveAPI

Latency, errors, traffic, and saturation. The signals are universal but the thresholds are not. TikLiveAPI advertises a 750 ms average response time across 7+ servers with 99.9% uptime, so your alerting baselines must respect that envelope rather than copy-pasting generic web defaults.

Latency. Track p50, p95, p99 per endpoint. A spike on /search-video/ is expected (search is heavier); a spike on /userid/ is a real incident.
Errors. Separate HTTP 4xx (your bug - bad params, wrong header, expired credits) from 5xx and timeouts (upstream issue). Page on 5xx burn rate, ticket on 4xx anomalies.
Traffic. Requests per second per endpoint. A sudden drop is as alarming as a spike, since it usually means your client broke or auth is misconfigured.
Saturation. Your credit balance, your local connection pool, and your retry queue depth. Credits are a saturation signal most teams forget until they get a 402.

Per-Endpoint SLO Design

Not every endpoint deserves the same SLO. Group the 37 endpoints into three tiers based on how user-facing they are in your product.

Tier 1 - interactive, blocking the user. These run in the request path of a page load and need tight SLOs. Examples: /userid/, /userinfo-by-username/, /post-detail/, /download-video/. Target: p95 below 1.2 s, availability 99.9% over 30 days.

Tier 2 - interactive but tolerant. Search and list endpoints where users expect a small spinner. /search-video/, /search-user/, /user-posts/, /post-comments/. Target: p95 below 2.5 s, availability 99.5%.

Tier 3 - batch / background. Pagination crawlers, /user-followers/ walking the time cursor, /user-following/ reading the followings array, /challenge-posts/ bulk ingest. Target: p95 below 6 s, availability 99.0%, retries permitted.

Encode the tiers as data so every system reads from one source:

slo_tiers:
  tier1_interactive:
    p95_ms: 1200
    availability: 0.999
    burn_rate_fast: 14.4
    burn_rate_slow: 6.0
    endpoints:
      - /userid/
      - /userinfo-by-username/
      - /userinfo-by-id/
      - /post-detail/
      - /download-video/
      - /download-music/
  tier2_search:
    p95_ms: 2500
    availability: 0.995
    burn_rate_fast: 10.0
    burn_rate_slow: 3.0
    endpoints:
      - /search-video/
      - /search-user/
      - /search-challenge/
      - /user-posts/
      - /post-comments/
      - /post-comment-replies/
  tier3_batch:
    p95_ms: 6000
    availability: 0.990
    burn_rate_fast: 6.0
    burn_rate_slow: 2.0
    endpoints:
      - /user-followers/
      - /user-following/
      - /challenge-posts/
      - /music-posts/
      - /playlist-posts/
      - /collection-posts/

Synthetic Monitoring with Periodic Probes

Synthetics catch problems before users do. Run a small probe every 60 seconds against the cheapest representative endpoint in each tier. /userid/ is perfect for Tier 1: it takes a known-good username, returns a flat {"id": "107955"}, and lets you assert both latency and payload shape.

import os
import time
import httpx
from prometheus_client import Counter, Histogram, start_http_server

BASE = "https://api.tikliveapi.com"
KEY = os.environ["TIKLIVEAPI_KEY"]

probe_latency = Histogram(
    "tikliveapi_probe_duration_seconds",
    "Synthetic probe latency",
    ["endpoint", "tier"],
    buckets=(0.1, 0.25, 0.5, 0.75, 1.0, 1.5, 2.5, 5.0, 10.0),
)
probe_result = Counter(
    "tikliveapi_probe_total",
    "Synthetic probe outcomes",
    ["endpoint", "tier", "outcome"],
)

PROBES = [
    ("/userid/", {"username": "tiktok"}, "tier1", lambda j: "id" in j),
    ("/userinfo-by-username/", {"username": "tiktok"}, "tier1",
     lambda j: "user" in j and "stats" in j),
    ("/search-video/", {"keyword": "cats", "count": 5}, "tier2",
     lambda j: "videos" in j and "hasMore" in j),
]

def probe_once(client):
    for path, params, tier, validate in PROBES:
        t0 = time.perf_counter()
        try:
            r = client.get(path, params=params, timeout=10.0)
            elapsed = time.perf_counter() - t0
            probe_latency.labels(path, tier).observe(elapsed)
            if r.status_code != 200:
                probe_result.labels(path, tier, f"http_{r.status_code}").inc()
                continue
            if not validate(r.json()):
                probe_result.labels(path, tier, "shape_mismatch").inc()
                continue
            probe_result.labels(path, tier, "ok").inc()
        except Exception as exc:
            probe_result.labels(path, tier, type(exc).__name__).inc()

def main():
    start_http_server(9101)
    headers = {"X-Api-Key": KEY}
    with httpx.Client(base_url=BASE, headers=headers) as client:
        while True:
            probe_once(client)
            time.sleep(60)

if __name__ == "__main__":
    main()

Two notes. First, schema validation matters. Upstream APIs do occasionally rename fields; asserting that /user-following/ returns a key called followings (not following) catches contract drift early. Second, run probes from at least two geographies. Single-region synthetics lie when there is a regional network event.

Real User Monitoring

Synthetics tell you the API is up. RUM tells you what your users actually experienced. Wrap your TikLiveAPI client in a middleware that records every real call with the same labels your synthetics use.

import time
from contextlib import contextmanager
from prometheus_client import Counter, Histogram

rum_latency = Histogram(
    "tikliveapi_request_duration_seconds",
    "Real user request latency",
    ["endpoint", "tier", "status_class"],
    buckets=(0.1, 0.25, 0.5, 0.75, 1.0, 1.5, 2.5, 5.0, 10.0, 30.0),
)
rum_total = Counter(
    "tikliveapi_request_total",
    "Real user request outcomes",
    ["endpoint", "tier", "status_class"],
)

TIER_OF = {
    "/userid/": "tier1", "/userinfo-by-username/": "tier1",
    "/post-detail/": "tier1", "/download-video/": "tier1",
    "/search-video/": "tier2", "/search-user/": "tier2",
    "/user-followers/": "tier3", "/user-following/": "tier3",
}

@contextmanager
def track(endpoint):
    tier = TIER_OF.get(endpoint, "tier3")
    t0 = time.perf_counter()
    status = "5xx"
    try:
        yield
        status = "2xx"
    except httpx.HTTPStatusError as e:
        code = e.response.status_code
        status = f"{code // 100}xx"
        raise
    finally:
        elapsed = time.perf_counter() - t0
        rum_latency.labels(endpoint, tier, status).observe(elapsed)
        rum_total.labels(endpoint, tier, status).inc()

From these two metrics you can derive p50, p95, p99, request rate, and error rate per endpoint with PromQL.

histogram_quantile(0.95,
  sum by (endpoint, le)
    (rate(tikliveapi_request_duration_seconds_bucket[5m])))

Error Budget Math

Pick a Tier 1 endpoint with 99.9% availability. Across a 30-day window your error budget is 0.1% of all requests. If your dashboard pushes 250,000 calls/month to /userinfo-by-username/, your budget is 250 failed calls. A burn-rate alert at 14.4x means you would exhaust the entire monthly budget in 2 hours; a 6x burn covers 6 hours. Tier 2 budgets are larger (1,250 errors per 250k), Tier 3 even more (2,500). Budget arithmetic is what turns "the API is flaky today" into "we have 41 hours of budget left this month, hold the change freeze."

Alerting Strategy: Burn Rate, Not Raw Errors

The single biggest upgrade most teams can make is to stop paging on raw 5xx counts and start paging on multi-window burn-rate. A short window catches fast disasters; a long window catches slow corrosion. Page when both windows agree.

groups:
- name: tikliveapi-burn-rate
  rules:
  - alert: TikLiveAPITier1FastBurn
    expr: |
      (
        sum(rate(tikliveapi_request_total{tier="tier1",status_class!="2xx"}[5m]))
        / sum(rate(tikliveapi_request_total{tier="tier1"}[5m]))
      ) > (14.4 * 0.001)
      and
      (
        sum(rate(tikliveapi_request_total{tier="tier1",status_class!="2xx"}[1h]))
        / sum(rate(tikliveapi_request_total{tier="tier1"}[1h]))
      ) > (14.4 * 0.001)
    for: 2m
    labels:
      severity: page
      service: tikliveapi
    annotations:
      summary: "Tier 1 burning 14.4x error budget"
      runbook: "https://runbooks.example.com/tikliveapi-tier1"
  - alert: TikLiveAPITier1SlowBurn
    expr: |
      (
        sum(rate(tikliveapi_request_total{tier="tier1",status_class!="2xx"}[30m]))
        / sum(rate(tikliveapi_request_total{tier="tier1"}[30m]))
      ) > (6 * 0.001)
      and
      (
        sum(rate(tikliveapi_request_total{tier="tier1",status_class!="2xx"}[6h]))
        / sum(rate(tikliveapi_request_total{tier="tier1"}[6h]))
      ) > (6 * 0.001)
    for: 15m
    labels:
      severity: ticket
      service: tikliveapi

Prometheus, Grafana, and PagerDuty Wiring

Run the probe exporter as a sidecar (port 9101), expose RUM metrics from the application (/metrics), and scrape both from Prometheus:

scrape_configs:
  - job_name: tikliveapi-synthetic
    scrape_interval: 30s
    static_configs:
      - targets: ["probe-eu:9101", "probe-us:9101"]
  - job_name: tikliveapi-rum
    scrape_interval: 15s
    static_configs:
      - targets: ["app-1:8000", "app-2:8000", "app-3:8000"]

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

In Alertmanager route severity: page to PagerDuty (use the Events API v2 integration) and severity: ticket to a Slack channel or Jira webhook.

route:
  receiver: slack-tickets
  group_by: [service, tier]
  routes:
    - match:
        severity: page
      receiver: pagerduty-oncall
      group_wait: 10s
      repeat_interval: 1h
receivers:
  - name: pagerduty-oncall
    pagerduty_configs:
      - routing_key: "${PD_ROUTING_KEY}"
        severity: critical
  - name: slack-tickets
    slack_configs:
      - api_url: "${SLACK_WEBHOOK}"
        channel: "#sre-tikliveapi"

Runbook Templates

Runbook: /search-video/ degraded. Search is Tier 2 and naturally slower because of publish_time and sort_by filtering. Steps: confirm via synthetic dashboard that p95 is above 4 s for more than 10 minutes. Check whether the cursor param is being passed (deep pagination is heavier). Reduce count from 30 to 10 in the client and roll out. If still degraded, fail open by serving stale cached results for popular keywords. Open a ticket through the support contact page with timestamps in UTC and a sample request id.

Runbook: /userinfo-by-username/ returning shape_mismatch. The endpoint returns nested user and stats with camelCase fields (uniqueId, followerCount, heartCount). If the validator fails, do not assume an outage; check whether your contract test still matches the documented response. Pin a smoke test username and diff the JSON tree against the snapshot. If a field truly disappeared, file a contract issue and switch the affected feature flag off rather than retrying.

Runbook: /post-detail/ download URLs missing. Confirm that play, wmplay, and hdplay are all present in the flat snake_case response. If only hdplay is missing it usually means the source video lacks an HD master; gracefully degrade to play. Do not page unless missing across more than 5% of recent calls.

Runbook: /user-followers/ paging stuck. Pagination uses the time timestamp param, not a numeric cursor. Verify your client is passing back the top-level time value from the previous response, not the integer cursor used elsewhere. Watch for hasMore flipping to false prematurely.

Runbooks cover the first fifteen minutes of a page; for severity classification, status-page comms, and postmortem structure, pair them with an incident response playbook for TikTok data outages.

Dashboard Recommendations

Tier overview row. Three columns (Tier 1/2/3) showing current p95, error rate, and budget remaining.
Per-endpoint heatmap. Latency histogram by endpoint, last 6 hours.
Synthetic vs RUM diff. If synthetics are green but RUM is red, the issue is your client; if both are red, the issue is upstream.
Credit saturation gauge. Pull your current credit balance from your account profile page on a 5-minute interval and alarm at 7-day projected exhaustion.
Top offenders table. Endpoints with the worst budget burn this week, sorted descending.

SLO Review Meetings

Hold a 30-minute review every two weeks with engineering, product, and one customer support voice. Walk through budget consumption per tier, the top three incidents, and one experimental change. If a tier ate its budget, the next sprint freezes risky changes against that tier until a postmortem ships. If a tier consistently runs under 10% of budget for two cycles, tighten the SLO; you are paying for reliability you do not need to advertise. If you resell this data downstream, the same budget math should feed the support SLAs you promise your own API customers.

FAQ

Do I really need both synthetic and RUM? Yes. Synthetics give you a known baseline at a known cadence; RUM tells you what your customers got. Together they let you separate upstream incidents from your own client bugs in under a minute.

How do I count timeouts in the error budget? Timeouts are errors. A 15-second hang is worse than a 503 because it blocks the caller's worker. Wrap every call with a hard timeout (10 s for Tier 1, 20 s for Tier 2, 60 s for Tier 3) and emit a 5xx status class on expiry.

Should retries count against the budget? Count the user-visible outcome, not every wire attempt. If a retry succeeds and the user got a 200, log it as success and increment a separate retry counter for capacity planning.

How fast should I page? A 14.4x fast-burn window of 5 minutes with a 1-hour confirmation is the standard SRE recipe and works well here. Anything faster pages on noise; anything slower lets your budget melt before anyone wakes up.

What about cardinality? Label by endpoint, tier, and status_class. Do not label by username, userid, or keyword - that path leads to a million-series Prometheus disaster.

Where do I learn the response shapes? Start at the full endpoint documentation, then drill into the endpoint you care about. Try every endpoint live without writing code in the interactive API playground. Compare credit tiers on the pricing page and read field notes on the engineering blog. Authentication is always the same header: X-Api-Key: your-key.

Build the dashboards, write the runbooks, and let the burn-rate math do the paging. Your on-call will thank you.