If your product depends on a third-party data source, the data source is part of your stack whether you like it or not. When TikLiveAPI is upstream of your feature, every TikTok endpoint you call becomes a node in your service graph that you cannot ssh into, cannot restart, and cannot patch. The only lever you have left is observability: knowing exactly when latency drifts, when error rates climb, and when your budget is on fire so you can shed load, fail over, or page a human before the customer notices.
This guide walks SRE and platform ops teams through a production-grade monitoring stack for TikLiveAPI. We will apply the four golden signals to a real workload, design SLOs per endpoint class, instrument both synthetic and real-user probes, calculate error budgets, configure burn-rate alerting through Prometheus and PagerDuty, and write runbooks that survive a 3 a.m. page. Code samples are Python and YAML, mirroring what most platform teams already run.
Latency, errors, traffic, and saturation. The signals are universal but the thresholds are not. TikLiveAPI advertises a 750 ms average response time across 7+ servers with 99.9% uptime, so your alerting baselines must respect that envelope rather than copy-pasting generic web defaults.
/search-video/ is expected (search is heavier); a spike on /userid/ is a real incident.Not every endpoint deserves the same SLO. Group the 37 endpoints into three tiers based on how user-facing they are in your product.
Tier 1 - interactive, blocking the user. These run in the request path of a page load and need tight SLOs. Examples: /userid/, /userinfo-by-username/, /post-detail/, /download-video/. Target: p95 below 1.2 s, availability 99.9% over 30 days.
Tier 2 - interactive but tolerant. Search and list endpoints where users expect a small spinner. /search-video/, /search-user/, /user-posts/, /post-comments/. Target: p95 below 2.5 s, availability 99.5%.
Tier 3 - batch / background. Pagination crawlers, /user-followers/ walking the time cursor, /user-following/ reading the followings array, /challenge-posts/ bulk ingest. Target: p95 below 6 s, availability 99.0%, retries permitted.
Encode the tiers as data so every system reads from one source:
slo_tiers:
tier1_interactive:
p95_ms: 1200
availability: 0.999
burn_rate_fast: 14.4
burn_rate_slow: 6.0
endpoints:
- /userid/
- /userinfo-by-username/
- /userinfo-by-id/
- /post-detail/
- /download-video/
- /download-music/
tier2_search:
p95_ms: 2500
availability: 0.995
burn_rate_fast: 10.0
burn_rate_slow: 3.0
endpoints:
- /search-video/
- /search-user/
- /search-challenge/
- /user-posts/
- /post-comments/
- /post-comment-replies/
tier3_batch:
p95_ms: 6000
availability: 0.990
burn_rate_fast: 6.0
burn_rate_slow: 2.0
endpoints:
- /user-followers/
- /user-following/
- /challenge-posts/
- /music-posts/
- /playlist-posts/
- /collection-posts/
Synthetics catch problems before users do. Run a small probe every 60 seconds against the cheapest representative endpoint in each tier. /userid/ is perfect for Tier 1: it takes a known-good username, returns a flat {"id": "107955"}, and lets you assert both latency and payload shape.
import os
import time
import httpx
from prometheus_client import Counter, Histogram, start_http_server
BASE = "https://api.tikliveapi.com"
KEY = os.environ["TIKLIVEAPI_KEY"]
probe_latency = Histogram(
"tikliveapi_probe_duration_seconds",
"Synthetic probe latency",
["endpoint", "tier"],
buckets=(0.1, 0.25, 0.5, 0.75, 1.0, 1.5, 2.5, 5.0, 10.0),
)
probe_result = Counter(
"tikliveapi_probe_total",
"Synthetic probe outcomes",
["endpoint", "tier", "outcome"],
)
PROBES = [
("/userid/", {"username": "tiktok"}, "tier1", lambda j: "id" in j),
("/userinfo-by-username/", {"username": "tiktok"}, "tier1",
lambda j: "user" in j and "stats" in j),
("/search-video/", {"keyword": "cats", "count": 5}, "tier2",
lambda j: "videos" in j and "hasMore" in j),
]
def probe_once(client):
for path, params, tier, validate in PROBES:
t0 = time.perf_counter()
try:
r = client.get(path, params=params, timeout=10.0)
elapsed = time.perf_counter() - t0
probe_latency.labels(path, tier).observe(elapsed)
if r.status_code != 200:
probe_result.labels(path, tier, f"http_{r.status_code}").inc()
continue
if not validate(r.json()):
probe_result.labels(path, tier, "shape_mismatch").inc()
continue
probe_result.labels(path, tier, "ok").inc()
except Exception as exc:
probe_result.labels(path, tier, type(exc).__name__).inc()
def main():
start_http_server(9101)
headers = {"X-Api-Key": KEY}
with httpx.Client(base_url=BASE, headers=headers) as client:
while True:
probe_once(client)
time.sleep(60)
if __name__ == "__main__":
main()
Two notes. First, schema validation matters. Upstream APIs do occasionally rename fields; asserting that /user-following/ returns a key called followings (not following) catches contract drift early. Second, run probes from at least two geographies. Single-region synthetics lie when there is a regional network event.
Synthetics tell you the API is up. RUM tells you what your users actually experienced. Wrap your TikLiveAPI client in a middleware that records every real call with the same labels your synthetics use.
import time
from contextlib import contextmanager
from prometheus_client import Counter, Histogram
rum_latency = Histogram(
"tikliveapi_request_duration_seconds",
"Real user request latency",
["endpoint", "tier", "status_class"],
buckets=(0.1, 0.25, 0.5, 0.75, 1.0, 1.5, 2.5, 5.0, 10.0, 30.0),
)
rum_total = Counter(
"tikliveapi_request_total",
"Real user request outcomes",
["endpoint", "tier", "status_class"],
)
TIER_OF = {
"/userid/": "tier1", "/userinfo-by-username/": "tier1",
"/post-detail/": "tier1", "/download-video/": "tier1",
"/search-video/": "tier2", "/search-user/": "tier2",
"/user-followers/": "tier3", "/user-following/": "tier3",
}
@contextmanager
def track(endpoint):
tier = TIER_OF.get(endpoint, "tier3")
t0 = time.perf_counter()
status = "5xx"
try:
yield
status = "2xx"
except httpx.HTTPStatusError as e:
code = e.response.status_code
status = f"{code // 100}xx"
raise
finally:
elapsed = time.perf_counter() - t0
rum_latency.labels(endpoint, tier, status).observe(elapsed)
rum_total.labels(endpoint, tier, status).inc()
From these two metrics you can derive p50, p95, p99, request rate, and error rate per endpoint with PromQL.
histogram_quantile(0.95,
sum by (endpoint, le)
(rate(tikliveapi_request_duration_seconds_bucket[5m])))
Pick a Tier 1 endpoint with 99.9% availability. Across a 30-day window your error budget is 0.1% of all requests. If your dashboard pushes 250,000 calls/month to /userinfo-by-username/, your budget is 250 failed calls. A burn-rate alert at 14.4x means you would exhaust the entire monthly budget in 2 hours; a 6x burn covers 6 hours. Tier 2 budgets are larger (1,250 errors per 250k), Tier 3 even more (2,500). Budget arithmetic is what turns "the API is flaky today" into "we have 41 hours of budget left this month, hold the change freeze."
The single biggest upgrade most teams can make is to stop paging on raw 5xx counts and start paging on multi-window burn-rate. A short window catches fast disasters; a long window catches slow corrosion. Page when both windows agree.
groups:
- name: tikliveapi-burn-rate
rules:
- alert: TikLiveAPITier1FastBurn
expr: |
(
sum(rate(tikliveapi_request_total{tier="tier1",status_class!="2xx"}[5m]))
/ sum(rate(tikliveapi_request_total{tier="tier1"}[5m]))
) > (14.4 * 0.001)
and
(
sum(rate(tikliveapi_request_total{tier="tier1",status_class!="2xx"}[1h]))
/ sum(rate(tikliveapi_request_total{tier="tier1"}[1h]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: page
service: tikliveapi
annotations:
summary: "Tier 1 burning 14.4x error budget"
runbook: "https://runbooks.example.com/tikliveapi-tier1"
- alert: TikLiveAPITier1SlowBurn
expr: |
(
sum(rate(tikliveapi_request_total{tier="tier1",status_class!="2xx"}[30m]))
/ sum(rate(tikliveapi_request_total{tier="tier1"}[30m]))
) > (6 * 0.001)
and
(
sum(rate(tikliveapi_request_total{tier="tier1",status_class!="2xx"}[6h]))
/ sum(rate(tikliveapi_request_total{tier="tier1"}[6h]))
) > (6 * 0.001)
for: 15m
labels:
severity: ticket
service: tikliveapi
Run the probe exporter as a sidecar (port 9101), expose RUM metrics from the application (/metrics), and scrape both from Prometheus:
scrape_configs:
- job_name: tikliveapi-synthetic
scrape_interval: 30s
static_configs:
- targets: ["probe-eu:9101", "probe-us:9101"]
- job_name: tikliveapi-rum
scrape_interval: 15s
static_configs:
- targets: ["app-1:8000", "app-2:8000", "app-3:8000"]
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
In Alertmanager route severity: page to PagerDuty (use the Events API v2 integration) and severity: ticket to a Slack channel or Jira webhook.
route:
receiver: slack-tickets
group_by: [service, tier]
routes:
- match:
severity: page
receiver: pagerduty-oncall
group_wait: 10s
repeat_interval: 1h
receivers:
- name: pagerduty-oncall
pagerduty_configs:
- routing_key: "${PD_ROUTING_KEY}"
severity: critical
- name: slack-tickets
slack_configs:
- api_url: "${SLACK_WEBHOOK}"
channel: "#sre-tikliveapi"
Runbook: /search-video/ degraded. Search is Tier 2 and naturally slower because of publish_time and sort_by filtering. Steps: confirm via synthetic dashboard that p95 is above 4 s for more than 10 minutes. Check whether the cursor param is being passed (deep pagination is heavier). Reduce count from 30 to 10 in the client and roll out. If still degraded, fail open by serving stale cached results for popular keywords. Open a ticket on /contact/ with timestamps in UTC and a sample request id.
Runbook: /userinfo-by-username/ returning shape_mismatch. The endpoint returns nested user and stats with camelCase fields (uniqueId, followerCount, heartCount). If the validator fails, do not assume an outage; check whether your contract test still matches the documented response. Pin a smoke test username and diff the JSON tree against the snapshot. If a field truly disappeared, file a contract issue and switch the affected feature flag off rather than retrying.
Runbook: /post-detail/ download URLs missing. Confirm that play, wmplay, and hdplay are all present in the flat snake_case response. If only hdplay is missing it usually means the source video lacks an HD master; gracefully degrade to play. Do not page unless missing across more than 5% of recent calls.
Runbook: /user-followers/ paging stuck. Pagination uses the time timestamp param, not a numeric cursor. Verify your client is passing back the top-level time value from the previous response, not the integer cursor used elsewhere. Watch for hasMore flipping to false prematurely.
Hold a 30-minute review every two weeks with engineering, product, and one customer support voice. Walk through budget consumption per tier, the top three incidents, and one experimental change. If a tier ate its budget, the next sprint freezes risky changes against that tier until a postmortem ships. If a tier consistently runs under 10% of budget for two cycles, tighten the SLO; you are paying for reliability you do not need to advertise.
Do I really need both synthetic and RUM? Yes. Synthetics give you a known baseline at a known cadence; RUM tells you what your customers got. Together they let you separate upstream incidents from your own client bugs in under a minute.
How do I count timeouts in the error budget? Timeouts are errors. A 15-second hang is worse than a 503 because it blocks the caller's worker. Wrap every call with a hard timeout (10 s for Tier 1, 20 s for Tier 2, 60 s for Tier 3) and emit a 5xx status class on expiry.
Should retries count against the budget? Count the user-visible outcome, not every wire attempt. If a retry succeeds and the user got a 200, log it as success and increment a separate retry counter for capacity planning.
How fast should I page? A 14.4x fast-burn window of 5 minutes with a 1-hour confirmation is the standard SRE recipe and works well here. Anything faster pages on noise; anything slower lets your budget melt before anyone wakes up.
What about cardinality? Label by endpoint, tier, and status_class. Do not label by username, userid, or keyword - that path leads to a million-series Prometheus disaster.
Where do I learn the response shapes? Start at /documentation/, then drill into the endpoint you care about. Try every endpoint live without writing code at /playground/. Compare pricing tiers at /pricing/ and read field notes on /blog/. Authentication is always the same header: X-Api-Key: your-key.
Build the dashboards, write the runbooks, and let the burn-rate math do the paging. Your on-call will thank you.
Ready to put what you read into code? Try our endpoints live or grab the full reference.