Incident Response Playbook for TikTok-Data Outages

Published on May 29, 2026

When a TikTok-data product goes down at 3 AM, the difference between a fifteen-minute blip and a four-hour customer-facing outage is almost never the actual fix. It is the runbook, the rotation, and the rituals around the fix. This playbook collects what we have learned operating TikLiveAPI and watching other teams run TikTok-scraping stacks. It is opinionated, blameless, and meant to be copied.

Incident classification: P1 through P4

Before you can respond, you have to agree on what an incident even is. A flat ladder with four severities is enough for almost every TikTok-data product. Anything more granular gets argued about at 4 AM and slows the page-out.

P1: customer-facing down

The dashboard at the product URL returns 5xx, the API endpoint at https://api.tikliveapi.com returns 5xx or times out for more than 60 seconds, or paid customers cannot authenticate with their X-Api-Key at all. P1 means everyone is paged. Primary on-call, secondary on-call, engineering lead, and the customer success lead all receive a phone call. Status page goes red. Public Twitter post within 15 minutes.

P2: partial or degraded

One endpoint family is down while others work. For example /post-detail/ is throwing 502s because the upstream TikTok web endpoint that powers no-watermark play URLs changed shape, but /userid/ and /userinfo-by-username/ still respond cleanly. Latency p95 is 3x normal. Error rate is between 1% and 10%. Page primary only, status page goes yellow, customer email goes out within an hour.

P3: batch slowness

Async work is backed up. The /user-posts/ background fetcher is two hours behind, or the daily stats aggregation that writes stats/archive/{date}.json did not finish on time. No paying customer hits a 5xx in the moment, but if it persists into business hours someone will notice. Slack-only page, no public status page entry unless it is unresolved at 8 AM local for the largest customer cohort.

P4: cosmetic

A chart on the profile page renders an off-by-one date. A 200 response has a stale cached field. The documentation page shows the wrong example response for a deprecated path. File a ticket, do not page anyone, fix in normal sprint cadence.

The three outage modes of TikTok-data products

Almost every real incident on a TikTok-scraping stack collapses into one of three modes. Memorizing them turns triage from a search problem into a lookup.

Mode 1: upstream TikTok API down or changed

TikTok periodically rotates its internal endpoints, signature schemes, and response shapes. You will see this as a sudden spike in 5xx on one or two endpoints, never all of them at once. The tell is that /userid/ (a simple username lookup that returns a single id) keeps working while /post-detail/ (which has to negotiate the watermarked vs no-watermark play URL) breaks. Fix is in the scraper layer, not the dashboard.

Mode 2: your database is the bottleneck

MySQL connection pool exhausted, slow query lock on the users or transactions table, or disk full on the stats partition. Symptom is that every endpoint slows down at the same time, including the marketing pages, because PHP cannot get a connection from baglan.php. Login page itself starts returning 5xx. This is the scariest mode because it kills your status page too if you self-host it.

Mode 3: queue or worker backed up

The most common mode. Background workers that hit upstream TikTok are stuck on a slow batch, or rate limited, and the queue depth balloons. Synchronous endpoints stay green, but anything async (post-comments paginated crawl, multi-user enrichment) misses SLA. The fix is almost always to drain the queue with extra workers and add a circuit breaker on the upstream call.

The runbook structure: detect, declare, communicate, mitigate, RCA

Every runbook in your repo should follow this five-phase shape. If a runbook does not have all five sections, it is not done.

Detect

Alerting must fire before a customer tweets. Synthetic checks against three endpoints, one per shape: /userid/ (single key response), /userinfo-by-username/ (nested user and stats objects), and /post-detail/ (flat object with play, hdplay, wmplay). If any of those three fail twice in a row, the on-call gets paged. Synthetic checks should use a dedicated test API key with its own quota so they do not eat into paid credits.

Declare

The on-call writes one Slack message in #incidents: severity, summary, IC (incident commander), start time. That message becomes the canonical incident channel link. Once declared, only the IC can change severity.

Communicate

Status page first, customer comms second, public comms third. Do not flip the order. See the templates below.

Mitigate

Mitigation is not the same as fixing. Mitigation is "customers stop hurting." Rolling back a deploy mitigates. Failing over to a secondary region mitigates. A code patch is a fix, not a mitigation, and it is usually the wrong first move during a P1.

RCA

Root cause analysis happens after the incident is closed. Blameless template below. The RCA owner is the IC, not the person who pushed the broken code.

War room rituals

The war room is a real-time voice channel plus a dedicated Slack thread. It opens the moment a P1 or P2 is declared. Rituals that have saved us:

  • Roll call every 15 minutes. IC asks each role (engineering, support, comms) for a one-sentence status. Silence is not acceptable. Anyone can say "no change."
  • One IC at a time. If the IC needs to debug, they hand off the IC role explicitly. "I am handing IC to X" goes in the thread. No silent handoffs.
  • Hypothesis before action. Before anyone runs a SQL query or pushes a config, they say what they expect to happen and why. This prevents two engineers from stomping on each other's changes.
  • Timeline in the thread. Every meaningful event (alert fired, hypothesis formed, config changed, customer email sent) gets a message with a UTC timestamp. The RCA is written from this thread, not from memory.
  • One status page update per ritual cycle. Even if nothing has changed, "still investigating, next update in 30 minutes" is a status page update.

Status page communication template

Your public status page (we host ours at /status/) needs three message templates. Customers do not want prose, they want shape: what is broken, what they should do, when you will update next.

Investigating template

[INVESTIGATING] Elevated error rates on /post-detail/

We are seeing elevated 5xx responses on the /post-detail/
endpoint affecting watermark-free video URL retrieval.
Other endpoints (/userid/, /userinfo-by-username/,
/user-posts/) are unaffected. We are investigating.

Next update: 15 minutes.

Identified template

[IDENTIFIED] Upstream shape change on /post-detail/

We have identified an upstream change affecting how the
play and hdplay fields are returned for /post-detail/.
A fix is being deployed. /userid/, /userinfo-by-username/,
/userinfo-by-id/, /user-posts/, and /post-comments/
remain fully operational.

Next update: 30 minutes or on resolution.

Resolved template

[RESOLVED] /post-detail/ fully restored

The /post-detail/ endpoint has returned to normal error
rates and latency as of [UTC timestamp]. A postmortem
will be published within 5 business days.

Total customer-impacting duration: [HH:MM].

Customer communications: Slack, email, in-app

Three channels, three audiences, three tones.

Shared Slack channels

Your biggest customers have a shared Slack channel with your team. During a P1 affecting their tier, post within 10 minutes. Tone is technical and direct. Link to the status page entry, do not duplicate its content.

Email

For P1 and P2 affecting paid customers, an email goes out within 60 minutes of declaration. Use the same shape as the status page but expand the customer-impact section. Include a link to /contact/ for follow-up questions. Do not include credit or refund commitments in the first email, that is a separate decision made after the incident closes.

In-app banner

The dashboard shows a yellow or red banner driven by the same data source as the status page. Logged-in users see the banner. The banner links to the public status page entry. Banner is removed within 5 minutes of resolution, not left to rot.

Blameless postmortem template

The postmortem is not about finding who pushed the bad code. It is about finding the system conditions that allowed a single bad push (or upstream change, or queue backup) to become a customer-facing incident.

## Incident: [one-line summary]

Severity: P[1-4]
Duration: [HH:MM] customer-impacting
Date: [YYYY-MM-DD UTC]
IC: [name]
Author: [name, usually IC]

## Impact
- Customers affected: [count or %]
- Endpoints affected: [list]
- Revenue impact: [estimate or "TBD"]
- Credit/refund decision: [link to ticket]

## Timeline (UTC)
HH:MM - [event]
HH:MM - [event]
HH:MM - [event]

## Root cause
[2-3 paragraphs. No names of individuals.
Focus on system conditions.]

## What went well
- [bullet]
- [bullet]

## What went poorly
- [bullet]
- [bullet]

## Where we got lucky
- [bullet]

## Action items
| # | Item | Owner | Due | Ticket |
|---|------|-------|-----|--------|
| 1 | ... | ... | ... | ... |

The "where we got lucky" section is the most important. It captures near-misses that did not become outages this time but will next time if you do not fix them.

Action items tracking

Every postmortem action item gets a ticket with a real due date. Action items without due dates do not exist. The engineering lead reviews open action items in a monthly meeting. Items more than 60 days overdue get escalated or explicitly killed; quiet death is not allowed because it teaches the team that postmortems are theater.

MTTR vs MTBF: pick one to optimize

MTTR (mean time to recovery) measures how fast you get out of an incident. MTBF (mean time between failures) measures how often you have incidents. For a young TikTok-data product, optimize MTTR first. You will have incidents because upstream TikTok will keep changing things you do not control. Spending engineering effort on MTBF for upstream-driven failures has diminishing returns; spending it on fast detection, fast rollback, and good runbooks pays back on every incident.

Once your MTTR is under 30 minutes for P1 and under 2 hours for P2, then turn the dial toward MTBF. Add circuit breakers, add request hedging, add multi-region for the dashboard so a single DB outage does not kill the status page.

On-call rotation design

Primary plus secondary

One primary on-call carries the pager. One secondary is reachable for escalation, handoff if primary is unreachable, and pair-debugging on P1s. Never run a single-person rotation. People get sick, lose phones, sleep through alerts.

Follow-the-sun

If you have engineers in two or more time zones, run two shifts that hand off at a fixed time daily. The handoff is a 10-minute call with the incoming on-call. Open incidents, active investigations, and "things I am watching" all transfer. Do not rely on Slack for handoffs.

Fair compensation

On-call is work. Pay for it. A flat weekly stipend for carrying the pager, plus an hourly multiplier for time spent in active incidents outside business hours. If you cannot afford to pay for on-call, you cannot afford to run a paid API. Customers will notice when burned-out on-calls miss pages.

Rotation length

One week, starting Monday morning, ending the following Monday morning. Two-week rotations cause burnout. Daily rotations cause amnesia (the on-call never builds context). One week is the right unit.

Runbook examples for the three modes

Runbook: upstream TikTok API down or shape-changed

Detect: synthetic check on the affected endpoint fails. Declare P1 if it is a core endpoint (/userid/, /userinfo-by-username/, /post-detail/), P2 if it is a secondary one (/post-comments/, /user-posts/). Mitigate by flipping the affected endpoint into a graceful-degradation mode that returns a documented error code rather than a 5xx, so customer code can branch. Fix by patching the scraper layer. RCA must include a check of all 37 endpoints (per /documentation/) to see if any others use the same upstream call.

curl -H "X-Api-Key: $TEST_KEY" \
  "https://api.tikliveapi.com/userid/?username=tiktok"

# Expected: 200 with {"id": "..."}
# Outage: 5xx, or 200 with empty id field

Runbook: database down or saturated

Detect: dashboard pages return 5xx, including marketing pages. Declare P1 immediately. Mitigate by failing over to read replica if available, or restarting the MySQL service, or killing long-running queries via SHOW PROCESSLIST and KILL. Communicate via the status page hosted outside the affected DB (this is why /status/ is on a separate host from the dashboard). Fix is operational, RCA covers connection pool sizing and slow query review.

Runbook: queue backed up

Detect: queue depth metric crosses threshold. Declare P3 by default, escalate to P2 if it persists 30 minutes. Mitigate by spinning up extra workers, draining the queue, and applying a circuit breaker on the upstream call if rate limiting is the cause. Fix is autoscaling rules and circuit breaker tuning.

Integration with PagerDuty and OpsGenie

The pager tool is not the source of truth. Your monitoring system (Prometheus, Datadog, or hand-rolled cron checks) is the source of truth. PagerDuty or OpsGenie is the delivery channel.

Map severity to escalation policy:

  • P1: Phone call primary, 5-minute escalation to secondary, 10-minute escalation to engineering lead.
  • P2: Push and SMS primary, 15-minute escalation to secondary.
  • P3: Slack-only, no escalation outside business hours.
  • P4: Ticket-only, never pages.

Use the pager tool's API to auto-create incidents from monitoring alerts. Do not let humans click "create incident" - they will forget under stress.

FAQ

How long should a P1 take to resolve?

Aim for under 30 minutes of customer impact. Anything over 60 minutes for a P1 is a signal that either your detection, your runbook, or your rollback path is broken. Investigate the process, not the incident.

Should we publish all postmortems?

Publish P1 postmortems publicly with timeline and root cause. Keep internal-only sections for revenue impact and specific customer details. Customers respect transparency more than they respect uptime claims.

What if the on-call is a junior engineer?

Pair them with a senior on the secondary slot for their first three rotations. The senior takes IC on P1s while the junior shadows. By rotation four, the junior takes IC on P2 and P3.

How do we handle credit refunds for outages?

Pre-decide your SLA before incidents happen. A common shape: monthly credit equal to 10x the duration of P1 customer-impact, capped at one month of the plan price. Refund decisions made during an incident are always wrong. See /pricing/ for current plan structure and link the SLA from there.

Do we need a separate status page provider?

If your dashboard and your status page share infrastructure, a DB outage takes both down. At minimum, host the status page on a different provider or region. If that is not possible, have a fallback Twitter account that customers know to check.

How often should we run incident drills?

Quarterly minimum. Pick a random Tuesday, page the on-call with a fake P1, run the full ritual including a fake status page post (clearly marked as a drill). The first three drills will expose more process gaps than your last six real incidents.

An incident response program is not a document, it is a habit. The teams that recover fastest are the teams that have practiced the rituals when nothing is on fire. Start with the classification ladder, get your three runbooks written, run a drill, and iterate. If you want to talk through how we run this on TikLiveAPI specifically, our /contact/ page is the fastest way to reach the team, and the live state of our own services is always at /status/.

Build with the TikTok API

Ready to put what you read into code? Try our endpoints live or grab the full reference.

Open Playground Read Documentation