About Site Reliability Engineer interviews
SRE interviews sit at the intersection of software engineering and systems operations, and the loop reflects that dual identity. After a recruiter screen, you'll typically face a hiring manager conversation probing your production incident history, then a multi-stage technical loop: a coding round (usually data structures, parsing logs, or writing automation rather than LeetCode-hard puzzles), at least one system design or distributed-systems interview, and a dedicated 'troubleshooting' or 'debugging' round where you're handed a degrading service and asked to diagnose it live. Larger companies (Google, Meta-style shops) add a non-abstract-large-systems-design round focused on capacity, sharding, and failure domains. A final round usually covers on-call culture, blameless postmortems, and cross-team collaboration with product engineers. What interviewers screen for is judgement under uncertainty: can you reason about SLOs, error budgets, and tradeoffs rather than just recite tools? Where candidates stumble most is talking exclusively about tooling (Kubernetes, Terraform, Prometheus) without demonstrating the underlying reasoning — interviewers want to know *why* you reach for a tool and what you'd do when it isn't there. Other common failures: jumping to a fix in the debugging round before forming a hypothesis, designing systems with no failure modes considered, and treating reliability as purely reactive firefighting rather than engineering toil reduction. Strong candidates frame everything through reliability outcomes, measurable risk, and sustainable operations.
Typical stages
- Recruiter screen
- Hiring manager interview
- Coding round
- Systems design / distributed systems
- Troubleshooting / debugging round
- Final / on-call culture & values
Common formats
- Behavioral STAR
- Live coding
- System design
- Live troubleshooting exercise
- Incident retrospective walkthrough
What hiring managers screen for
- Reasoning about SLOs, error budgets and measurable reliability tradeoffs
- Structured debugging under uncertainty with hypothesis-driven investigation
- Bias toward automation and reducing operational toil rather than heroics
- Blameless, collaborative approach to incidents and postmortems
- Software engineering depth, not just operational tool familiarity
Red flags to avoid
- Tool name-dropping with no reasoning about why or when to use them
- Jumping straight to a fix in debugging scenarios without forming a hypothesis
- Designing systems with no consideration of failure domains or graceful degradation
- Blaming individuals for incidents rather than systems and process
- Treating reliability as reactive firefighting with no investment in toil reduction
Primary questions (15)
Behavioural
Tell me about the most severe production incident you've been involved in. Walk me through your role from detection to resolution.
Why this comes up: Incident response is the core of SRE work and reveals how you behave under real pressure.
Prep pointers
- Pick an incident where you had a clear, owned role — not one you merely observed.
- STAR Situation/Task: quantify impact (users affected, revenue, SLA breach) and your specific responsibility (IC, comms lead, debugger).
- STAR Action should show structured triage: how you scoped blast radius, mitigated before root-causing, and communicated to stakeholders.
- STAR Result should include the follow-up — the postmortem and durable fixes, not just 'we restored service'.
- Avoid hero narratives; emphasise process and what made resolution repeatable.
Behavioural
Describe a time you pushed back on shipping a feature or change because of reliability concerns.
Why this comes up: SREs must hold the line on reliability against product delivery pressure, and interviewers test that spine.
Prep pointers
- Frame the tension explicitly: speed/feature value versus reliability risk.
- STAR Action should show you brought data (error budget burn, load test results) rather than just opinion.
- Show collaboration — how you offered a path forward (guardrails, staged rollout) not just a 'no'.
- Result should capture the outcome AND the working relationship preserved.
- Avoid sounding obstructionist; show you understood the business need you were balancing against.
Behavioural
Tell me about a piece of operational toil you eliminated through automation.
Why this comes up: Reducing toil is a defining SRE responsibility and signals an engineering rather than ops mindset.
Prep pointers
- Quantify the toil before: hours per week, error rate, on-call burden, frequency.
- STAR Action should explain how you measured the problem, chose to automate, and validated the automation.
- Result should show the durable saving and any second-order benefits (fewer pages, faster recovery).
- Mention how you guarded against the automation itself becoming a fragile dependency.
- Avoid presenting a one-off script with no measurable or lasting impact.
Behavioural
Describe a postmortem you led or contributed to where the root cause was uncomfortable to surface.
Why this comes up: Blameless postmortem culture is central to SRE, and this tests psychological safety and intellectual honesty.
Prep pointers
- Choose an example where systemic or process causes were involved, not a single person's mistake.
- STAR Action should demonstrate the blameless framing you used and how you kept it about systems.
- Highlight how you drove concrete, owned action items rather than vague 'be more careful' outcomes.
- Result should show a measurable reduction in recurrence or improved detection.
- Avoid any language that assigns individual blame, even subtly.
Technical
Design a globally distributed, highly available URL shortener that must serve billions of redirects with low latency.
Why this comes up: System design at scale tests your grasp of availability, caching, sharding and failure domains.
Prep pointers
- Clarify requirements first: read/write ratio, latency SLO, consistency needs, durability.
- Walk through data model, ID generation strategy, and how you'd shard and replicate.
- Explicitly address failure modes: region loss, cache stampede, hot keys, and graceful degradation.
- Tie design choices back to SLOs and capacity planning, not just 'add more nodes'.
- Avoid designing only the happy path — interviewers wait to see if you raise failures unprompted.
Technical
A service's p99 latency has degraded 4x over the last hour while p50 is unchanged. How do you investigate?
Why this comes up: Live troubleshooting under partial information is the signature SRE skill the debugging round screens for.
Prep pointers
- State your hypotheses before touching anything — tail latency suggests contention, GC, slow dependency, or a noisy subset.
- Describe how you'd use the USE/RED methods and what signals (saturation, queue depth, dependency latency) you'd check.
- Reason aloud about why p50 is stable — narrows the search to a subset of requests or resources.
- Show you'd mitigate (shed load, rollback, failover) in parallel with diagnosis if SLO is at risk.
- Avoid jumping to 'restart the service' or guessing a fix before forming a hypothesis.
Technical
Explain how you would define and implement SLOs and error budgets for a payment-processing service.
Why this comes up: SLO/error-budget literacy distinguishes mature SREs from operators and underpins reliability decision-making.
Prep pointers
- Distinguish SLI, SLO and SLA clearly and choose meaningful SLIs (availability, latency, correctness) for payments.
- Discuss how you'd measure from the user's perspective and over what time window.
- Explain how the error budget gates release velocity and triggers policy (freeze, focus on reliability).
- Note the special correctness/consistency stakes of payments versus a content service.
- Avoid picking arbitrary numbers like '99.99%' without justifying them against user expectations and cost.
Technical
Write a script that parses a large log file and reports the top N endpoints by error rate, given memory constraints.
Why this comes up: SRE coding rounds favour practical automation and data wrangling over abstract algorithms.
Prep pointers
- Clarify input format, scale, and whether the file fits in memory before coding.
- Discuss streaming/single-pass processing and use of a heap for top-N to respect memory limits.
- Talk through edge cases: malformed lines, zero-traffic endpoints, division-by-zero on error rate.
- Mention testability and how you'd validate correctness on sample data.
- Avoid loading the entire file into memory or ignoring the stated constraint.
Situational
You're on call and get paged at 3am for an alert you don't understand, with no runbook. What do you do?
Why this comes up: Tests on-call judgement, escalation discipline, and how you handle ambiguity in real operations.
Prep pointers
- Show your triage order: assess user impact and SLO risk first, then severity-based decision making.
- Explain when and how you'd escalate rather than burning hours alone — escalation is not failure.
- Describe mitigation-first thinking and capturing notes for the postmortem and a future runbook.
- Mention the follow-up: fixing the alert quality and creating the missing runbook.
- Avoid implying you'd either silently struggle for hours or escalate instantly without any investigation.
Situational
A product team wants to deploy on Fridays, but your team's policy discourages it. They escalate. How do you handle it?
Why this comes up: Reliability governance versus delivery autonomy is a recurring source of friction SREs must navigate.
Prep pointers
- Frame your reasoning around risk and recovery capacity, not bureaucratic rule-following.
- Discuss data-driven middle ground: change confidence, automated rollback, on-call coverage as enabling conditions.
- Show empathy for the team's velocity needs and a collaborative resolution path.
- Note how you'd revisit policy if the data shows it's overly restrictive.
- Avoid coming across as a gatekeeper who blocks rather than enables safe delivery.
Situational
Your monitoring is generating so many alerts that the team is suffering alert fatigue. How do you fix it?
Why this comes up: Alert quality and on-call sustainability are everyday SRE problems that reveal your operational maturity.
Prep pointers
- Start by distinguishing symptom-based, user-impacting alerts from cause-based noise.
- Describe auditing alert actionability and tying alerts to SLO burn rather than raw thresholds.
- Discuss measuring on-call health (pages per shift, false-positive rate) to track improvement.
- Cover the cultural angle: getting team buy-in to delete or downgrade noisy alerts.
- Avoid simply raising thresholds without reasoning about what you might now miss.
Competency
How do you decide whether a reliability problem should be solved with better code, better infrastructure, or better process?
Why this comes up: Tests the judgement and prioritisation that separates senior SREs from tactical fixers.
Prep pointers
- Show a framework: frequency, blast radius, cost of the failure mode, and recurrence likelihood.
- Give an example where the right answer was process or culture, not engineering.
- Discuss how you weigh long-term toil reduction against short-term mitigation.
- Reference how you'd use data (incident trends, postmortem themes) to decide.
- Avoid implying every problem is best solved by writing more automation.
Competency
How do you approach capacity planning for a service whose traffic is growing unpredictably?
Why this comes up: Capacity and cost management are core SRE responsibilities that test forecasting and tradeoff skills.
Prep pointers
- Discuss combining historical trends, headroom targets, and load testing to set capacity.
- Explain how autoscaling helps but doesn't remove the need for planning around hard limits and lead times.
- Address the cost-versus-reliability tradeoff and how you'd justify headroom.
- Mention handling step-change events (launches, marketing spikes) versus organic growth.
- Avoid claiming autoscaling alone solves capacity planning.
Culture fit
How do you keep the relationship between SRE and product engineering teams healthy rather than adversarial?
Why this comes up: SRE only works as a partnership; interviewers screen for collaboration over a policing mentality.
Prep pointers
- Describe concrete practices: shared SLOs, embedded SREs, joint postmortems, production readiness reviews.
- Show you treat reliability as a shared goal you enable, not a standard you impose.
- Give an example of building trust with a team that initially resisted SRE involvement.
- Mention how you make reliability data visible and actionable for product teams.
- Avoid framing SRE as the team that says 'no' or owns reliability in isolation.
Culture fit
What does a healthy on-call rotation look like to you, and how would you improve an unhealthy one?
Why this comes up: On-call sustainability signals whether you'll burn out a team or build a durable practice.
Prep pointers
- Define healthy in measurable terms: pages per shift, sleep impact, fair rotation, compensation.
- Discuss reducing the page load itself as the real fix, not just rotating people faster.
- Show you value the human side — escalation safety, handoffs, and follow-the-sun where viable.
- Reference time allocated to toil reduction so on-call improves over time.
- Avoid treating heavy on-call as an unavoidable rite of passage.
More practice questions (14)
Technical
Explain the difference between liveness and readiness probes in Kubernetes and when each matters.
Why this comes up: Container orchestration fundamentals come up constantly for SREs operating on Kubernetes.
Technical
How does a cache stampede happen and what techniques prevent it?
Why this comes up: Caching failure modes are a common distributed-systems pitfall interviewers probe.
Technical
Walk me through what happens, end to end, when a user's request times out hitting your service.
Why this comes up: Tests depth of understanding across the full request path and timeout propagation.
Technical
How would you design a deployment pipeline that supports safe canary releases and automatic rollback?
Why this comes up: Progressive delivery and automated rollback are central to reducing release risk.
Technical
What's the difference between availability measured by uptime versus by successful requests, and which would you choose?
Why this comes up: SLI selection nuance reveals whether you measure reliability from the user's perspective.
Situational
A dependency you don't own is causing your SLO breaches. What's your plan?
Why this comes up: Tests cross-team influence and resilience patterns when you lack direct control.
Situational
You discover a single point of failure in a critical system the week before a major launch. What do you do?
Why this comes up: Tests risk triage and pragmatic decision-making under time pressure.
Behavioural
Tell me about a time your fix made an incident worse before it got better.
Why this comes up: Reveals intellectual honesty and how you recover from your own mistakes.
Behavioural
Describe a time you had to learn an unfamiliar system quickly to resolve an outage.
Why this comes up: On-call frequently demands rapid learning in unfamiliar territory.
Competency
How do you prioritise reliability work against feature requests when both compete for your time?
Why this comes up: Tests prioritisation and how you advocate for non-feature engineering investment.
Competency
How do you decide what to monitor for a brand-new service with no operational history?
Why this comes up: Observability design from first principles is a frequent SRE task.
Technical
Explain how you'd debug intermittent packet loss between two services in different availability zones.
Why this comes up: Network-layer troubleshooting tests depth beyond the application stack.
Culture fit
How do you spread reliability knowledge so the team doesn't depend on a single expert?
Why this comes up: Tests whether you reduce bus-factor risk and build resilient teams, not silos.
Situational
Leadership asks you to cut infrastructure costs by 30% without hurting reliability. How do you approach it?
Why this comes up: Cost-efficiency versus reliability tradeoffs are an increasingly common SRE mandate.
Get a prep pack tailored to your experience
describe.me matches these questions against your real work history,
flags your prep priorities, and gives you a STAR scaffold per question.
Start free →