Question 1

You are onboarding a backend engineer who has never used Datadog before. Explain the difference between metrics, distributed traces, and logs, and give a concrete example of when you would reach for each one to debug a latency problem in a microservices API.

Accepted Answer

A strong answer explains that metrics are numeric aggregations good for alerting on trends, traces capture the causal chain across service boundaries to find which hop is slow, and logs carry the raw event detail for understanding why a specific request failed. The debugging example should show the three being used in sequence: alert from a p99 latency metric, trace to find the slow downstream call, log on that service to read the specific error.

Question 2

Most engineering orgs have a leveling framework. What is it supposed to achieve, how do you use it in practice for promotions and calibration, and where does it break down?

Accepted Answer

Strong candidates describe the dual purpose of leveling frameworks: clarity of expectations and calibration across managers. Practical use includes using them in promotion write-ups as evidence against each dimension. Failure modes include treating levels as checklists, over-indexing on tenure, and company-specific level definitions that do not translate across organizations.

Question 3

Explain the difference between an SLA, an SLO, and an SLI. As an engineering manager, not an SRE, when and how do you engage with these concepts in your team's work?

Accepted Answer

Strong candidates accurately define each: SLI is the measured metric, SLO is the target, SLA is the contractual commitment. As an EM, they engage when setting reliability objectives for a service, defending the error budget concept to product when reliability investments are undervalued, and using SLO burn rate as an input to prioritization.

Question 4

What are the four DORA metrics, what does each measure, and how do you use them in practice to assess and improve your team's delivery performance?

Accepted Answer

Strong candidates accurately name deployment frequency, lead time for changes, change failure rate, and mean time to restore. They describe how to baseline each metric, identify the constraint that is keeping the team in a lower performance bucket, and use them as inputs to improvement investments rather than targets to game. They also acknowledge that elite performance on all four simultaneously is rare.

Question 5

Many engineering teams run sprint planning, standups, retrospectives, and reviews as ritual rather than with purpose. For each ceremony, what is its real purpose, and when is it appropriate to eliminate or significantly change it?

Accepted Answer

Strong candidates describe each ceremony's original intent accurately: planning to commit to a shared goal, standup to surface blockers not report status, retrospective to improve process, review to get feedback from stakeholders. They are comfortable recommending eliminating standups for a small senior team that communicates asynchronously, or replacing retrospectives with direct feedback channels when the ritual has become rote.

Question 6

You do not own the CI/CD pipeline directly but your team's delivery speed depends on it. What do you need to understand about continuous integration and continuous deployment to make good decisions as an engineering manager, and what metrics do you monitor?

Accepted Answer

Strong candidates describe the EM's role as consuming and influencing the pipeline rather than building it. They track build time, flaky test rate, deployment frequency, and rollback frequency. They invest in trunk-based development culture, feature flags for safer deploys, and advocate for pipeline improvements to the platform team when build time exceeds a threshold that is costing developer productivity.

Question 7

What is the purpose of a skip-level meeting, how often should you run them, and how do you make sure the information you get is accurate rather than filtered by people telling you what they think you want to hear?

Accepted Answer

Strong candidates describe skip-levels as a way to calibrate their understanding of team health and identify blind spots in their direct reports' management. They run them quarterly, make clear that the conversation is not about the direct manager, and ask open-ended questions. They follow up on systemic issues without attributing them to specific individuals.

Question 8

Amazon popularized disagree and commit as a leadership principle. What does it actually mean, where does it apply, and where does it become a problematic norm that silences legitimate dissent?

Accepted Answer

Strong candidates describe disagree and commit as appropriate for reversible decisions where debate has been had and a choice must be made. It breaks down when used to pressure people into committing to decisions made without their input, when applied to ethical concerns, or when it becomes a way for senior leaders to shut down debate prematurely. Candidates should distinguish it from compliance culture.

Question 9

How do you define the difference between managing and leading, and which is more important in your current role? Can you give an example of a time you were managing when you should have been leading, or vice versa?

Accepted Answer

Strong candidates distinguish management (process, delivery, people operations) from leadership (direction, culture, inspiration, change). Both are needed. They give a concrete example of over-indexing on one at the wrong moment: for example, focusing on sprint hygiene during a period when the team needed a clear long-term direction.

Question 10

Your organization has 200 microservices and engineers frequently do not know who owns a service when responding to an incident. Walk me through how you would use the Datadog Service Catalog to improve ownership clarity, what metadata you would require for every service, and how you would enforce that teams keep it up to date.

Accepted Answer

A strong answer describes defining a service in service.datadog.yaml with team, tier, contacts, and runbook links, enforcing schema validation in CI, linking the Service Catalog to PagerDuty schedules so on-call contact is always current, and using the catalog's SLO and error-budget views to surface reliability health per team. The candidate should mention using catalog completeness metrics to drive adoption rather than just mandating it.

Question 11

A team running a payments API wants to adopt SLOs. Walk me through how you would define a meaningful SLO for them, how to implement it in Datadog using monitors and SLO widgets, how to set up error budget burn rate alerts, and how to use the burn rate data to make decisions about engineering velocity versus reliability work.

Accepted Answer

A strong answer identifies a good SLI for payments (success rate of charge API calls, excluding retried duplicates), sets a reasonable SLO target based on historical performance (not aspirational), implements it as a metric-based SLO in Datadog tracking rolling 30-day compliance, configures a fast burn rate alert (5x budget consumption rate over 1 hour triggers page) and a slow burn rate alert (2x over 6 hours triggers ticket), and uses the remaining error budget to decide when to freeze non-critical feature work.

Question 12

It is 2am. You are on call and you get a page: Notion page loads are failing for 30% of users, error rate is climbing, and your Slack is lighting up. Walk me through exactly what you do in the first 30 minutes.

Accepted Answer

A strong answer follows a clear incident response playbook: (1) acknowledge the alert and open an incident channel, (2) assess impact with dashboards (error rate, affected regions, error type distribution), (3) check recent deploys and roll back the most recent one if there is a strong correlation, (4) mitigate before fully diagnosing (if possible, route traffic around the failing component), (5) communicate status to users and stakeholders every 15 minutes. The candidate should discuss how to triage between a Postgres outage, a bad deploy, a network issue, and a third-party provider failure using the right signals (DB connection pool metrics, APM traces, Cloudflare status). A great answer also covers post-incident: a blameless postmortem within 48 hours and tracking action items to completion.

Question 13

Your Snowflake bill has grown from $50K to $400K per month in one year and your CFO wants a plan to reduce it. Walk me through how you would analyze where the credits are going and what levers you would pull to reduce cost without hurting performance.

Accepted Answer

Start with ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY to find which warehouses consume the most credits, and QUERY_HISTORY to find the top credit-consuming queries. Common levers: right-size warehouses (many teams default to LARGE or X-LARGE when SMALL would do), enable auto-suspend at 60 seconds or less, check for always-on warehouses with no traffic overnight, identify long-running queries that could be optimized, audit automatic clustering costs on tables that do not need it, and check storage costs from excessive Time Travel retention settings. Finally, implement resource monitors with hard credit caps per warehouse.

Question 14

Walk me through the high-level software and control system architecture that enables a Falcon 9 first stage to execute a return-to-launch-site landing. What are the key computational constraints and failure modes in that loop?

Accepted Answer

A strong answer covers the guidance, navigation, and control (GNC) stack: inertial navigation plus GPS fusion, the powered descent guidance algorithm (PDG, derived from fuel-optimal convex optimization), aerosurface and engine throttle actuation, and the real-time constraints that make this a hard problem (millisecond control loops, fault tolerance, limited compute aboard). Candidates should mention that the landing burn must solve a constrained trajectory in near-real-time and that sensor failures must be detected and handled within the loop.

Question 15

What is Tuckman's model of team development, and how does it change how you manage a team that is in the storming phase versus one that is in the norming phase?

Accepted Answer

Strong answers describe forming, storming, norming, and performing accurately and show they have used this model practically. In storming they establish clear decision-making processes and address conflicts early. In norming they codify the team's emerging norms formally. In performing they protect team autonomy and reduce manager-as-bottleneck dynamics.

Question 16

What are the four fundamental team types in the Team Topologies model, and when would you choose a platform team over an enabling team in your org design?

Accepted Answer

Strong candidates describe stream-aligned, platform, enabling, and complicated-subsystem teams accurately. A platform team is appropriate when multiple stream-aligned teams have the same infrastructure need and the demand justifies a product team approach. An enabling team is better when the goal is to teach and then step back, not maintain an ongoing service. Candidates connect the model to their actual experience.

Question 17

What is psychological safety as defined by Amy Edmondson's research, how does it differ from comfort or niceness, and what is the evidence for why it matters in high-performing engineering teams?

Accepted Answer

Strong candidates describe psychological safety as the belief that speaking up will not result in punishment or humiliation, and distinguish it from group harmony or absence of conflict. They reference Edmondson's 1999 research on hospital work teams, where higher-performing teams reported more errors because they felt safe surfacing them, not because they made more mistakes. They may also cite Google's Project Aristotle, which found psychological safety to be the strongest predictor of team effectiveness. They note that psychologically safe teams surface problems faster, iterate more, and have lower error rates in complex tasks.

Question 18

Salesforce has over 40 certifications available through Trailhead. If you were hiring a Salesforce developer, how much weight would you give their certifications versus hands-on experience? What certifications would you expect a senior developer to hold?

Accepted Answer

Strong answers are balanced: certifications show baseline knowledge of the platform and are useful signals especially for candidates without much prior Salesforce work history, but hands-on experience building real solutions matters more. A senior Salesforce developer should minimally hold Platform Developer I (PDI), and ideally Platform Developer II (PDII), which requires a practical coding exam. Application Architect and System Architect are meaningful for senior roles. The candidate should also mention that Trailhead badges supplement but do not replace real project experience.

Question 19

Starship is designed for rapid full reusability: both the booster and the ship are caught, inspected, and reflown quickly. From a software perspective, what are the hardest problems that rapid reusability creates that do not exist for an expendable rocket?

Accepted Answer

A strong answer identifies: health monitoring and prognostics software that tracks component wear across flights and predicts maintenance needs, configuration management that tracks per-vehicle software and hardware state across reflights, the challenge of incorporating inspection data from the previous flight into go/no-go decisions, and the scheduling and logistics software for a high-tempo launch cadence. The candidate should note that expendable rockets have no concept of per-vehicle flight history in their software stack.

Question 20

During atmospheric re-entry, a vehicle experiences extreme thermal loads. From a software perspective, what does that mean for sensor data integrity, onboard computer operation, and the communication blackout period?

Accepted Answer

A strong answer covers the ionization blackout (plasma sheath around the vehicle absorbs RF, cutting communications for several minutes), which means the vehicle must operate fully autonomously during that window with no uplink possible. Onboard software must be designed to complete re-entry without ground intervention. Thermocouples and heat shield sensors may produce noisy or out-of-range readings that the GNC software must handle gracefully. The candidate should note that this is why pre-planned autonomy and thorough pre-entry state verification matter.

Questions

Explain the difference between metrics, traces, and logsDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

How engineering leveling frameworks work and when they break downDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

SLAs, SLOs, and SLIs and why they matter to an engineering managerDomain knowledgeeasyCommon

As asked

Sample answer outline

Expect these follow-ups

DORA metrics as a delivery health indicatorDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

The actual purpose of agile ceremonies and when to cut themDomain knowledgeeasyCommon

As asked

Sample answer outline

Expect these follow-ups

What an EM needs to know about CI/CD and deployment pipelinesDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

Running effective skip-level meetingsDomain knowledgeeasyCommon

As asked

Sample answer outline

Expect these follow-ups

Disagree and commit as an operating principleDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

The difference between managing and leading an engineering teamDomain knowledgeeasyCommon

As asked

Sample answer outline

Expect these follow-ups

How would you use the Datadog Service Catalog?Domain knowledgeeasyCommon

As asked

Sample answer outline

Expect these follow-ups

Design and implement SLOs with error budgetsDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

How would you handle a P0 incident where Notion pages are not loading?Domain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

Build a Snowflake cost optimization strategy from scratchDomain knowledgehardCommon

As asked

Sample answer outline

Expect these follow-ups

Describe the software control loop that enables Falcon 9 RTLS landingDomain knowledgehardCommon

As asked

Sample answer outline

Expect these follow-ups

Tuckman's model of team development and how to apply itDomain knowledgeeasyOccasional

As asked

Sample answer outline

Expect these follow-ups

Team Topologies patterns for engineering org designDomain knowledgemediumOccasional

As asked

Sample answer outline

Expect these follow-ups

Amy Edmondson's psychological safety research applied to engineering teamsDomain knowledgeeasyOccasional

As asked

Sample answer outline

Expect these follow-ups

How do you evaluate the value of Salesforce certifications?Domain knowledgeeasyOccasional

As asked

Sample answer outline

Expect these follow-ups

What software challenges are unique to full Starship reusability?Domain knowledgehardOccasional

As asked

Sample answer outline

Expect these follow-ups

Explain re-entry heating and what it means for onboard softwareDomain knowledgemediumOccasional

As asked

Sample answer outline