Tell me about a time you failed or had a production crisis

๐ŸŽฏ Leadership Principles Demonstrated

  • Bias for Action - Same-day emergency hotfix
  • Customer Obsession - Protected users first
  • Ownership - Took responsibility, no blame
  • Dive Deep - Root cause analysis under pressure
  • Earn Trust - Transparent with stakeholders
  • Learn and Be Curious - Process improvements

๐Ÿ“– Full STAR Answer (2.5 minutes)

Situation (30 seconds)

Shortly after launching the Event Effect feature (birthday celebration animations) to 200 million LINE users, we started receiving crash reports from Android users. Crashes happened when users viewed their birthday animations - turning a delightful moment into a frustrating experience.

The pattern was alarming: even high-end devices crashed if users had many apps running. This was a production crisis. Every hour meant thousands more affected users. The Design team and PM had just publicly announced this feature. As code owner and developer, I owned this problem.


Task (25 seconds)

My responsibility was immediate:

  1. Stop the bleeding - prevent more crashes NOW
  2. Find root cause - understand what went wrong
  3. Implement proper fix - long-term solution
  4. Improve processes - prevent recurrence

Time was critical - we needed action now, not perfect solutions later.


Action (85 seconds)

Phase 1: Emergency Response (Same Day - 4-6 Hours)

I immediately:

  1. Confirmed pattern: Android only, memory-related, especially devices with many background apps
  2. Organized emergency meeting with PM, Design, QA, Engineering leadership
  3. Proposed tough decision: "Temporarily disable Event Effect for Android until we fix properly"

This was difficult because:

  • Design team worked hard on animations
  • PM just announced the feature
  • We'd just launched successfully

But protecting 200M users came first. Everyone agreed.

Deployed hotfix within hours via feature flag. Crashes stopped immediately.


Phase 2: Root Cause Analysis (Days 1-3)

With users protected, I investigated thoroughly using Android Profiler:

Discovery: APNG library consumed 500MB memory for full-screen animations!

Why wasn't this caught?

  • Test devices: Clean state, few background apps
  • Real users: 20-50 apps installed, background processes consuming memory
  • Even 8GB high-end phones crashed with 40 apps running
  • Full-screen animations (vs small stickers) left no safety margin

I evaluated 3 options:

  1. Fix in-house C++ library โ†’ 4+ weeks, unfamiliar code
  2. Find OSS library with better memory management โ†’ 1-2 weeks โœ“
  3. Build from scratch โ†’ weeks + maintenance burden

Phase 3: Proper Fix (Weeks 2-3)

Found OSS library using Just-In-Time rendering (decode frames on-demand):

  • Benchmarked: 5MB vs 500MB (100x improvement!)
  • Tested on emulator, high-spec, low-spec, top-used devices
  • Applied fix app-wide - benefited Stickers, profile icons, chat decorations

Process improvements implemented:

  • Real-world simulation testing - test with 20-30 apps running
  • Memory pressure testing - deliberately consume memory before testing
  • Code review standard - "Did you test under memory pressure?"
  • Case study documented for team learning

Result (30 seconds)

โœ… Crisis resolved same day - Hotfix within hours, crashes stopped

โœ… Proper fix in 2 weeks - Memory 500MB โ†’ 5MB (100x improvement)

โœ… Feature re-enabled - Zero crashes after fix

โœ… App-wide benefit - All APNG features improved (Stickers 80% less memory)

โœ… Process improvements - Now catch performance issues BEFORE launch

โœ… Team learning - Documented case study

Key Learning: "Test in real-world conditions, not lab ideals. Lab devices are clean; user devices have 50 apps running. Also, temporary mitigation + proper fix beats rushing under pressure."

This turned me from "developer who builds features" to "engineer who owns complete lifecycle and learns from production."


๐Ÿ’ก Key Phrases to Practice

  • "Shortly after launch, crashes turned delightful birthdays into frustrating experiences - a production crisis"
  • "I proposed a tough decision: temporarily disable the feature to protect 200M users. That came first."
  • "Deployed hotfix within 4-6 hours via feature flag - crashes stopped immediately"
  • "Root cause: 500MB memory. Test devices were clean; real users had 50 apps running"
  • "Found OSS library reducing memory from 500MB to 5MB - a 100x improvement"
  • "Implemented real-world simulation testing - now test with 30 apps running, memory pressure"
  • "This taught me: test in reality, not theory. Lab โ‰  production."

๐ŸŽค Delivery Tips

โฐ Time Management:

  • Situation: 30s | Task: 25s | Action Phase 1: 25s | Phase 2: 30s | Phase 3: 30s | Result: 30s

โœ… DO:

  • Emphasize "birthday crashes" - shows user empathy
  • Highlight tough decision under pressure (disable feature same day)
  • Show two-phase thinking: quick fix + proper fix
  • Mention 100x improvement - very memorable
  • End with learning and process improvements
  • Be honest: "We missed this in testing"

โŒ DON'T:

  • Sound defensive or blame QA
  • Downplay severity - it WAS a crisis
  • Skip process improvements
  • Make it sound easy
  • Forget to mention app-wide benefit

โ“ Key Follow-Up Questions & Answers

Q1: "Why wasn't this caught in testing before launch?"

Answer: "Great question - this reveals a fundamental gap in mobile testing: test environments can't fully replicate real-world memory conditions.

Test Environment:

  • Devices in clean, controlled state
  • Few apps installed
  • Fresh reset before each test
  • Memory mostly available

Real-World Production:

  • Users have 20-50 apps installed (sometimes 100+)
  • 10+ apps running in background
  • Device hasn't been restarted in days
  • Already at 70-80% memory usage

The critical difference:

A high-end device with 8GB RAM but 50 apps running might have less available memory than a clean low-end device in our test lab.

500MB worked fine in test (plenty of free memory), but crashed in production (limited free memory).

What I changed:

  1. Real-world simulation testing - test with 30+ apps installed and running
  2. Memory pressure testing - deliberately consume 70-80% memory before testing
  3. Stress testing - run animations repeatedly, not just once
  4. Production monitoring - alert when memory exceeds safe thresholds

The principle: Test in conditions that match your users' reality, not lab ideals."


Q2: "How did stakeholders react to disabling the feature?"

Answer: "Initially there was hesitation - PM and Design had just announced the feature publicly.

But I framed it as a clear choice:

  • 'Option 1: Keep feature running while millions see crashes
  • Option 2: Temporarily disable, fix properly, re-enable with confidence'

I emphasized three things:

  1. Temporary, not cancellation - 'Mitigation for 1-2 weeks, not killing the feature'
  2. User protection - 'Users see crashes now. With feature off, they see nothing - better than broken'
  3. Clear plan - 'Root cause within 2 days, proper fix within 2 weeks'

PM appreciated:

  • Transparency - no hiding the problem
  • Two-phase plan - not just panic
  • Taking ownership - I committed personally to fix it

Design team:

Initially disappointed, but understood when I showed crash data affecting thousands of users.

Building trust through transparency - I didn't make excuses. I presented data, proposed solutions, took ownership. This strengthened our relationship because they saw I prioritized users over feature pride.

The principle: Stakeholders appreciate honesty and systematic plans during crises. Transparency builds trust even in failure."


Q3: "What would you do differently if you could do this again?"

Answer: "Three key improvements:

1. Real-World Simulation Testing from Day 1

  • Should have tested with 20-30 apps running during development
  • Now I deliberately launch many apps before testing any memory-intensive feature
  • Lesson: Treat real-world simulation as essential, not optional

2. Explicit Risk Assessment for New Scale

  • When using existing library at new scale (small stickers โ†’ large full-screen animations), should have flagged this as risk
  • Now I ask: 'Are we using this in a fundamentally different way?'
  • If yes, automatically do stress testing

3. Memory Pressure Testing as Standard

  • Should have done memory stress testing - consume 80% memory, then test feature
  • Should have tested after extended device usage (30+ mins), not just fresh launches
  • Now part of my pre-launch checklist

What I did well:

  • Quick crisis response - protected users within hours
  • Transparent communication - no defensiveness
  • Two-phase thinking - mitigation + proper fix
  • Systemic improvements - fixed the process, not just the bug

The principle: Every production issue is either a learning opportunity or a waste. I chose to learn."


Q4: "How did this change your development approach going forward?"

Answer: "This fundamentally changed my mindset from code-centric to system-centric thinking.

Before:

  • "Does my code work on my test device?"
  • Test in ideal conditions
  • Ship when it looks good

After:

  • "Does my code work in users' real-world conditions?"
  • Test in realistic scenarios
  • Ship when data proves it's safe

Specific changes:

1. Standard Testing Checklist Now Includes:

  • Test with 30+ apps running in background
  • Test on device running for 3+ days without restart
  • Test in battery saver mode
  • Test with poor network
  • Profile memory under pressure

2. Proactive Performance Gates:

  • Memory and CPU profiling required before launch
  • Must test under low, medium, high memory pressure
  • Code reviewers explicitly check: 'Did you test under memory pressure?'

3. Platform Monitoring:

  • Follow Android Beta releases
  • Test on new Android versions 3 months before launch
  • Subscribe to platform behavior change notifications

Measurable Impact:

Before: ~5 platform-related bugs per quarter reached production

After: ~1 bug per quarter (80% reduction)

Before: Average 1-2 weeks to investigate

After: 2-3 days (70% faster)

The principle: Senior engineers don't just write code - they ensure it works in the entire system (platform + users + constraints). This shift is what separates good engineers from senior/staff engineers."


๐Ÿ“Š Story Strength: 10/10 โญโญโญโญโญ

Why This Is Perfect:

  • โœ… Real production crisis with user impact
  • โœ… Quick decision-making under pressure (same-day hotfix)
  • โœ… Honest about mistake ("We missed this in testing")
  • โœ… Two-phase thinking (quick + proper fix)
  • โœ… Quantified impact (100x improvement)
  • โœ… Process improvements prevent future issues
  • โœ… Shows growth from incident

๐ŸŽฏ Practice Checklist

  • [ ] Can describe crisis setup dramatically
  • [ ] Can articulate tough decision (disable feature)
  • [ ] Can explain "500MB to 5MB" (100x) memorably
  • [ ] Can explain why testing didn't catch it
  • [ ] Can describe 4 process improvements
  • [ ] Can emphasize learning and growth
  • [ ] Timed full answer at 2.5 minutes
  • [ ] Practiced 4 follow-up questions

๐Ÿ“š Full Detailed Version: See /mnt/project/12_Event_Effect_Production_Crisis_and_Recovery.md for comprehensive version with 10+ follow-up questions and extensive details.


Remember: Google values engineers who handle production issues well, learn from mistakes, and improve processes. This story proves you can do all three! ๐Ÿš€

results matching ""

    No results matching ""