Tell me about a time you failed or had a production crisis

🎯 Leadership Principles Demonstrated

Bias for Action - Same-day emergency hotfix
Customer Obsession - Protected users first
Ownership - Took responsibility, no blame
Dive Deep - Root cause analysis under pressure
Earn Trust - Transparent with stakeholders
Learn and Be Curious - Process improvements

📖 Full STAR Answer (2.5 minutes)

Situation (30 seconds)

Shortly after launching the Event Effect feature (birthday celebration animations) to 200 million LINE users, we started receiving crash reports from Android users. Crashes happened when users viewed their birthday animations - turning a delightful moment into a frustrating experience.

The pattern was alarming: even high-end devices crashed if users had many apps running. This was a production crisis. Every hour meant thousands more affected users. The Design team and PM had just publicly announced this feature. As code owner and developer, I owned this problem.

Task (25 seconds)

My responsibility was immediate:

Stop the bleeding - prevent more crashes NOW
Find root cause - understand what went wrong
Implement proper fix - long-term solution
Improve processes - prevent recurrence

Time was critical - we needed action now, not perfect solutions later.

Action (85 seconds)

Phase 1: Emergency Response (Same Day - 4-6 Hours)

I immediately:

Confirmed pattern: Android only, memory-related, especially devices with many background apps
Organized emergency meeting with PM, Design, QA, Engineering leadership
Proposed tough decision: "Temporarily disable Event Effect for Android until we fix properly"

This was difficult because:

Design team worked hard on animations
PM just announced the feature
We'd just launched successfully

But protecting 200M users came first. Everyone agreed.

Deployed hotfix within hours via feature flag. Crashes stopped immediately.

Phase 2: Root Cause Analysis (Days 1-3)

With users protected, I investigated thoroughly using Android Profiler:

Discovery: APNG library consumed 500MB memory for full-screen animations!

Why wasn't this caught?

Test devices: Clean state, few background apps
Real users: 20-50 apps installed, background processes consuming memory
Even 8GB high-end phones crashed with 40 apps running
Full-screen animations (vs small stickers) left no safety margin

I evaluated 3 options:

Fix in-house C++ library → 4+ weeks, unfamiliar code
Find OSS library with better memory management → 1-2 weeks ✓
Build from scratch → weeks + maintenance burden

Phase 3: Proper Fix (Weeks 2-3)

Found OSS library using Just-In-Time rendering (decode frames on-demand):

Benchmarked: 5MB vs 500MB (100x improvement!)
Tested on emulator, high-spec, low-spec, top-used devices
Applied fix app-wide - benefited Stickers, profile icons, chat decorations

Process improvements implemented:

Real-world simulation testing - test with 20-30 apps running
Memory pressure testing - deliberately consume memory before testing
Code review standard - "Did you test under memory pressure?"
Case study documented for team learning

Result (30 seconds)

✅ Crisis resolved same day - Hotfix within hours, crashes stopped

✅ Proper fix in 2 weeks - Memory 500MB → 5MB (100x improvement)

✅ Feature re-enabled - Zero crashes after fix

✅ App-wide benefit - All APNG features improved (Stickers 80% less memory)

✅ Process improvements - Now catch performance issues BEFORE launch

✅ Team learning - Documented case study

Key Learning: "Test in real-world conditions, not lab ideals. Lab devices are clean; user devices have 50 apps running. Also, temporary mitigation + proper fix beats rushing under pressure."

This turned me from "developer who builds features" to "engineer who owns complete lifecycle and learns from production."

💡 Key Phrases to Practice

"Shortly after launch, crashes turned delightful birthdays into frustrating experiences - a production crisis"
"I proposed a tough decision: temporarily disable the feature to protect 200M users. That came first."
"Deployed hotfix within 4-6 hours via feature flag - crashes stopped immediately"
"Root cause: 500MB memory. Test devices were clean; real users had 50 apps running"
"Found OSS library reducing memory from 500MB to 5MB - a 100x improvement"
"Implemented real-world simulation testing - now test with 30 apps running, memory pressure"
"This taught me: test in reality, not theory. Lab ≠ production."

🎤 Delivery Tips

⏰ Time Management:

Situation: 30s | Task: 25s | Action Phase 1: 25s | Phase 2: 30s | Phase 3: 30s | Result: 30s

✅ DO:

Emphasize "birthday crashes" - shows user empathy
Highlight tough decision under pressure (disable feature same day)
Show two-phase thinking: quick fix + proper fix
Mention 100x improvement - very memorable
End with learning and process improvements
Be honest: "We missed this in testing"

❌ DON'T:

Sound defensive or blame QA
Downplay severity - it WAS a crisis
Skip process improvements
Make it sound easy
Forget to mention app-wide benefit

❓ Key Follow-Up Questions & Answers

Q1: "Why wasn't this caught in testing before launch?"

Answer: "Great question - this reveals a fundamental gap in mobile testing: test environments can't fully replicate real-world memory conditions.

Test Environment:

Devices in clean, controlled state
Few apps installed
Fresh reset before each test
Memory mostly available

Real-World Production:

Users have 20-50 apps installed (sometimes 100+)
10+ apps running in background
Device hasn't been restarted in days
Already at 70-80% memory usage

The critical difference:

A high-end device with 8GB RAM but 50 apps running might have less available memory than a clean low-end device in our test lab.

500MB worked fine in test (plenty of free memory), but crashed in production (limited free memory).

What I changed:

Real-world simulation testing - test with 30+ apps installed and running
Memory pressure testing - deliberately consume 70-80% memory before testing
Stress testing - run animations repeatedly, not just once
Production monitoring - alert when memory exceeds safe thresholds

The principle: Test in conditions that match your users' reality, not lab ideals."

Q2: "How did stakeholders react to disabling the feature?"

Answer: "Initially there was hesitation - PM and Design had just announced the feature publicly.

But I framed it as a clear choice:

'Option 1: Keep feature running while millions see crashes
Option 2: Temporarily disable, fix properly, re-enable with confidence'

I emphasized three things:

Temporary, not cancellation - 'Mitigation for 1-2 weeks, not killing the feature'
User protection - 'Users see crashes now. With feature off, they see nothing - better than broken'
Clear plan - 'Root cause within 2 days, proper fix within 2 weeks'

PM appreciated:

Transparency - no hiding the problem
Two-phase plan - not just panic
Taking ownership - I committed personally to fix it

Design team:

Initially disappointed, but understood when I showed crash data affecting thousands of users.

Building trust through transparency - I didn't make excuses. I presented data, proposed solutions, took ownership. This strengthened our relationship because they saw I prioritized users over feature pride.

The principle: Stakeholders appreciate honesty and systematic plans during crises. Transparency builds trust even in failure."

Q3: "What would you do differently if you could do this again?"

Answer: "Three key improvements:

1. Real-World Simulation Testing from Day 1

Should have tested with 20-30 apps running during development
Now I deliberately launch many apps before testing any memory-intensive feature
Lesson: Treat real-world simulation as essential, not optional

2. Explicit Risk Assessment for New Scale

When using existing library at new scale (small stickers → large full-screen animations), should have flagged this as risk
Now I ask: 'Are we using this in a fundamentally different way?'
If yes, automatically do stress testing

3. Memory Pressure Testing as Standard

Should have done memory stress testing - consume 80% memory, then test feature
Should have tested after extended device usage (30+ mins), not just fresh launches
Now part of my pre-launch checklist

What I did well:

Quick crisis response - protected users within hours
Transparent communication - no defensiveness
Two-phase thinking - mitigation + proper fix
Systemic improvements - fixed the process, not just the bug

The principle: Every production issue is either a learning opportunity or a waste. I chose to learn."

Q4: "How did this change your development approach going forward?"

Answer: "This fundamentally changed my mindset from code-centric to system-centric thinking.

Before:

"Does my code work on my test device?"
Test in ideal conditions
Ship when it looks good

After:

"Does my code work in users' real-world conditions?"
Test in realistic scenarios
Ship when data proves it's safe

Specific changes:

1. Standard Testing Checklist Now Includes:

Test with 30+ apps running in background
Test on device running for 3+ days without restart
Test in battery saver mode
Test with poor network
Profile memory under pressure

2. Proactive Performance Gates:

Memory and CPU profiling required before launch
Must test under low, medium, high memory pressure
Code reviewers explicitly check: 'Did you test under memory pressure?'

3. Platform Monitoring:

Follow Android Beta releases
Test on new Android versions 3 months before launch
Subscribe to platform behavior change notifications

Measurable Impact:

Before: ~5 platform-related bugs per quarter reached production

After: ~1 bug per quarter (80% reduction)

Before: Average 1-2 weeks to investigate

After: 2-3 days (70% faster)

The principle: Senior engineers don't just write code - they ensure it works in the entire system (platform + users + constraints). This shift is what separates good engineers from senior/staff engineers."

📊 Story Strength: 10/10 ⭐⭐⭐⭐⭐

Why This Is Perfect:

✅ Real production crisis with user impact
✅ Quick decision-making under pressure (same-day hotfix)
✅ Honest about mistake ("We missed this in testing")
✅ Two-phase thinking (quick + proper fix)
✅ Quantified impact (100x improvement)
✅ Process improvements prevent future issues
✅ Shows growth from incident

🎯 Practice Checklist

[ ] Can describe crisis setup dramatically
[ ] Can articulate tough decision (disable feature)
[ ] Can explain "500MB to 5MB" (100x) memorably
[ ] Can explain why testing didn't catch it
[ ] Can describe 4 process improvements
[ ] Can emphasize learning and growth
[ ] Timed full answer at 2.5 minutes
[ ] Practiced 4 follow-up questions

📚 Full Detailed Version: See /mnt/project/12_Event_Effect_Production_Crisis_and_Recovery.md for comprehensive version with 10+ follow-up questions and extensive details.

Remember: Google values engineers who handle production issues well, learn from mistakes, and improve processes. This story proves you can do all three! 🚀

Q3: Production Crisis / Failure