Lessons From a Tricky Performance Bug

Solving performance issues is a tricky business. Teams are often not knowledgeable in performance testing. When a performance bug comes in, it could be the team's first time to witness such an issue. Furthermore, performance bugs are always critical and need to be solved yesterday. If the team has the necessary skills to solve the issue, this would be manageable. However, many teams rarely have enough experience to fix performance bugs quickly. They often end up learning on the job as they watch their house burn down. This happened to my team in October 2019. The performance team raised a ticket against a critical use case in their test suite. Let's step through the sequence of events and highlight the principles I learned as team lead in hindsight. As I write this, I'm listening to Brendan Gregg talk about performance debugging. I'll try to add his thoughts to this too.

Day 0 - Ticket created

Our team owned the failed component in question - let's call it SweetShop. SweetShop creates and manages Sweets. Historically, SweetShop has been a problem. Our most difficult tickets have been related to SweetShop in some way. Suffice to say, my team does not have a good relationship with SweetShop. In the ticket description we saw "SweetShop logs say OutOfMemory after we create 2300 Sweets".

Day 1 - First responders

We picked up the bug the next morning. After a quick review of the ticket, we logged into the system to see the SweetShop logs. We confirmed what the QA team reported in the ticket: OutOfMemoryError: Java heap space. We suggested to QA to increase heap memory for SweetShop. That was our first possible solution.

Possible Solutions

Increase heap memory.

Principles Learned

Respond to critical tickets quickly with the smallest possible solution.

Day 2 - Gather info

The next morning, the QA added more details to the ticket. The test was to verify that SweetShop can create and store 3000 Sweets. In hindsight, this ticket was poorly written from the outset. There were no details of what the QA team were trying to accomplish, only some errors they found in the logs. More importantly, there was no information on the test itself. A few simple data points always help the support team to get started quickly.

What type of test is it? eg. Load, stability.
What are you trying to accomplish in this test? eg. Create 3000 Sweets in 3 hours, Create as many Sweets as possible, Create infinite Sweets.
Has this test passed historically or is this a new test?
Who is observing the problem?

We logged onto the system and ran some of the common linux commands: top, iostat -x 1, netstat and so forth. Increasing the heap capacity did not remedy the failures. We even tried to install Zabbix to do some monitoring because on of our developers used it before. There was no thought give to what we were trying to achieve. Why would there be? We had no idea what to do and wanted the ticket closed ASAP. In Netflix they call this the Street Light Anti-Method. In other words, start with the problem first, then use the right tools to figure out what is wrong. Don't start with the tools you know simply because you know them.

Possible Solutions

Increase heap memory.

Principles Learned

Do not investigate any ticket until sufficient details are added.
Ask basic questions to understand the problem.
Once you have asked your questions, then find the tools.

Day 3 - Bring in the troops

On Day 3, our ticket appeared on the management's morning stand up. When this happens, the ticket goes straight to the top of our priority list and our mission becomes simple: Get the ticket solved. So we shifted gears. We requested to see the system logs for SweetShop and the scripts they used to create their Sweets. The QA team kindly responded with a link to a Jenkins job which supposedly ran the scripts. In other words, find the scripts yourself - lovely! To top it off, we were cc'd to a management thread with our ticket name as the title.

Possible Solutions

~~Increase heap memory.~~

Principles Learned

Do not accept vague information. Make sure you get exactly what you asked for.
If a manager wants something ASAP (as was this case when we were cc'd), don't expect anything less from their side. Escalate quickly if you don't get what you want.

Day 4 - It's Getting Hot

I should say that this was our first major performance issue. We were learning on the job and struggling to think of ideas. This meant we had no monitoring tools setup or any idea of what tools we could use. All we had was the failed system and the QA script that said Sweet123 creation successful or Sweet999 creation failed. Nevertheless, we weren't lying down. We rejected the Jenkins job link that we received the previous evening and demanded the scripts be provided. Once received, two of our teams began their investigation.

In parallel, we looked into SweetShop performance tuning. We found a troubleshooting page that said the min and max heap size should be set to the same value. We had the min heap = 2GB and max heap = 4GB. This indicated that the garbage collector could be insufficient at higher levels of load. Heap size was back on the table as a solution! We set the min heap and max heap to 4GB each. Even if it didn't fix the situation, it was recommended by the SweetShop support and so should go in.

Possible Solutions

~~Increase heap memory.~~
Tune heap memory.

Principles Learned

Look for shortcuts to a solution. Close off as many paths as possible in parallel.

Day 5 - A Blind Scramble

At this stage we had two devs looking into the test scripts and an idea to tune the heap. Being a pessimistic team, we weren't ready to relax. We needed more information on how the system was behaving at load. We had nothing setup to monitor our systems - big mistake! This was our first performance bug. We hadn't prepared for anything like this before. We checked with other teams - they had nothing either! Fortunately, linux comes with a few handy tools. We ran the performance tests on 3 different systems. On two of those systems, we increased the heap size to see if it would help. On the other system, we created a quick python script to create batches of 10 Sweets with a delay between batches. While one of the tests was running, someone spotted that the garbage collector was slowing down significantly. The beginning of the test looked like:

tail -f /path/to/gc.log
Total time for which application threads were stopped: 0.0001006 seconds

The 500th Sweet creation looked like:

tail -f /path/to/gc.log
Total time for which application threads were stopped: 3.0001342 seconds

A few hours later, the 1500th Sweet creation looked like:

tail -f /path/to/gc.log
Total time for which application threads were stopped: 24.1947332 seconds

Then, the system started to fail. Since the heap was increased and the garbage collector was still having trouble, we decided to increase the time between batches to see what was the issue.

Possible Solutions

~~Increase heap memory.~~
Tune heap memory.
Optimise the test scripts.
- Add delay of ~~2.5~~ 4 seconds between batches.

Principles Learned

Performance testing must be accompanied by the appropriate debug tools.

Day 6 & 7 Getting Somewhere

Some good news came the following morning. With the delay of 2.5 seconds, the performance tests managed to create 2600 Sweets. Ok - we were onto something. We looked through the scripts once more to see if anything else could cause unnecessary load on the system. Wait..what's this? CONCURRENCY=20 How did we miss that the first time? That puts a huge load on the SweetShop service! We weren't even sure if it could officially handle that level of load.

We updated the test scripts and ran them again with the following parameters:

Batch delay = 4 seconds.
Concurrency = 10.

Possible Solutions

~~Increase heap memory.~~
Tune heap memory.
Optimise the test scripts.
- Add delay of ~~2.5~~ 4 seconds between batches.
- Reduce concurrency to 10.

Principles Learned

Do not blindly trust tests written by other people against your service. They may be testing something that is not supported.

Day 8 One Step Forward, Two Steps Back

The next morning we got a new result - only 1750 Sweets were created. This was frustrating! Again, we saw OutOfMemoryError: Java heap space in the logs. So none of our changes had resolved the issue. The day was spent trawling through all the logs again to see was there anything we missed. In the meantime, the performance team kicked off another run for some extra feedback.

Possible Solution

~~Increase heap memory.~~
Tune heap memory.
Optimise the test scripts.
- Add delay of ~~2.5~~ 4 seconds between batches.
- Reduce concurrency to 10.

Day 9 Blame Someone Else

We weren't making much progress for such a high priority ticket. We decided to take the failed system with 1750 Sweets and push it to see how far it would go. We created our own python script to create a single Sweet one at a time in sequence. We removed the batches and the time delays and connected a java monitor to the system. The result: 62 Sweets were created in 16 hours. The java monitor showed the heap was maxed out constantly at 4GB. We suspected something was happening with the createSweet method so we opened a ticket on the development team who handled some of this code. I believe this violated another one of Netflix's anti-methods - Blame Someone Else:

Find a system you are not responsible for. In this case the java functionality - Done.
Hypothesize that the issue is with that component - Done.
Redirect the issue to the responsible team - Done.
When proven wrong, go to 1 - Todo.

Possible Solutions

~~Increase heap memory.~~
Tune heap memory.
Optimise the test scripts.
- Add delay of ~~2.5~~ 4 seconds between batches.
- Reduce concurrency to 10.
Get the development team to fix it.

The development team observed that the system significantly slowed down after the 1000th Sweet. The CPU also maxed out a 100% - embarrassingly, we didn't spot this. The development team decided it wasn't their code and pushed it off as an internal bug to SweetShop, completing our cycle of the Blame Someone Else Anti-Method:

Find a system you are not responsible for - Done.
Hypothesize that the issue is with that component - Done.
Redirect the issue to the responsible team - Done.
When proven wrong, go to 1 - Done.

Let's update our solution list.

Possible Solutions

~~Increase heap memory.~~
Tune heap memory.
Optimise the test scripts.
- Add delay of ~~2.5~~ 4 seconds between batches.
- Reduce concurrency to 10.
~~Get the development team to fix it.~~
Get the SweetShop support team to fix it.

Now we had two tickets. The first was ours to figure out how to successfully create 3000 sweets. The second was the one we raised on the development team to figure out why the createSweet is inefficient at high load. The second ticket in turn was connected to a SweetShop support ticket - what a mess! Remember the manager's standup? Our ticket was front and center each morning.

Principles Learned

Don't "Blame Someone Else". Instead pull in people to help without assigning blame to them.

Day 15 A Slow Death...Then

Days went by with no progress. Each morning, our team gave a status update like "We are trying x, y and z but no success yet". I suppose this violates the Drunk Man Anti-Method:

Try things at random until the problem goes away. The development team attempted some improvements recommended by SweetShop support, but nothing improved the situation.

Possible Solutions

~~Increase heap memory.~~
Tune heap memory.
Optimise the test scripts.
- Add delay of ~~2.5~~ 4 seconds between batches.
- Reduce concurrency to 10.
~~Get the development team to fix it.~~
Get the SweetShop support team to fix it.

A week later, we found a way forward. The SweetShop support finally admitted "We don't support this number of Sweets in a system. We would expect issues with the garbage collection. Please upgrade to the latest version to resolve this." In other words, there is a garbage collection bug in SweetShop that cannot be patched. A major upgrade would take months of work, so that was not an option since our quarterly release was about 6 weeks away. But this was good news for our team. At least we now knew there was a limitation where a permanent fix was not possible. Next task: figure out a way to clear the heap and free up some memory.

Possible Solutions

~~Increase heap memory.~~
Tune heap memory.
Optimise the test scripts.
- Add delay of ~~2.5~~ 4 seconds between batches.
- Reduce concurrency to 10.
~~Get the development team to fix it.~~
~~Get the SweetShop support team to fix it.~~
Upgrade SweetShop.

After some experimenting, we found that the SweetShop garbage collection could be completely cleared via a service restart. So we were left with a dilemna. Restarting services as part of a load test is not a real solution. In fact, it's a really...really bad solution. But we were in a tight situation and had no other alternative. To propose it to management, we needed a plan to completely resolve it. After pushing the SweetShop support team, they recommended upgrading our product from v2 to v6. SweetShop v6 had improved garbage collection capabilities. We added it to our solution plan:

Possible Solutions

~~Increase heap memory.~~
Tune heap memory.
Optimise the test scripts.
- Add delay of ~~2.5~~ 4 seconds between batches.
- Reduce concurrency to 10.
~~Get the development team to fix it.~~
~~Get the SweetShop support team to fix it.~~
Upgrade SweetShop to v6 in Q1.

As I said, a major upgrade would take months of work. Our quarterly release was only a few weeks away. So what could we do right now? What quick and dirty solutions are on the table? We know that a restart clears the garbage collector. So we restarted SweetShop after every 500th Sweet created. We took the QA performance scripts and added in a service restart after every 500 Sweets. On the first run, we were able to run the full script up to 3000 Sweets - fantastic!!!

Possible Solutions

~~Increase heap memory.~~
Tune heap memory.
Optimise the test scripts.
- ~~Add delay of 4 seconds between batches.~~
- Reduce concurrency to 10.
- Restart SweetShop after every 500 Sweets.
~~Get the development team to fix it.~~
~~Get the SweetShop support team to fix it.~~
Upgrade SweetShop to v6.

However, upon a review of the logs, we saw that 200 out of 3000 Sweets failed to create. Still, it was good progress. We reran the test again with smaller batch sizes - every 400 Sweets. We also added in some error handling to retry if a createSweet failed. This time, we successfully created all 3000 Sweets. We updated our proposal:

Possible Solutions

~~Increase heap memory.~~
Tune heap memory.
Optimise the test scripts.
- ~~Add delay of 4 seconds between batches.~~
- Reduce concurrency to 10.
- Restart SweetShop after every ~~500~~ 400 Sweets.
- If creation fails, retry.
~~Get the development team to fix it.~~
~~Get the SweetShop support team to fix it.~~
Upgrade SweetShop to v6.

The proposal was agreed along with a lengthy explanation to our release notes why this use case had a serious limitation.

Principles Learned

Don't discount quick and dirty solutions. Every solution is on the table in a critical scenario.

The ticket was closed...until...

Day 45 Dec 6

"We see this issue again in system 6!" Considering the quaterly release was just 2 weeks away, this was not a pleasant message to see in the afternoon. Ok, let's review the situation.

The proposal was actioned 30 days ago.
Multiple performance tests were run since the proposal.
No further issues were seen until now.
No changes had been made to SweetShop.

Something didn't add up. We asked the QA engineer if they ran the exact same test as last month. They confirmed "Of course". "Of course" - two words which I now know translate to "maybe". Fortunately, we learned our lesson from Day 7: Do not blindly trust tests written by other people. We re-checked the parameters in the scripts:

Batch size = 400 => OK
Restart SweetShop between batches => OK
Concurrency = 20 => not OK!!!

But "of course", the exact same test was run! I firmly responded to rerun the test with concurrency = 10. After this, the test passed and we never returned to it again.

Principles Learned

I repeat: Do not trust tests written by other people. They may be testing something that is not supported!

Conclusion

While these couple of days were quite stressful, it resulted in a rewarding experience. We were able to attack the problem head on and rule out wrong solutions quickly. In the following sprints, we fed a lot of performance and monitoring improvements into our backlog. We didn't want to get stuck like that again. The principles we picked up along the way have been instrumental in how we handled subsequent performance tickets. Now, when a ticket comes in from the QA Performance team on SweetShop, we have a set of tools ready to analyze the problem and judge quickly what needs to be done to solve it. Here is a list of all the principles we learned along the way.

Principles Learned

Respond to critical tickets quickly with the smallest possible solution.
Do not investigate any ticket until sufficient details are added.
Ask basic questions to understand the problem.
Once you have asked your questions, then find the tools.
Do not accept vague information. Make sure you get exactly what you asked for.
If a manager wants something ASAP (as was this case when we were cc'd), don't expect anything less from their side. Escalate quickly if you don't get what you want.
Look for shortcuts to a solution. Close off as many paths as possible in parallel.
Performance testing must be accompanied by the appropriate debug tools.
Do not blindly trust tests written by other people against your service. They may be testing something that is not supported.
Don't "Blame Someone Else". Instead pull in people to help without assigning blame to them.
Don't discount quick and dirty solutions. Every solution is on the table in a critical scenario.

0 kudos