In the project pursuit of shipping your new website project on time within your deadline, a lot of pressure can be put on a technical team to cut corners, reduce scope and reduce QA time. We are all obviously human and capable of making mistakes, I can definitely hold my hand-ups and admit that I've made a bunch of deployments mistakes over my career.
Todays I thought I would mix things up a bit and to help ensure that you don't make the same mistakes as me, I thought I'd share with you the three worst deployment horror stories from my career. Now as some of the projects I'm mentioning in this video are all house-hold names and because I'm a professional and this could border on divulging NDA secrets, I will change the company names to protect the innocent 🔥🔥🔥
Media Issue That Lead To Increase Of Traffic
My first horror story was probably one of my most hardest and most stressful production issues. The backstory on this is simple, we were working on a new website and we had pushed out a site. We tested the site and everything worked great.
A few days after we launched the new site, the client phoned us and told us the website was down. Looking at the server, it had turned off, so I turned it back on, tested everything, monitored the site for a few hours and all was good. The next day the same issue happened, the site went down for no apparent reason, I then had to kick the server again to fix it. This cycle then continued the next day as well. This issue took me about a week to track down.
The culprit for the crash was due to a web scrapper that the client ran daily against the site. This scraper would ping the site with a list of hardcoded URLS to archive the site. The client didn't tell us about this tool, or on discovery, why they needed to run this scrapper daily, however, these are the types of things that can happen in the real-world!
This scraper contained a list of URLs that pointed to pages that used to exist on the old site. You could not access some of these URLs directly from the website, however, one URL did exist that when triggered, these pages created a Stackoverflow error. The issue is that when the scraper pinged a URL and got a 500, it would re-ping the site repeatedly for 10 minutes. This loop meant the server was hammered with incoming stack overflow exceptions. At the time, if IIS got more than 5 stack overflow errors within 5 minutes, it would shut down to prevent the physical box falling over. The hard part about debugging this error was that the scraper caused so much noise in the logs, it was hard to spot this one problem URL. It literally scanned 1000s of pages a second which resulted in 200, 404, and some errors. The other issue is that the logs were stored as plain text on the server itself.
After figuring out what pages caused the error, finding the root cause was pretty easy. The CMS we were forced to use for this project threw an infinite loop if someone accessed a page with no corresponding frontend HTML. Within the content tree, we had a few areas that contained container pages. The containers were created to make it easier for content editors to manage news articles within the CMS because having 100s of items under a single node, made the CMS itself to become unresponsive.
In the frontend, news pages were access either from the news hub page, or a search. The container page could only be accessed directly by someone manipulating the URL within a browser. As this container page type had no corresponding HTML as it didn't need one, instead of the CMS throwing a 404 it threw a stack overflow. To compound this error, the clients web scrapper was also forcing these container pages to be called repeatedly in a non-standard way, these two things combined caused mayhem!
So what are the takeaways you can use to improve your deployment process from this mess. First, if you are getting random website crashes, check your server logs first for 500 related errors. Next, consider that before you ship your site spotting errors in a site with low traffic is easy. In production, with thousands of requests per second this is not the case. This is why you should always avoid storing your logs as plain text on the server. At a minimum, sync your logs to a logging tool like Datadog or use a tool like New Relic. Making use of an easy visualization tools will help you to quickly debug your errors when the shit hits the fan. Also, nowadays you should store your logs as JSON or YAML which allows for much better searching!
The second tip is to always use a link spidering tool on your site before launching it. I always use Xenu to make sure no 500 errors are being thrown! Xenu won't catch errors on hidden pages, however, this is a essential step for any website launch as it will spot errors before go live. If I had followed these tips, the production issue would have been avoided!
The Tweet That Crashed The Site
We had just launched some new design work for a big customer. The work was thoroughly tested and signed off, however, the day of release the company in question had a news scandal and appeared in the nightly news on TV. At the time no one at work thought anything of this PR issue.
After leaving work that day I went home as normal. That night I had a date with a lady but at 8pm I got a call from work that the clients site had gone down and I needed to go in straightaway to fix things. Awkwardly, I had to cancel this first date and I even had to get her to drop me at my work in order to get the site up as the news had picked up on it.
Eventually I realised that someone on the clients side had disabled their Twitter account so no one could add negative comments overnight. They also didn't make us aware of this.
As part of the new code release, we had redesigned the customers Twitter feed. In order to render the clients latest tweets, I used a .NET server-side third-party package to get the tweets and convert them into C# objects. It turned out there was a bug in this plug-in, when a Twitter key was provided but the account was deactivated it threw a 500 error rather than just return an empty array.
Obviously, we had never tested what would happen if someone had deleted their Twitter account during QA, I mean how many large companies would do that? As the Twitter plugin was on every page, when the client turned off their Twitter account, the whole site died.
If your reading this and thinking why are you doing this in C# dumbass, keep in mind this happened over 15 years ago, before JavaScript had become what it was today. Back then we had IE issues, browser compatibility issues, etc.. Nowadays doing this client-side would be a no brainer, back then life was different. As the rest of the site was being rendered server-side within a 30 minute cache, it made sense to do render the tweets server-side as well.
So what can you learn from this mistake. I think the big mistake here was having such a reliance on a hobbyist Nuget package. I see this happen a lot on project. Someone needs to solve a key problem so they try to save time and solve the issue by picking a package written by a lone developer with low adoption. Doing this is a massive long-term risk to any project, regardless of the language. The situation I've seen play out 30-40 times now, is that the package creator moves on and the package become obsolete. Next thing you know, .NET updates, a new security issue is found, you need to upgrade some packages and the package you rely on stops working and no one updates it.
When this happens to your project, your only option is to rip everything out that touches the package and re-create it. This not only wastes time, its boring work, and the refactoring is highly likely to introduce bugs. It also means your company's not focusing on new and interesting features, instead your in maintenance mode.
If only at the start you had used a mainstream package, or, maybe built something yourself, you would have avoided this mess. When you rely on packages created by hobbyists, you don't know how much its been tested and if it contains bugs or hidden surprises like throwing an error if someone deactivates Twitter. I admit I've been forced have had to completely re-create projects before because there's been too much reliance on these type of packages. In many instances from my past, the effort to start from a clean slate has been way less compared to refactoring 80% of a code base.
My takeaways here are simple. When you pull in third-party data (especially if those components like within the header or footer), make sure its wrapped in a try/catch. Test what happens if the data source dies before you launch your site, . If these components fail, your whole site could fail.
Next, only use widely adopted packages. Nowadays, I only use packages created by a business like Microsoft, that have been starred thousands of times, unless there is no other possible alternative. Following these tips will prevent you from wasting countless hours of uneeded maintenance time throughout your career!
People Unable To Get Into a Gym
This next issue happened during COVID and the fallout of problems led to me working 80 hour weeks for about a month. The sad part about this story is because I joined a new company I didn't have paternity leave and this story happened in the first month of my baby being born, so mixing crying new-born baby in a one bedroom flat, with 80 hour work weeks. This wasn't the best month of my life
The big picture story here is that the business was forced to shut down due to COVID. When locked down finished, the chain was allowed to reopen. The issue with reopening was that the highest capacity of people in the companies history tried to access the site and the gyms at the same time. On top of that, the underlining system hadn't been used in anger since the gym closed!
I won't cover too many details here, but the end result is that on opening day, the site went down, the servers failed, and we saw a bunch of errors.
What are the takeaways here? The first point is something that many teams overlook. Overtime, if a company is successful, it will require more resources to keep its website and servers running. In normal times, this increase will likely not be noticeable. A slight increase every month that doesn't impact performance means your getting better server value, right?
Lets say you load test your new project before launch to great success. Over the next three years, the company grows and doubles its stores and members. As the site is never under very heavy load, you might think everything is great and your simply getting better server cost saving, however, there is a glaring issue.
One day out of nowhere, an issue happens. As the customer base has doubled, the traffic can 10x compared to normal. When this happens the underlining infrastructure can't cope. If your using cloud infrastructure, you think scaling is easy, however be aware, even when you use cloud scaling there's a chance that certain parts of that system might not be able to scale. That's exactly what happened here. Your scaling is only as good as its weakest parts. If one single mission critical path can't scale, your whole estate will fail.
The takeaways here are simple but easy to overlook. Obviously load testing is an essential step on all projects. If you do not load test at a much higher capacity than your expected average traffic, you can not honestly say what will happen during busy periods.
Secondly and more subtly, only load testing at the start of a project is very risky. A very easy thing to overlook is scheduling a load test every year. As a company grows, you need to load test inline with your companies successes. Do not expect linear network growth on a per member basis, expect exponential growth. This is where the real risk from a lack of regular load testing occurs. As members grow, the impacts of a disaster will also be magnified.
Often companies will not want to pay or schedule these annual capacity reviews (this is one cause of what happened here), however, companies have the most to loose financially within their busiest periods. Without regularly capacity planning, you massively increase the chances of this happening.
Finally, a system can only scale as much as its weakest part. Just because you think your API can scale at OK levels, if a downstream system can't, the whole system will break, so don't assume!!
Happy Coding 🤘