The JavaScript Incident: How to Break Production (and Learn From It)

Feb 28, 2019

On Monday, February 4th, 2019, our production environment suffered significant downtime due to a JavaScript error. The root cause of this outage - since dubbed “The JavaScript Incident” - proved to be quite hard to find, and the root-cause itself to be very intricate and interesting in nature. It involved mismatched JavaScript bundles, misleading SHAs, shaking trees, and more. In this post, we aim to give an accurate recollection of the incident timeline, the outcomes of the root cause analysis and our learnings.

The infamous error

Recap

The issue was first reported on Friday, February 1st. That morning, after a release, we quickly reverted a commit after a JavaScript error was reported. After a brief inspection, we released this commit again the same day - this time without error. During the weekend, the error did not reoccur. On Monday, February 4th, after a seemingly unrelated release, the error resurfaced. Production was reported down for a subset of our user base but seemed to be working fine for other users. We identified the error to be the same error reported on February 1st. Due to our suspicion that the root cause of those two incidents was the same, we once again reverted the commit that was reintroduced on Friday afternoon. This by itself, unfortunately, did not completely resolve the issue. Some of our users reported the issue as resolved, whereas others still encountered an error - including people who were not affected by the error in the first place. Purging our Cloudflare caches resolved the issue for all users.

A Preliminary Analysis

The next day we had a lot of questions. What happened? Why did this change cause the error even though it was already in production since Friday? Why didn’t it break tests performed on different environments? Why didn’t the error affect everyone? How can we reproduce this error locally?

Upon first inspection, the JavaScript error itself didn’t give us a lot to go by. The error message that we captured in Sentry was not very descriptive, due to the fact that our JavaScript bundles are uglified and minified in our production environment. Initial analysis of the commit that triggered the incident also did not highlight any clear connection between the error and the code change. Given only a subset of the HackerOne user base was affected by this error and reverting the code change did not completely solve the issue, our thinking was that the underlying issue might be more complex.

Since we had to purge the Cloudflare caches in order to resolve the issue, our first suspicion was that this might be a caching issue. Another hypothesis was that one or more application servers were maybe serving a corrupted or perhaps outdated JavaScript bundle. We confirmed, with help from our infrastructure team, that all application servers were running the same Yarn and Webpack version and were serving the same JavaScript bundle.

At this point, we still didn’t have a satisfactory answer to our outstanding questions. Our limited understanding of the root cause combined with the lack of context provided by our existing tooling made it clear that we needed to reproduce the issue in an environment similar to the production environment.

Make It Happen (again)

Considering the fact that we still have a limited understanding of the root cause at this point, we decided to redeploy the code to investigate the behavior a bit more. Our aim was to test the behavior on our test server and analyze the Javascript bundle files.

After deployment, out of all developers that were involved in this part of the investigation, only one could successfully reproduce the issue on their machine. Furthermore, the issue was no longer reproducible on that machine after a hard refresh in the browser, which removes any locally cached files. We made sure to download all the JS bundles in both instances in order to compare and cross-reference both files for inconsistencies. Our Javascript bundles consist of a frontend.js and a vendor.js file and have an SHA hash included in their filenames which we presumed to be calculated from the content of the file (e.g. vendor.mzmzmz.js). At first glance, the javascript bundles seemed to be the same - the SHA in the filenames were identical. Upon closer inspection, however, we noticed that although the vendor.js files had the same SHA in their names, the file sizes were different. This was a clear indication that something was wrong. We continued our investigation by taking a closer look at the vendor.js files.

Shake it and Break it

We used diff to verify there was actually a difference in the code contained within the two vendor.js files:

diff ~/file1 ~/file2
11c11
< b1dda86dcadb6",1:"d464881fb1c00be22232",2:"1f4c014d24399a1cc107"}[e]+".js";var u=setTimeout(n,12e4);
---
> b1dda86dcadb6",1:"d13ba94b1e56b8416286",2:"1f4c014d24399a1cc107"}[e]+".js";var u=setTimeout(n,12e4);
781,2200c781,2202
< tive","aria-current":"page"},t.a=f},"11Uz":function(e,t){function n(e){var t=0,n=0;return function()
< {var a=i(),s=o-(a-n);if(n=a,s>0){if(++t>=r)return arguments[0]}else t=0;return e.apply(void 0,argume
< nts)}}var r=800,o=16,i=Date.now;e.exports=n},"12bv":function(e,t){function n(e){var t=[];if(null!=e)
....

Due to uglification of our javascript code, this is unreadable. In order to understand what the actual difference is, the javascript files were rebuilt with uglification disabled. Using diff now does give us useful output (only relevant part shown):

...
-/* unused harmony default export */ var _unused_webpack_default_export = (NavLink);
+/* harmony default export */ __webpack_exports__["a"] = (NavLink);
...

In the output, we can see that in the previous vendor.js file the NavLink export was marked as unused. This is done by a process called tree shaking. Tree shaking is used to eliminate unused code from the generated javascript bundle. In our old vendor.js file NavLink is marked as unused, and it the new vendor.js it is needed - this actually reflects a change in our code:

import { NavLink } from react-router-dom;

NavLink wasn’t used before in our codebase, so this change actually makes sense.

But then why didn’t the filename SHA change? After some investigation, we concluded that the SHA is not calculated from the file content - but rather from the source files combined into the bundle. Since we already use other exports from the react-router-dom module, this doesn’t actually change - tree shaking does not affect it. So the resulting bundles share the same SHA and thus filename, even though they differ in content. This causes issues with caching, as caches will not notice the change in file content - they will keep serving the old vendor.js file. The error results in the code trying to access the NavLink, but it’s using the old cached vendor.js file which doesn’t contain it.

Timeline revisited

Taking into account all the information we have now, including the “What happened”, We revisited the timeline established at the start of this blog post in order to answer the following outstanding question: Why did this error resurface on Monday even though the code was already in production since Friday?

When the commit that triggered the incident was reintroduced on Friday afternoon, it got released together with another unrelated commit that did, in fact, introduce a new module - causing a change in the vendor hash. So in that case, there was no conflict between different bundles using the same hash.

When on Monday that seemingly unrelated commit got reverted, the code change that originally triggered the vendor hash change got removed and the resulting hash matched that of the old cached vendor file that triggered the issue on Friday morning.

The result is that we were back to square one where Cloudflare reserved the old cached version that did not contain the Navlink export. This also explains why after reverting the breaking commit, an additional purging of the caches was needed.

Stop Shaking the Tree! Fixing the Root (Cause)

Since the error could resurface with any newly introduced import, we chose to disable tree shaking for now, as mitigation.

One would expect the bundle size to increase significantly, due to dead code previously being removed by the tree shaking now being included in the bundle. However, when we compiled the vendor bundle locally with tree-shaking disabled, we noticed that the bundle didn’t significantly increase in size. It turned out, that even though some modules were dereferenced by the tree shaking - they were never actually removed. They were still present in the bundle but couldn’t be reached by the rest of the code. Even though this behaviour was unintended, it allowed us to disable tree shaking without obvious downsides.

The End?

When we released the commit that disabled the tree shaking, the error resurfaced. Because tree shaking was disabled, previously tree shaken code was included the vendor bundle. However, given that file name hash was not changed the caches did not refresh and we ended with an incompatible vendor file being served. It hadn’t occurred to us that the commit to resolve the issue, could itself re-trigger the issue!

Given our better understanding of the issue, we were able to respond in a swift manner. We immediately involved our infrastructure team to purge all Cloudflare caches. Furthermore, we separately deployed a small change to the vendor file to force a refresh on the user’s browser cache.

The Actual End


We are Brian, Miray and Alexander and together we form the engineering trinity of the Kerk Squad at HackerOne. We’re the embodiment of Move Fast and Break Things. Aside from breaking production we are fluent in memeing.