A Preliminary Analysis
The next day we had a lot of questions. What happened? Why did this change cause the error even though it was already in production since Friday? Why didn’t it break tests performed on different environments? Why didn’t the error affect everyone? How can we reproduce this error locally?
At this point, we still didn’t have a satisfactory answer to our outstanding questions. Our limited understanding of the root cause combined with the lack of context provided by our existing tooling made it clear that we needed to reproduce the issue in an environment similar to the production environment.
Make It Happen (again)
frontend.js and a
vendor.js file and have an SHA hash included in their filenames which we presumed to be calculated from the content of the file (e.g.
vendor.js files had the same SHA in their names, the file sizes were different. This was a clear indication that something was wrong. We continued our investigation by taking a closer look at the
Shake it and Break it
diff to verify there was actually a difference in the code contained within the two
diff now does give us useful output (only relevant part shown):
In the output, we can see that in the previous
vendor.js file the
NavLink is marked as unused, and it the new
vendor.js it is needed - this actually reflects a change in our code:
NavLink wasn’t used before in our codebase, so this change actually makes sense.
But then why didn’t the filename SHA change? After some investigation, we concluded that the SHA is not calculated from the file content - but rather from the source files combined into the bundle. Since we already use other exports from the
react-router-dom module, this doesn’t actually change - tree shaking does not affect it. So the resulting bundles share the same SHA and thus filename, even though they differ in content. This causes issues with caching, as caches will not notice the change in file content - they will keep serving the old
vendor.js file. The error results in the code trying to access the
NavLink, but it’s using the old cached
vendor.js file which doesn’t contain it.
Taking into account all the information we have now, including the “What happened”, We revisited the timeline established at the start of this blog post in order to answer the following outstanding question: Why did this error resurface on Monday even though the code was already in production since Friday?
When the commit that triggered the incident was reintroduced on Friday afternoon, it got released together with another unrelated commit that did, in fact, introduce a new module - causing a change in the vendor hash. So in that case, there was no conflict between different bundles using the same hash.
When on Monday that seemingly unrelated commit got reverted, the code change that originally triggered the vendor hash change got removed and the resulting hash matched that of the old cached vendor file that triggered the issue on Friday morning.
The result is that we were back to square one where Cloudflare reserved the old cached version that did not contain the
Navlink export. This also explains why after reverting the breaking commit, an additional purging of the caches was needed.
Stop Shaking the Tree! Fixing the Root (Cause)
Since the error could resurface with any newly introduced import, we chose to disable tree shaking for now, as mitigation.
One would expect the bundle size to increase significantly, due to dead code previously being removed by the tree shaking now being included in the bundle. However, when we compiled the vendor bundle locally with tree-shaking disabled, we noticed that the bundle didn’t significantly increase in size. It turned out, that even though some modules were dereferenced by the tree shaking - they were never actually removed. They were still present in the bundle but couldn’t be reached by the rest of the code. Even though this behaviour was unintended, it allowed us to disable tree shaking without obvious downsides.
When we released the commit that disabled the tree shaking, the error resurfaced. Because tree shaking was disabled, previously tree shaken code was included the vendor bundle. However, given that file name hash was not changed the caches did not refresh and we ended with an incompatible vendor file being served. It hadn’t occurred to us that the commit to resolve the issue, could itself re-trigger the issue!
Given our better understanding of the issue, we were able to respond in a swift manner. We immediately involved our infrastructure team to purge all Cloudflare caches. Furthermore, we separately deployed a small change to the vendor file to force a refresh on the user’s browser cache.
The Actual End
We are Brian, Miray and Alexander and together we form the engineering trinity of the Kerk Squad at HackerOne. We’re the embodiment of Move Fast and Break Things. Aside from breaking production we are fluent in memeing.Share