Debugging a "production only" bug

You might be familiar with the scenario. Months of hard work, sprint after sprint, features and bug fixing, and now your project is ready to be deployed. The countdown starts at the same time that the nail-biting, and then… the site is up and running.

Everything looks good, no 404s, 500s, etc. Just when I was ready to relax, I noticed that there was something weird, the header menu was not showing at all!

Not believing my eyes I quickly hit the reload button, and it magically reappeared… with a lot of confusion I started to navigate the site, and the menu was unpredictably disappearing.

There were no console errors or warnings, of course, the JS bundles were transpiled into production mode. It wasn’t a server issue neither, when you looked into the server-side rendered code everything seemed fine.

I’m using VueJS with SSR (Server Side Rendering) and I also have a couple of third-party script tags in my site. These third-party scripts use jQuery, so my first guess was that there were hydration issues. Hydration problems happen when something changes the DOM structure before the Vue’s client side hydration kicks in. The weird things is that it had never happened before in other environments.

The main difference I noticed in my local environment is that no matter what, the external scripts would always run last. I decided that I needed to simulate different loading times to be able to “choose” the order when they were loaded, so I went to the express server and did something like this:

app.get(`/my-js-bundle.js`, function (req, res, next) {
  setTimeout(() => {
    next();
  }, 3000);
});

That allowed me to play with the order that each script was loaded. I tried all of the different possibilities and permutations but still couldn’t reproduce the error.

At this point, turning on developer mode on production was tempting, but part of me knew that there should be another option. I did a ngrok tunnel to try my local environment over HTTPS, and I even tried to link my local project to the production scripts to see if I had a different result, nothing…

I got to a point where my setup was practically the same as the production one, except that mine was running in http://localhost:3000 instead of the production domain… and then it hit me.

I fired up my terminal, typed sudo vim /etc/hosts and wrote an entry like this:

127.0.0.1   my-production-site.com

This allowed me to redirect any traffic from the production domain to my localhost environment making my PC, the browser and any script behave like it was the real domain. This meant that window.location.host would be equal to my-production-site.com.

I got into my-production-site.com:3000 (as mentioned 3000 was the port of my local server) and BAM! there was the error showing clear as water in the inspector console. One of the third-party scripts was definitely changing the DOM, but only when the domain matched the production URL.

After that, the fix was just a matter of initializing that specific external script after the hydration process had finished.

What I learned

There is (almost) always a way to reproduce production bugs.
You should try to replicate every relevant aspect that might differ from your other environments, it can be the domain like in this case, but also think about a DB specific entry or even differences in the protocol like HTTP or HTTPS.
Finally, try to have an environment that (when possible) closely mimics the production one, so you can catch any error before deploying, or in the worst case scenario, debug it when it happens in your launched site.