Have you watched the Bruce Willis film Armageddon? I haven't. And yet its poster is seared into my memory. At my first job I was an engineer working on a Windows app which did "Unified Communications" - a fancy way of saying some video, a bit of chat and the odd attachment. We were doing quite well and had an app that I thought was genuinely more pleasant to use than the competition.
Like all places we had bugs and occasionally really bad ones. The UC space was extremely competitive and moving providers was becoming easier with Zoom rooms, credit card billing and the rise of remote work. We took to naming bugs that threatened to lose us a large number of customers "Armageddon" bugs and created a chat group specially for them, with the aforementioned film's poster as the icon.
One fateful day I checked my email and found we had reports of upgrades failing. I had my own Armageddon bug.
How did we upgrade?
The company had started off focussed on the meeting room and so a lot of the tech could be traced back to embedded Linux style development. If you installed the app you would find you had two copies of the application on disk, and a file saying which one was active. A "bootloader" application would read this file and start the actual app:
UCApp/
config.dat # contains 2
bootloader.exe # this is what runs from your start menu
1/ # inactive installation
application.exe
supporting_library.dll
2/ # active installation
application.exe # the bootloader then runs this exe, as config.dat=2
supporting_library.dll
new_supporting_library.dll
In the background when an upgrade was available we'd download the new version (i.e. application.exe
and any supporting DLLs) into the inactive installation and change the number in the config file. Then the user would be prompted to restart the app which would start the new version.
We had recently had cause to upgrade the bootloader itself, which was not covered by this mechanism. This was a complicated process where the bootloader would be included in the installation directory. On the first run, application.exe
would kill the bootloader that started it and move the new bootloader into place. This new bootloader was modified to allow it to upgrade itself.
All in all, the bootloader upgrade had gone smoothly. But when, a release or two later, upgrades started failing we became suspicious.
Often with issues like this we would see one particular OS version affected. Maybe a Windows update changed behaviour slightly, or maybe it was a latent bug in Windows 7 that just happened to get hit at the same time. In this case however it was a range of versions affected, with machines that had nothing in common. Crappy notebooks and powerful desktops seemed just as likely to be hit.
With an upcoming change that would require us to bump the minimum supported app version - and therefore upgrade everyone - it was time to dust off the armageddon group chat.
Debugging
As with any issue on a customer's machine we had to get them to send the logs. Unfortunately we could not do this automatically but a few customers were good enough to click the appropriate button in the UI.
The upgrade code was overzealous with its logging, to our advantage. We saw that we could not delete the inactive installation's folder as it was locked. Windows, unlike Linux and Mac, is fussy about deleting/renaming files or directories when they are in use by another program.
The bootloader upgrade sprang to mind - maybe on its previous run it had left a file handle open? That didn't make sense though. When a program exits the OS should clear up all its resources. An OS bug then? No, no. I'm letting myself off the hook. A bootloader must be hanging around for some reason.
Luckily we had just the tool to help us validate our theory - a problem reporter. This was an application that ran a bunch of useful tests that we had built up over the years - graphics driver checks, memory usage reporting, free disk space and, most importantly, a list of all processes with our company's name in the path.
So we got a few of the customers to run it. There was only one bootloader process - the one that had launched the running version of the app and therefore shouldn't have a handle on the old installation directory.
Now I am beginning to worry. At this point a few days have passed and the date we were supposed to be bumping the minimum supported version was fast approaching and my assumptions were wrong. Looking back at it now I could see I was getting overly het up about it but at the time though it felt like my first major cock up.
Going back to the drawing board I wondered whether antivirus was scanning the old installation - we had strange issues with that before. Or perhaps our filtering of the process list wasn't quite right and had missed the bootloader.
Handles
I was clutching at straws and so far my assumptions had got me nowhere. I decided to gather more information and try to work out what exactly what had the lock on the directory.
On Linux I knew how to tell what had a handle to a specific file - lsof
. I had no idea on Windows. With some googling I found Handle, part of the Windows sysinternal tools. You gave it a path or part of a path and you could see any application with handles to it. This was perfect!
Unfortunately those affected by the issue seemed to be run of the mill employees - not very technical. To make it easier we were hoping to embed Handle in the problem reporter, collect its stdout and send it along with the reporter results. The licence prohibited this though. Instead we would instruct users to download the program and then we would modify the problem reporter to look for it in the user's download folder. If it was there we would run it and collect the output.
This took another day or so to do, test and send out to customers. A couple of results came back within a few hours - thanks customers! On one machine Chrome had the directory open. On another, Windows' image viewer.
Erm, What?
When I first saw this I couldn't believe it. Why on earth would a user open anything in the installation directory using Chrome or an image viewer? The answer, of course, is that they didn't. We had.
> handle.exe C:\Users\User\UCApp\
Nthandle v4.22
Copyright (C) 1997-2021 Mark Russinovich
Sysinternals - www.sysinternals.com
chrome.exe pid: 624
98: File (R--) C:\Users\User\UCApp\1\
Remember how I said that we were a "Unified Communications" app that did the odd attachment? Well, like many an app, once the user clicked on an attachment we opened it for them. We used C#'s Process
API:
var info = new ProcessStartInfo("explorer");
info.Arguments = path;
info.WorkingDirectory = ".";
Process.Start(info);
See that .
there? It means that explorer
is started with its working directory set to the current one. For us it was the directory that the UC application was in - i.e. one of the two installation directories.
Why was the locked directory the old installation and not the current one then? I surmised it must be because the attachment was opened when that directory was the active installation - a whole upgrade ago.
Conclusion
In the end fixing it was simple. For currently affected customers simply closing all other apps, or a reboot, sorted it out. For the future we changed the working directory when opening attachments and made the upgrade process delete files in the installation directory, not the installation directory itself.
I was extremely relieved. Customers got their upgrades back and we only delayed the minimum supported version bump by a couple of days. In hindsight this incident taught me a lot.
Firstly, it's no use dithering, flapping, or ruminating. Thinking carefully is fine, avoiding making the decision is not.
Secondly that assumptions are the enemy of debugging. For most of this time I was convinced that the bootloader was at fault. In many ways this was natural - it was a difficult task to do atomically and missing an edge case could have happened. It was extremely well tested however - probably the second most well tested code I've written. Thinking it was responsible closed my mind to other alternatives and without it maybe I would have gone straight to handle.exe
.
Finally that I need to watch myself when under pressure. I remember being extremely grumpy during this time to my friends and I remember snapping at the release manager that I was too busy to give an update on what features were going out. Sorry if you're reading this.
All in all I'm thankful it happened.
Thanks to Akhil for proof-reading this post.