Saturday, August 20, 2016

Lucee's Rebellion

We've all been there. Facing an elusive bug that just won't die. Over the last few weeks we have been waging a war with such a bug at work. I has easily become the most involved and confusing bug that I have worked on. That being said I wasn't even the first person working on the issue and others at the company has spent countless more hours on the issue.

So here is a simple rundown of the history of the issue. A few months back, sporadically we were seeing our sites go down with seemingly no explanation. Sites would be working fine and then you would get a 503 response back from the reverse proxy and accessing the application directly would just end up with the browser spinning. A restart of the server and all was well in the world but still, a site going down randomly with what seems like no explanation was not a good thing. When the issue was originally seen it only happened on our high-traffic QA sites and a few other devs machines. As time went on the issue just became more and more common to the point where all sites were going down fairly consistently. So we had a problem.


Once we hit this point where all sites were going down consistently we knew we had to do something drastically. We got a team together, locked them in a room, and hoped for the best. At that point we didn't know many facts but we knew it took a good amount of time to see the issue. Logging levels were upped, Jconsole run constantly, and heap dumps analyzed. There actually wasn't a whole lot of a difference between a healthy site and a broken site.


Intermission time for a brief explanation of architecture of this product:



 

Of course this  is an over simplification of the architecture but it is sufficient to describe the bug we were facing.  To explain simply; there is a Apache reverse proxy that sits in front of our backend services which does load balancing and other good stuff. One of the main services behind that load balancer is the one shown which is a Tomcat app server that holds a Lucee (ColdFusion Markup Language Engine) servlet where the majority of the code resides and we also introduced a new servlet recently that holds pure Java endpoints. This jersey endpoint was new to this release as we look forward to the future and read the writing on the wall that the ColdFusion community is dying. However enough about architecture.

Going into this issue we wondered if Jersey might be at fault but weren't able to confirm this until we ripped out all Jersey endpoints (or simply didn't call them) and the site wouldn't go down. Also, once we started investigating further we found out that almost exactly 3 hours after starting the server was when it would break. This led to a horrible feedback loop of having to wait for 3 hours after every change. I won't bore you with all the of the things that were tried and investigated but suffice to say that we narrowed it down to a threading issue of some sort. We were getting IllegalMonitorStateExceptions, ThreadDeath Exceptions, and using jconsole to interrogate the taskqueue in Tomcat of a bad site always showed the next thread in the queue was null for some reason. This led to writing our own versions of Java concurrent data structures, custom versions of tomcat, logging bugs to tomcat, yet we were never able to find an actual fix. Taking a step back to see what we had learned we decided again to take another look at Lucee. 


We had not been looking too much into Lucee because if you only exercised Lucee code you would not see the issue, only in the code path through Apache and Jersey would the issue be seen. Our chief architect found a location in the code that would kill threads if they were alive after a configured timeout and we decided to investigate there. Sure enough when we investigated the default timeout it was 3 hours! Three hours we had known was significant so this felt like a win and it was. We set the timeout to 3 minutes and sure enough, three minutes after exercising a Jersey endpoint the site went down. This really was the first time we had any control over this issue so it felt great. There was still plenty of confusion about why this was happening. The jersey call was definitely ending before that three minute timeout. And why was Lucee having it's hand in these requests anyway. We decided to log the stacktrace of the thread right before it was killed and it showed that it was in the "Parked" state. This confused us even further as a thread in a parked state definitely shouldn't be killed for a timeout since it ready and willing to do more work. This however did explain why we were getting in our bad state as this perfectly good thread was being killed by Lucee and more or less corrupting the taskqueue. IllegalMonitorStateExceptions and ThreadDeath exceptions also were explained by this. We added some temporary code that would check the stacktrace of the thread before killing it and if it was in the parked state not kill it. Sure enough this solved the problem. Fragile for sure but it did the job and told us we were on the right track.


Next problem to solve was that we needed to figure out why these threads were being tracked by Lucee and why they weren't being cleared out from the tracked queue when the request was over. This part of the process actually didn't take to long. Our chief architect is one of the contributors to the Lucee codebase and therefore had a fairly solid understanding of the places to look. He pointed us in a few direction and we attacked. As a way of understanding what was happening and trying to determine if the requests were actually making it into the Lucee servlet I decided to log out the stacktrace of each request that came in. Of course the majority of the request that came in were legitimate but we did see a few come in for a url Lucee:80 and when one of these came in it was a "kiss of death" for the site. We weren't sure what this URL was for and were sure we definitely weren't calling that URL. From this stacktrace we did get an explanation of why these threads were coming into Lucee even though they were requests to Java endpoints. From the stacktrace we could see that first it hit tomcat, no surprise, then Jersey handled it, after which it goes into Spring dependency injection which, (drumroll) instantiated Lucee. A little context and why that stacktrace is possible. In our move to Java we had some context that was only in ColdFusion and wanted to get it over into Java. In order to enable this, and since Lucee just runs on the JVM, we decided to use Spring to pass it between the two pieces. When Lucee boots up it loads the context object into Spring's application context and when Jersey needs said context it can retrieve it from Spring, it is a solution we are pretty proud of. Back to our previously scheduled rant. Once we saw this it hit us like a ton of bricks how this was happening. It hadn't occurred to anyone that when getting this context object it would create a request to the server. Each request comes into CFMLFactoryImpl and gets added to a concurrent map that is iterated over occasionally to see if the timeout has elapsed. So our request is being added to the map when the request comes into to our Lucee:80 URL but because its not a real request it never "ends" and therefore is sticking around forever according to the system. We noticed that these fake requests were conveniently of type HttpServletRequestDummy. With this knowledge in mind we extended the functionality of the system to not track HttpServletRequestDummy requests. And with this change of ~10 lines we had solved our elusive bug. Committed the code filed a pull request with Lucee proper and we were done.  As these things usually go, the majority of the time was spent in finding the root of the issue and a much smaller time actually fixing the issue. 


So what were my lessons learned from this bug. Honestly some basic stuff. Logs are crucial, we actually had a log of these timeouts but they were getting lost in the mix in a log file that is never read. Even if you think that a certain piece of code couldn't be the issue you could likely be wrong. Integration of various technologies can cause some crazy bugs. 


As you can see that there was nothing special that we learned. Just apply simple principles, helps solve the gnarliest of bugs. 

No comments:

Post a Comment