Saturday, August 20, 2016

Lucee's Rebellion

We've all been there. Facing an elusive bug that just won't die. Over the last few weeks we have been waging a war with such a bug at work. I has easily become the most involved and confusing bug that I have worked on. That being said I wasn't even the first person working on the issue and others at the company has spent countless more hours on the issue.

So here is a simple rundown of the history of the issue. A few months back, sporadically we were seeing our sites go down with seemingly no explanation. Sites would be working fine and then you would get a 503 response back from the reverse proxy and accessing the application directly would just end up with the browser spinning. A restart of the server and all was well in the world but still, a site going down randomly with what seems like no explanation was not a good thing. When the issue was originally seen it only happened on our high-traffic QA sites and a few other devs machines. As time went on the issue just became more and more common to the point where all sites were going down fairly consistently. So we had a problem.


Once we hit this point where all sites were going down consistently we knew we had to do something drastically. We got a team together, locked them in a room, and hoped for the best. At that point we didn't know many facts but we knew it took a good amount of time to see the issue. Logging levels were upped, Jconsole run constantly, and heap dumps analyzed. There actually wasn't a whole lot of a difference between a healthy site and a broken site.


Intermission time for a brief explanation of architecture of this product:



 

Of course this  is an over simplification of the architecture but it is sufficient to describe the bug we were facing.  To explain simply; there is a Apache reverse proxy that sits in front of our backend services which does load balancing and other good stuff. One of the main services behind that load balancer is the one shown which is a Tomcat app server that holds a Lucee (ColdFusion Markup Language Engine) servlet where the majority of the code resides and we also introduced a new servlet recently that holds pure Java endpoints. This jersey endpoint was new to this release as we look forward to the future and read the writing on the wall that the ColdFusion community is dying. However enough about architecture.

Going into this issue we wondered if Jersey might be at fault but weren't able to confirm this until we ripped out all Jersey endpoints (or simply didn't call them) and the site wouldn't go down. Also, once we started investigating further we found out that almost exactly 3 hours after starting the server was when it would break. This led to a horrible feedback loop of having to wait for 3 hours after every change. I won't bore you with all the of the things that were tried and investigated but suffice to say that we narrowed it down to a threading issue of some sort. We were getting IllegalMonitorStateExceptions, ThreadDeath Exceptions, and using jconsole to interrogate the taskqueue in Tomcat of a bad site always showed the next thread in the queue was null for some reason. This led to writing our own versions of Java concurrent data structures, custom versions of tomcat, logging bugs to tomcat, yet we were never able to find an actual fix. Taking a step back to see what we had learned we decided again to take another look at Lucee. 


We had not been looking too much into Lucee because if you only exercised Lucee code you would not see the issue, only in the code path through Apache and Jersey would the issue be seen. Our chief architect found a location in the code that would kill threads if they were alive after a configured timeout and we decided to investigate there. Sure enough when we investigated the default timeout it was 3 hours! Three hours we had known was significant so this felt like a win and it was. We set the timeout to 3 minutes and sure enough, three minutes after exercising a Jersey endpoint the site went down. This really was the first time we had any control over this issue so it felt great. There was still plenty of confusion about why this was happening. The jersey call was definitely ending before that three minute timeout. And why was Lucee having it's hand in these requests anyway. We decided to log the stacktrace of the thread right before it was killed and it showed that it was in the "Parked" state. This confused us even further as a thread in a parked state definitely shouldn't be killed for a timeout since it ready and willing to do more work. This however did explain why we were getting in our bad state as this perfectly good thread was being killed by Lucee and more or less corrupting the taskqueue. IllegalMonitorStateExceptions and ThreadDeath exceptions also were explained by this. We added some temporary code that would check the stacktrace of the thread before killing it and if it was in the parked state not kill it. Sure enough this solved the problem. Fragile for sure but it did the job and told us we were on the right track.


Next problem to solve was that we needed to figure out why these threads were being tracked by Lucee and why they weren't being cleared out from the tracked queue when the request was over. This part of the process actually didn't take to long. Our chief architect is one of the contributors to the Lucee codebase and therefore had a fairly solid understanding of the places to look. He pointed us in a few direction and we attacked. As a way of understanding what was happening and trying to determine if the requests were actually making it into the Lucee servlet I decided to log out the stacktrace of each request that came in. Of course the majority of the request that came in were legitimate but we did see a few come in for a url Lucee:80 and when one of these came in it was a "kiss of death" for the site. We weren't sure what this URL was for and were sure we definitely weren't calling that URL. From this stacktrace we did get an explanation of why these threads were coming into Lucee even though they were requests to Java endpoints. From the stacktrace we could see that first it hit tomcat, no surprise, then Jersey handled it, after which it goes into Spring dependency injection which, (drumroll) instantiated Lucee. A little context and why that stacktrace is possible. In our move to Java we had some context that was only in ColdFusion and wanted to get it over into Java. In order to enable this, and since Lucee just runs on the JVM, we decided to use Spring to pass it between the two pieces. When Lucee boots up it loads the context object into Spring's application context and when Jersey needs said context it can retrieve it from Spring, it is a solution we are pretty proud of. Back to our previously scheduled rant. Once we saw this it hit us like a ton of bricks how this was happening. It hadn't occurred to anyone that when getting this context object it would create a request to the server. Each request comes into CFMLFactoryImpl and gets added to a concurrent map that is iterated over occasionally to see if the timeout has elapsed. So our request is being added to the map when the request comes into to our Lucee:80 URL but because its not a real request it never "ends" and therefore is sticking around forever according to the system. We noticed that these fake requests were conveniently of type HttpServletRequestDummy. With this knowledge in mind we extended the functionality of the system to not track HttpServletRequestDummy requests. And with this change of ~10 lines we had solved our elusive bug. Committed the code filed a pull request with Lucee proper and we were done.  As these things usually go, the majority of the time was spent in finding the root of the issue and a much smaller time actually fixing the issue. 


So what were my lessons learned from this bug. Honestly some basic stuff. Logs are crucial, we actually had a log of these timeouts but they were getting lost in the mix in a log file that is never read. Even if you think that a certain piece of code couldn't be the issue you could likely be wrong. Integration of various technologies can cause some crazy bugs. 


As you can see that there was nothing special that we learned. Just apply simple principles, helps solve the gnarliest of bugs. 

Saturday, February 27, 2016

Coding for Others

I have recently had the chance to reflect on some of the things that I find difficult and challenging in my job vs. my coding side projects. One of the things that has jumped out to me in this reflection is how you have to change what you are thinking about when you are in your work. One of the major differences being that you are always writing code for others not just for the problem itself. On these side projects, other than simply sharpening your craft and doing the best that you can, you don't need to follow any standard or best practices to get a job done. As long as the program does what it needs to do no one will be the wiser. In these situations I would call this coding for the code. Whereas when you are in a team working on code that will likely live on for a decade or more you are still coding for the code (or for the compiler) but you also must code with the next developer in mind. With the expert code sorcerer and the newbie both in mind.

Somewhere where I believe these two coding styles can be seen so clearly is in functional programming. I don't have a lot of experience with functional programming. Some Scala and using functional programming principles in other languages (Javascript, Java, etc) however, as an only somewhat experienced functional programmer I can see this when I look at a block of functional code. There are sometimes you look at some one-liner that just perfectly massages the data in such an obvious way but also gets the job done. These are the places where you get that rush of excitement from just reading some code. Then there are other times where you hit a one-liner that might as well be written in greek (that is, unless you read greek). And then when you finally figure out what it is doing it is not obvious and you realize the developer before you probably just wrote it that way as a way to get it to one line and not writing for the next developer.

Taking this "coding for others" concept further sometimes that is the primary reason we are writing the code that we are. I'm thinking of libraries and APIs. Sure these both do have behavior but they don't do anything without someone using them. And if you don't make it easy or enjoyable to work with your API no one will use it.  I have heard a few talks on API design and it is hard to have a sure fire process because I do believe it is very problem/domain specific. However I think it all starts with having those "others" in your mind as you code.

I will end with one short example. Last week I had the opportunity to participate in a hackathon at my work. The feature my group was working on was a two-factor authentication mechanism to strap onto our current login system. After only a little research we stumbled upon a service called Authy. Given that we only had a limited number of hours to work on this the team decided this was the way to go. And, as are famous last words often said, I shared with the team that "This is going to be easy." And looking back it was but in the next hours that is not what I was thinking. Now, granted, I think Authy had a thing or two stacked against them. The main one being that they were recently bought by Twillio and it appears they had migrated over some of their APIs or at least that some of the APIs had changed. This created a confusing landscape to test out this service. Thinking that their client helper libraries would make it dead simple to integrate we started there but about 400 obscure errors later and we gave up on that for a while. Finally one of my teammates found their Github page which gave us some insight on how to properly use the client library. Still couldn't use it entirely but at least got farther without errors. Then we found some limited documentation on how to integrate with just REST calls and we tried that out and learned a little bit more how to manipulate their API which allowed us to finally use it in production mode but we still could never get a proper response in dev mode. Only after finding a random person's blog that detailed how to use their dev API that we finally got it all working. Granted, next to the Dev API key it does say, "Contact us to learn how to use this" but I shouldn't have to. It should be simple enough that I can figure it out. Your errors should be descriptive enough that I know what is wrong. They really should have had that final piece of information on their documentation not on this random persons blog. But alas we got it done and all was well. I don't mean to rip on Authy as I believe we all have made products like this and Authy definitely has made more money off their products than me so we can see who the real winner is.

All this being said I guess the punch line is how I started. When writing code remember who you are writing for and the world will be a better place.