Wednesday, February 19, 2014

Release Management : planning meetings

After avoiding the obvious for a while now, we have decided to institute recurring planning meetings to cover only Release Management tasks. Heretofore we allowed these conversations to occur naturally, when needed, and with whom needed to be present. This usually involved someone in the development arena talking to someone that would push code changes into production. Few more were involved. And to the detriment of others, others were not always included.

The good of this method of informing of 'need to change' is that it is organic, non planned, skips meetings (everyone hates meetings) and has the randomness that is requisite at times to be agile and allow change to occur. A process was still followed, but it was executed at random times. This is a plus to those that want to keep light on their feet.

The bad is that this method forces change into a system that should and could take a bit more planning prior to execution. Regardless of how agile your development teams want to be, changing production is fraught with potential disaster. Small change or even well choreographed large change can impact a system minimally and leave no lasting scaring. However, rushed, inconsiderate change, can wreak havoc. This is the type of change that is likely to occur in a rushed or unplanned manner.

So a process was instituted at least at the inception of the change request so that several necessary tasks could flow upon knowledge of an impending change. If we stuck to the task list and performed them satisfactorily, we were usually successful. Not always. But usually. Tweeks occur as time passes, and the process is tightened.

Fast forward to today. We now realize that this method of singular knowledge sharing with a select few was detrimental to others in the organization. Some poor business analyst on the north end of the building had no idea that one of his favorite tables just suffered a drastic change and the fields he commonly referred to in his reporting were just altered inextricably, and he had no forewarning of said changes.  No one thought to let him know and his reports now suffer. Sad. But no one knew. Well, someone knew, eventually, but to the chagrin of those displaying the information to others, probably some executive in a plush conference room. Oops.

So we now have instituted a recurring meeting (uugghh, peal the skin from my face would be a better use of my time...). In this meeting, we invite many. Hopefully all. But for now, many folks. These folks were chosen for their potential interest in change to our production system. They now have the option to attend a brief meeting and hear discussion of potential changes to the production system. Here is a forum in which they can ask why, when, and why. Conversations can begin here and continue to the satisfaction of all parties. Plans will be made as to when the change will occur. Bartering as to how this change can be introduced with the least impact will ensue. Parties will be informed, knowledge shared, and life will move forward.

The changes will still involve a select few. The process to perform the change and even prepare for the change will remain similar to before. Tasks being accomplished, questions asked and answered, plans created, testing, and so on. But with this little recurring planning meeting, folks are informed. Change is much less drastic and caustic. Acceptance can begin much earlier in the process and anything needing to be tweeked to allow and accept this change can be implemented much earlier in the process. No more waiting for someone else to point out the flaw, later in a meeting, and hopefully not in front of C level folks you are trying to impress.

Start with small tweeks to your Release Management process. See how you can improve it. Add some oil here, change a gear there, and before you know it, your machine that drives and introduces change into your production topology will be so smooth you won't even hear it purring along gracefully.


Thursday, February 13, 2014

Be tenacious

A Release Management tale


Last night we performed a Release that affected a production database and a website. It was a fairly simple release, and we had done the same steps previously. So with a little effort, we prepared the Release Plan which contains the steps to be performed, and executed on those after hours. Within 15 minutes, all backups and snapshots and the like were done. Within a few more minutes, all new code had been successfully pushed out. Testing ensued and the Release was labeled a success.

We all finished up our tasks, our compares, our post snapshots, documentations, and so on. Emails were sent, and we logged off. All was well.

Until the morning.

The users, darn them, starting using the website and noticing some issues. They complained. Those complaints reached our ears early in the morning, before most of us made it in to the office. So from comfy home office chairs, we logged in and started looking around. Email was the form of communication initially, but this became burdensome to await for responses, and a chat room was opened up in our internal IM product so we could talk more freely.

Initially, there were members of the troubleshooting team that wanted action. Something is broke and its only natural to want to fix it as quickly as possible. Especially since users were using it and seeing the issues. Its different at night when no one is online. Less pressure then. But now, in the morning time, people are anxious and that transfers rather quickly to the rest of us.

I had to say no. We are not just going to roll back. Just be patient.

Once we all gathered and started troubleshooting, we could dig into the why. What was happening. What we thought was happening. Reading logs. Watching processes. Watching memory. And so on. At one point we even said that it was not the database. And it was suggested that I could go back to my normal tasks. But I stuck around. I didn't feel confident that we knew what was going on, and even though I could show that the database was not under any duress, I stuck it out. I kept working on it. I helped, we all helped. Others were brought in to the mix and their ideas were considered.

Fast forward. We still do not know what is happening, except that the IIS server will get a lot of memory pressure, the site will cease to function, and once it all blows up, things start over, and the site seems to work. We see this over and over. Users are in there. We are in there. All of us contributing, but there is still no smoking gun.

So I open Profiler and limit it to the particular database and web server that is having an issue. We capture everything that is happening on the db, which is a lot, and just cross our fingers. After a few more iterations of the memory consumption and release, I notice a repeating query in the profiler, just as all hell breaks loose. Its the last statement, seemingly, that was trying to execute. I grab it as is, and attempt to run it raw. It gives a divide by zero error.

Divide by Zero!

What is this query doing? Does anyone recognize it? does it have anything to do with what we pushed last night? Is data goofed? And other relevant questions were asked. After digging a bit, sure enough, deep in a view that was altered last night, a field was being divided by, and it could be zero on occasion.

I hear a muffled 'Oops' escape the developer standing behind me. 'How did that get past testing?', he asks no one in particular. We discuss for a bit, come up with a solution, and make an alteration on the fly in production that fixes this little issue. After that, the query run raw was able to complete. And as soon as we made the change, we notice the memory consumption and explosion slow down.

It didn't cease, but it did slow.

This gave us more time. More time to look deeper. We continued to watch the Profiler results. We continued to perform tests, and we continued to see the web server work for a bit, then struggle, then use all its memory, then flush everything and continue on as if it had a goldfish sized memory. All's well now, lets go. seemingly forgetting that mere seconds ago it had used and flushed all its memory.

Another query started being the last query executed just prior to the spike in memory usage. As I captured and executed this manually, it too gave us an error. Something about a field it couldn't find in the db. Some field that looked like a valid field, yet it didn't exist. After pointing it out to the developer, he incredulously stammered something like 'where did that come from?'. Turns out that the staging environment had an extra field. This field was built in to the middle ware code that had been generated, and now was trying to do its thing against production where no such field exists.

And the web server simply crashed.

Instead of throwing an error that was helpful, or logging that it got a bad result or no result or some error, it simply kept attempting the query, letting its memory consumption expand to biblical proportions, and come crashing down. Only to try again in a few minutes, as if it had no memory of the preceding few minutes.

So now we fix it.

Now we know what is causing it. And the quickest route to fixing it is to roll back. Roll back all the changes and the site should work like it did yesterday. Not like an Alzheimer patient continually performing the same unsuccessful task. Roll back the code.

The point here is that more than half this story ago you will recall that was the suggestion. Roll it back. But that suggestion was in the heat of the moment. Something was broke. We changed something. Roll it back. If we had done this, then the 2 pieces of code that were hiding well hidden within would have never been known or fixed. Dev would have re factored the release, we would have performed it again on another day, probably tested a lot better, and found the same results. Something not working right, and we had no idea what.

So it took us a few hours. So it was frustrating. So the users were unable to use the site for a bit. With sufficient communication, we let the users know. They were placated. With some time, we dug and dug and discussed, and tried, and ultimately found the culprits. Silly little things bugs are. Scampering here and there, just out of the corners of your eyes. But havoc is being caused, until they are eradicated.

I am happy that the end result was more knowledge, time spent in the forge of troubleshooting, and an actual cause to the problem instead of a quick acting rollback, ultimately hiding the problem, but reverting us to a known state.

Its the unknown that kills you. Or if not kills you, at least puts a damper on your day.

Be patient. Be thorough. Be smart. Be tenacious.