You know, most of the time, when there's a server outage and the site goes down, it's due to some mundane thing like planned software updates or the typical datacenter fire. But this week, we had some rather annoying issues that really give fuel to the old System Admin adage: "Well, I've never seen that before."
Monday started off with a classic server hang. Everything just went offline like the server had been unplugged. And, in a way, it was. And of course, it was my office day. Though I couldn't look into it until the evening, I was prepared for a scenario where something had gone Quite Wrong™. In reality, it hadn't. The server was simply not responding and needed a power cycle. It booted right back up without issue. Hours of downtime that were solved in minutes. C'est la vie. Things were fine. But not for long.
Barely made it 24 hours before the unthinkable happened. Tuesday evening, the site went down again, but this time, not completely down. I could still log in on the console even though the site was offline. Oh, could I ever log in. But it was not pretty.
The database, our precious storage location for every forum game, roleplay, meme, and inappropriate private conversation, was offline and could not start. Why? File corruption. Some pages were missing from our Big Book of RpNationNonsense Carefully Worded and Thoughtful Roleplays. Our server's filesystem is so good that it can detect when data becomes corrupted and is more stubborn than your most hated in-law when it comes to attempting to read that bad data. Bad data is worse than no data. We store our data in two places at the same time, a pair of SSD drives in the server, each with a full copy of the data. But today, the server regretfully informed me (the server is a tall, imposing, British butler wearing a tuxedo in today's story), data on both drives was mangled at the same time. Now, we do also sync the data offsite to another location as an off-site backup. So the ship certainly isn't sunk, but rolling back to a backup means data loss, and that's not ideal. (Can you imagine having to re-create the last several harrowing minutes of the thread where we see how high we can count without a mod??)
It's worth noting that Reginald McServerface over here only really cares about the data that's detected as bad. So I go looking through the logs of the database. It turns out that the file in question with bad data isn't actually part of the data storage, but is nonetheless an essential part of the database's functionality (hence why it won't start). So I check the server logs while experimenting with the database. It seems like the part of the file that is corrupted is fairly small. So I have the server copy out all of the file it can read, while ignoring the bad parts. Only a few bytes were missing. Not that bad, really. Now the question is, can the database take this "reconstructed" file and fix it based on its own recovery mechanisms. Only one way to find out. We replace the bad file with the reconstructed file and start the database.
Everything came right up.
So we've avoided data loss! At least, for the most part. It's possible a post or profile comment or something got lost right around when the server went down. No way to know for sure. But what happened? Why did our redundant data storage scheme fail?
Well to that, I have not much more than guesses. I contacted our hosting company about the issue, and they offered to take a look at the server hardware. Since the site was down anyway, that posed no problem. I shut the server down and they replaced both the power supply and the connectors for the SSD drives. Best guess? Something faulty in the communication (RAM, PCIe bus errors, something else) caused by some random event (cosmic rays, electrical power surge, bad karma, a glitch in the Matrix) made it so that bad data was written to both SSDs at the same time. In that case, the data can't be automatically recovered. With luck, the problem came from the hardware they replaced and we won't have to deal with this again.
But we're back now (as of Wednesday evening), and everything has been caught up with regard to our data backups. If you have any questions or just want more details about the technical side, let me know and I'll see if I can elaborate.
Go RpNation!
Monday started off with a classic server hang. Everything just went offline like the server had been unplugged. And, in a way, it was. And of course, it was my office day. Though I couldn't look into it until the evening, I was prepared for a scenario where something had gone Quite Wrong™. In reality, it hadn't. The server was simply not responding and needed a power cycle. It booted right back up without issue. Hours of downtime that were solved in minutes. C'est la vie. Things were fine. But not for long.
Barely made it 24 hours before the unthinkable happened. Tuesday evening, the site went down again, but this time, not completely down. I could still log in on the console even though the site was offline. Oh, could I ever log in. But it was not pretty.
The database, our precious storage location for every forum game, roleplay, meme, and inappropriate private conversation, was offline and could not start. Why? File corruption. Some pages were missing from our Big Book of RpNation
It's worth noting that Reginald McServerface over here only really cares about the data that's detected as bad. So I go looking through the logs of the database. It turns out that the file in question with bad data isn't actually part of the data storage, but is nonetheless an essential part of the database's functionality (hence why it won't start). So I check the server logs while experimenting with the database. It seems like the part of the file that is corrupted is fairly small. So I have the server copy out all of the file it can read, while ignoring the bad parts. Only a few bytes were missing. Not that bad, really. Now the question is, can the database take this "reconstructed" file and fix it based on its own recovery mechanisms. Only one way to find out. We replace the bad file with the reconstructed file and start the database.
Everything came right up.
So we've avoided data loss! At least, for the most part. It's possible a post or profile comment or something got lost right around when the server went down. No way to know for sure. But what happened? Why did our redundant data storage scheme fail?
Well to that, I have not much more than guesses. I contacted our hosting company about the issue, and they offered to take a look at the server hardware. Since the site was down anyway, that posed no problem. I shut the server down and they replaced both the power supply and the connectors for the SSD drives. Best guess? Something faulty in the communication (RAM, PCIe bus errors, something else) caused by some random event (cosmic rays, electrical power surge, bad karma, a glitch in the Matrix) made it so that bad data was written to both SSDs at the same time. In that case, the data can't be automatically recovered. With luck, the problem came from the hardware they replaced and we won't have to deal with this again.
But we're back now (as of Wednesday evening), and everything has been caught up with regard to our data backups. If you have any questions or just want more details about the technical side, let me know and I'll see if I can elaborate.
Go RpNation!