lapure1 problems.


Thursday, October 2, 2014

We are rebooting this server to make it recognize a disk replacement. It should come back in 10 minutes.

Update 17:31: The raid volume isn't getting recognized. We are working on it.

Update 17:43: The raid volume got recognized. However, the OS is not booting. We are working on it.



This is a bigger issue now. Here are the details:

Description of the issue:
 
A drive backplane issue happened on this server.
 
Impact:
 
Potential data loss.
 
Details:
 
On Tuesday, we have got alerted by a drive failure on the server and we went ahead and replaced the drive. However, the server didn't recognize it and still showed the port as failed. Then we tried with 2 other drives and nothing has changed. Today we have tried with yet another drive that has arrived to the DC a few hours ago. Nothing has changed.
 
We have been seeing disk read/write errors in the logs and decided to do a cold reboot on the server.
 
After the reboot, the raid controller marked 4 drives as failed. We have shut the server down and reseated the drives.
 
We have powered it back on and the raid controller gave us an option to reactivate the raid volume after recognizing all the drives.
 
We went ahead and reactivated the raid volume. However, the host OS didn't boot.
 
We have booted the server with a Live OS and tried to view the volume status. The raid utility shows the volume as healthy, however the OS didn't recognize a partition table.
 
We are now going to try recreating the partition table and the partitions to see if we can view the data on them.
 
Restore Process:
 
We have the backup of the server from ~6 hours ago. We have a spare node in the DC. We are currently restoring the data to this spare node to get your VPS online with the data that's from 6 hours ago. We except that this process will take around 2-3 hours.
 
We are terribly sorry for the inconvenience this issue caused you.
 
Please open a ticket if you have any questions.

--

  We have tried to rescue the data off the volume, however, even the filesystems are not getting recognized on it.

Unfortunately, the data on the server is lost. As a reminder: we do have backups from ~6 hours ago.

The restoring process to the spare node is taking much longer than expected and it seems it'll need around 12-16 more hours. This is due to backup containing lots of small files.

To avoid that much waiting time, we have recreated the raid array on the old server and made sure it's working as expected to prepare it for a bare-metal restore from the backup server. This will eliminate for each file being recreated one by one and will do the restoring at block level.

The bare metal restore process is now running. And luckily, it's much faster.

As a compensation for this awful and unlucky event, we are going to apply 2 months worth of free credit to your account.

We'll update you when we get more progress on the matter.

--

The restore process is still going. Based on the calculations we have made, unfortunately, this will take around 16-18 more hours.

We realize that waiting for another 16-18 hours for your VPS to come back online may be unbearable.

If you have your own backups and would like a fresh VPS, please open a ticket and we'll provision one you immediately.

We are extremely frustrated for the outcome of the things and would love to make it up to you. Therefore, we are extending the account credit compensation to 3 months. We are going to apply a 3 months worth of free credit for your VPS on your account.

We are also going to work on a better disaster recovery strategy that would fix things much faster in the future if a terrible event like this ever happen.

We sincerely apologize.

--

The restoring process is still running. However, it has only restored around half of the data, yet.

If you could live with a fresh VPS and only need a few folders and databases from the backup, please open a ticket and we'll recreate your VPS on a different node with same IP and restore your folders and databases manually to get you online quicker.

Please use this template when opening a ticket:

Subject: Recreating VPS & Partial Restore

Contents: I'd like my VPS to be recreated on a different node with same IP address and would like the following folders restored:
/var/lib/mysql
/var/www
...

And we'll put your files in /root/backup/ directory for you to move them to their original folders.

To restore your MySQL databases, please follow the steps:

0. Install the mysql-server package if you have not already.
1. Stop the MySQL server.
2. Delete your current /var/lib/mysql folder.
3. Move the mysql files from your backup to a new /var/lib/mysql folder. (The folder will be named mysql in your /root/backup folder. Simply issue the commands: mkdir -p /var/lib/mysql and cp -R /root/backup/mysql/* /var/lib/mysql)
4. Change the ownership of the database files. (chown -R mysql:mysql /var/lib/mysql)
5. Then start the MySQL server.

--

THE RESTORATION PROCESS HAS BEEN COMPLETED. ALL VPS' ARE ONLINE  AND FULLY FUNCTIONAL WITH DATA FULLY RESTORED.

We will work on cleaning everything up and confirming everything is as it should be before sending out an announcement email and applying credit.

--

We are glad to let you know that the restoring process has ended around 8 hours ago and we have run filesystem correcting processes.

Your VPS should be online for the last 7-8 hours. If you are having problems with your VPS, please open a ticket and we will either try to fix or restore your VPS from the backup manually.

« Back