Current Time: 15:26:04 EST
 

Mail and MySQL Urgent Server Maintenance

Posted In: Maintenance — Aug 22nd, 2016 at 11:51 am EDT by HE: Kevin N.
Shared services are affected

Incident Description:

Our system administrators have detected an issue with the RAID array on one of the main servers that hosts client email and MySQL databases. RAID refers to “redundant array of independent disks”, a technology that allows us to achieve high levels of storage reliability from our server drives. It does this by arranging the devices into an array. Simplified, this means they act like one large hard drive, but if one drive dies, there is enough data stored on the rest to recreate the lost data once the broken hard drive is replaced with a new one.

If a RAID fails, or becomes corrupted, it must be rebuilt. This means the architecture that allows for RAID redundancy must be repaired or completely rebuilt.

There is no downtime expected, but mail services may appear slow which can result in time outs and give the appearance that the service is unavailable. We would like to ensure you that no data loss is expected through this process.

UPDATE:  Please see most recent updates below.  RAID rebuild has been stopped and manual evacuations are in progress.

 

Update 2016/08/22 1:30 PM EDT:

The following mail servers are affected: mail1201, mail1202, mail1203, mail1204, mail1205, mail1206, mail1207, mail1208, mail1209, mail1210, mail1211, mail1212, mail1213, mail1214, mail1215, mail1216, mail1217, mail1218, mail1301, mail1302, mail1303, mail1304, mail1305, mail1306, mail1307, mail1308, mail1309, mail1310, mail1311, mail1312, mail1313, mail1314, mail1315, mail1316, mail1317, mail1318, mail1319, mail1320, mail1401, mail1402, mail1403, mail1404, mail1405, mail1406, mail1407, mail1408, mail1409, mail1410, mail1411, mail1412, mail1413, mail1414, mail1415, mail1416, mail1417, mail1418, mail1419, mail1420, mail1421, mail1422, mail1423, mail1424, mail1425

The following mysql servers are affected: mail310, mysql1201, mysql1202, mysql1203, mysql1204, mysql1205, mysql1206, mysql1207, mysql1208, mysql1209, mysql1210, mysql1211, mysql1212, mysql1213, mysql1214, mysql1215, mysql1216, mysql1217, mysql1218, mysql1219, mysql1220, mysql1301, mysql1302, mysql1303, mysql1304, mysql1305, mysql1306, mysql1307, mysql1308, mysql1309, mysql1310, mysql1401, mysql1402, mysql1403, mysql1404, mysql1405, mysql1406, mysql1407, mysql1408, mysql1409, mysql1410, mysql1411, mysql1412, mysql1413, mysql1414, mysql1415, mysql1416, mysql1417, mysql1418, mysql1419, mysql1420, mysql1421, mysql1422, mysql1423, mysql1424, mysql1425, mysql1426, mysql1427, mysql1428, mysql1429, mysql1430

The following pgsql servers are affected: pgsql1201, pgsql1202, pgsql1301, pgsql1302, pgsql1401

Which Customers are Impacted?

All customers on mentioned mail and database servers.

How are Customers Impacted?

Mail and database services may appear slow and result in time outs

How often will we be updated?

As information is available

Time to Resolution (ETA)

We expect the RAID rebuild to last 6-8 hours.

Incident Updates

  • 22/08/2016 15:00 EDT - At this time, the bad drives have been removed from the RAID and the RAID is rebuilding once more.  Services may still be slow while this process is ongoing.  We will provide more information as soon as it is available.
  • 22/08/2016 19:25 EDT - Raid continues to progress and no impact should be felt at this time.
  • 22/08/2016 21:47 EDT - Raid rebuild is proceeding on schedule.
  • 23/08/2016 06:41 EDT - Raid rebuild has reached an impasse.  Proceeding may put data at risk.  Engineers are working with senior engineering team of our vendor to try to resolve the issue while we are manually moving data in parallel to another storage.  Customer systems remain on line at this time.  If production stress is too much for the system we may have to temporarily close access, but we are working to avoid that step.
  • 24/08/2016 14:17 EDT - Mail310 is online
  • 24/08/2016 17:55 EDT - At this point, the migration to alternate storage is continuing.  We are going at a very slow rate so that we don't impact customers.  Any customer experiencing a problem should contact support.
  • 24/08/2016 17:52 EDT - MySQL1209 is online
  • 24/08/2016 21:29 EDT - MySQL1212 and MySQL1216 is online
  • 24/08/2016 21:30 EDT- MySQL1220 is online
  • 24/08/2016 22:00 EDT - MySQL1306 is online
  • 24/08/2016 23:12 EDT - MySQL1403 is online
  • 24/08/2016 23:41 EDT - MySQL1203 is online
  • 24/08/2016 23:43 EDT - MySQL1410 is online
  • 24/08/2016 23:48 EDT - MySQL1417 is online
  • 25/08/2016 00:33 EDT - MySQL1426 and PgSQL1401 is online
  • 25/08/2016 01:41 EDT - MySQL1204 is online
  • 25/08/2016 02:34 EDT - MySQL1210 is online
  • 25/08/2016 02:37 EDT - Mail1314 is online
  • 25/08/2016 02:44 EDT - Mail1316 is online
  • 25/08/2016 02:53 EDT - MySQL1420 is online
  • 25/08/2016 03:00 EDT - Mail1201 is online 
  • 25/08/2016 03:14 EDT - Mail1202 is online
  • 25/08/2016 03:14 EDT - Mail1203 is online
  • 25/08/2016 03:33 EDT - Mail1312 is online
  • 25/08/2016 03:46 EDT - Mail1304 is online
  • 25/08/2016 03:56 EDT - Mail1408 is online
  • 25/08/2016 04:07 EDT - Mail1410 is online
  • 25/08/2016 04:20 EDT - Mail1205 is online
  • 25/08/2016 04:31 EDT - Mail1303 is online
  • 25/08/2016 04:42 EDT - Mail1419 is online
  • 25/08/2016 04:48 EDT - Mail1320 is online
  • 25/08/2016 04:59 EDT - Mail1405 is online
  • 25/08/2016 05:09 EDT - Mail1413 is online
  • 25/08/2016 05:12 EDT - Mail1218 is online
  • 25/08/2016 05:18 EDT - Mail1306 is online
  • 25/08/2016 05:26 EDT - Mail1418 is online

Resolution Description

N/A

Web309 is Down – Resolved

Posted In: Maintenance — Aug 24th, 2016 at 9:30 pm EDT by HE: Greg Cook
Shared services are affected

Incident Description:

We are rebooting Web309 to add additional space for hsphere partition.

Which Customers are Impacted?

All customers on Web309.

How are Customers Impacted?

These websites will be unavailable.

How often will we be updated?

As required

Time to Resolution (ETA)

Unknown

Incident Updates

n/a

Resolution Description

The server is back up.

Webmail2 – Down – Resolved

Posted In: Outage — Aug 19th, 2016 at 11:51 pm EDT by HE: Kris G.
Shared services are affected

Incident Description:

We are currently experiencing issues with Webmail2 at this time. Our administrators are currently investigating and will provide further information once available.

Which Customers are Impacted?

All customers using webmail2

How are Customers Impacted?

webmail services will be unavailable

How often will we be updated?

1Hour

Time to Resolution (ETA)

N/A

Incident Updates

N/A

Resolution Description

Services became stuck and were restarted for webmail2 and resumed.

Web320 – Maintenance – Complete

Posted In: Maintenance — Aug 19th, 2016 at 2:53 am EDT by HE: Kris G.
Shared services are affected

Incident Description:

Our administrators are planning to change Web320 to the new virtual environment starting at 3AM ET. The server will be unavailable during this time. No data loss is to be expected.

Which Customers are Impacted?

All customers on Web320

How are Customers Impacted?

websites and services will be unavailable

How often will we be updated?

20 minutes

Time to Resolution (ETA)

20 minutes

Incident Updates

N/A

Resolution Description

Maintenance was completed and Web320 has been switched to the new environment and services resumed.

Webmail2 – Resolved

Posted In: Uncategorized — Aug 16th, 2016 at 3:27 am EDT by IX: Jason H.

Incident Description:

We’re currently experiencing difficulties with Webmail2. We will update this page once we have more information to share regarding this matter.

Which Customers are Impacted?

N/A

How are Customers Impacted?

N/A

How often will we be updated?

N/A

Time to Resolution (ETA)

1 hour

Incident Updates

N/A

Resolution Description

Frozen service was restarted.

Web915 – Is Down – RESOLVED

Posted In: Outage — Aug 15th, 2016 at 5:03 pm EDT by HE: Greg Cook
Shared services are affected

Incident Description:

Our system administrators have detected an issue with this server. During this time the web server will be unavailable, meaning that websites hosted on the server will not be working.

Which Customers are Impacted?

All clients using this web server.

How are Customers Impacted?

All websites/FTP/control panel will not be reachable.

How often will we be updated?

As required.

Time to Resolution (ETA)

15 minutes

Incident Updates

n/a

Resolution Description

Web915 back to normal.

Web320 – Urgent Maintenance Resolved

Posted In: Outage — Aug 14th, 2016 at 7:02 pm EDT by HE: Greg Cook
Shared services are affected

Incident Description:

Server was found stuck and forced to reboot. System Administrators have determined that it was caused due to bad blocks on one of the server’s hard drives. The bad drive is replaced, but in order to preserve data and increase performance, stability, and reliability on the server, admins decided to recreate the server as a virtual machine on new hardware. The old server and all hosted sites are up and running while the backup information is being copied to the new machine. Once the copy is completed, we will have to bring the server down to perform a final sync of recently changed data then the server will be back online in its new environment.

Which Customers are Impacted?

All clients using WEB320.

How are Customers Impacted?

Server is online and fully operational

How often will we be updated?

As required.

Time to Resolution (ETA)

24 hours

Incident Updates

  • Aug/14/2016 10pm - System Administrators started updating data in backup vault to have immediate copy of the server. No services are affected, web server is up and running while data is being copied in a background
  • Aug/15/2016 5am - Server went read-only and restarted. Mandatory filesystem check is in progress, ETA 1h.
  • Aug/15/2016 7am - filesystem check is completed and the server is back online. Background data copy to new environment is resumed.

Resolution Description

The filesystem check has completed, and the server is back online. Background data copy to the new environment is resumed.

WEB320 – Is Down – RESOLVED

Posted In: Uncategorized — Aug 14th, 2016 at 8:34 am EDT by HE: Admin
Shared services are affected

Incident Description:

Our system administrators have detected an issue with this server. The server has been restarted and is now doing a mandatory file system check (FSCK). During this time the web server will be unavailable, meaning that websites hosted on the server will not be working.

Which Customers are Impacted?

All clients using this web server.

How are Customers Impacted?

All websites/FTP/control panel will not be reachable.

How often will we be updated?

  • As needed

Time to Resolution (ETA)

ETA - 2.5 hours.

Incident Updates

  • 2016/08/14 09:37 am EST - Data copy is 74% complete.
  • 2016/08/14 09:48 am EST - Data copy is 87 % complete
  • 2016/08/14 11:11 am EST -  Data copy complete  

Resolution Description

WEB320 server is up. All services restored . Resolved.

Web521 – Down – Resolved

Posted In: Outage — Jul 29th, 2016 at 8:07 pm EDT by IX: Daren H.
Shared services are affected

Incident Description:

Web521 encountered an issue with the file system and automatically went into read only mode to protect the data. System Administrators attempted to correct the issue but during FSCK detected an error with the Operating System files. Fsck stands for “file system check” and is a tool for checking the consistency of a file system. 

Without these files, the server cannot be rebooted and data on the server is unrecoverable. We are recreating this server using data from the most recent backup performed July 26th.

Which Customers are Impacted?

Customers with websites on Web521. You can check to see if your sites are located on Web521 by logging in to your control panel. Click the manage button next to your hosting account, click the FTP icon and the server name will tell you which web server your account is on.

How are Customers Impacted?

These websites will be unavailable until your account is recreated on the new server. Once recreated, the data restored will be from the backup taken on July 26th.

How often will we be updated?

Every 30 minutes/as available

Time to Resolution (ETA)

5 hours

Incident Updates

  • 2016/07/29 22:55 PM ET - FSCK is still performing on the server, however we are also working on to virtualize the server with the most recent backup we have available. We will have more information in 2 hours.
  • 2016/07/30 02:20 AM ET - At this time we are extending ETA 2 hours.
  • 2016/07/30 4:06 AM ET - FSCK has been restarted on the server. No current ETA at this time.
  • 2016/07/30 08:30am EST - Data copy is 84% complete.
  • 2016/07/30 09:00am EST - Data copy is 90% complete.
  • 2016/07/30 09:30am EST - Data copy is 98% complete.
  • 2016/07/30 10:00am EST - Data copy is complete.  Server configuration is in process.

Resolution Description

Server recreation is complete and the server is operational again.

Web504 – Down – Resolved

Posted In: Uncategorized — Jul 29th, 2016 at 1:20 pm EDT by HE: Brian S.

Incident Description:

Web504 is currently offline.  Our team discovered that it is related to a problem with the Network Interface Card and our System Administration team is investigating.

Which Customers are Impacted?

All customers with web content on web504

How are Customers Impacted?

Websites are offline

How often will we be updated?

As soon as we have more info

Time to Resolution (ETA)

30 minutes

Incident Updates

  • 2016/07/29 13:50 EDT - We have swapped the server chasis and are now working on modifying the Network Interface Cards

Resolution Description

Web504 has been corrected and is back online.

 
© 2011 Host Excellence.