Current Time: 15:26:04 EST
 

Ordering Wizard maintenance – Complete

Posted In: Maintenance — May 27th, 2016 at 3:43 pm EDT by IX: Daren H.

Incident Description:

On Monday, May 30th at 4:00AM EDT we will be performing a brief maintenance on our PCI Environment.  The maintenance is only expected to last for a total of 5 minutes and will have minimal customer impact.

Which Customers are Impacted?

We expect zero customer impact.

How are Customers Impacted?

It will affect new orders for any product.

How often will we be updated?

When Completed

Time to Resolution (ETA)

5 Minutes

Incident Updates

  • 2016/05/30 04:56 AM EDT - Maintenance has started. Apologies for the delay.

Resolution Description

Maintenance has been completed. Thank you for your cooperation.

Mail Server Queues – Resolved

Posted In: Other issues — May 26th, 2016 at 3:45 pm EDT by HE: Greg Cook
Cloud services are affected

Incident Description:

Some of our mail servers are experiencing higher-than-normal queues for sending Email.  We are investigating the issue.  Here are the servers currently affected:

smh01.opentransfer.com
smh02.opentransfer.com
smh03.opentransfer.com

These servers work with mail which is being sent from websites, not mail clients or webmail.

Which Customers are Impacted?

Mail which is being sent from websites may be delayed in receipt.

How are Customers Impacted?

Sent Email from websites may be delayed in receipt.

How often will we be updated?

Every hour

Time to Resolution (ETA)

Unknown at this time

Incident Updates

  • 2016/05/26 3:52 PM EDT - We are cleaning queues and investigating source of the problem.
  • 2016/05/26 4:45 PM EDT - smh03 is cleaned. We are still working on smh01-02
  • 2016/05/26 5:45 PM EDT - smh02 almost cleaned

Resolution Description

All mail servers have returned to normal operation.

Network Maintenance: 05/17/2016 – 05/19/2016 – Complete

Posted In: Maintenance — May 13th, 2016 at 10:54 am EDT by HE: Victoria Witten
Shared services are affected

Incident Description:

Our System Administrators will be performing network maintenance to connect our shared servers to new switches.  This maintenance will be performed on the following dates and times:

Tuesday May 17th: 11PM to 3AM EDT

Wednesday May 18th: 11PM to 3AM EDT

Thursday May 19th: 11PM to 3AM EDT (If needed)

This increase in bandwidth from the new switches will help ensure that heavy network activity does not cause service degradation to customers.

Which Customers are Impacted?

Customers on our shared hosting services.

How are Customers Impacted?

Customer impact is expected to be less than 10 seconds per server.  Servers will be inaccessible during this very short period of time.

How often will we be updated?

When completed.

Time to Resolution (ETA)

N/A

Incident Updates

  • 2016/05/17 11:32 PM EDT - Maintenance has started
  • 2016/05/17 5:48 AM EDT - Maintenance has completed
  • 2016/05/18 11:00 PM EDT - Maintenance has begun.
  • 2016/05/19 03:21 AM EDT - Maintenance has been completed for today, however will be continuing again tonight starting at 11 PM EDT.
  • 2016/05/19 11:25 PM EDT - Maintenance has started.

Resolution Description

Maintenance has now completed successfully

Network Maintenance – 05/15/2016 1AM-2AM – Complete

Posted In: Maintenance — May 12th, 2016 at 1:08 pm EDT by HE: Admin
Cloud services are affected
VPS services are affected
Shared services are affected

Incident Description:

Our Systems Administrators will be performing network maintenance on Sunday May 15th from 1AM to 2AM EDT.
During maintenance, they will be upgrading components in our border routers .  This will increase the redundancy of both border routers.

Which Customers are Impacted?

We expect zero customer impact.

How are Customers Impacted?

We expect zero customer impact.

How often will we be updated?

When complete

Time to Resolution (ETA)

1 hour

Incident Updates

  • 2016/05/15 01:00 AM EDT - Maintenance has started. Please contact support should any issues be experienced, however we expect no problems during this time.
  • 2016/05/15 01:55 AM EDT - More time is required to complete all needed steps for maintenance. ETA extended 1 hour.
  • 2016/05/15 01:55 AM EDT - Maintenance is now complete.

Resolution Description

Maintenance is complete.

Windows VPS Maintenance – May 11th, 2016 – Complete

Posted In: Maintenance — May 10th, 2016 at 3:51 pm EDT by HE: John Richards
VPS services are affected

Incident Description:

On May 11th, 2016 at 11 PM EDT we will be performing maintenance on our Windows VPS Node “WVZ7″  during which we will be replacing the CPU for this node.  The maintenance is expected to last for one hour.  During this time, all customer servers on this node will be offline.

 

Which Customers are Impacted?

All customers with VPS products on WVZ7.

How are Customers Impacted?

All services will be offline during the maintenance.

How often will we be updated?

Hourly

Time to Resolution (ETA)

1 hour

Incident Updates

Maintenance is now complete and all servers online.

Resolution Description

Maintenance is complete.

Mail412 Urgent Maintenance – Resolved

Posted In: Maintenance — May 09th, 2016 at 3:07 pm EDT by HE: Greg Cook
Shared services are affected

Incident Description:

Our system administrators found an issue with the server and will be performing urgent maintenance.  System administrators have rebooted it by force. Currently, the server is under File System Check(FSCK).

Which Customers are Impacted?

All customers on mail412.

How are Customers Impacted?

Messages that will be sent will be delivered after FSCK is over.

How often will we be updated?

As Required.

Time to Resolution (ETA)

~5 hours.

Incident Updates

n/a

Resolution Description

FSCK has been completed and the server is up.

Mail and MySQL Server Urgent Maintenance – Resolved

Posted In: Outage — Apr 28th, 2016 at 3:00 pm EDT by HE: Brian S.
Shared services are affected

Incident Description:

Our system administrators have identified an issue with one of our mail arrays related to a hardware failure of one storage member in the SAN.

A drive failure occurred on a member of the dmail02 storage group.  The member attempted to initiate a RAID rebuild, which was unsuccessful and the storage member removed itself from the storage group.  RAID refers to “redundant array of independent disks”, a technology  that allows us to achieve high levels of storage reliability from our server drives. It does this by arranging the devices into an array. Simplified, this means they act like one large hard drive, but if one drive dies, there is enough data stored on the rest to recreate the lost data once the broken hard drive is replaced with a new one.

The server had to be taken offline and solutions are currently being investigated.  For all email messages that are being sent to email addresses on these servers, that mail is being queued and will be delivered once services are resumed.

 

Which Customers are Impacted?

Customers with email service provided by this  array and also customers who have database servers on this array.  A full list has been posted.

How are Customers Impacted?

Email and database services are temporarily offline.  Mail delivering to the affected mail servers will wait in a queue to be delivered once services return.  Customers may also be unavailable to access their control panels during this outage, as well.

How often will we be updated?

As required

Time to Resolution (ETA)

Unknown

Incident Updates

2016/04/28 03:30PM EDT - Our system administrators are still investigating the cause of the problem.  Our primary concern at this point is maintaining data integrity so all services remain offline.
  • 2016/04/28 03:45PM EDT - Full list of affected mailservers has been added to the main post
  • 2016/04/28 03:50PM EDT - Full list of affected mailservers has been updated
  • 2016/04/28 03:55PM EDT - Full list of affected database servers has been added to the main post
  • 2016/04/28 04:30PM EDT - Services remain offline while we continue investigation.  Incoming mail during this outage will be queued and delivered once services are restored to normal
  • 2016/04/28 05:03PM EDT - Our engineers are working with the vendor engineers on restoring the storage array.
  • 2016/04/28 05:40PM EDT - Our storage vendor engineers are currently running a full diagnostic test on the array, in an attempt to try to bring the RAID back up.
  • 2016/04/28 06:20PM EDT - Our storage vendor engineers have escalated this issue further up through their development team and our System Engineers are also investigating alternate scenarios to resolve the issue.
  • 2016/04/28 07:36PM EDT - Our storage vendor engineers have identified a possible solution and they are preparing to attempt it.
  • 2016/04/28 07:59PM EDT - As we investigate deeper into this issue we have identified these additional mail servers affected: mail21, mail37, mail310, mail1213, mail1217, mail1218, mail1302, mail1411, mail1417, mail1421, mail1424
  • 2016/04/28 08:42PM EDT - SAN restoration attempts have not been successful, the engineering team is working with vendor engineers on the remaining options to restore the server without data loss.  We apologize that this process is taking some time, but it is very important that we are very careful and thorough with this sensitive problem.
  • 2016/04/28 11:40PM EDT - We are working a more detailed explanation for everyone that will contain more information on what failed and our next steps
  • Update 2016/04/29 01:20AM EDT - Although DBMail02 cluster of virtual machines is organized in a redundant RAID 50 SAN, it had several consecutive failures today, resulting in the system wide downtime you’re experiencing.One disk failure is normally not a problem in an array of this kind; however today, multiple drives failed consecutively. This is unlikely chain of events rendered the entire cluster unavailable. We are currently making copies of the failed disks. If these copies can be successfully created, the array can be brought back online by performing several sophisticated technical steps on the hard disks. If the array can’t be brought back online, we would at least have a more recent version of the data, so that it can be restored after all services have been brought back online from the last backup data. This backup restore process is running in parallel now, and most data will be gradually restored from backup as the services come back up. There will be another update in the morning with more technical details and information.This is a very long outage and frustrating outage for everybody. We wish wholeheartedly there was a way to speed this up, but our main concern is preserving data and minimizing any data loss. We will continue to work through the night on every avenue that will accomplish that, while simultaneously restoring services and data from backup.
  • Update 2016/04/29 07:50AM EDT - We are working on a detailed update that should be complete within the next hour.  Stay tuned.
  • Update 2016/04/29 08:26AM EDT - Our engineers have worked through the night, and we have been able to successfully copy the failed disk, which gives us more options toward the still primary goal of restoring the database and mail data. Currently our engineers are back online with the highest level vendor engineers, and have managed to get the array back up in a delicate state, which gives us hope that we can evacuate the data safely and get it back online. We are very carefully attempting to do that now. While those operations proceed, our second engineering team has also been working through the night to recreate all 149 servers and starting to sync backup data from the backups we do have of the Database, SiteStudio, and Control Panel servers. Copying that much data does take time, which is why we started it yesterday, however we are still very hopeful that we will not have to use this solution. Our mail cluster continues to spool incoming mail, and will hold that mail until the mail servers are re-established, so no customers should lose emails sent to them during the outage. We do see and hear your calls for more frequent updates, and we very much want to provide them. Unfortunately many of the operations underway are done very carefully and slowly, and sometimes we are just simply waiting for output from the systems for an hour or more. Again we are very sorry for how seriously this is affecting all of you, and commit that every level of HE is completely focused on resolving this issue as quickly as possible.
  • Update 2016/04/29 01:09PM EDT -  We are tentatively reporting that we have more progress.  We were able to stabilize the RAID array and connect another member.  We have started to evacuate the data.  We will all be steadily watching and hoping that the evacuation will complete successfully.   If the evacuation completes successfully, we hope to have everyone back on with little to no data loss.  We continue to see and hear the calls for more specific ETAs, but there is just no way to provide one until the evacuation is further along, it is currently at 5%.  Give us a couple of hours to calculate progression rates, and we may be able to give more concrete ETAs.  
  • Update 2016/04/29 04:15PM EDT  Evacuation of the SAN has been going smoothly so far, and we are becoming more encouraged that we will be able to restore the production servers and not need to use the backup systems, although that continues to be progressed by the second engineering team as a fail safe.  The evacuation process first moves the largest volumes, so we have not had any servers ‘come out’ of it yet:  as of this update we are at 19%, and so far our progression is averaging 5-7% per hour.  However, as the evacuation progresses, entire server volumes will start to restore.  For database servers, we will bring them online immediately.  For mail servers, the queued mail will first be brought down, and then the server will be made available online.  We will update this post with server names as we confirm they are up.

    Again we sincerely apologize for this lengthy issue, saving all customer data has been our priority throughout, and will continue to be our main priority.

  • Update 2016/04/29 07:00PM EDT We are now past 30% and volumes are starting to emerge.  Once all the partitions (volumes) of a server are out we will start to bring them online as discussed in the previous update.  We should have some start very soon.  

  • Update 2016/04/29 09:00PM EDT - Evacuation progress is currently at 39%

  • Update 2016/04/29 09:15PM EDT - Our first server is back online.   MySQL1411 is now online, but it will still be inaccessible to customers.

  • Update 2016/04/29 10:38PM EDT - Six MySQL servers are online and accessible. You can view the online server in the incident description above.

  • Update 2016/04/29 11:19PM EDT - Evacuation progress is currently at 48%
  • Update 2016/04/29 11:32PM EDT - Evacuation progress is currently at 50%
  • Update 2016/04/30 12:02AM EDT - Evacuation progress is currently at 52%
  • Update 2016/04/30 12:34AM EDT - Evacuation progress is currently at 55%
  • Update 2016/04/30 12:34AM EDT - Evacuation progress is currently at 55%
  • Update 2016/04/30 01:08AM EDT - Evacuation progress is currently at 57%
  • Update 2016/04/30 01:56AM EDT - Evacuation progress is currently at 59%
  • Update 2016/04/30 02:24AM EDT - Evacuation progress is currently at 61%
  • Update 2016/04/30 03:06AM EDT - Evacuation progress is currently at 64%
  • Update 2016/04/30 03:54AM EDT - Evacuation progress is currently at 68%
  • Update 2016/04/30 04:25AM EDT - Evacuation progress is currently at 70%
  • Update 2016/04/30 04:58AM EDT - Evacuation progress is currently at 73%
  • Update 2016/04/30 06:03AM EDT - Evacuation progress is currently at 77%
  • Update 2016/04/30 08:35AM EDT - Evacuation progress is currently at 88%
  • Update 2016/04/30 09:51AM EDT - Evacuation progress is currently at 91%
  • Update 2016/04/30 11:35AM EDT - Evacuation progress is currently at 95%
  • Update 2016/04/30 12:27PM EDT - Evacuation progress is currently at 98%
  • Update 2016/04/30 01:18PM EDT - Evacuation progress is currently at 100% Evac is complete.  The last sets of servers are preparing to be brought online.

Resolution Description

Data has been evacuated from the failed storage array and servers have been re-enabled.  Mail queues have been delivered and all services are restored.

DDoS (Distributed Denial of Service) attack – Resolved

Posted In: Other issues — Apr 28th, 2016 at 10:07 am EDT by HE: Toi Santamaria
Shared services are affected

Incident Description:

Our system administrators detected a Distributed Denial of Service attack (DDoS), launched against the nameservers for CP12.

A DDoS is is an attempt to make a computer resource unavailable to its intended users. The way the attack is carried out varies as much as who is attacked and why. One common method of attack involves saturating the target (victim) machine with external communications requests. This creates so many false connections to the server, real attempts to connect cannot be completed. Because so many domains share an IP, it is not possible to determine which site the attack is directed at. In many cases, a temporary block is sufficient until the DOS attack passes, however, if the attack continues, the shared IP could remain blocked for an extended period of time.

In order to mitigate the attack and prevent larger service impact, system administrators have temporarily filtered all connections to those nameservers. Customers who do not have their DNS already cached will not be able to browse their sites.

Which Customers are Impacted?

All customers with websites that use CP12 nameservers. You can determine if your account uses CP12 by clicking the manage button next to your hosting account. The address in the address bar will tell you what CP you are located on.

How are Customers Impacted?

Customers who do not have their DNS already cached will not be able to browse their sites.

How often will we be updated?

Hourly

Time to Resolution (ETA)

Systems Administrators are working to mitigate the effects of the DDoS. We will update with an ETA as soon as one is available.

Incident Updates

  • 2016/04/28 10:20AM EDT - System Administrators are still investigating the best way to mitigate the DDoS
  • 2016/04/28 11:15AM EDT - No new information to provide at this time
  • 2016/04/28 11:20AM EDT - Our system administrators have removed the filters on CP12 DNS queries.  We have implemented new rules to mitigate the attack.  CP12 nameservers are now successfully answering queries.
  • 2016/04/28 12:20PM EDT - The changes we have implemented are still having a positive impact.  Due to the large amount of traffic that is still incoming some queries may still timeout, but we have noticed an increase in the number of legitimate queries that are processed.
  • 2016/04/28 12:45PM EDT - The DDoS is still active, but we have successfully filtered it and all queries are being handled.  We are still actively monitoring the DDoS to see if there are any changes.

Resolution Description

The filter our System Administrators have implemented is working.    All incoming traffic to this nameserver is isolated to one provider to protect the other parts of our network from the attack.  We are monitoring it to make sure if anything changes we are aware.

Windows VPS Maintenance – April 29, 2016 – Postponed

Posted In: Maintenance — Apr 27th, 2016 at 2:32 pm EDT by HE: Brian S.

Incident Description:

On April 29th, 2016 at 11PM EDT we will be performing maintenance on our Windows VPS Node “WVZ7″  during which we will be replacing the CPU for this node.  The maintenance is expected to last for one hour.  During this time all customer servers on this node will be offline.

Which Customers are Impacted?

All customers with VPS products on WVZ7

How are Customers Impacted?

All services will be offline during the maintenance

How often will we be updated?

Hourly

Time to Resolution (ETA)

1 hour

Incident Updates

  • 2016/05/03 2:40PM EDT - Maintenance has been postponed.

Resolution Description

N/A

Semi-Annual Data Center Maintenance – Friday, April 29, 2016 – Resolved

Posted In: Maintenance — Apr 26th, 2016 at 10:48 am EDT by HE: Toi Santamaria
Cloud services are affected
VPS services are affected
Shared services are affected

Incident Description:

Beginning Friday, April 29th, 2016 from 11:00 PM EST – 4:00 AM EST, we will be conducting routine maintenance on our data centers major electrical systems.

The purpose is to test and repair any internal components and batteries, as well as to inspect the Power Distribution Units throughout the data center.

During the maintenance, the commercial power grid will be offline and we will  be  running entirely on our generator systems.  One at a time, we will take each UPS (we have two, UPS A and B) offline via Maintenance Bypass. 

The maintenance is scheduled to be completed within a 6 hour maintenance period.

Which Customers are Impacted?

All active customers will be affected.

How are Customers Impacted?

Backup power generators will be unavailable during maintenance, servers will run on UPS power backup until generator power is restored in the unlikely event of a power outage.

How often will we be updated?

6 hours

Time to Resolution (ETA)

Friday,April 29th, 2016,4:00 AM EST

Incident Updates

N/A

Resolution Description

N/A

 
© 2011 Host Excellence.