Amazon AWS

We have been hosting our dashboard in the Frontier Cyber Center in Rochester NY for the last 10 years. It has, overall, been a very reliable place for our rack of servers. Especially for the last few years, there has been very little downtime for the dashboard. Response times are fast.

However, we are contemplating the best decision for the next decade of Semify growth. This issue has come up before. So much so, we launched an exploratory project in March of 2018 with Amazon Web Services (AWS). We had always been intrigued with AWS, since you lease the hardware from Amazon. Because all of the instances are virtualized, there is tremendous flexibility for backups, mobility and scaling. In early 2018 we decided to invest in a Disaster Recovery project using AWS. We knew that many of our servers in the Cyber Center were aging and would reach end-of-life in the next year or so. For those server geeks out there, here is the high-availability (HA), fully redundant, set-up we currently have in production.It has served us very well for 10 years.

In March of 2018 (a little over 1 year ago) we launched a replica of our primary production database server and a production web server at AWS. We implemented the full dashboard set-up, complete with load balancing and firewall protection. The idea was to use our DR site implementation as a "learning project" so that we could get to know the AWS infrastructure.

The 2018 AWS project had the added benefit of upgrading our disaster preparedness. Prior to 2018, our DR plan was to use the hardware in our Rochester office and host the dashboard over the Fiber connection provided by Greenlight. Backups are pushed out of the Cyber Center to S3 - which we could bring down to Rochester and restore. The total bring-up time for this DR strategy was about 12-16 hours, due to the restore time required for the MySQL database (which is quite large). Additionally, we would be limping along on older hardware that would not likely meet the demands of our clients. Full restore time would have likely been 1 week due to hardware procurement and deployment.

Moving to AWS in 2018, we realized that we could improve our DR bring-up time by implementing real-time replication. This would cut out the 12-16 hours of data restore time from the SQL dumps we had been using. The only step required to cut-over after a disaster was declared would be DNS, followed by restoring many jobs that run off-line (mostly analytics, reports and emails).

In June, 2018 we started replicating the dashboard data to AWS in real-time.

We also realized that AWS would give us the ability to immediately increase the hardware capabilities if needed. Rather than procuring new hardware and building it out in our Rochester office, we could order more computing power from AWS in a matter of minutes.

Spring 2019 Decision

After running the AWS DR site for over a year, we are now deciding if we should move our primary site to Amazon. We put monitors on the DR site and it has been stable. Additionally, we have run several load tests again AWS and compared them to our Rochester Cyber Center capabilities.

We have started a series of Load Tests to benchmark AWS vs our Rochester production facilities. The results are favorable (about 30% faster) with our current implementation.

While the costs are higher at AWS, we know the flexibility is greater. We also believe we will save some time on hardware monitoring and occasional replacement - since it will now be managed by Amazon.

Additionally, by using multiple Availability Zones at AWS, we can achieve redundancy in case one of the Amazon facilities goes down. Here is the architecture we have set up at Amazon today:

What is holding us back from moving to Amazon?

There are a few reasons NOT to go to AWS. First, it will take down-time to implement the cut-over. We would schedule this for weekend night / off-hours to minimize customer impact.

Second, there is the unknown. While we feel that our load tests have been solid, there are always the "unknowable unkowns." Having been in IT for 25 years, I have always resisted change as it minimizes unknowns. Things ARE running extremely smoothly now. Introducing AWS may cause more downtime as we encounter new situations we have not drilled for.

As always, we would welcome your feedback on this decision.

******************************4/5/19 Update************************************

We made the decision to move our servers to AWS. Thanks for all the great input and consultation.

Preparations are underway for the migration. We have tentatively scheduled Saturday 4/20/19 for the move. We are expecting downtime between 5am and 7am (Eastern) on Saturday morning 4/20/19.

Because all of your dashboard URLs use CNAMES to a DNS entry we control, we believe that your dashboards will be accessible as soon as the DNS entries are propagated. In other words, we don't believe there is anything you need to do.

Please call us with any questions.

******************************4/19/19 Update************************************

All of our final preparation checklists are going well and we are a GO for this cut-over tomorrow morning (Saturday). We have done exhaustive checksum verifications of the data (which has been replicating from Rochester to AWS for the last 14 months) and we are tied-out.

We have done trial runs of our cut-over procedure (which has 67 steps on it) a bunch of times. The team sat down on Wednesday for a final review of the QA steps and found a few final holes - which were plugged.

We think we're ready. But we also "eat failure for breakfast." Thanks for your support.

******************************4/20/19 Update************************************

The cut-over is complete and you are reading this from AWS. Our cut-over went according to plan with 26 minutes of downtime (5:10am - 5:36am Eastern). All QA steps check-out. Please let us know if you see any issues.