Tuesday, 14 May 2013

Statement: Network and Redundancy Systems Failure and Resolution

Official Statement from Excel Markets to Live Account Holders

Tuesday, 14 May 2013. An inexcusable outage that we take full responsibility for occurred today for up to 2 hours, along with outages at just over 60 other brokers using the same very popular and very trusted professional service provider.  If your trading on a live account was affected by this outage we ask that you contact us at support@excelmarkets.com with an explanation of the issue you experienced.  Besides correcting any issues and issuing full refunds where due, we have already deposited a free $200 (or equivalent in base currency) to every single trader that had an open trade during the time regardless of whether the trade was affected by the outage or not as a way of saying we are very sorry.  This affects live accounts only.

It is with great embarrassment and disappointment that we apologize for what occurred today and it is our vow to make it an absolute priority to completely mitigate any potential downtime of this magnitude in the future.  The highest level of management in the organization will be paying close attention to this serious matter, if you should have any feedback or questions or anything else we also ask that you contact us. A deeper explanation of the situation may be found below.

Today, Tuesday 14 May 2013, we experienced a network failure followed by an inexcusable failure of redundancy systems that caused a downtime of up to 2 hours.  We have been working with one of the top companies in the field, hosting mission-critical proprietary financial software in the prestigious NY4 Equinix datacenter and other redundancy locations.  This company serves over 60 brokers including some very large ones, so downtime across the board occurred today with a large number of brokers.

The impeccable reputation of the company we were working with was not enough to make us choose them as we are very conscientious when it comes to client services.  Instead of just accepting the fact they could do precisely what we needed, we underwent significant due diligence and testing, besides having many discussions with them to ascertain their level of expertise we beta tested them prior to our initial site release for over a year.  During this time we experienced an impressive record of zero downtime.  Still, we demanded full control over all of our machines, which they informed us was not typical in their setup as they did not typically give over machine control.  Though we did receive full control as demanded, when it came time to roll over to backups we found a lock that was probably not intentional on their part but required a second login thereby preventing us from accessing our machine to complete a fast roll over.

The downtime we saw today was intolerable, unprofessional and downright embarrassing and so incredibly opposite of how we strive to do business.  We have not even had a website with this amount of downtime in the last few years unless it was taken down on purpose for upgrading.  We will definitely be taking strong and decisive action and employing an entirely new set of redundancy measures and control so the potential for an outage like this simply cannot occur again.  There are some extremely large companies that trust this company we deal with to manage their infrastructure; however, we will take all measures to ensure this incidence cannot be repeated.

Downtime is an unavoidable fact of running servers, routers, firewalls etc. However, in 7 years I have never seen our websites down for 2 hours and a lot more work is justified in maintaining uptime on a trading server.  The fact is downtime should be extremely rare and last no more than 15 minutes or so as long as the proper redundancy systems are in place.  It should be noted the company we worked with had extensive redundancy systems in place and we even had test drills during weekends to test automated redundancies.

Of the 60 brokerages they work with, we were told not one did test drills like we did.  Somehow those failed today, and while I am sure we will receive an explanation from our partner, I don't care what the answer is because the fact that it happened once is enough to determine we will from now on maintain full control of every tiny facet of the trading infrastructure.  The company I mentioned has 60+ brokers to deal with and of course when something goes wrong that can create a lot of confusion and delay.

We do have 5 alternate connection points. However, these servers balance load, cache data and provide a streamlined route to the execution system that exists in NY4.  Ultimately, all connection points have a dependency on the Primary trading system that has fiber backbones into the bank aggregation systems, thus a failure of the primary lead-to means the alternate connection points are rendered ineffective; however, they can quickly be tethered to an alternate primary machine such as the one we switched to today.  As stated, every critical component within the system including the primary trading system is fully redundant and the true failure today was the excessive amount of time it took to failover to those redundancy components.

Please note logging into the server mtr1.excelmarkets.com should work at this time. However, in the rare event DNS changes have not propagated to your area, you can access the server directly via IP address at 216.93.241.26:443.  We will also be installing an emergencies page so even in the future if there is a 5 minute bout of downtime and either network outages or DDOS attacks force a rollover clients will be well aware of alternate redundancy points to login to.  Please note we will be rolling primary systems back to NY4 on Saturday 18 May 2013 to eliminate unnecessary, added downtime.
Facebook Excel Markets Tweet Excel Markets Reddit Excel Markets Digg Excel Markets

1 comment: