Postmortem of the outage at CloudAfrica 2015-03-12
Postmortem of the outage of CloudAfrica systems at the VOX Waverley (Johannesburg, South Africa) Data Centre
12th March 2015
We would like to share the details of the outage that affected our infrastructure and Cloud services on the afternoon of 12thMarch 2015, as well as details of what we will be doing to prevent this from happening again.
On behalf of CloudAfrica, we are extremely sorry for this outage, and the severe inconvenience it has caused you – our customers and partners – and the inconvenience caused in turn to your customers.
We take outages of any kind extremely seriously, and we continue to aim for providing the best performing and most robust Cloud services infrastructure on the continent.
Background
All CloudAfrica systems and associated networking and storage architecture are designed to function optimally in highly redundant manner, with no single points of failure (which is one of the reasons we do not utilise shared SAN storage). So, we operate highly redundant local storage on all our compute nodes, as well as redundant network firewalls and switching infrastructure.
However, we (currently) provide services from 2 carrier-grade data centres operated by third parties, through whom we connect for our upstream external (i.e. Internet/MPLS) connectivity – and in the case of the VOX Waverley facility, that happens to be through VOX itself. So, whilst our racks are populated by our own compute nodes, storage infrastructure, and network and firewall elements, we ultimately connect our core network into that of our upstream providers to enable our systems to connect to the outside world.
What Happened?
At approximately 14h40, it appears that the core network switch for the entire VOX Waverley facility faulted. Whilst we have got confirmation (on a number of occasions previously and yet again today) that this core switching infrastructure is in fact redundant, for whatever reason, the redundant core switching unit failed to take over, and the entire VOX Waverley core network failed. Another core network switch was then bought online, but because of the scale of the failure as well as the complexity of the configuration, it took approximately up until 17h00 to get things completely operational again.
As of early evening on 12th March 2015, all systems are fully functional and operational and we will continue to monitor the situation very carefully.
Total downtime experienced at the datacentre was 2h15mins.
Next Steps
We will be taking several steps to prevent this failure from happening again.
First: We will be securing an additional bandwidth supplier out of our Teraco Isando facility, in order to allow connectivity to the outside world to be maintained in the event of such failures in the future – we will be looking to have additional multiple network uplinks from not only VOX Waverley, but automatically routed via Teraco Isando should the need arise.
Second: Whilst the VOX Waverley infrastructure is connected to our secondary facility at Teraco Isando via a 500Mbps redundant fibre, it just so happens that this connectivity is also facilitated through the core networking infrastructure at VOX Waverley, so any connectivity between the facilities had likewise been affected. We are exploring options to enable us to internetwork our platforms through an additional third party network provider, bypassing the core VOX Waverley networking infrastructure and ensuring that we are able to maintain intersite connectivity should such an event recur.
Closing
We want to reiterate our apology for the magnitude of this issue and the impact it caused you — our customers — and your customers. Rest assured that we have put the balls in motion to ensure that we move as diligently as we can, and as expediently as we can, to prevent an issue like this from happening again, and to ensure that we maintain service availability should such events happen in the future.
We have also taken a decision that all customers will receive a 10% credit on all Cloud hosting fees for the next 3 months – this will be effective from next month’s billing run (i.e. the bill run to be executed in April 2015, for the Cloud hosting fees related to March 2015 utilisation). No further communication is required from any customer in this regard – these credits will be applied automatically to all affected customers from the April 2015 bill run.
We will communicate the details of the broad steps outlined above during the coming week.
Sincerely
Len and Angelo.
12th March 2015
We would like to share the details of the outage that affected our infrastructure and Cloud services on the afternoon of 12thMarch 2015, as well as details of what we will be doing to prevent this from happening again.
On behalf of CloudAfrica, we are extremely sorry for this outage, and the severe inconvenience it has caused you – our customers and partners – and the inconvenience caused in turn to your customers.
We take outages of any kind extremely seriously, and we continue to aim for providing the best performing and most robust Cloud services infrastructure on the continent.
Background
All CloudAfrica systems and associated networking and storage architecture are designed to function optimally in highly redundant manner, with no single points of failure (which is one of the reasons we do not utilise shared SAN storage). So, we operate highly redundant local storage on all our compute nodes, as well as redundant network firewalls and switching infrastructure.
However, we (currently) provide services from 2 carrier-grade data centres operated by third parties, through whom we connect for our upstream external (i.e. Internet/MPLS) connectivity – and in the case of the VOX Waverley facility, that happens to be through VOX itself. So, whilst our racks are populated by our own compute nodes, storage infrastructure, and network and firewall elements, we ultimately connect our core network into that of our upstream providers to enable our systems to connect to the outside world.
What Happened?
At approximately 14h40, it appears that the core network switch for the entire VOX Waverley facility faulted. Whilst we have got confirmation (on a number of occasions previously and yet again today) that this core switching infrastructure is in fact redundant, for whatever reason, the redundant core switching unit failed to take over, and the entire VOX Waverley core network failed. Another core network switch was then bought online, but because of the scale of the failure as well as the complexity of the configuration, it took approximately up until 17h00 to get things completely operational again.
As of early evening on 12th March 2015, all systems are fully functional and operational and we will continue to monitor the situation very carefully.
Total downtime experienced at the datacentre was 2h15mins.
Next Steps
We will be taking several steps to prevent this failure from happening again.
First: We will be securing an additional bandwidth supplier out of our Teraco Isando facility, in order to allow connectivity to the outside world to be maintained in the event of such failures in the future – we will be looking to have additional multiple network uplinks from not only VOX Waverley, but automatically routed via Teraco Isando should the need arise.
Second: Whilst the VOX Waverley infrastructure is connected to our secondary facility at Teraco Isando via a 500Mbps redundant fibre, it just so happens that this connectivity is also facilitated through the core networking infrastructure at VOX Waverley, so any connectivity between the facilities had likewise been affected. We are exploring options to enable us to internetwork our platforms through an additional third party network provider, bypassing the core VOX Waverley networking infrastructure and ensuring that we are able to maintain intersite connectivity should such an event recur.
Closing
We want to reiterate our apology for the magnitude of this issue and the impact it caused you — our customers — and your customers. Rest assured that we have put the balls in motion to ensure that we move as diligently as we can, and as expediently as we can, to prevent an issue like this from happening again, and to ensure that we maintain service availability should such events happen in the future.
We have also taken a decision that all customers will receive a 10% credit on all Cloud hosting fees for the next 3 months – this will be effective from next month’s billing run (i.e. the bill run to be executed in April 2015, for the Cloud hosting fees related to March 2015 utilisation). No further communication is required from any customer in this regard – these credits will be applied automatically to all affected customers from the April 2015 bill run.
We will communicate the details of the broad steps outlined above during the coming week.
Sincerely
Len and Angelo.
0 комментариев
Вставка изображения
Оставить комментарий