NSE passes the bug on dug-up roads, storage failure for February 24 outage

NE BUSINESS BUREAU

MUMBAI, MAR 25

The National Stock Exchange of India Ltd. said a failure in a storage system was the key reason behind the longest-ever outage experienced by the world’s biggest derivatives bourse on February 24.
The exchange also said in a statement Monday that the link between its main data center and primary disaster recovery facility experienced problems on the day of the outage because of digging and construction activity along the path ..
The outage was caused by ‘failover logic’ — a method used to switch to a standby system upon abnormal termination of an application — for a storage system that was implemented by a vendor and didn’t conform to the exchange’s design requirements, NSE said in the statement citing its analysis. “This resulted in the risk-management system of NSE Clearing and other systems such as clearing and settlement, index and surveillance systems becoming unavailable.”
NSE’s primary data centre is in BKC, a Near Disaster Recovery (NDR) site is maintained in Kurla, and the disaster recovery (DR) site is in Chennai. There is synchronous data replication between our primary site in BKC and NDR site to ensure no data loss in case of primary site failure, and asynchronous replication to our DR site in Chennai which is designed to take over with zero data loss in case of disaster at the primary site.

Between our primary and NDR sites, NSE has multiple telecom links with two service providers to ensure redundancy. On February 24, 2021 we had instability in links from both service providers primarily due to digging and construction activity along the path between the two sites. The replication to NDR is designed such that in the event of the links between primary and NDR getting cut, the primary continues operations without any direct effect. Post earlier link failures in February 2021, operations continued without any interruption.

However, on February 24^th, post link failure, we saw unexpected behaviour of the Storage Area Network (SAN) system, with the primary SAN becoming inaccessible to the host servers. This resulted in the risk management system of NSE Clearing and other systems such as clearing and settlement, index and surveillance systems becoming unavailable.

While there was no impact on the trading system, given that the risk management system was unavailable, allowing trading to continue on NSE posed an unacceptable risk, and hence trading had to be halted.

The SAN is a fault tolerant system that was designed to function seamlessly even in the event of telecom link failures between primary and NDR copies. One of the features of SANthat was deployed in October 2020 was designed to provide not just zero data loss but also zero down time. Before deployment, the system was tested against various scenarios, including link failures and functioned properly. However, on 24^th February, post link failure, the SAN system at the primary data centre stopped functioning, which was completely unexpected.

Subsequent incident analysis showed that the problem was caused by failover logic implemented by the vendor which did not conform to NSE’s stated design requirements, coupled with issues in the configuration done by the SAN vendor that triggered the failover logic. We note that the specific failure logic used by the vendor is not documented, was not communicated to NSE, and was not appropriate for NSE’s setup. The resultant SAN failure led to the incident on February 24^th.