For approximately one hour yesterday morning, Wikipedia, the popular online encyclopedia, was down. That’s right – many students who thought that they could do last minute research online for their class reports just might have found themselves in a library. Who knows? Maybe they also had to figure out how to use the Dewey Decimal System in order to locate a book to finish up their research.
According to the Wikipedia Tech Blog:
“Due to an overheating problem in our European data center many of our servers turned off to protect themselves. As this impacted all Wikipedia and other projects access from European users, we were forced to move all user traffic to our Florida cluster, for which we have a standard quick failover procedure in place, that changes our DNS entries.
However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally. This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects.”
This outage serves as a reminder that an overheated server or critical IT equipment could result in an inconvenience or even worse ramifications. At any rate, it’s important to plan ahead and be on the lookout for Single Points of Failure (SPOF), such as overheating. What are SPOFs? Here’s a recent St. Louis Small Business Monthly article on SPOFs written by our very own David Brown.
Did you also know that overcooling a server and/or critical IT equipment could cause damage or hide dangerous heat-related problems? It’s true. A data center can get too cool. This Processor article identifies concerns and considerations when it comes to overcooling.
Any one failure point could be disastrous, so make sure you know your SPOFs and how you’ll react if any failures do occur. Don’t let a SPOF close you down.