Jump to content

L4D2 Server Status


cici

Recommended Posts

GC Board Member

The L4D2 server is inaccessible because the cooling system has failed in the Chicago data center. NFO Servers who hosts our VDS provided the following information:

Jan 14 2024 11:26:31 AM PT: Investigating packet loss and routing reconvergence, we found that a link to one of our upstreams in Chicago was flapping today. We have shut down that link while we investigate further, which should prevent further impact to customers.

Jan 14 2024 11:48:10 AM PT: The issues in Chicago are actually due to components in our routers overheating, due to a major datacenter outage with their cooling system. We are unable to stop the packet loss for now as a result. We are actively monitoring for updates from the facility on a fix.

Update @ 3:48pm CSTThe facility reports that they are continuing to try to recover from multiple failed chillers, which went down due to the low temperatures. Currently, temperatures are still rising in the facility, wreaking havoc on our equipment, including our routers. Our last look at our main router showed an internal case temperature of around 160 degrees F.

We have tried workarounds, but both of our redundant routers are seeing serious problems, including failed power supplies, due to the heat issue, and all of our upstream connections are intermittently offline because the network cards and cables are so hot. Only when they have brought temperatures down again will we be able to assess what is permanently broken and adapt to it.

Update @ 6:31pm CST: The facility still has not brought down temperatures, though they are trying with fans. Our primary router seems to have entirely failed, likely because it overheated; we are transitioning to exclusive use of our secondary router, but do not know how long it will hold out. Its upstream links are also still frequently going down.

Update @ 6:53pm CST: As the facility continues to have temperature rises, we are seeing switches and machines fail. We will have to assess these when the cooling is restored. We hope that they are not entirely dead.

Link to comment
Share on other sites

GC Board Member

I see the servers are accessible now. However, NFO states 'we're not out of the woods yet' in their last update. Also, NFO mentioned 'automatic CPU throttling may occur on machines at the physical system level, limiting performance'. Therefore, I'm thinking we may see performance issues on the server until the cooling system is fully restored.

Additional Updates:

Update @ 9pm CST: We are told that one of the broken chillers is back online now and that temperatures have stabilized at 120 degrees F. It is still not possible for us to effectively troubleshoot downed equipment, but we are monitoring very closely and will try to bring everything back online as soon as we are able to.

Update @ 9:41pm CST: The facility says that another chiller is back online and that temperatures are slowly decreasing now, but we have not seen a change yet in our equipment status. We are continuing to monitor and wait.

Update @ 2am CST: As the temperature slowly goes down, our router is going longer before its network adapter overheats and it kills the connection. We are observing about 7 minutes of connectivity before it goes offline for a minute.

Our primary router and one of our network switches are still offline. We have asked the facility to investigate these, but they have told us that they will not turn back on any equipment for customers until temps drop further. We will pursue them.

Update @ 3:42am CST: The ambient temperature dropped a little further and our secondary router has not had a high-temperature disconnect error for a bit over 30 minutes now. This means that most customers have connectivity again.

We still have one switch offline, and our primary router offline, and we are lobbying the facility to investigate these ASAP. The router being offline is not causing customers problems because it is redundant, but the switch being offline is leading to some customers' machines or VDSes being inaccessible.

So far, we have seen a couple of machines that rebooted due to the heat, but we haven't noted any total hardware failures apart from the switch and primary router. We will be performing a complete audit of all equipment after temperatures are back in the normal range and the facility has restored the downed switch.

Update @ 3:57am CST: The switch that was offline is now online again; it seems to have left a temperature protection mode as the ambient temperature dropped. We are continuing to investigate the downed router and to look for any other equipment that might be having problems.

Please also note that because the facility's temperature is still high -- we are told that it is 88F now -- automatic CPU throttling may occur on machines at the physical system level, limiting performance. This should automatically resolve as the temperature drops further.

Update @ 9:08am CST: Customer equipment stayed online through the night, but the facility itself has not yet fully recovered, so we're not out of the woods yet. Data center says that there was a slight increase in temperature during the night when two chillers failed and had to be restarted, and that they are in the process of installing additional portable coolers. They have not yet worked on our offline router.

Link to comment
Share on other sites

GC Board Member

Latest Update:

Update @ 3:40am CST on 1/16: The facility reports that five out of six chillers are operational, and the datacenter is within a normal temperature range again. They manually rebooted our primary router, and it came back online; we've now shifted loads back onto it.

Our audits have not identified any equipment that is not functioning properly, so everything appears to be back to normal now at this facility. We will continue to monitor, however, and follow up with the facility as they work to repair the sixth chiller and improve their overall cooling systems.

  • Like 2
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...