That blip both nights was my fault. But to quote 45, “it wasn’t me”

2/5/2021 - from 10:45 to 11:07 we had 2 interrupts where our core router rebooted itself. This resulted in two 3 minute interruptions of service. This is the most unplanned downtime I have had in years due to my equipment.

On 2/1/2021, our router manufacturer released a hotfix, to address a DNS vulnerability that resurfaced after 10 years of hiding. This new vulnerability was disclosed on January 21, and cyber hacks promptly started abusing equipment around the world. (dnspooq if you are inclined), it also addressed a memory leak when using the vendor’s network operations suite (which we do). I never install a fresh hotfix or patch right away. The more urgent they are, the more likely they have some unintended side effort which is worse than what was addressed. I planned to install this hotfix next weekend. Instead, I installed it last night at 3:39AM.

So yeah, sorry about that 6 minutes of downtime, but I was up from 2:00AM till 4:30AM planning/executing/testing a mitigation for you. Pass the coffee, please.

2/6/2021 - It happened again. However, we found a smoking gun. On 2/2/2021 Net253 upgraded our bandwidth commit to KPUD by a factor of 2.5. KPUD implemented a new burst cap feature (speed limit) on our transit circuit. This cap was double our new commit, so 5 times higher than our old utilization number. This limit was put in place because Net253 can consume the entire bandwidth for all KPUD internet transit (including the other ISPs). This was a good thing to do, so we thought. This cap was put into effect on 2/5/2021. That night, my bandwidth spike happened (which is 9-11PM) as normal. However, the burst cap kicked in on the KPUD router in Bremerton. The Net253 router has a 16 core processor, normally at 3% utilization as all routing is done in hardware offload. When the cap kicked in, hardware offload on my router failed and the cores all pegged for about 10 minutes until the router rebooted. This happened over and over until the load went down. As a mitigation, KPUD has doubled the burst cap so there is no chance we will hit it Sunday evening.

This is new territory for KPUD as well as Net253. Net253 moves more data than any other ISP on the network. Net253 has been working with KPUD for the last three months to get new multi-gigabit commit rates defined for the wholesale contract table. We are now committing to buy 50% more bandwidth every second than what was the top end of the KPUD wholesale contract table as of January 2021.

Edit: 2/7/2021 - During the last 10 minutes of the superbowl, we peaked several hundred megs higher than we were yesterday when the wheels fell off the apple cart. Looking like we know now what our issue was, we are bigger internet users than our monitoring was telling us.

Previous
Previous

Snowmageddon

Next
Next

If you have a google wifi mesh system, don't ever use the 'Priority' feature