Why the "Digital Ocean killed my company" incident scares me

update 05-06-19: Digital Ocean has posted their postmortem on this situation as mentioned by their CTO Barry Cooks in the comments.

Last Friday May 31, Nicolas Beauvais, CTO and only tech person of small startup raisup.com, took to Twitter with a cry for help. Their cloud hosting company Digital Ocean had just locked their account and made it clear that this was permanent.

How @DigitalOcean just killed our company @raisupcom. A long thread for a very sad story. pic.twitter.com/uOFCDRoYJ6
— Nicolas Beauvais (@w3Nicolas) May 31, 2019

DO claimed that RaiseUp had broken the terms of usage. I do not have all the details, so caveat lector, but apparently it was due to spinning up a dozen droplets and running a batch job.

Anyway, the story got a ton of coverage on Hacker news and Twitter. People swore to abandon Digital Ocean — "never again!"; the CTO jumped in; meal mouthed excuses were made etc. etc. You get it.

The account was apparently restored reasonably quickly and the angry mob dispersed.

I am Nicolas

So, I am Nicolas. Sort of. Not with the same impact and gravity of a Charlie Hebdo, "Je suis Charlie" association, but not dissimilar in spirit. I feel almost personally hit by this incident.

I run a small company, by myself. For now at least.
I use big cloud providers like AWS and Heroku.
I have customers (some quite big actually) that rely on my service.
I actually ran a dozen Digital Ocean droplets with fairly high resource usage before switching to AWS completely some months ago...

But most of all, I'm scared I could be hit by an out-of-control abuse algorithm and a broken customer service process. And I have zero Twitter clout or any other online notoriety.

Is Checkly prepared for outages and failover? Yes, it actually is. We do regular backups, we can switch from EU based to US based hosting. We can deploy to new regions super quickly. Heck, 90% of our monitoring infrastructure is permanently distributed over the whole globe.

Would this survive when AWS and Heroku decide to unequivocally shut us down? Probably not.

Let's look at both sides of the argument

Why didn't you just...

So what did Nicolas do wrong here? Luckily we have the friendly crowds on Twitter and Hacker news to tell us. Here is some wisdom from the various threads

Why were you only hosting one JUST ONE cloud provider?
Why didn't you have backups outside of JUST ONE cloud provider?
Why didn't you have a monthly validated Disaster Recovery plan?
Why did you use Digital Ocean in the first place?
Why isn't everyone on bare metal?
Why didn't you use React in Docker? (this one I made up)

Mostly fair questions, right? But as someone in the Twitter thread said

"its good advice bt im Not sure uve had a startup before." [sic]

And this is spot on.

Of course, you can't just shrug off basic service reliability and availability planning just because you're a (small) startup. Consequently, that is the whole reason you are using a cloud service like Digital Ocean. Or any cloud service in general.

Your shitty, self hosted, never patched, not-monitored infrastructure is almost never better than a modern cloud provider. Money wise and time wise. Multiply that statement by ten if you are a bootstrapped, indie hacker, maker, solo-dev whatever startup.

This also goes for offsite backups and DR scenarios. Your cloud provider is 9 out 10 times better equipped to offer you the tools to implement this on their redundant, geographically distributed data centers.

I'll even make the counter argument. Small startups do get some slack with regard to typical Enterprise™ services. They are innovative, quick to act, mostly cheap and eager to please early customers.

This is worth something and Fortune 500 companies doing business with small startups know this. The risk factor is higher, but probably more due to the company folding because of typical startup reasons —money, fights, egos, fatigue — than because of infrastructure lock-out.

Also "doing business with a Fortune 500 company" most of the time means a team inside that Fortune 500 is using your service or tool for the team's purpose. General Electric does not run on your bootstrapped stamp collecting SaaS, trust me.

So, did Nicolas screw up by having all his eggs in one basket? No. He's in a two person startup. This is fine.

You only have to be wrong once

Handling abuse is hard and very annoying. I do not claim to have any experience at the scale Digital Ocean is operating at, but I do have to cancel/ban accounts once in a while when some crypto farmer tries to abuse Checkly.

It does not surprise me at all that DO has automated large parts of their "abuse pipeline". It also does not surprise me at all that they have a "three strikes and you're out" policy and that they enforce this.

However, the problem is similar to what you often hear in discussions around terrorism and counter-terrorism: "the terrorists only have to get it right once, where counter terrorism has to get it right 100% of the time".

Not calling anyone here a terrorist, but the example stands. Your abuse algorithm has to be REALLY good and it needs to be followed up by some REALLY good human investigation to double check the data.

As always, it's a combination of factors that cause the accident.

The algorithm failed, based on what I can gather, to recognize some scaling operation as normal, legitimate usage. Again, I assume it was legitimate. Something was set too tight. Or not tight enough and it triggers too many false positives making triage hard.
The human process failed, probably because of bureaucracy. Maybe because of stress, poor training, company culture, you name it. They failed to recognize Nicolas' business as just a normal customer. They probably didn't even check their website or customer history. This is all speculation of course.

It's the combination of these two categories of failure that scare me. Even more when there is a "no recourse" policy and I'm a social media nobody.

Mitigations

So what am I going to do about this? What can fellow small startups do?

Be upfront and honest with your customers about your operation. Don't create false expectations. They are probably fine with it.
Don't go all DRS / failover crazy. You're probably solving the wrong problem for your business.
Get comfortable with the support crew at your cloud provider. Maybe hit up engineers or support on LinkedIn or Twitter to establish some rapport.
Pay for that extra support tier, if possible. Note, this can get crazy expensive.
Establish some form of "clout", just in case you need a Twitter mob to wake up C-Level.
Do have some backups outside of your primary cloud provider. You'll sleep better.

banner image: Meiji Emperor at Horse Race in Ueno Park. Not identified, Japan. Source

DETECT

Uptime Monitoring

Synthetic Monitoring

COMMUNICATE

Status Pages

Alerts

Dashboards

RESOLVE

Rocky AIAnalysis

Tracing

Developers

Resources

Webinars & Events

Community

Why the recent "Digital Ocean killed my company" incident scares the hell out of me

I am Nicolas

Why didn't you just...

You only have to be wrong once

Mitigations

Related Articles

How I set up SSL with Let's Encrypt for my SaaS customers' dashboards

Post mortem: Checkly security incident [updated]

Post mortem: outage browser check results & alerting

Dips and Wiggles: Monitoring Website Performance with Checkly, Prometheus and Grafana

How we monitor Checkly's API and Web App (updated)