Chaos Unicorn Day - What We Learned by Breaking Status

On April 1st 2019, we broke Status. It wasn't a practical joke and it wasn't by accident.

Chaos Unicorn Day was announced well in advance - you can read about the hows and whys in the origin post, but here's the gist of it.

Status' story arc has always been to enable decentralized and censorship-resistant world-wide messaging. Whether that message be a financial transaction, a dapp interaction, or a chat message, not even members of the Status organization should be able to put a stop to the communication if they wanted to. Chaos Unicorn Day (CUD) was the first such experiment.

This post will document what we've learned from the experience, as well as what our plans for the future are.

Chaos Rising

Internally, we've often discussed how we should use a chaos engineering approach to make sure we are censorship-resistant and resilient. More recently, these discussions have gotten more and more traction. It was only this January that we finally committed to it and convinced everyone that this crazy experiment was indeed worthwhile.

Knowing how much our application currently relies on third party services from Infura to Etherscan, as well as our own bootnodes and mailserver nodes hosted on various cloud services, we were immediately certain of two things:

  1. Disabling the services would take a lot of work.
  2. Disabling the services would break the app, no doubt about it.

But much like in September 2018 when we decided extreme dogfooding (1 2) was the only way to really improve Status and drove the point home by turning off Slack on stage at Devcon 4 mere two months later, extreme self-sabotage seemed to us like the only way to make sure Status is robust and resiliant from future attacks and outages. We were additionally encouraged by the fact that we're still in alpha with a relatively small user-base, so the app breaking would be just disruptive enough to teach us useful things about our vulnerabilities, but not disruptive enough to ruin someone's work day.

As the date neared, the excitement became palpable. The core team was hacking away on custom builds that disable access to entire domains...

... and we got to work producing guides on how to immunize yourself 1, 2 if you're a frequent (and somewhat technical) user.

Chaos Apex

On the day, pretty much whatever we expected to happen happened: wallet functionality stopped working because it couldn't connect to Infura or Etherscan to read balances or send transactions, chat functionality died because all the bootnodes in our cluster were down.

Within an hour, we were back up and running. A few nodes were launched using this guide, others connected to Jacques' node and some contributors even ran their own Ethereum nodes for much needed wallet functionality.

In no more than 2 hours, we were happily chatting away in Status and sharing enode addresses. The network had recovered.

Aftermath

As the dust settled and we revived our cluster, we met for a chat and discussed what we learned. We interviewed Status users who participated, shared our own experiences and suggested alternative approaches for implementations of features expected of the platform. You can read the full retrospective here, which includes a list of of precise actions that we aim to take. A lot of these are fairly low-hanging fruit, such as increasing bootnode diversity, better error handling when things go wrong, and buggy configuration values. These are things we wouldn't have discovered if it weren't for this event.

Of course, some fruit is a bit harder to reach, but this only re-iterated the importance of investing in long term projects such as: light nodes, incentivization, empowering people to run their own nodes, and creating a robust network resiliant to individual node failure. Let's talk briefly about these and where we currently stand.

Light nodes

Technically running nodes on resource restricted device.

This applies to Whisper and IPFS/Swarm as well, but the primary thing that was surfaced during CUD is about Ethereum nodes and the reliance on Infura. The fact that a single party can stop you from being able to transact through Status is unacceptable, and goes directly against our principles.

See issue on re-enabling LES in the app, discussion on various connectivity options, as well as the Constrained resource clients ring.

This is a hard problem, especially if you both want performance (not draining your battery on mobile) and strong security guarantees. That said, we aren't doing nearly enough on this front (aside from the amazing work Nimbus is doing on this front).

Node incentivization

Making it economically profitable to run nodes.

Research is under way to make it profitable to run your own node, see this proposal and thread.

For ULC there are some ideas here.

Aside from economic incentivization, there are other forms of incentivization that are more related to culture. For more on this, see the next section.

Running your own node

Making the user experience of running and using your own node painless, and create a culture of running your own node (similar to owning your own keys).

We now realize that we need to let the community know that it's actually preferable to run their own nodes versus depend on ours, for both trust and reliability issues. Part of the problem in getting people interested in running their own nodes - apart from the fact that you needed to be reasonably technical - is that mailservers currently need high and reliable uptime. Mailservers should be able to join and drop at will with no ill effect on the network if there's enough of them, and running one should be cheap and painless.

Likewise, both individual users of the Ethereum network and other companies/dapps/organizations need to become aware of the fact that running a full node (not an archive node!) is incredibly easy and almost free these days with plug and play pre-configured hardware like that from Block and Mortar or Ava.do.

On top of this, UX is incredibly important. Interfaces need to be intuitive and the user must be able to easily find out how to both run custom nodes and find them around the world. The user must also be able to either understand all the errors they are shown, or be given resources for where to learn about them.

Main Takeaways

  1. Like many things in life, initially this was painful and a bit scary to some people, but the lessons learned were well worth it.

  2. There's a lot of low hanging fruit out there. By breaking things we could identify and work on fixing them.

  3. There are many fundamentally hard and interesting challenges (technically, economically, culturally) that we need to solve. These problems are general in the larger community. They are also all within reach if we just apply ourselves to them.

Next steps

Based on the above takeaways, we decided that we'll keep doing this quarterly with specific dates to be announced well in advance.

Ultimately, it shouldn't be perceptible to the end user whether Status' cluster is live or not. All in all, CUD fostered community building and creative problem solving and we're looking forward to what we'll collectively come up with to defeat CUD2!

If you're running any kind of service that's supposed to survive the censorship apocalypse, we recommend trying this out on your own infrastructure - you'd be surprised as to how much you can learn from it. Get in touch, we'd be happy to be your chaos testers!