Ok, while we're at this, I think that the most important part of the post was this:
> Edgestore data was already well-collocated, which meant that cross-shard transactions ended up being fairly rare in practice—only 5-10% of Edgestore transactions involve multiple shards.
What if performant cross-shard transactions is a red herring, and the thing that we should be looking more into is reliable automatic data colocation to avoid performing cross-shard transactions as much as possible? There's decent amount of academic research around this, with projects like SWORD [1] and Schism [2] that study shard load balancing as a problem of hypergraph partitioning. It seems like this might be worth incorporating into commercial distributed database projects.
Edgestore's API is set up to shepherd users into good collocation patterns by default, and a lot of work over the past year or two went into improving collocation and educating users about best practices. The collocation efforts were actually orthogonal to implementing cross-shard transactions, but they were obviously very beneficial.
Thanks -- just curious, am I correct to interpret this to basically mean that thus far, the performance of the system basically relies on users to explicitly define colos nicely within their application-level data model?
For some reason this reminds me of something like the entity group concept in Google's Megastore [1].
There is somewhere between where we are today and a completely uncollocated free-for-all where the system would fall over. There's a separate axis of the rate at which users request transactions (with locks) versus non-transactional writes using optimistic concurrency control that would come into play sooner. Our guidance is therefore that users try to reserve transactions for when there's correctness critical reasons why two objects need to be updated together and rely on asynchronous primitives to handle "eventually consistent" mutations.
I have witnessed first hand companies who decide to do cross-shard transactions when in fact they don’t really need it.
For example, imagine that you have a distributed key-value store. You want to users (on different shards) to either both be able to see a piece of content or neither. You can achieve this by allocating a key on either shard (or a third shard), writing a reference to each user’s shards. If all of those writes succeed, you can write to the new key which was allocated. If any writes fail, you can bail and your data is consistent. Wrap the above logic in a nice library and your service will scale horizontally.
For us (Dropbox), the threshold ended up being multiple product teams having to implement ad-hoc two-phase commit over their datatypes and burning engineer hours not to implement it but to prove that they had gotten it right and to handle the clean up after any unsuccessful writes.
You're right that most systems probably don't need 2PC, which is why Edgestore didn't include it until now. As mentioned in the post, we finally felt that we had reached the right balance of tradeoffs to justify the API-level primitive.
When you get to big enough scale, you need cross shard transactions. Google reached it with Spanner.
If I look at Dropbox, i am sure features related to sharing folders between people/organizations cross-shards. You can't avoid them if you want to offer a fully featured product.
After a quick skim of 2PC in Edgestore (sorry, no time), it is unclear if there is a single transaction coordinator (TC) or not. I assume it is a single TC - that you can scale it to 10m trans/sec is impressive. The really hard part is have multiple TCs and to design protocols to coordinate their recovery after failure. Here is a good example in the open-source NDB (MySQL Cluster) system - https://drive.google.com/file/d/1gAYQPrWCTEhgxP8dQ8XLwMrwZPc...
We have a routing tier of a few hundred machines. Any of those can serve as transaction coordinator. This is one of the places where having a transaction record is useful -- in general, our locking and non-transactional read scheme will wait nicely for a 2PC transaction to complete, but in the face of failures, any actor in the system can abort the 2PC transaction by marking the transaction record. There's also a really nice optimization for non-transactional reads that allows them to read even in the event of a staged but not yet committed 2PC transaction (you can prove to yourself that a linearized read can occur on either side, if the 2PC transaction is still pending when the read begins).
Then you are just forcing your application to deal with the dangling reference in the case of a failed "transaction". The whole point of TPC is not to have to deal with that complexity.
Whether that complexity is actually difficult to deal with is a whole other question, I could easily believe that for some applications it's not but apparently for Dropbox it's worth implementing TPC to avoid dealing with it.
Yes and no. You end up with a weaker api, eg you can’t perform atomic updates across shards. But you can create data atomically, perhaps build atomic counters, etc.
Giving up some flexibility in the types of atomic operations you can do for a simpler architecture is sometimes worth it.
The idea of creating an abstraction layer over thousands of MySQL nodes is a novel one, and should be commended. I like the idea a lot, and by DropBox's scale and success, it must work well.
I would be interested in a granular point-by-point comparison between this layer and native MySQL replication. In other words, if someone were to rewrite MySQL replication such that it did everything that this abstraction layer did (in addition to replication), what would it need to do?
Now (and this is strictly academic/theoretical), I'm curious what would be necessary to modify in the abstraction layer if the abstraction layer was to support a whole bunch of disparate SQL databases underneath it, i.e., Postgres, SQLite, SQL Server, Oracle, etc. (No, that wouldn't be practical, but it would be an interesting exercise to really learn where the gotchas might be where working with different SQL dialects...)
Vitess [1] is a similar kind of abstraction layer that's compatible with the native MySQL wire protocol. It also supports atomic cross-shard transactions using 2PC.
Vitess is a bit less of an application abstraction layer in the sense of Edgestore, Tao, and others, and more of an operational abstraction that hides physical sharding and taking the pain out of repartitioning to accommodate new capacity.
I've had some opportunity to work with the Vitess (now PlanetScale) folks on some other databases-related work and it's a great community and product so I would definitely encourage people to check it out.
Notably, from that site: "Vitess has been serving all YouTube database traffic since 2011, and has now been adopted by many enterprises for their production needs."
It seems like you are looking for TiDB? It is deployed as a single MySQL compatible database that transparently partitions data so that writes scale horizontally. TiDB is actually the MySQL compute layer. TiKV is the underlying distributed Key-Value store that different protocols can be implemented on (one company made a TiKV-backed Redis).
Dropbox is very easy to shard though, the users are mostly independent. Take something like Facebook, and it becomes much much harder - anybody can friend anybody, anybody can message anybody, anybody can like or comment on public posts.
> Edgestore data was already well-collocated, which meant that cross-shard transactions ended up being fairly rare in practice—only 5-10% of Edgestore transactions involve multiple shards.
What if performant cross-shard transactions is a red herring, and the thing that we should be looking more into is reliable automatic data colocation to avoid performing cross-shard transactions as much as possible? There's decent amount of academic research around this, with projects like SWORD [1] and Schism [2] that study shard load balancing as a problem of hypergraph partitioning. It seems like this might be worth incorporating into commercial distributed database projects.
[1] http://www.cs.umd.edu/~amol/papers/edbt13.pdf
[2] http://www.vldb.org/pvldb/vldb2010/papers/R04.pdf