Scalability of a varying degree

Seen at official site of Qumulo:

Scale

Platforms must be able to serve petabytes of data, billions of files, millions of operations, and thousands of users.

Thousands of users...? Isn't it a little too low? Typical Swift clusters in Telcos have tens of millions of users, of which tens or hundreds of thousands are active simultaneously.

Google's Chumby paper has a little section on scalability problem with talking to a cluster over TCP/IP. Basically at low tens of thousands you're starting to have serious issues with kernel sockets and TIME_WAIT. So maybe that.

MinIO liberates your storage from rebalancing

MinIO posted a blog entry a few days ago where the bragged about adding capacity without a need to re-balance.

First, they went into a full marketoid mode, whipping up the fear:

Rebalancing a massive distributed storage system can be a nightmare. There’s nothing worse than adding a storage node and watching helplessly as user response time increases while the system taxes its own resources rebalancing to include the new node.

Seems like MinIO folks assume that operators of distributed storage such as Swift and Ceph have no tools to regulate the resource consumption of rebalancing. So they have no choice but to "wait helplessly". Very funny.

But it gets worse when obviously senseless statements are made:

Rebalancing doesn’t just affect performance - moving many objects between many nodes across a network can be risky. Devices and components fail and that often leads to data loss or corruption.

Often, man! Also, a commit protocol? Never heard of her!

Then, we talk about some unrelated matters:

A group of drives is an erasure set and MinIO uses a Reed-Solomon algorithm to split objects into data and parity blocks based on the size of the erasure set and then uniformly distributes them across all of the drives in the erasure such that each drive in the set contains no more than one block per object.

Understood, your erasure set is what we call "partition" in Swift or a placement group in Ceph.

Finally, we get to the matter at hand:

To enable rapid growth, MinIO scales by adding server pools and erasure sets. If we had built MinIO to allow you to add a drive or even a single hardware node to an existing server pool, then you would have to suffer through rebalancing.

MinIO scales up quickly by adding server pools, each an independent set of compute, network and storage resources.

Add hardware, run MinIO server to create and name server processes, then update MinIO with the name of the new server pool. MinIO leaves existing data in their original server pools while exposing the new server pools to incoming data.

My hot take on the social media was: "Placing new sets on new storage impacts utilization and risks hotspotting because of time affinity. There's no free lunch." Even on the second thought, I think that is about right. But let us not ignore the cost of the data movement associated with rebalancing. What if the operator wants to implement in Swift what MinIO blog post talks about?

It is possible to emulate MinIO, to an extent. Some operators add a new storage policy when they expand the cluster, configure all the new nodes and/or volumes in its ring, then make it default, so newly-created objects end on the new hardware. This works to accomplish the same goals that MinIO outline above, but it's a kludge. Swift was not intended for this originally and it shows. In particular, storage policies were intended for low numbers of storage classes, such as rotating media and SSD, or Silver/Gold/Platinum. Once you make a new policy for each new forklift visit, you run a risk of finding scalability issues. Well, most clusters only upgrade a few times over their lifetime, but potentially it's a problem. Also, policies are customer visible, they are intended to be.

In the end, I still think that balanced cluster is the way to go. Just think rationally about it.

Interestingly, the reverse emulation appears to be not possible for MinIO: if you wanted to rebalance your storage, you would not be able to. Or at least the blog post above says: "If we had built MinIO to allow you to add a drive or ... a node to an existing server pool". I take it to mean that they don't allow, and the blog post is very much a case of sour grapes, then.

Swift in 2021

A developer meet-up for OpenStack, known as PTG, occurred a week ago. I attended the Swift track, where somewhat to my surprise we had two new contributors show up.

I got into a habit of telling people that I did not want Swift to end like AFS: develop great software and dead, with nobody using it. Today I looked it up, and what do you know: OpenAFS made a release in June 2020 (and apparently they also screwed up and had to post an emergency release in October).

So, I was chatting with Matt O. at PTG and he said, "oh yeah, we won some contracts when I was at SuSE, Swift was beating the competition." Not entirely a surprise, but it got me thinking: is it too early to declare Swift dead, or even AFS level dead?

Since NVIDIA gobbled up Swift, I was full of concerns for the centralization. NVIDIA uses Swift as a hyperscaler, in support of their own clusters. They already started to divest themselves from Swiftstack's customer base. I envisioned a future where NVIDIA assembles all the core contributors, then fires them all and closes the project. But then I learned that Lustre went through a cycle like that, being acquired, but then sold out to a smaller, more focused company (to DDN).

To sum, I see a possibility for Swift to remain relevant through a three-step strategy, if you will. First, Swift remains open, aligned to technology, and performant. Thanks to that, it wins new deployments (in HPC and Telco in particular). And because of the field use, it will find a corporate stewardship. So, basically, suck less for success.

P.S. Also at PTG I learned that S3 Inventory existed. Seemed like implementing it in Swift could be a satisfying accomplishment for someone new.

A small billion-object Swift cluster

In the latest of Swift numbers: talked to someone today who mentioned that they have 1,025,311,000 objects, or almost exactly a billion. They are spread over only 480 disks. That is, if my arithmetic is correct, 2,000 times smaller than Amazon S3 was in 2013. But hey, not everyone is S3. And they aren't having any particular problems, things just work.

~avg on NoSQL

Just saving it from LinkedIn:

The real difference between SQL-based (and other relational databases) and NoSQL glorified KV stores is the presence of algebraic structure (i.e. Codd algebra). Algebra is basically all about transformations between equivalent expressions to arrive to a desireable form (i.e. simplified, or factorized, or whatever the goal is). These transformations have another name: optimizations.

Basically, when you have a real SQL database, you have ability to optimize execution plans. Which could easily yield orders of magnitude of improvement in performance.

(And, yes, modern relational databases (i.e. Snowflake) do internally convert semi-structured data into tabular form so that the optimizations are applicable to these as well).

If I had something to say about this, it would be something about stable, dependable performance having a value of its own. That is why TokyoCabinet was such a revelation and prompted the NoSQL revolution, which later ended with Mongo and reaction, like any revolution. But this is not my field, so let's just save it for future reference.

Google outage

It's very funny to hear about people who were unable to turn on their lights because their houses were "smart". Not a good look for Google Nest! But I had a real problem:

Google outage crashed my Thunderbird so good that the only fix is to delete the ~/.thunderbird and re-add all accounts.

Yes, really.

Cries of the vanquished

The post at roguelazer's is so juicy from every side that I'd need to quote it whole to give it justice (h/t ~avg). But its ostensible meat is etcd.[1] In that, he's building a narative of the package being elegant at first, and bloating later.

This tool was originally written in 2013 for a ... project called CoreOS. ... etcd was greater than its original use-case. Etcd provided a convenient and simple set of primitives (set a key, get a key, set-only-if-unchanged, watch-for-changes) with a drop-dead simple HTTP API on top of them.

Kubernetes was quickly changed to use etcd as its state store. Thus began the rapid decline of etcd.

... a large number of Xooglers who decided to infect etcd with Google technologies .... Etcd's simple HTTP API was replaced by a "gRPC" version; the simple internal data model was replaced by a dense and non-orthogonal data model with different types for leases, locks, transactions, and plain-old-keys.

Completely omitted from this tale is that etcd was created as a clone of Google Chumby, which did not use HTTP. The HTTP interface was implemented in etcd for expediency. So, the nostalgic image of early etcd he's projecting is in fact a primitive early draft.

It's interesting that he only mentions leases and locks in passing, painting them as a late addition, whereas the concept of coarse locking was more important for Chumby than the registry.

[1] Other matters are taken upon in the footnotes, at length. You'd think that it would be a simple matter to create a seaprate post to decry the evils of HTTP/2, but not for this guy! I may write another entry on the evils of bloat and how sympathetic I am to his cause later.

Recruiter spam

Recruitment spam, like conference spam, is a boring part of life. However, it raises an eyebrow sometimes.

A few days ago, a Facebook recruiter, John-Paul "JP" Fenn, sent me a form e-mail to an address that I do not give to anyone. It is only visible as a contact for one of my domains, because the registrar does not believe in privacy. I was pondering if I should propose to give him a consideration in exchange for the explanation of just where he obtained the address. Purely out of curiosity.

Today, an Amazon recruiter, Jonte, sent a message to an appropriate address. But he did it with addresses in the To: header, not just the envelope. He used a hosted Exchange of all things, and there were 294 addresses in total. That should give you an idea just how hard these people work to spam and at what level of being disposable I am in their eyes.

It really is pure spam. I think it's likely that JP bought a spam database. He didn't write a Python script that scraped whois information.

I remember a viral story a few years ago how one guy got a message from Google recruiter that combined his LinkedIn interests in amusing ways. It went like this: "We seek people whose strength is Talking Like A Pirate. As for Telling Strangers On The Internet They Were Wrong, that's one of my favorite pastimes as well." You know you made it when you receive that kind of attention. Maybe one day!

UPDATE 2020-03-02: JP let me go, but 3 more Facebook recruiters attacked me: Faith, Sara, Sandy. I really didn't want to validate their spam database, but I asked their system to unsubscribe. They didn't make it simple though. First, I had to follow a link in recruiter's spam. It didn't have a cookie though, but pointed to a stock URL with a form to enter the e-mail address. Then, Facebook sent a message to the given address (thus fully validating the address they harvested). That e-mail contained a link with the desired cookie. When I followed that, I was asked to confirm the unsubscription. Only then, they promised to unsubcribe me.

Seagate and SMR in 2020

Back in 2015, I wrote about Seagate Kinetic and its relation to shingles in Seagate product. Unfortunately, even if Kinetic were a success, it would only support a fraction of workloads. But the rest of Seagate customers demanded density increases. So, to nobody's surprise, Seagate started including shingles into their general purpose disk drives, perhaps only for a part of the surface, or coupled with a flash cache. The company was an enthusiastic early adopter of hybrid drives, as a vendor. Journalists are trying to make a story out of it, because caches are only caches, and once you started spilling, the drive slows down to the shingle speed. But naturally, Seagate neglected to mention in their documentation just how exactly their drive worked. Sacre bleu!