Pete Zaitcev (zaitcev) wrote,
Pete Zaitcev
zaitcev

Swift and balance

Swift is on the cusp of getting yet another intricate mechanism that regulates how partitions are placed: so-called "overload". But new users of Swift even keep asking what weights are, and now this? I am not entirely sure it's necessary, but here's a simple explanation why we're ending with attempts at complexity (thanks to John Dickinson on IRC).

Suppose you have a system that spreads your replicas (of partitions) across failure zones equally - say 65,000 partitions each. This works great as long as your zones are about the same size, like a rack. But then one day you buy a new rack with 8TB drives and suddenly the new zone is several times larger than others. If you do not adjust anything, it ends only filled by quarter at best.

So, fine, we add "weights". Now a zone that has weight 100.0 gets 2 times more replicas than zone with weight 50.0. This allows you to fill zones better, but this must compromize your dispersion and thus durability. Suppose you only have 4 racks: three with 2TB drives and one with 8TB drives. Not an unreasonable size for a small cloud. So, you set weights to 25, 25, 25, 100. With replication factor of 3, there's still a good probability that the bigger node will end with 2 replicas for some partitions [0]. Once that node goes down, you lose redundancy completely for those partitions.

In the small-cloud example above, if you care about your customers' data, you have to eat the imbalance and underutilization until your retire the 2TB drives [1].

<clayg> torgomatic: well if you have 6 failure domains in a tier but their sized 10000 10 10 10 10 10 - you're still sorta screwed

My suggestion would be just ignore all the complexity we thoughtfuly provided for the people with "screwed" clusters. Deploy and maintain your cluster to make it easy for the placement and replication: have a good number of more or less uniform zones that are well aligned to natural failure domains. Everything else is a workaround -- even weights.

P.S. Kinda wondering how Ceph deals with these issues. It is more automagic when it decides what to store where, but surely there ought to be a good and bad way to add OSDs.

[0] This is a while lie, because the real Swift does not assign each replica to a partition independently. In an independent assignment to uniform zones, and assuming the example parameters above (replication factor 3 and only 4 zones), only 37% of partitions would be fully durable, and 56% would have 2 replicas in the same zone. In case of 25-25-25-100 weights, 23% are fully durable and 57% are 2-aliased. The 3-aliased partitions skyrocket from 6% to 19%. The real Swift attempts the "unique as possible" strategy instead. For each partition, it tracks what zones were used at least once, and selects among the remainder for the next replica. So, the 8TB drives above, may end under-utilized even with non-uniform weights.

[1] Strictly speaking, other options exist. You can delegate to another tier by tying 2 small racks into a zone: yet another layer of Swift's complexity. Or, you could put new 8TB drives on trays and stuff them into existing nodes. But considering that muddies the waters.

UPDATE: See the changelog for better placement in Swift 2.2.2.

Subscribe

  • Scalability of a varying degree

    Seen at official site of Qumulo: Scale Platforms must be able to serve petabytes of data, billions of files, millions of operations, and…

  • MinIO liberates your storage from rebalancing

    MinIO posted a blog entry a few days ago where the bragged about adding capacity without a need to re-balance. First, they went into a full…

  • Swift in 2021

    A developer meet-up for OpenStack, known as PTG, occurred a week ago. I attended the Swift track, where somewhat to my surprise we had two new…

  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 0 comments