Pete Zaitcev (zaitcev) wrote,
Pete Zaitcev

The "fast-post" is merged into Swift

I'm just back from a hackathon at HPE site in Bristol, England, where we made a final look to so-called "fast-post" patch and merged it in. It was developed by Alistair Coles and basically made POST work as everyone expected it to work, at last.

In the original Swift, it was found that when you do a POST to an object, in the presence of failures it was possible to end with some nodes having old data but new (posted) attributes. The bad part was that the replication mechanism could not do anything to recoincile the inconsistency, and then your GET returns varying data forever depending on what node you hit. It occured when new timestamp from POST attached itself to old data (and other equivalent scenarios).

This is some of a fundamental issue with using a timestamp based replication in Swift. Greg and Chuck knew about it all along, and their solution was known as "POST to PUT". They made Swift Proxy to fetch the object, then update its attributes for the POST, then do essentially a PUT. This way timestamps, data, and attributes are always consistent, as they are in the initial PUT. If this POST-to-PUT thing occurs across a failure, replication uses timestamps to restore consistently correctly.

The problem with that, POST-to-PUT is slow, as well as deceptive. Users think they issue a lightweight POST, but actually they prompt a massive data move inside the cluster if the object is big.

Alasdair's insight was that the root of the problem was not that timestamps were no good as a basic mechanism, but that the "fast" POST broke them by assigning new timestamps to old data (or old attributes, metadata, Content-Type). As long as each indepentently settable thing had its own timestamp, there was no problem. In Swift, we have 3 of those: object data, object metadata, and Content-Type (don't ask). So, store 3 timestamps with each object and presto!

The actual patch employs an additional trick by not changing the container DB schema. Instead, it encodes 3 timestamps into 1 field where the timestamp used to live. This way a smooth migration is possible in a cluster where old async pendings still float, for example. It looks a little kludgy at first, but I convinced myself that it made sense under the circumstances.

P.S. The fast-post is not a default, even now. Needs Container Sync updated to be compatible. I think Eran was going to look into that.

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded