You are viewing [info]dga's journal

Previous 10

Dec. 1st, 2011

A brief response to "COPS and PNUTS"

I stumbled across a review by Peter Bailis of one of our SOSP papers today, and I'm worried that we failed to be clear about a few things in the work, so I figured I'd grab the blog soapbox and try to clarify. First is the relationship of COPS [pdf] and prior work. Peter comments:

I'm not sure this advances much of the state of the art beyond, say, Dynamo, which provided a scalable web store with user-specified conflict handling.

The delta from Dynamo is very simple: Dynamo offers no “guaranteed” consistency semantics at all. Operations that touch different keys/tablets can arrive at replicas in arbitrary order. In COPS, that’s not the case - operations can only arrive in an order allowed by causal consistency. The example of updating two keys, one that controls the permissions of an object, and one that modifies the object, is a good way to illustrate this. In Dynamo, the object modification could be applied at a remote datacenter before the corresponding permissions update. In COPS, it could not. PNUTS is stronger than Dynamo, but weaker than COPS - it provides per-record timeline consistency, but provides no consistency guarantees across different keys. Consider three updates by a single thread:

     write A=1, write B=dog, write B=cow, write A=2

In Dynamo, the order at which these updates are applied is unconstrained. In a remote datacenter, read(A), read(A) might return A=2, A=1

In PNUTS, the value of “A” could never go backwards: A=1, A=1, … A=2, A=2, …

But in both Dynamo and PNUTS, the relative values of A and B are unconstrained, and each of these systems could read(A), read(B): {2, dog}. Even though the originating thread set B=cow before it set A=2.

In COPS, this latter read sequence could not happen. The allowable reads are:

   {1, }, {1,dog}, {1,cow}, {2, cow}

In all three systems (in PNUTS, using Read-any), concurrent updates at different datacenters could cause conflicts that invoke conflict handlers, and in this way the three systems are not different - nor do we claim they are.

Causal consistency in this way is not new, and it’s not the contribution of COPS. It’s the same consistency model provided by Bayou and the systems that followed it. The contribution is that COPS does so in a scalable way when your data store is partitioned among many machines in a datacenter. As you shard data onto more and more tablets, those tablets can replicate to their peers in a remote datacenter without all updates having to go through a single point of serialization. The tricky thing in doing so is ensuring that you don't build up an ever-growing pile of metadata in the process, and I think this is COPS's biggest contribution - Wyatt figured out how to do this replication, efficiently identify and garbage collect the dependencies, and so on.

I hope that we didn’t convey the impression that we thought “causal+” is a novel model - in fact, as we explicitly point out in section 3.4, it’s exactly the consistency model provided by Bayou and PRACTI. However, in order to prove that COPS’s design preserved the causal ordering, we had to create a formal definition of causal+. As Peter says, we could have called it “employing commutative merge functions to totally order causally divergent versions”, but that’s a mouthful, so we coined a shorthand. It’s not clear to me that giving names to enhance clarity justifies claiming that the “card-carrying club members” of the SOSP community were “misleading” and “posturing”. :-)

Finally, COPS isn't the solution to every problem under the sun. There are times when you want to sacrifice consistency for availability, and for those times, COPS provides strictly stronger consistency semantics than, e.g., Dynamo and PNUTS, with the same availability. But there are also times when you need something stronger, and for that, you have to use something like Megastore or Walter. There's a reason Amazon's S3 and PNUTS offer the choice of strong or eventual consistency, and nothing can change that - it's a fundamental tradeoff. But COPS makes the "available" choice somewhat easier to use.

Feb. 13th, 2011

Lovely running in the bay area

This week started a bit off with a sore throat (my two highest-ever mileage weeks combined with the SIGCOMM deadline, or coincidence? You decide), but I took it easy for a few days and then flew to the bay area for an ISAT meeting. The meeting was interesting from my side of things (I learned things), but arguably the best treat was getting in 13 mile runs to lovely, hilly trails in the mornings. Run details from Garmin Connect.

From Running

Needless to say, I was happy, though I look weird in this photo - might have been a bit dehydrated or something.

From Running
Sunrise was gorgeous:
From Running

And the view of the south bay from the hills was great:

From Running

The elevation was not the usual Pittsburgh run profile:

Tags:

Oct. 21st, 2010

Running is fun

This is really a preface to a longer post involving mildly surprising disclosures like me deciding to eat mostly vegan for the last 4 months (and counting...), but for now, I'll skip to some of the fun. Below is the "workout" part (not counting another 4 miles of warmup/cooldown) from a lactate threshold workout (eg, "comfortably hard", but decently harder than just normal daily runs -- I usually need a rest day after these) done today and a month ago:
     Date      Mile 1      Mile 2      Mile 3      Mile 4
             Pace  HR    Pace  HR    Pace  HR    Pace  HR
             ---------------------------------------------
 9/16/2010   6:48  148   6:55  153   7:02  155   6:59  156
10/21/2010   6:42  149   6:41  152   6:42  152   6:48  153
In other words, a month ago I ran 4 miles at a 6:56 average pace with an average heart rate of 153 beats per minute; today, I ran 4 miles at a 6:43 average pace with an average heart rate of 151.5. In fairness, it was cold out today, but that's still a lovely improvement for only a month. The relation to the eating habits is that I've never before been able to sustain the mileage I'm running now. Wish I'd figured out earlier that eating better is good for running. :)

Sep. 22nd, 2010

Newest FAWN cluster photos

Our newest cluster is up to about 20 nodes, all with the new pineview Atom chips and fast SSDs. Much fun! I particularly like the use of big elastic bands to hold the SSDs to the motherboards. :)

From CMU

From CMU

From CMU

Sep. 20th, 2010

(no subject)

This is [info]redbeard's fault. And I quote:

Take a picture of yourself, as you are right now, no changing clothes or makeup or especially photoshop, and post it.




Yeah, it's possible I still look like I'm 14. But the older and creakier I get---and the more suspiciously color-free hairs I notice, ahem---the less I mind that too much. :)

May. 4th, 2010

Notes from Google "Flash" talk at UCSD non-volatile memories workshop

I was at the UCSD workshop on Non-Volatile Memories about three weeks ago, and had a surprisingly great time. I say "surprisingly" because I showed up at the reception the first night, realized I didn't know a single person there, and thought "uh-oh." That "uh-oh" turned into "ooh!" the next day -- I learned a surprising amount about the lower levels of contemporary nonvolatile memory technology and met some very cool folks.

Many of the slides from the talks are online (though, as in all things, the hallway conversations were both unrecorded and perhaps as or more useful). But one of the stand-out talks isn't -- Al Borchers talk about Google's experiences with Flash memory. I've jotted down some highlights from the talk that jumped out at me. Caveat: These are filtered through my own interest, and a lot of what really jabbed in my head echos our own experiences with FAWN, and several reinforced things I said in my talk the day before so if something seems odd, it's probably my fault, and not Al's.

Al Borchers is in the platforms group developing system software for Ggogle's server platforms, and has been working on high performance storage devices. Ph.D. in theoretical CS from Minnesota (1996), has been hacking unix and linux device drivers and systems software in industry. Much of the talk he gave involved work with Naveen Upta, Tom Keane, Kyle Nesbit.

Looking at HW devices and how SW could be modified to take advantage of Flash, if necessary: "It has been a rocky experience with flash. We've had difficulties with performance and reliability of devices, and figuring out where we can apply flash in a cost-effective way. Many applications... some obvious some not. Without forcing apps to change too radically.

App trends:

  • High tput large data workloads are seek limited
  • Apps are latency sensitive - 99.9th %ile latency, disk queueing, might see 100ms latency waiting to read off of disk.

Hardware:

  • [Started out by mentioning the high IOPS-DRAM, low IOPS-disk duality]
  • Disk capacity grows faster than seeks, which are basically constant
  • Waste capacity to provide seeks, or share and prioritize. We just buy more and more disks.
  • Flash between RAM and disk in price and performance
  • Flash provides seeks

Application: BigTable [Work from Kyle Nesbit]

  • Key/value store API, large memory footprint for cache, GFS for persistent storage
  • GFS: File API, data stored on disk, replication for fault tolerance

Options: Could use flash as a cache, put it on chunkservers, or on bigtable servers. Looking at it most in the chunk servers.

  • Instrumented Bigtable tablet server
  • Collected traces of Bigtable accesses
  • Ran trace simulations offline about cache size and replacement policy
  • For several Google apps

1) Bigger caches almost always better. 16 .. 512 GB cache went from 450 cache misses/second to 130 cache misses/sec. Linear reduction with exponential increase in cache size.

  • Flash as cache can be very write-intensive
  • Write performance is hugely variable and almost chaotic based on workload.
  • Sequential writes tend to work well, but random writes...
  • Device also wears out, and wear depends on workload.
  • Lifetime affects cost of the solution - how long will it last?
  • LRU: Good hit rate, but bad lifetime/perf with lots of random writes
  • FIFO: Bad hit rate, good lifetime, perf
  • Research opportunity: Replacement policy with good hit rates and good lifetime.
  • Difference between the two evens out as the cache size grows, however... 512GB FIFO nearly as good as LRU.
  • LRU vs FIFO estimated lifetime:
    • FIFO gets 2x the lifetime as LRU based on their simulations. They monitored how many GCs the device was doing.

2) CPU overhead

  • Flash shifts from seek-bound to CPU-bound
  • High IOPS but proportional CPU overhead
  • Storage stack large fraction of application CPU usage
  • CPU overhead depends on h/w interface, driver, block layer, I/O scheduler, file system, NUMA
  • CPU overhead depends on application programming model--direct I/O, async/sync IO, single/multi threaded I/O (most experiments used direct I/O, peforms better. Async better than sync, single thread better than mutli... but apps may feel otherwise!)
  • As CPU overhead, with highest perf devices, we use one core to do I/O, and then we get NUMA effects as things communicate over the system bus.

Al then gave some numbers on the relative performance of, e.g., PCI vs SAS vs SATA drives they'd measured. I didn't write all of these down well enough to be confident reposting them. The gist was that access over PCI incurred less CPU overhead than SAS and SATA. NUMA access - when you had to go through a different core - hurt just as much.

CPU use for async/multithread: At high BW, sync multithreaded model uses 2-3x the CPU. They didn't really see that in SATA because it was limited to 31 outstanding requests by NCQ.

Overhead summary:

  • Block layer: Even raw block device with direct IO can add 2x to 3x CPU overhead of accessing RAM
  • IO scheduler adds 30% overhead, unnecessary disk optimizations and lock contention (one device could cut that out)
  • FS added 39% overhead; metadata writes and cold data hurts perf and lifetime on some filesystems
  • if you just have a few files, you can do a very simple userspace FS...
  • NUMA 20% to 40% more overhead

Interface:

  • IO size constrained by flash page/block sizes
  • Doesn't mean flash has to be a block device in the storage stack
  • New level in storage hierarchy needs a new interface
  • Next gen high speed NAND devices will have higher IOPS
  • Research opportunities:
    • Optimize storage stack for flash performance characteristics
    • Other interfaces for flash -- for example, user space IO

Problem 3: Error rates

  • Read error rate higher than predicted by bit error rate
  • Block, plane, and die failures seem to dominate
  • Errors seem concentrated in a few bad chips
  • (haven't watched the devices to see errors through wearout and aging)
  • Small impact to caching apps -- cache miss
  • Large impact to DB apps -- data loss
  • Looking at RAID, but w/traditional RAID drivers, perf is terrible. Adds another layer of CPU overhead. Looking at optimized RAIDs for flash, would like to see more...
  • research op: fault tolerance in flash storage devices, long term, large scale failure rates and mechanisms.

Q: Which drives did you use?
A: Doesn't matter. Can't say. But all of the devices suffer perf overhead.

Q: Can you comment SLC vs MLC on reliability?
A: Our initial reliability of SLC seemed a little bit better, but we haven't taken them to life and worn them out, but for both we saw a lot of early-life failures.

  • We feel like we're forced to MLC...
  • encouraged that mfgs are talking about enterprise MLC... discouraging looking at commodity flash chips that have shorter and shorter lifetimes

Q: Comment on pci-express as interface?
A: We like it better, it seems to perform better, lower overhead, ...

.. more about high overhead of going through block layer to get to SSDs at high IOPS.

All in all, it was an excellent talk, and shows that Google has been taking a very serious look at Flash in their datacenters. We're seeing a lot of indicators that Flash is poised -- but not completely ready yet -- to start making huge inroads into the DC.

Apr. 26th, 2010

Sauteed onion and candied cashew salad

The result of playing "what the heck do we have around the house to eat with this lettuce?" turned out surprisingly tasty last night. So I'm inflicting it on my blog as a way to write it down. Ingredients are a rough estimate - this was a purely seat-of-the-pants recipe:

1/2 head buttercrunch lettuce
less than 1/4 C crumbled blue cheese.
1/3rd medium onion, sliced
scant 1T olive oil
handful of cashews, broken into halves or quarters
1/2 T granulated sugar, 1/2 T brown sugar
2T water

Dressing:
3T olive oil (I'm totally guessing here)
2t good red wine vinegar
< 2t apple cidar vinegar
2 cloves garlic, minced
salt and pepper to taste. I used about two pinches of kosher salt and .. some pepper.

To make dressing: Combine ingredients. Shake. Let sit for a while.

Onion: Saute onion in a splash (I'm guessing under 1T) of olive oil. Past soft, but not quite to fully caramelized - maybe 15 minutes. Remove from heat and set aside.

Add a few T of water to small nonstick pan, warm, add sugars, bring to gentle simmer. Keep careful eye on and stir frequently. Midway to caramelization, add cashews. Continue to cook until you get a light brown caramel coating the cashews. Be careful: The syrup will caramelize fast, and you want to take it off the heat immediately afterwords.

Remove cashews from heat. Mix with onions. Break the cashews into smaller chunks -- they'll be glued together with the caramel -- but don't burn yourself. The caramel is hot and sticky.

Combine ingredients. Toss with dressing. Enjoy.

Mar. 10th, 2010

NSDI paper - SplitScreen virus scanning

(work-related post ahead)

The camera-ready version of our NSDI paper, SplitScreen: Enabling Efficient, Distributed Malware Detection is now online. I'm mentioning it because it has (to me) a fun story behind it: The project started out with absolutely nothing to do with virus or malware scanning. I like this one as an example of happenstance in research, as I'll explain. And it has a cute trick involving Bloom filters, so what more could you want?

Two years ago, I was chatting with Tom Mitchell about his Read the Web project. Part of RTW involves searching the web to count how many times a phrase occurs, for millions of phrases. In other words, fgrep -f BigPhraseFile web/*. The BigPhraseFile is big. The web is really big.

We wanted to run this search on our FAWN cluster of tiny computers, to see if we could do it better, faster, cheaper, lower-power. But then a problem arose: grep built a 4 gigabyte search trie (using a variant of Aho-Corasick). Our first-generation wimpy nodes had 256MB of memory. We promptly locked up the cluster and had to reboot everything.

Iulian Moraru spent the next while creating a memory-optimized algorithm to solve the problem, and succeeded. The technique, feed-forward bloom filters splits the processing into two steps. The first process imprecisely filters the two inputs (the phrases and the web) to produce two outputs: A smaller set of phrases, which includes all potentially matching phrases and some extras, and a smaller version of the web, which includes all potentially matching lines from the web corpus. We then run a conventional exact-match grep on those smaller files. The result is that it uses a twentieth of the memory. This, in turn, means that the problem can fit in cache better on normal processors. We then applied further techniques to cache-optimize the data structure.

The result works for Tom's read the web problem -- and then David Brumley suggested that we might try applying it to virus scanning. David's student Sang Kil Cha and Iulian integrated the FFBF processing into the open-source ClamAV virus scanner, and managed to (at least) double the speed that it can scan for viruses.

I find this one fun because it illustrates in one paper the very diverse set of inputs it took to reach the end: We had a bunch of memory-constrained nodes, so we were thinking about a particular set of challenges. Tom's RTW project got us hooked on the idea of making grep work on those nodes. David suggested the application to ClamAV that put us on the path towards the NSDI paper. There's something to be said for surrounding yourself with smart people. :)

p.s., we now have a very fast hammer - feed forward bloom filters - if you think you have an appropriate nail. :) Code available upon request. We're also still searching for the right venue in which to publish a paper about the FFBFs themselves - some place that likes algorithm implementation details, minimizing TLB misses, cache lines, and bloom filter math... thoughts?

Dec. 15th, 2009

Internet Thermodynamics

I just had to, with some embarrassment, leave the following explanation in closing out my service ticket with my (patient and wonderful) ISP, Speakeasy.

The context: My internet service has been bouncing up and down every hour and a half for the last day or so. Very frustrating.

I believe I've diagnosed it, and owe you a thanks for putting up with a customer-caused problem. When on the most recent phone call, I accidentally bumped into the power cable for my NAT box, rebooting the machine, which solved the mystery:

A curtain was brushing against the power cord. Over the last few months, the constant bumping had worked the plug loose to the point where it was just barely in. When the furnace would turn on, about every hour and a half, the forced air would cause the curtain to move -- which would glitch the power to the NAT box. After about 20 minutes, the house would be warm again, and the furnace would shut off... only to repeat an hour and a half later.

Thanks for your patience and help with this one. I'm going to go bonk the owner of said NAT box on the head for not looking at the uptime on his machine before crying wolf. I believe you can close the ticket."

Dec. 9th, 2009

A geek's approach to happy teeth

I was going to title this post "hacking teeth", but that has a particularly horrible image associated with it. So, to the point: This is what happens when your Ph.D.-holding geek friend gets worried about his oral health and catches a cold that gives him a few days to do research on the always-reliable Internet. If you want to skip the blather, the bottom line is: About five months ago, I started using a (brushing, rinsing, etc.) regime that its inventor calls, generically, clean white teeth. Before I jumped into it, I spent an embarrassing amount of time reading some of the background research behind it, and came to the conclusion that, in theory, it should work. My dentist visit today was zippy fast, things looked great, and I'm now convinced this regime works. If you want the rationale and links to the underlying research, read on.Full Details )

Previous 10

December 2011

S M T W T F S
    123
45678910
11121314151617
18192021222324
25262728293031

Links

Syndicate

RSS Atom
Powered by LiveJournal.com