Lovely running in the bay area

This week started a bit off with a sore throat (my two highest-ever mileage weeks combined with the SIGCOMM deadline, or coincidence? You decide), but I took it easy for a few days and then flew to the bay area for an ISAT meeting. The meeting was interesting from my side of things (I learned things), but arguably the best treat was getting in 13 mile runs to lovely, hilly trails in the mornings. Run details from Garmin Connect.

From Running

Needless to say, I was happy, though I look weird in this photo - might have been a bit dehydrated or something.

From Running
Sunrise was gorgeous:
From Running

And the view of the south bay from the hills was great:

From Running

The elevation was not the usual Pittsburgh run profile:

Running is fun

This is really a preface to a longer post involving mildly surprising disclosures like me deciding to eat mostly vegan for the last 4 months (and counting...), but for now, I'll skip to some of the fun. Below is the "workout" part (not counting another 4 miles of warmup/cooldown) from a lactate threshold workout (eg, "comfortably hard", but decently harder than just normal daily runs -- I usually need a rest day after these) done today and a month ago:
     Date      Mile 1      Mile 2      Mile 3      Mile 4
             Pace  HR    Pace  HR    Pace  HR    Pace  HR
 9/16/2010   6:48  148   6:55  153   7:02  155   6:59  156
10/21/2010   6:42  149   6:41  152   6:42  152   6:48  153
In other words, a month ago I ran 4 miles at a 6:56 average pace with an average heart rate of 153 beats per minute; today, I ran 4 miles at a 6:43 average pace with an average heart rate of 151.5. In fairness, it was cold out today, but that's still a lovely improvement for only a month. The relation to the eating habits is that I've never before been able to sustain the mileage I'm running now. Wish I'd figured out earlier that eating better is good for running. :)

(no subject)

This is redbeard's fault. And I quote:

Take a picture of yourself, as you are right now, no changing clothes or makeup or especially photoshop, and post it.

Yeah, it's possible I still look like I'm 14. But the older and creakier I get---and the more suspiciously color-free hairs I notice, ahem---the less I mind that too much. :)

Notes from Google "Flash" talk at UCSD non-volatile memories workshop

[This post has moved to my new blogspot blog]

I was at the UCSD workshop on Non-Volatile Memories about three weeks ago, and had a surprisingly great time. I say "surprisingly" because I showed up at the reception the first night, realized I didn't know a single person there, and thought "uh-oh." That "uh-oh" turned into "ooh!" the next day -- I learned a surprising amount about the lower levels of contemporary nonvolatile memory technology and met some very cool folks.

Many of the slides from the talks are online (though, as in all things, the hallway conversations were both unrecorded and perhaps as or more useful). But one of the stand-out talks isn't -- Al Borchers talk about Google's experiences with Flash memory. I've jotted down some highlights from the talk that jumped out at me. Caveat: These are filtered through my own interest, and a lot of what really jabbed in my head echos our own experiences with FAWN, and several reinforced things I said in my talk the day before so if something seems odd, it's probably my fault, and not Al's.

Al Borchers is in the platforms group developing system software for Ggogle's server platforms, and has been working on high performance storage devices. Ph.D. in theoretical CS from Minnesota (1996), has been hacking unix and linux device drivers and systems software in industry. Much of the talk he gave involved work with Naveen Upta, Tom Keane, Kyle Nesbit.

Looking at HW devices and how SW could be modified to take advantage of Flash, if necessary: "It has been a rocky experience with flash. We've had difficulties with performance and reliability of devices, and figuring out where we can apply flash in a cost-effective way. Many applications... some obvious some not. Without forcing apps to change too radically.

App trends:

  • High tput large data workloads are seek limited
  • Apps are latency sensitive - 99.9th %ile latency, disk queueing, might see 100ms latency waiting to read off of disk.


  • [Started out by mentioning the high IOPS-DRAM, low IOPS-disk duality]
  • Disk capacity grows faster than seeks, which are basically constant
  • Waste capacity to provide seeks, or share and prioritize. We just buy more and more disks.
  • Flash between RAM and disk in price and performance
  • Flash provides seeks

Application: BigTable [Work from Kyle Nesbit]

  • Key/value store API, large memory footprint for cache, GFS for persistent storage
  • GFS: File API, data stored on disk, replication for fault tolerance

Options: Could use flash as a cache, put it on chunkservers, or on bigtable servers. Looking at it most in the chunk servers.

  • Instrumented Bigtable tablet server
  • Collected traces of Bigtable accesses
  • Ran trace simulations offline about cache size and replacement policy
  • For several Google apps

1) Bigger caches almost always better. 16 .. 512 GB cache went from 450 cache misses/second to 130 cache misses/sec. Linear reduction with exponential increase in cache size.

  • Flash as cache can be very write-intensive
  • Write performance is hugely variable and almost chaotic based on workload.
  • Sequential writes tend to work well, but random writes...
  • Device also wears out, and wear depends on workload.
  • Lifetime affects cost of the solution - how long will it last?
  • LRU: Good hit rate, but bad lifetime/perf with lots of random writes
  • FIFO: Bad hit rate, good lifetime, perf
  • Research opportunity: Replacement policy with good hit rates and good lifetime.
  • Difference between the two evens out as the cache size grows, however... 512GB FIFO nearly as good as LRU.
  • LRU vs FIFO estimated lifetime:
    • FIFO gets 2x the lifetime as LRU based on their simulations. They monitored how many GCs the device was doing.

2) CPU overhead

  • Flash shifts from seek-bound to CPU-bound
  • High IOPS but proportional CPU overhead
  • Storage stack large fraction of application CPU usage
  • CPU overhead depends on h/w interface, driver, block layer, I/O scheduler, file system, NUMA
  • CPU overhead depends on application programming model--direct I/O, async/sync IO, single/multi threaded I/O (most experiments used direct I/O, peforms better. Async better than sync, single thread better than mutli... but apps may feel otherwise!)
  • As CPU overhead, with highest perf devices, we use one core to do I/O, and then we get NUMA effects as things communicate over the system bus.

Al then gave some numbers on the relative performance of, e.g., PCI vs SAS vs SATA drives they'd measured. I didn't write all of these down well enough to be confident reposting them. The gist was that access over PCI incurred less CPU overhead than SAS and SATA. NUMA access - when you had to go through a different core - hurt just as much.

CPU use for async/multithread: At high BW, sync multithreaded model uses 2-3x the CPU. They didn't really see that in SATA because it was limited to 31 outstanding requests by NCQ.

Overhead summary:

  • Block layer: Even raw block device with direct IO can add 2x to 3x CPU overhead of accessing RAM
  • IO scheduler adds 30% overhead, unnecessary disk optimizations and lock contention (one device could cut that out)
  • FS added 39% overhead; metadata writes and cold data hurts perf and lifetime on some filesystems
  • if you just have a few files, you can do a very simple userspace FS...
  • NUMA 20% to 40% more overhead


  • IO size constrained by flash page/block sizes
  • Doesn't mean flash has to be a block device in the storage stack
  • New level in storage hierarchy needs a new interface
  • Next gen high speed NAND devices will have higher IOPS
  • Research opportunities:
    • Optimize storage stack for flash performance characteristics
    • Other interfaces for flash -- for example, user space IO

Problem 3: Error rates

  • Read error rate higher than predicted by bit error rate
  • Block, plane, and die failures seem to dominate
  • Errors seem concentrated in a few bad chips
  • (haven't watched the devices to see errors through wearout and aging)
  • Small impact to caching apps -- cache miss
  • Large impact to DB apps -- data loss
  • Looking at RAID, but w/traditional RAID drivers, perf is terrible. Adds another layer of CPU overhead. Looking at optimized RAIDs for flash, would like to see more...
  • research op: fault tolerance in flash storage devices, long term, large scale failure rates and mechanisms.

Q: Which drives did you use?
A: Doesn't matter. Can't say. But all of the devices suffer perf overhead.

Q: Can you comment SLC vs MLC on reliability?
A: Our initial reliability of SLC seemed a little bit better, but we haven't taken them to life and worn them out, but for both we saw a lot of early-life failures.

  • We feel like we're forced to MLC...
  • encouraged that mfgs are talking about enterprise MLC... discouraging looking at commodity flash chips that have shorter and shorter lifetimes

Q: Comment on pci-express as interface?
A: We like it better, it seems to perform better, lower overhead, ...

.. more about high overhead of going through block layer to get to SSDs at high IOPS.

All in all, it was an excellent talk, and shows that Google has been taking a very serious look at Flash in their datacenters. We're seeing a lot of indicators that Flash is poised -- but not completely ready yet -- to start making huge inroads into the DC.

Sauteed onion and candied cashew salad

The result of playing "what the heck do we have around the house to eat with this lettuce?" turned out surprisingly tasty last night. So I'm inflicting it on my blog as a way to write it down. Ingredients are a rough estimate - this was a purely seat-of-the-pants recipe:

1/2 head buttercrunch lettuce
less than 1/4 C crumbled blue cheese.
1/3rd medium onion, sliced
scant 1T olive oil
handful of cashews, broken into halves or quarters
1/2 T granulated sugar, 1/2 T brown sugar
2T water

3T olive oil (I'm totally guessing here)
2t good red wine vinegar
< 2t apple cidar vinegar
2 cloves garlic, minced
salt and pepper to taste. I used about two pinches of kosher salt and .. some pepper.

To make dressing: Combine ingredients. Shake. Let sit for a while.

Onion: Saute onion in a splash (I'm guessing under 1T) of olive oil. Past soft, but not quite to fully caramelized - maybe 15 minutes. Remove from heat and set aside.

Add a few T of water to small nonstick pan, warm, add sugars, bring to gentle simmer. Keep careful eye on and stir frequently. Midway to caramelization, add cashews. Continue to cook until you get a light brown caramel coating the cashews. Be careful: The syrup will caramelize fast, and you want to take it off the heat immediately afterwords.

Remove cashews from heat. Mix with onions. Break the cashews into smaller chunks -- they'll be glued together with the caramel -- but don't burn yourself. The caramel is hot and sticky.

Combine ingredients. Toss with dressing. Enjoy.

Internet Thermodynamics

I just had to, with some embarrassment, leave the following explanation in closing out my service ticket with my (patient and wonderful) ISP, Speakeasy.

The context: My internet service has been bouncing up and down every hour and a half for the last day or so. Very frustrating.

I believe I've diagnosed it, and owe you a thanks for putting up with a customer-caused problem. When on the most recent phone call, I accidentally bumped into the power cable for my NAT box, rebooting the machine, which solved the mystery:

A curtain was brushing against the power cord. Over the last few months, the constant bumping had worked the plug loose to the point where it was just barely in. When the furnace would turn on, about every hour and a half, the forced air would cause the curtain to move -- which would glitch the power to the NAT box. After about 20 minutes, the house would be warm again, and the furnace would shut off... only to repeat an hour and a half later.

Thanks for your patience and help with this one. I'm going to go bonk the owner of said NAT box on the head for not looking at the uptime on his machine before crying wolf. I believe you can close the ticket."