Tuesday, August 28, 2007

Jim Gray et. al. on disks and scan times

Here are a couple of highlights Jim Gray and Prashant Shenoy's 1999 paper "Rules of Thumb in Data Engineering", with approximate updates for 2007.

Two key parameters for disk storage are
  • Price: 1994: $42K/TB. Predicted for 2004: $1K/TB. Seagate currently offers a 500GB drive which can be had for $180, or $0.36K/TB. This isn't the bleeding edge. Seagate is announcing a 1TB drive, and I haven't done anything like a thorough search across all manufacturers.
  • Scan time (time required to read every byte on a disk or other medium): Disks have been getting faster, but they've been getting bigger faster than they've been getting faster. In 1999 a typical 70GB drive with a transfer rate of 25MB/s would scan in about 45 minutes. The paper predicts 500GB, 75MB/s and 2 hours for 2004. The Seagate 500GB drive can sustain 72MB/s.
The price trend is just Moore's law. The main lesson, as with most hardware, is don't buy any more than you have to. It'll be cheaper tomorrow.

Increasing scan time has more subtle but crucial effects. We're used to thinking of disks as random-access devices (at least in comparison to, say, tapes). That's why we use them for virtual memory. But they're actually becoming more like tapes and less like RAM. Random access on a disk takes seek time and rotation time. Sequential access just takes transfer time. Seek time and rotation time are becoming more and more expensive relative to transfer.

This has a whole host of implications. Some that Gray and Shenoy mention:
  • Mirroring makes more sense for RAID performance than parity. With mirrors you can spread read accesses out across multiple copies, clawing back some of the lost random access performance.
  • Mirroring also makes more sense for backup. Gray and Shenoy look at tape backup and conclude that tape storage will soon (i.e., now) be purely archival. It just takes too long to scan through all the data on tape. They don't look at CD/DVD, but 500GB of disk is about 60 dual layer DVDs (neglecting compression). Better just to keep multiple copies online.
  • Log-structured file systems will make more and more sense for general use (and were already prevalent in high-performance database systems in 1999). This dovetails with the "change by adding" viewpoint of wikis, version control systems and such.
These effects are more visible behind the scenes than on the web at large. When we factor in CPU and network performance, the results are more directly visible. I'll get to that ...

1 comment:

David Hull said...

Note to self: did I get to that?