David Hardtke's Blog

Home Feature Tour About Privacy Blog
 

Data rates around the web


A few weeks ago I was at the Twitter Developer Conference. On the hack day there were many impressive presentations about the tools that Twitter has developed to manage all of the data going in and out of Twitter. Twitter moving their back-end data store over to Cassandra. They threw out some impressive numbers -- 50 million tweets per day. 600 million searches per day. After the hack day, I had dinner with a friend from Twitter (@jeanpaul) and we were discussing the raw data volumes that they have to deal with.

My benchmark for "big data" is the STAR Experiment at RHIC. I worked on STAR from 1997-2003, and at that point I believe it was the largest volume data producer in existence. The raw data rates were enormous (Gigabyte or so per second) but it was fairly easy to compress that to 100 MB/s using electronics. At the end of the day, we had to put everything on tape, and the limit at the time was about 20 Mb/s to tape. Using the technologies available at the time, 20 Mb/s was that maximum you could record.

Today, of course, nobody uses tape for these sorts of problems. Tape is the same price as it was 10 years ago but disk is about 1000 times cheaper. One would assume then that people are recording data at much higher rates than the physicists were 10 years ago. Turns out, that for human generated data, the data rates are not as high as one might think. I compiled the following numbers from various places. This is data that needs to be archived -- when Ashton Kutcher sends a 4 kB tweet it causes 20 GB of bandwidth to be used, but only the 4 kB tweet needs to be saved.

Source Rate Data to Storage
Twitter 700/s 2 MB/s
Facebook Status Updates 600/s 2 MB/s
Facebook Photos 400/s 40 MB/s
Google Search Queries 34,000/s 30 MB/s

All of this content is humans typing at a keyboard (except for the Facebook photos). We see something interesting -- human generated unique content, integrated over all humanity, is not a very difficult data problem. Everything we generate is of order 100 MB/s, or perhaps 1 GB/s if we include emails and SMS.

 
 
 
 
 

« May 2010 »
SunMonTueWedThuFriSat
      
1
2
3
4
5
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
29
30
31
     
Today
Follow Stinky Teddy

    [This is a Roller site]
     
    © Stinky Teddy