David Hardtke's Blog

Home Feature Tour About Privacy Blog
 

"How to Save the News" by James Fallows


Yesterday I read an excellent article by James Fallows on How to Save the News in the Atlantic. The article discussed how Google was working to help news organizations monetize their content in the online age. The article prompted me to submit the following letter to the editor:

The ability to quickly find quality content is the primary reason that people use Google, and is good to see that Google is committed to helping professional journalism survive ("How to Save the News" by James Fallows). Journalists provide a large share of the material that draws people to the search engine, both Google news and the main search page. Consumers know that Google (and other search engines like Bing and Yahoo!) provide easy one-stop access to the information they seek on a large variety of topics. In exchange for this convenience, consumers allow Google to run sponsored links above the search results for the small fraction of search queries that are commercially valuable. A good analogy is commercial radio. Radio stations aggregate quality content that consumers want (songs from various artists) in exchange for the right to subject the consumer to the occasional advertisement. The radio versus search engine analogy breaks down, however, when we note that those who create the content that attracts consumers to the radio stations, the songwriters, are compensated for their creations in the form of performance royalties paid by the radio station. The fact that such a performance royalty agreement does not exist between search engines and professional journalists is an accident of history, and need not be the case in the future. Google and other search engines should compensate content providers directly for the right to use their creations.

In order to create such a system journalists need to realize that their creations are largely interchangeable from the perspective of Google or the average web surfer. If individual news providers starts charging in some way, the consumers and Google will simply move their attention to free news sites. This is why micro-payments are not the answer. Instead, news organizations need to bargain collectively and require that Google and other search engines pay for the right to index their content. If a large fraction of news organizations were to simultaneously remove their content from Google, it would seriously impact the quality of the Google product (and give Bing a huge advantage if they were to pay for that body of content). In a performance royalty system, news organizations would be payed a small fixed fee each time Google linked to their articles, similar to the way songwriters are compensated via ASCAP when their songs are played on the radio. Google will argue that the legal concept of fair use allows them to aggregate short portions of content from copyrighted sources and therefore prevents such a system from being enforceable, but it is not clear that fair use applies (the creation of the search index requires an entire copy of the document be stored on the Google servers). Additionally, journalists cannot be forced to participate, just as songwriters need not join ASCAP.

Clearly, such a system will not be invented by Google as it is bad for their bottom line. Nonetheless, a performance royalty system would fairly compensate journalists for the value they provide both Google and consumers.

 
 
 
 

Data rates around the web


A few weeks ago I was at the Twitter Developer Conference. On the hack day there were many impressive presentations about the tools that Twitter has developed to manage all of the data going in and out of Twitter. Twitter moving their back-end data store over to Cassandra. They threw out some impressive numbers -- 50 million tweets per day. 600 million searches per day. After the hack day, I had dinner with a friend from Twitter (@jeanpaul) and we were discussing the raw data volumes that they have to deal with.

My benchmark for "big data" is the STAR Experiment at RHIC. I worked on STAR from 1997-2003, and at that point I believe it was the largest volume data producer in existence. The raw data rates were enormous (Gigabyte or so per second) but it was fairly easy to compress that to 100 MB/s using electronics. At the end of the day, we had to put everything on tape, and the limit at the time was about 20 Mb/s to tape. Using the technologies available at the time, 20 Mb/s was that maximum you could record.

Today, of course, nobody uses tape for these sorts of problems. Tape is the same price as it was 10 years ago but disk is about 1000 times cheaper. One would assume then that people are recording data at much higher rates than the physicists were 10 years ago. Turns out, that for human generated data, the data rates are not as high as one might think. I compiled the following numbers from various places. This is data that needs to be archived -- when Ashton Kutcher sends a 4 kB tweet it causes 20 GB of bandwidth to be used, but only the 4 kB tweet needs to be saved.

Source Rate Data to Storage
Twitter 700/s 2 MB/s
Facebook Status Updates 600/s 2 MB/s
Facebook Photos 400/s 40 MB/s
Google Search Queries 34,000/s 30 MB/s

All of this content is humans typing at a keyboard (except for the Facebook photos). We see something interesting -- human generated unique content, integrated over all humanity, is not a very difficult data problem. Everything we generate is of order 100 MB/s, or perhaps 1 GB/s if we include emails and SMS.

 
 
 
 

Internet Explorer 8 Goodies


This week Microsoft approved two applications that integrate Stinky Teddy's Gossip Powered search directly into your browser and posted them in the Internet Explorer Add-ons Gallery. These tools were built to take advantage of some great features that Microsoft added to Internet Explorer 8. Browsers are becoming like smart phones where the actual phone is not as important as the apps that are available (in the browser world, "apps" are known as "add-ons"). Mozilla's Firefox is the king of the add-on business. Firefox was built as a lightweight shell that could be customized by the user. There are more than 10,000 add-ons in the Mozilla add-on gallery. Google's Chrome has recently enabled third-party add-ons and many Mozilla developers have ported their applications.

Add-ons and toolbars have long existed for Internet Explorer, but there has been a fundamental barrier to their widespread adoption -- the tools used to build add-ons and toolbars for Internet Explorer are also used by hackers to steal your information and infect your computer. Installing add-ons required that you install system software on your computer, and once you hit that button you were at the mercy of the software developer. Often they enticed you to hit the button by offering something useful like smiley face emoticons or access to games. Mozilla's Firefox built a sandboxing mechanism that keeps the add-ons separate from the operating system. Mozilla also has a good system of community policing that keeps the Mozilla community safe from malicious hackers.

Internet Explorer is the default browser for most users, so there has always been a desire to bring add-on features to Internet Explorer without requiring the user to install potentially malicious software on their computer. Enter Internet Explorer 8, with the concept of the Accelerator. Accelerators allow developers to interact with web pages that are rendered in your browser. The applications are completely sandboxed in the browser, and are only activated when you explicitly call for them. Hence, they are safe to install and use.

The Stinky Teddy Abracadabra Search Accelerator allows you to launch a search directly from a web page, either by highlighting terms on the page or by simply right clicking and selecting our accelerator. A little search preview box will pop up, so in many cases you can navigate directly to the page you are looking for. What I've described is pretty standard, but we've added a special ingredient. The Stinky Teddy Abracadabra Search accelerator uses the page you are currently visiting as context for your search. The concepts on the page are used as a frame of reference that guides us when we decided which search results to show you. The word "base" means different things if you are on a page about baseball or a page about furniture. Where you are helps us to know where you want to go. Although this idea is obvious, no other search engine uses this information. To be clear, we aren't tracking you -- all we use is your current screen to provide context. We don't save any information about you.

A second cool feature added to Internet Explorer 8 is Visual Search Suggestions. Firefox allows for search suggestions in a limited fashion (one line of text). After installing our Search Box Plugin we show you a preview of the search page as you type in your search query. Most search providers show query suggestions -- we show the search page. The search page preview we show has most of our usual content types (web, video, real-time, twitter, news), and the "buzzing" content is shown first. We wonder why other search engines don't show you search results as you type, and we suspect the answer is that this is a case where the business of search gets in the way of the user experience. The business of search is to show sponsored links above the search results. Search engines want you to go to their page, even when that step is unnecessary. Direct navigation from the search box makes more sense to us.

Check out our Internet Explorer 8 goodies and let us know what you think.

Visual Search Bar Plugin:


Accelerator:


 
 
 
 

The French Trail


I'm currently training for the upcoming Oakland Marathon. Last weekend I needed to go for a long run in spite of a steady rain. I decided to do the French Trail -- if there is any trail that will make a run good on a miserable day it is the French Trail. Here's the GPS. I was surprised to several other runners out that day, all covered in mud, and all enjoying the rainy day in the forest.

For those of you not familiar with the French Trail, it's located in Redwood Regional Park. The park has 4 major north-south trails that are excellent for running (East Ridge Trail, Stream Trail, West Ridge Trail, and French Trail). The French Trail is by far the most difficult, but also the best. I think it is the best running trail in the East Bay. The French Trail hugs the eastern side of the mountain that forms the Oakland skyline. Becuase of this geography, it is able to support the few remaining redwood groves in the area (these are second generation trees -- all of the old growth was cut down to rebuild San Francisco in 1906). The trail has several micro-climates with bits of chapparal intersperesed with redwood rain forest.

The trail is quite difficult to reach. You need to hike about a mile the trail head (park at the skyline gate, and take the West Ridge Trail). In order to reach the best part of the trail (between Tres Sendas and Chown Trails), it requires a several mile hike or run from the parking lot. The best way to access this part of the trail is to park at the Redwood Bowl parking lot, and take the West Ridge Trail to either Tres Sendas or Chown.

 
 
 
 

A study of web traffic from blogs and social media


Based on anonymous usage data collected at Stinky Teddy we have written a paper entitled A Measurement of the Social Media Impulse Response Function. This paper is of interest to Internet startups, public relations professionals, and anyone interested in the new web-blog-social media information sharing ecosystem.

Like most alternative search engines, Stinky Teddy doesn't get much traffic. On an average day we get a few hundred searches on our site (Google handles about 1 billion searches per day worldwide). It doesn't help that our advertising, marketing, and public relations budget is $0. This is not strictly true - we once spent $40 on a Facebook advertising campaign, but that experience warrants a separate blog post.

We do, however, get an occassional surge of traffic. Somebody writes an article about us on their blog and we get a bunch of people checking out the site. Our first surge of traffic came in October, when Frederic Lardinois wrote a short piece on ReadWriteWeb entitled Stinky Teddy: A Cool Real-Time Search Engine with a Rather Odd Name. We didn't know the article was coming, and only noticed that it had been posted when our site crashed (we had a memory leak, since fixed). Before this ReadWriteWeb article we got no traffic whatsover as we had not yet released the product.

Being a scientist, I couldn't help but to utilize this ReadWriteWeb post as a chance to do an interesting study. The Internet Entrepreneur's dream scenario is the following:

  1. Build Great Product in Secrecy
  2. Using PR, generate massive news coverage on day of launch
  3. Go viral, with peer-to-peer messaging on social media leading to massive adoption.
This scenario never works for new search engines. Nonetheless, there are a bunch of people out there willing to try a new search engine, and positive news coverage is the way to get on their radar screen.

When it comes to planning for this glorious launch, however, there is one question that the Internet Entrepreneur wants answered that nobody will tell them. How much traffic will I get, and how long will it last?. The study I performed using the ReadWriteWeb post addresses the "how long" question.

The basic premise behind the study was that this single ReadWriteWeb post was singularly responsible for all traffic on Stinky Teddy for the next month. Our traffic before was nil, and we did absolutely no marketing or PR during this period. Therefore, any visitor or user on our site during that period was directly or indirectly related to the ReadWriteWeb post. For the first time, we were able to measure the "Impulse Response Function" of the web-blog-social media ecosystem. The "Impulse Response" is how a system responds to a sharp input signal (for a detailed discussion, read the paper). In this study, we measured the hourly/daily traffic on our site. That is the data we need to determine the impulse response. Here's the traffic in the 100 hours after publication:

This shows an interesting two-peak structure to the traffic. The first peak is obviously direct traffic from the ReadWriteWeb blog. We suspect the second peak is due to social media (e.g. Twitter sharing) and news readers (Google Reader, Netvibes, etc.). The second peak corresponds to 9 AM on the East Coast of the United States, so these are people checking yesterday's news when they arrive at work the next morning. We also looked at traffic on Stinky Teddy for the next 25 days:

Here we see something very interesting. In the web-blog-social media ecosystem stories "ring" for a long time. Half the traffic attributable to the ReadWriteWeb article came more that 4 days after the article. Only 10% came during that initial 5 hour burst from the ReadWriteWeb page.

This is a one-time only experiment. We've had several other momentary spikes in traffic, but only for this period in October through Novembmer could we definitively attribute all of the traffic back to a single source. It would be interesting of others repeated this study to see if what we observe is universal. Our main findings are:

  1. Only 10% of traffic eventually generated by the blog post came via early direct clickthroughs from the ReadWriteWeb home page.
  2. There is a two-peak structure in the traffic during the first 24 hours, with the second peak likely associated with "first thing in the morning" readers of yesterday's news through social media sharing or readers.
  3. Half of the traffic (from both direct and indirect sources) came four or more days after the article was posted.
Please read the paper and leave your thoughts below.

 
 
 
 
 

« July 2010
SunMonTueWedThuFriSat
    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
       
Today
Follow Stinky Teddy

    [This is a Roller site]
     
    © Stinky Teddy