David Hardtke's Blog

Home Feature Tour About Privacy Blog

The Real-Time Web is Dying (if it was ever alive)

When I started working on Stinky Teddy in 2009, there was much activity in the real-time web search space. Real-time search, in a nutshell, aims to surface digital content that humans are posting as soon as it is posted. Twitter Search is the leader in the field, but they only index results from Twitter updates. A few brave startups tried to index wider swaths of the real-time web, including rss feeds, blogs posts, blog comments, and other social netwoks (identi.ca, for instance). Notable amongst these startups were Scoopler, OneRiot, and Collecta. Stinky Teddy is a metasearch engine that uses real-time conversations to figure out what content (new or old, text, image or video) might be important to a web searcher at the moment of their search. The real-time search engines do the hard work for me by indexing these conversations and presenting the data to me contextual analysis.

Sadly, today is the final day for the Collecta API, so I'm entirely dependent on Twitter (for now) to figure out the pulse of the web. Oneriot shut down in October, and the Scoopler team has moved onto celebrity spotting. The problem, of course, is that nobody has found a sustainable business model around indexing and searching the public portions of the real-time web. There is a market for this sort of data. Gnip is now in the business of collecting and selling real-time information, including some of the stuff sitting behind the dreaded nofollow wall (Twitter, Facebook, YouTube). I'm still hoping that at some point Bing adds their social feed to their public APIs.

Recently, Wired had the provocative article The Web is Dead that explained how the open web was giving way to walled gardens and apps. Unfortunately, the real-time web, in the sense of a "web" that is both open and searchable, never made it out of its infancy.


OneRiot, R.I.P.

Stinky Teddy is a meta-search engine that utilizes search feeds from multiple search providers. The glue that brings it all together are the so-called real-time search engines. Using data from these real-time search engines I can figure out how people are using words and phrases in their current online conversations.

Twitter is an invaluable resources for this real-time data as it is the largest available source for real-time user generated content. I highlight the word available because Facebook does not provide access to their users' real-time status updates. The one drawback of Twitter is that the contributers (Tweeters) are a highly biased sample of humanity, mostly techies and narcissists. Normal people don't tweet -- most average Joes and Janes on Twitter don't have enough real followers to justify the effort it takes to send out a Tweet (do you talk out loud when nobody is around to hear you?). Twitter by itself generally does not allow me to figure out what is important to "real" people.

So it was with great sadness that I read this week about the discontinuation of the OneRiot real-time search API. Technically they are launching their new Advertising Network, but hidden somewhere in the announcement is the demise of the search engine and search API.

I've been using the OneRiot search API for over a year now, but today deactivated it on Stinky Teddy (the search API will be turned off in a couple of days). This is a big blow -- OneRiot did a superb job of surfacing fresh content that normal people care about. I used the OneRiot API with the "rest of the world besides Twitter" option. This is the world where U2 is more important than Justin Bieber. I'm not sure of the exact details of OneRiot's algorithm, but it worked very well from both algorithmic perspective (great content) and a technical level (fast response, no outages, stable API). Goodbye old friend.

All is not lost. Collecta still helps me to capture the pulse of the normal world. Hopefully Bing opens up their social API (they are indexing Facebook public updates now). Also, I'm starting to index select parts of the web myself.


Core value: Best Result First

Today I watched an old video of Steve Jobs talking about branding that had risen to the top of Hacker News. In the video, he talks about how a company needs to have a core value, and that the brand should be about that core value. It got me thinking -- what is the core value of Stinky Teddy? We're not actually a company at this point, just an experiment. Nonetheless, we have a simple core value and it is best result first.

Let me elaborate. As Larry Page pointed out, the perfect search engine should have only a single result in many cases, and that one result is the exact link you are looking for. Our goal at Stinky Teddy, simply put, is to use all available information to try to figure out what that exact result is for you at a particular moment in time.

You might ask, don't all search engines try to do that? To see if "best result first" is a core value of your search engine do a search for something highly commercial like "mortgage refinance process." In all likelihood, the first result is a sponsored link. Not what you want, and certainly not the best result.

Here's the video:


open source code on github

I've started the process of open-sourcing some of the code we've developed for Stinky Teddy. Eventually, we'll open source everything except for the ranking algorithms and intelligence. Stinky Teddy uses multiple search APIs, so a lot of effort went into normalizing multiple search APIs. Our first stab at open-sourcing is a simple Java class that converts multiple string representations of the date (used by various search APIs) into the java.util.Date format and also the date string format used by Lucene (org.apache.lucene.document.DateTools.dateToString()).

I'm using github to host the code: Stinkteddy @ github.


"How to Save the News" by James Fallows

Yesterday I read an excellent article by James Fallows on How to Save the News in the Atlantic. The article discussed how Google was working to help news organizations monetize their content in the online age. The article prompted me to submit the following letter to the editor:

The ability to quickly find quality content is the primary reason that people use Google, and is good to see that Google is committed to helping professional journalism survive ("How to Save the News" by James Fallows). Journalists provide a large share of the material that draws people to the search engine, both Google news and the main search page. Consumers know that Google (and other search engines like Bing and Yahoo!) provide easy one-stop access to the information they seek on a large variety of topics. In exchange for this convenience, consumers allow Google to run sponsored links above the search results for the small fraction of search queries that are commercially valuable. A good analogy is commercial radio. Radio stations aggregate quality content that consumers want (songs from various artists) in exchange for the right to subject the consumer to the occasional advertisement. The radio versus search engine analogy breaks down, however, when we note that those who create the content that attracts consumers to the radio stations, the songwriters, are compensated for their creations in the form of performance royalties paid by the radio station. The fact that such a performance royalty agreement does not exist between search engines and professional journalists is an accident of history, and need not be the case in the future. Google and other search engines should compensate content providers directly for the right to use their creations.

In order to create such a system journalists need to realize that their creations are largely interchangeable from the perspective of Google or the average web surfer. If individual news providers starts charging in some way, the consumers and Google will simply move their attention to free news sites. This is why micro-payments are not the answer. Instead, news organizations need to bargain collectively and require that Google and other search engines pay for the right to index their content. If a large fraction of news organizations were to simultaneously remove their content from Google, it would seriously impact the quality of the Google product (and give Bing a huge advantage if they were to pay for that body of content). In a performance royalty system, news organizations would be payed a small fixed fee each time Google linked to their articles, similar to the way songwriters are compensated via ASCAP when their songs are played on the radio. Google will argue that the legal concept of fair use allows them to aggregate short portions of content from copyrighted sources and therefore prevents such a system from being enforceable, but it is not clear that fair use applies (the creation of the search index requires an entire copy of the document be stored on the Google servers). Additionally, journalists cannot be forced to participate, just as songwriters need not join ASCAP.

Clearly, such a system will not be invented by Google as it is bad for their bottom line. Nonetheless, a performance royalty system would fairly compensate journalists for the value they provide both Google and consumers.


« April 2019
Follow Stinky Teddy

    [This is a Roller site]
    © Stinky Teddy