Twitter just made it possible to search throughout every tweet ever tweeted, now that its search engine has been upgraded. There are plenty of reasons why you would like to search for a particular tweet. Maybe you want to trace back some personally relevant information, or you are doing a research project on social media use. Either way, the new feature is an important step forward compared with the limited previous version.
Now words and hashtags become searchable. Moreover, you can set the search for a certain period, regardless of how far back that is. Google or Topsy made search possible prior to this upgrade, but a tailor made search engine will come up with results of a much higher quality and reliability.
Yi Zhuang, a Search Infrastructure Engineer, describes the new product in a lengthy blog post. He says that after a search is launched, half a trillion documents will be analyzed in less than 100ms.
Until now, Twitter’s search engine was oriented toward recency. It meant that tweets older than one week stopped showing up in the list of results. The initial search service has been implemented in 2011, about five years after the first tweet has been tweeted. It may seem strange now, but Twitter has always been a realtime platform focused on recent periods. Implicitly, managing a growing archive was not the focus of a developing platform. The measure signals the company’s transition to a fully mature stage.
However, the tweets got stored in the meantime and will finally be available, after several years of work. Twitter took the first steps two years ago, when an experimental batch of two billion tweets had been put to the test. Last year, they indexed even more tweets and refined the system. In the end, most of the tweets, have been indexed in 2014.
The previous search engine relied on “in-memory” systems, but they turned out to be too expensive for large scale deployment. Instead, the company chose the fast SSDs as the main storage option.
SSDs solved the speed requirements, but the upgrade had another core issue in need of a solution. Besides indexing a huge archive, the system needs to archive realtime tweets as well. The solution was to use the machines running Hadoop MapReduce to archive both the older and the newer tweets in parallel.