Today I’d like to mention about search engines and their design criterias. Yes we have Google, Bing, Yahoo, Duckduckgo, yandex and many more. But is it easy to build a search engine? The answer is clearly NO. If you think that you can easily build a search engine you’re more like you’re comparing a Ferrari and 1769 Cugnot Steam Trolley (Jonathan Holguinisburg). Yes, it is that much different.
I’ll elaborate what I mean. Let’s start with first phase, crawling.
Crawling
You need to get web pages to get their contents. Web pages have text, images, videos, pdfs, and most importantly links to other pages. When Google proposed Page Rank, links at the webpage play critical role for the algorithm. They not only represent relationship of the pages but also they are the gates to new pages. I’ll not give the details of Page Rank but it states that if a page is referenced by many web pages, it is an important page (it has a higher rank). So if you look for crawling point of view links are the hearth of the crawling.
Crawling process starts with a seed file. Seed file contains base url(s) to crawl initially. Crawler reads the urls from the seed file and gets the contents of the url. This sounds very simple, doesn’t it? But it is an iceberg and you’re at the Titanic. It is OK if you deal with small number of urls. But in real it’s far beyond our estimations.
When i type “a” in Google and hit enter I had about 25 billion results.
So, we have plenty of urls, it is clear. What problems may arise with such big number of urls?
- Parallel crawling
- Storage of such amount/volume of data
- Update/delete/create of new links/urls
- Network problems (slow connections)
- Dynamically created web page issues (db connection errors, different contents for same urls)
Perhaps STORAGE and Parallel crawling may be the most challenging ones.
After Google publishes its Page Rank, Nutch (a java based crawler) and Hadoop (distributed big data platform, HDFS – distributed file system) born.
One can put those to his toolbox and can say that I can crawl and go distributed because I have all I need! I would say yes boy, that’s how you say mama 🙂
Using, configuring, administering Hadoop is OK for small volumes of data. You need to configure kernel parameters to allow number of threads, number of processes and many other configuration details to successfully and effectively use it. Using effectively means, cpu and disk utilization, network utilization should be optimal. If %20 of your cluster is not performing optimal your should not complain about the electric bill and personal payments.
Volume is not the only issue even if your tool (Nutch or other) is mature and robust enough. You’ll see failed map/reduce jobs for many reasons. You’ll lose your long running crawling task for a stupid link, and hackers would like it! Yes you may have 1 billion urls and one url may fail your crawling.
You crawl your seed list, you get new links, and add these new links to your seed list. This process is exponential. I mean after 5th or 6th run you may find yourself a 3 hours running crawling job. To decrease the level of Adrenalin (if job fails you’ll lose 3 hours of cluster time – 3 hours electric bill, 3 hours wasted cluster time etc.) you’ll limit the number of urls to crawl.
After that another issue arises: you have a limited resource (number of servers for instance). You can NOT crawl entire web right unless you’re a Google, Bing, Yahoo, or Yandex?
A little clarification about the need: As of 2011 Google is estimated to have 900.000 physical servers (http://www.datacenterknowledge.com/archives/2011/08/01/report-google-uses-about-900000-servers/).
So, you cannot just add any link to your seed list. So WHAT IS YOUR STRATEGY to add links?
Language based crawling can be used, I was co-author of the paper btw (ieeexplore.ieee.org/iel7/6820096/6830164/06830532.pdf).
Another decision should be taken on crawling cycles, ie. when to re-crawl? One possible approach is to have multiple crawlers run at the same time. But how will you coordinate crawlers, how will you make fair crawling, how will crawlers update seed list simultaneously and many more questions…
This little illustration is to show you the difficulties of the simplest part of the search engine.
I even did not mention about indexing, ranking, search suggestions, image search.
If I had time, I would also like to write about them.
One more thing:
Technology is a living thing. It is not the same as yesterday and will not be the same tomorrow. Lots of tools, technologies did not exist or mature a few years ago. There are created because of our needs, they are mature because of us. Either we reported bug/feature or we fixed or coded it. A high school student can build and write codes to a Raspberry PI to control his room’s temperature easily nowadays. A few years ago cellular phones with 3 lines dot screen were very luxury devices. World is changing bro(s).