Designing a search engine

I want to design my own search engine or spider, where do I start? What software do I need to get started? Do you know of an online manual?

There are no manuals on how to do this one. Search engines and spiders are living, breathing things. As a simple starter for the spider (also called a robot): You first have to determine how you are going to "spider" sites. Then you have to write the utilities to go get the information from each site, retrieve and store keywords and process all of the levels within that site. The utilities must also be capable of following links from the initial page down through the site and store the keywords you are looking for from the page(s) you've returned. Your software will also need the capability to process frames and/or other types of initial pages (including Macromedia Flash) menu systems in order to drill down in the site. Your software will also need to relate keywords with other keywords and related information (similar to a thesaurus), including misspelled words. It will then need to be able to recognize differences in document types based on the returned value to determine the language and character set being returned. You also need to respect the robot.txt file and format in order to know what you should and should not process.

For the Search Engine: It will need to quickly return the number of matches for a search, return titles, portions of the page (stored locally or cached) and percentage of match. It needs to be able to handle advanced options to do simple logic. It also needs to be able to recognize when multiple hits are at the same site and only return one hit from that site. It will then need to display that information in a format that is as neat and clean as possible so that the user can scan and select the links they want. Lastly, it will need to keep track of the "Click-through" so that you can report which sites are displayed the most and which sites your users actually visit. I think that's it.


