How do search engines work
Search engines are the main way in which people find information on the web and in particular where to find a particular product or service. If we are to get our web sites onto the first results pages, rather than languishing on the later often unread pages, we need to know how the search engines produce their results. Unfortunately the large search engines like Google, Yahoo, and Live Search do not publish the detailed information on how they produce the results pages when we type in a query.
However, the basic principles are known, briefly a Web search engine is a suite of software programs designed to search for information on the World Wide Web. The information may consist of web pages, images or types of files. Some search engines also look for data available in newsbooks, databases, or in the open directories. Unlike Web Directories, which are maintained by human editors, search engines operate algorithmically (Google) or are a mixture of algorithmic and human input.
You can find more detailed information on search engine technology on the Wikipedia.
|
Google
Google, who currently handle around 80% of the search queries say that: "The software behind our search technology conducts a series of simultaneous calculations requiring only a fraction of a second. Traditional search engines rely heavily on how often a word appears on a web page. We use more than 200 signals, including our patented PageRank algorithm, to examine the entire link structure of the web and determine which pages are most important. We then conduct hypertext-matching analysis to determine which pages are relevant to the specific search being conducted. By combining overall importance and query-specific relevance, we're able to put the most relevant and reliable results first."
|
Search Engine Components
A search engine will consist of:-
- A web crawler: which is a program or automated script that browses the World Wide Web in a methodical. automated manner. In general the crawler starts with a list of URLs to visit, called seeds. As the crawler visits these URLs, it identifies the links on the page and adds these to its list of URLs. These URLs are then recursively visited according to a set of policies. In addition the crawler will gather the information from the web pages they visit and store this data for later indexing by the search engine.
- Search engine indexing: this is the process of collecting, parsing and storage of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, information technology, physics and computer science. An alternateve name for this process, in the context of search engines designed to find web pages on the internet is Web indexing. The results of the indexing are stored in a database that can be accessed by the other parts of the search engine to answer our queries.
- A user interface: This is the part of the search engine that you interact with, this process takes the queries from you and interogates the index database to get the relevant URLs. This information is then combined with a sample from each of the relevant pages and presented to you in the form of a list. This list, contains the URL of the web page and the sample text relevant to your query, and is generally provided as a series of pages each containing perhaps 20 results. Most users will only look at the first couple of pages of results rarely venturing past the third page so it is important that our web page appears within those first few pages of the results.
|