The architecture of a common Web search engine contains a front-end process and a back-end process, as shown in Figure 1. In the front-end process, the user enters the search words into the search engine interface, which is usually a Web page with an input box. The application then parses the search request into a form that the search engine can understand, and then the search engine executes the search operation on the index files. After ranking, the search engine interface returns the search results to the user. In the back-end process, a spider or robot fetches the Web pages from the Internet, and then the indexing subsystem parses the Web pages and stores them into the index files. If you want to use Lucene to build a Web search application, the final architecture will be similar to that shown in Figure 1.

Implement advanced search with Lucene
Lucene supports several kinds of advanced searches, which I'll discuss in this section. I'll then demonstrate how to implement these searches with Lucene's Application Programming Interfaces (APIs).
Most deep web search engines provide Boolean operators so users can compose queries. Typical Boolean operators are AND, OR, and NOT. Lucene provides five Boolean operators: AND, OR, NOT, plus (+), and minus (-). I'll describe each of these operators.
- OR: If you want to search for documents that contain the words "A" or "B," use the OR operator. Keep in mind that if you don't put any Boolean operator between two search words, the OR operator will be added between them automatically. For example, "Java OR Lucene" and "Java Lucene" both search for the terms "Java" or "Lucene."
- AND: If you want to search for documents that contain more than one word, use the AND operator. For example, "Java AND Lucene" returns all documents that contain both "Java" and "Lucene."
- NOT: Documents that contain the search word immediately after the NOT operator won't be retrieved. For example, if you want to search for documents that contain "Java" but not "Lucene," you may use the query "Java NOT Lucene." You cannot use this operator with only one term. For example, the query "NOT Java" returns no results.
- +: The function of this operator is similar to the AND operator, but it only applies to the word immediately following it. For example, if you want to search documents that must contain "Java" and may contain "Lucene," you can use the query "+Java Lucene."
- -: The function of this operator is the same as the NOT operator. The query "Java -Lucene" returns all of the documents that contain "Java" but not "Lucene."
No comments:
Post a Comment