In this project, we will design and implement a mini search engine that is used to search through a colle ction of documents . The data struc tures used are files for sto rin g, has h tab les for ind exi ng and tre es for search ing the doc ume nts .
The documents will be stored using files and given a set of texts and a query, the search engine will locate all the documents that contain the keywords in that query. The purpose of this project is to provide an overview of how a search engine works and to gain hands-on experience in using hash tables, files and trees.
The documents stored as files will be indexed based on their words/tokens using hashing functions. This is done in order to make it easier to retrieve the required documents.
Searching will be done using trees, and depend in g upon th eefficiency an d complexity of the algorithm we will use AVL trees or balanced binary search trees. In order to allow efficient searching, for every word a list of documents where it will occur will be stored. The queries may contain simple Boolean operators, that is AND/OR, which act in a similar manner with the well-known analogous logical operators. For each such query, the document that satisfies that query will be displayed.
For instance, a query:
Keyword1 AND Keyword2 -- should retrieve all documents that contain both these keywords (elements).
Keyword1 OR Keyword2 -- instead will retrieve documents that contain either one of the two keywords