1. Complete code in a compressed archive (zip tgz etc)
2. A readme file with complete description of used software installation compilation and execution instructions
3. A document with the results for the questions below.
Develop a specialized Web crawler.
Test your crawler only on the data in:
Make sure that your crawler is not allowed to get out of this directory!!! Yes there is a robots.txt file that must be used. Note that it is in a non-standard location.
The required input to your program is N the limit on the number of pages to retrieve and a list of stop words (of your choosing) to exclude.
Perform case insensitive matching.
You can assume that there are no errors in the input. Your code should be robust under errors in the Web pages you’re searching. If an error is encountered feel free if necessary just to skip the page where it is encountered.
Efficiency: Don’t be ridiculously inefficient. There’s no need to deliver turbo-charged algorithms or implementations. You don’t need to worry about memory constraints; if your program runs out of space and dies on encountering a large file that’s OK. You do not have to use multiple threads; sequential downloading is OK.