Specifically, we discuss the crawler architecture and one of its most important variations - Focused Crawling and its application to language-specific crawling. Finally, we discuss some experiments related to specific tongue crawling performed in Nutch.
Crawling
Focused Crawling
Probabilistic Ranking Methods
We are going to discuss Probabilistic Ranking Principle and some models and some variations of the principle in Chapter 4.
Aim and Motivation
Overview of the Report
The basic algorithm of crawling can be described as follows: Given a set of initial URLs (Uniform Resource Locators), the crawler downloads all the corresponding web pages and extracts the outgoing hyperlinks from those pages and recursively downloads those pages. Follow rules: The crawler should not download a web page repeatedly to maintain freshness.
Crawler Architecture
- URL Frontier
- Duplicate URL elimination
- DNS Resolution
- Robots Exclusion Protocol
Some web pages seem to be constantly changing, but most of the time it's because of the ads or comments from readers. If there is no entry in the table, the host is assigned to the empty backup queue and the URL is pushed into the backup queue and the process terminates.

Distributed Crawling
This protocol is called Robots Exclusion Protocol and HOST/robots.txt is where this information is provided. But since this information is needed repeatedly for every URL from that host, it makes sense to keep a cache for these files. We have to decide on the cache removal policy (such as the least recently used policy) and some web servers themselves specify an expiration date for their robots.txt file.
Summary
User’s View
Taxonomy selection and refinement: The URLs provided by the user are classified with the initial classifier and the results are shown to the user. Interactive exploration: The system then suggests some URLs that are closer to the examples provided by the user for classification. The user can manually classify some of them and mark some of them as 'good', which means they are relevant.
Feedback: The hub list is displayed to the user at intervals marked as relevant or not relevant and this feedback is fed back to the classifier.
Classification
The user marks the classification results that feel correct to him as 'good' and corrects the misclassified results and marks them as 'good'. Hard Focus: The above formulas are used to find the topic c* with the highest probability that the document belongs to that topic. If an ancestor of c* is marked good, the extracted URLs are added to the frontier, otherwise the crawl is pruned at d.
The URLs retrieved from this page are then prioritized according to this importance.
Using Context Graphs
Generating Context Graphs
Classification
A confidence limit is set for each layer, so that if a document's probability of belonging to all the classes does not exceed the corresponding thresholds, it is classified as belonging to none of the classes.
Crawling
Web Page Classification Techniques
Applications
Hierarchical Classification
Positive Example Based Learning
Before describing the M-C algorithm, we will define the following notations: POS: positive data set, Mi(neg): A subset of negative examples where Mi(neg) is farther from POS than Mi+1(neg). This means that we will list all the features that appear in most of the positive data, but that rarely occur in the unlabeled data. Then we filter out any possible positive data points from the unlabeled data, which leaves us with only very negative data that are M1 (neg).
Convergence stage: An SVM is built with POS as positive data and NEG as negative data.
Language Specific Crawling
Using Focused Crawling Mechanism
- Simple Strategy
- Limited Distance Strategy
But the focused crawler is based on the fact that there is topical location between the web pages, which means that the web pages belonging to a topic are located at a closer distance from each other. Character encoding: One method is to use character encoding scheme found in the web page META tag. Hard focus: The relevance of a page is 1 if it is of the desired language and 0 if not.
Extracted URLs are only pushed to Frontier if the relevance score is 1, otherwise crawling that path is cut off.
Detecting Indian Languages
Summary
Binary Independence Model
The binary term corresponds to the fact that documents and queries are represented by vectors of incidence of binary terms. The term independence corresponds to the fact that the term phenomena are supposed to be independent of each other. After making the Naive Bayes Conditional Independence assumption and estimating the probabilities with some measures that we give below and simplifying them, we get the following formula which represents the importance of a document d for a query.
The resulting amount is called Retrieval Status Value (RSV). dft−s)/((N−dft)−(S−s)) (4.2) Where N is the total number of documents, S is the number of relevant documents, s is the number of relevant documents in which the term t is present,dftis the total number of documents containing the term t. To avoid the possibility of zero within the logarithm, 1/2 is added to each term.
Non Binary Model : Okapi BM25
This assumption is far from true, especially in the case of interactive information retrieval, which we discuss in the next section. The first and simplest possibility is that the score for document d is the sum of the idf scores for each word in d. In addition to the common variables with the binary independence model from the previous section, tftd is the frequency of term t in document d , Ld is the length of the document, Lave is the average length over all documents, k1 is a parameter that determines the document term frequency scaling.
PRP for IIR
Cost Model
E(cij) =eij +pij(qijbij+ (1−qij)gij) (4.5) To maximize the expected utility of the situation, we must consider the entire set of choices. For brevity, we denote the average benefit of accepting cij as qijbij+ (1−qij)gij byaij. ThenE(cij) =eij+pijaij Doing some calculations, we get the PRP for IIR, which says that we should sort the choices by decreasing values of e.
Thus, we get the classic PRP, where the documents are sorted according to the decreasing values of their probability of relevance.
Probabilistic Model for Fewer Relevant Documents
Greedy Approach
Summary
Profile based: First the system asks the user to complete a profile in which information about the user's interests and activities is obtained, which is used in personalization. Collaboration Based: Based on a user's actions with the system, the system discovers users who perform similar actions and presents the results based on other actions of those similar users. Learning-based: The system uses the user's interactions with the system like previous searches and learns some patterns that are unique to a user.
In profile-based systems, the information about the user is taken explicitly, while here it is taken implicitly and the model is constantly evolving.
Automated Analysis of Interests and Activities
Let R be the number of documents in the index and ri be the number of documents in R containing the term i. Corpus representation: N is the number of documents on the web and ni is the number of documents that contain the term i. Another alternative is to set N as the number of documents containing the original search terms enniis the number of documents containing the search terms and the term i.
We obtain estimates of these parameters by issuing single-word queries to the search engine and using the number of results reported by it.
Ontology Based Framework
The simplest way to use this is to consider each document in this index as relevant to the user. We can experiment with different subsets of the index, such as including only pages that the user has viewed recently, or including only web pages. The query is expanded by including terms that are closer to the query in the relevant documents.
Ambiguity in user requests and system responses is used to approximate this uncertainty.
Summary
In this chapter, we discuss several evaluation methods that are commonly used in various information retrieval systems. For this there is a need for some standard test data collections that can be used by everyone. Conference and Laboratories of the Evaluation Forum (CLEF): This forum provided test collections for the evaluation of IR systems for European languages and the retrieval of information across multiple languages between them.
Forum for Information Retrieval Evaluation (FIRE): Its primary goal is to encourage research in South Asian language information access technologies by providing large-scale reusable test collections for cross-language IR experiments.
Unranked Retrieval Sets
Knowledge Base Acceleration Track: Develops techniques to improve the effectiveness of human knowledge base curators by proposing extensions to the existing KB. NII Testbeds and Community for Information Access Research (NT-CIR): These collections are similar in size to the collections provided by TREC, but for East Asian languages, and are focused on cross-linguistic information retrieval. Harmonic mean is closer to the minimum of the two when the difference between the two measures is large.
Ranked Retrieval Results
Here, Zk is the normalization factor calculated so that the NDCG value at k equals 1 when the classification is perfect.
Summary
If we don't build the project, we can find the Nutch installation in the executable/local directory at the root of the source tree. Crawl.java is the starting point that contains the main function that calls all the crawler components in the order of the crawling process: Injector, Generator, Fetcher, Parser, CrawlDbUpdater are the most important. It takes a directory of segments as input and tries to retrieve the URLs that are present when fetching by content.
Using the information about the URLs that were fetched and which ones were not in the last segment, it updates the crawldb.
Language Specific Crawling
Takes input as the directory and the crawldb directory containing the list of initial URLs and performs the normalizers and filters on the URLs and initializes crawldb. Parses the crawldb and generates a new segment with a subdirectory generation containing information about the URLs to be fetched. There is already a LanguageIdentifier plugin that detects the language and adds it to the URL metadata.
This results in the crawl being pruned at that stage and the outbound links from that web page not being added to the fetch list for the next iteration.
Results and Observations
In this report, we discussed crawling architecture, focused crawling, probabilistic ranking models, customized IRs, and evaluation methods for IRs. In the spider, we discussed the architecture of the spider and saw how it can be extended to a distributed spider. With the focused crawler, we've seen two types of focused crawler, one that uses the hierarchical structure of the web and another that models the context around relevant pages.
In Probabilistic Ranking Models we discussed the Probabilistic Ranking Principle and some models based on it.
Future Work
In Proceedings of the 23rd Annual ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 256–263. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 239–248. In Proceedings of the 29th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 429–.
Crawler Architecture
URL Frontier
Distributed Crawler
Context Graph
Visualisation of the sets
Relevance Feedback