• No results found

Web Crawling and IR - CFILT, IIT Bombay


Academic year: 2023

Share "Web Crawling and IR - CFILT, IIT Bombay"


Loading.... (view fulltext now)

Full text

Specifically, we discuss the crawler architecture and one of its most important variations - Focused Crawling and its application to language-specific crawling. Finally, we discuss some experiments related to specific tongue crawling performed in Nutch.


Focused Crawling

Probabilistic Ranking Methods

We are going to discuss Probabilistic Ranking Principle and some models and some variations of the principle in Chapter 4.

Aim and Motivation

Overview of the Report

The basic algorithm of crawling can be described as follows: Given a set of initial URLs (Uniform Resource Locators), the crawler downloads all the corresponding web pages and extracts the outgoing hyperlinks from those pages and recursively downloads those pages. Follow rules: The crawler should not download a web page repeatedly to maintain freshness.

Crawler Architecture

  • URL Frontier
  • Duplicate URL elimination
  • DNS Resolution
  • Robots Exclusion Protocol

Some web pages seem to be constantly changing, but most of the time it's because of the ads or comments from readers. If there is no entry in the table, the host is assigned to the empty backup queue and the URL is pushed into the backup queue and the process terminates.

Figure 2.2: URL Frontier
Figure 2.2: URL Frontier

Distributed Crawling

This protocol is called Robots Exclusion Protocol and HOST/robots.txt is where this information is provided. But since this information is needed repeatedly for every URL from that host, it makes sense to keep a cache for these files. We have to decide on the cache removal policy (such as the least recently used policy) and some web servers themselves specify an expiration date for their robots.txt file.


User’s View

Taxonomy selection and refinement: The URLs provided by the user are classified with the initial classifier and the results are shown to the user. Interactive exploration: The system then suggests some URLs that are closer to the examples provided by the user for classification. The user can manually classify some of them and mark some of them as 'good', which means they are relevant.

Feedback: The hub list is displayed to the user at intervals marked as relevant or not relevant and this feedback is fed back to the classifier.


The user marks the classification results that feel correct to him as 'good' and corrects the misclassified results and marks them as 'good'. Hard Focus: The above formulas are used to find the topic c* with the highest probability that the document belongs to that topic. If an ancestor of c* is marked good, the extracted URLs are added to the frontier, otherwise the crawl is pruned at d.

The URLs retrieved from this page are then prioritized according to this importance.

Using Context Graphs

Generating Context Graphs


A confidence limit is set for each layer, so that if a document's probability of belonging to all the classes does not exceed the corresponding thresholds, it is classified as belonging to none of the classes.


Web Page Classification Techniques


Hierarchical Classification

Positive Example Based Learning

Before describing the M-C algorithm, we will define the following notations: POS: positive data set, Mi(neg): A subset of negative examples where Mi(neg) is farther from POS than Mi+1(neg). This means that we will list all the features that appear in most of the positive data, but that rarely occur in the unlabeled data. Then we filter out any possible positive data points from the unlabeled data, which leaves us with only very negative data that are M1 (neg).

Convergence stage: An SVM is built with POS as positive data and NEG as negative data.

Language Specific Crawling

Using Focused Crawling Mechanism

  • Simple Strategy
  • Limited Distance Strategy

But the focused crawler is based on the fact that there is topical location between the web pages, which means that the web pages belonging to a topic are located at a closer distance from each other. Character encoding: One method is to use character encoding scheme found in the web page META tag. Hard focus: The relevance of a page is 1 if it is of the desired language and 0 if not.

Extracted URLs are only pushed to Frontier if the relevance score is 1, otherwise crawling that path is cut off.

Detecting Indian Languages


Binary Independence Model

The binary term corresponds to the fact that documents and queries are represented by vectors of incidence of binary terms. The term independence corresponds to the fact that the term phenomena are supposed to be independent of each other. After making the Naive Bayes Conditional Independence assumption and estimating the probabilities with some measures that we give below and simplifying them, we get the following formula which represents the importance of a document d for a query.

The resulting amount is called Retrieval Status Value (RSV). dft−s)/((N−dft)−(S−s)) (4.2) Where N is the total number of documents, S is the number of relevant documents, s is the number of relevant documents in which the term t is present,dftis the total number of documents containing the term t. To avoid the possibility of zero within the logarithm, 1/2 is added to each term.

Non Binary Model : Okapi BM25

This assumption is far from true, especially in the case of interactive information retrieval, which we discuss in the next section. The first and simplest possibility is that the score for document d is the sum of the idf scores for each word in d. In addition to the common variables with the binary independence model from the previous section, tftd is the frequency of term t in document d , Ld is the length of the document, Lave is the average length over all documents, k1 is a parameter that determines the document term frequency scaling.


Cost Model

E(cij) =eij +pij(qijbij+ (1−qij)gij) (4.5) To maximize the expected utility of the situation, we must consider the entire set of choices. For brevity, we denote the average benefit of accepting cij as qijbij+ (1−qij)gij byaij. ThenE(cij) =eij+pijaij Doing some calculations, we get the PRP for IIR, which says that we should sort the choices by decreasing values ​​of e.

Thus, we get the classic PRP, where the documents are sorted according to the decreasing values ​​of their probability of relevance.

Probabilistic Model for Fewer Relevant Documents

Greedy Approach


Profile based: First the system asks the user to complete a profile in which information about the user's interests and activities is obtained, which is used in personalization. Collaboration Based: Based on a user's actions with the system, the system discovers users who perform similar actions and presents the results based on other actions of those similar users. Learning-based: The system uses the user's interactions with the system like previous searches and learns some patterns that are unique to a user.

In profile-based systems, the information about the user is taken explicitly, while here it is taken implicitly and the model is constantly evolving.

Automated Analysis of Interests and Activities

Let R be the number of documents in the index and ri be the number of documents in R containing the term i. Corpus representation: N is the number of documents on the web and ni is the number of documents that contain the term i. Another alternative is to set N as the number of documents containing the original search terms enniis the number of documents containing the search terms and the term i.

We obtain estimates of these parameters by issuing single-word queries to the search engine and using the number of results reported by it.

Ontology Based Framework

The simplest way to use this is to consider each document in this index as relevant to the user. We can experiment with different subsets of the index, such as including only pages that the user has viewed recently, or including only web pages. The query is expanded by including terms that are closer to the query in the relevant documents.

Ambiguity in user requests and system responses is used to approximate this uncertainty.


In this chapter, we discuss several evaluation methods that are commonly used in various information retrieval systems. For this there is a need for some standard test data collections that can be used by everyone. Conference and Laboratories of the Evaluation Forum (CLEF): This forum provided test collections for the evaluation of IR systems for European languages ​​and the retrieval of information across multiple languages ​​between them.

Forum for Information Retrieval Evaluation (FIRE): Its primary goal is to encourage research in South Asian language information access technologies by providing large-scale reusable test collections for cross-language IR experiments.

Unranked Retrieval Sets

Knowledge Base Acceleration Track: Develops techniques to improve the effectiveness of human knowledge base curators by proposing extensions to the existing KB. NII Testbeds and Community for Information Access Research (NT-CIR): These collections are similar in size to the collections provided by TREC, but for East Asian languages, and are focused on cross-linguistic information retrieval. Harmonic mean is closer to the minimum of the two when the difference between the two measures is large.

Ranked Retrieval Results

Here, Zk is the normalization factor calculated so that the NDCG value at k equals 1 when the classification is perfect.


If we don't build the project, we can find the Nutch installation in the executable/local directory at the root of the source tree. Crawl.java is the starting point that contains the main function that calls all the crawler components in the order of the crawling process: Injector, Generator, Fetcher, Parser, CrawlDbUpdater are the most important. It takes a directory of segments as input and tries to retrieve the URLs that are present when fetching by content.

Using the information about the URLs that were fetched and which ones were not in the last segment, it updates the crawldb.

Language Specific Crawling

Takes input as the directory and the crawldb directory containing the list of initial URLs and performs the normalizers and filters on the URLs and initializes crawldb. Parses the crawldb and generates a new segment with a subdirectory generation containing information about the URLs to be fetched. There is already a LanguageIdentifier plugin that detects the language and adds it to the URL metadata.

This results in the crawl being pruned at that stage and the outbound links from that web page not being added to the fetch list for the next iteration.

Results and Observations

In this report, we discussed crawling architecture, focused crawling, probabilistic ranking models, customized IRs, and evaluation methods for IRs. In the spider, we discussed the architecture of the spider and saw how it can be extended to a distributed spider. With the focused crawler, we've seen two types of focused crawler, one that uses the hierarchical structure of the web and another that models the context around relevant pages.

In Probabilistic Ranking Models we discussed the Probabilistic Ranking Principle and some models based on it.

Future Work

In Proceedings of the 23rd Annual ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 256–263. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 239–248. In Proceedings of the 29th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 429–.

Crawler Architecture

URL Frontier

Distributed Crawler

Context Graph

Visualisation of the sets

Relevance Feedback


Figure 2.1: Basic architecture of crawler
Figure 2.2: URL Frontier
Figure 2.3: URL splitting in Distributed Crawler
Figure 3.1: Context Graph


Related documents

0BCS 2022 Assamese Modhusmita Bora 30s10183 s202139s Sarbodaya College, Jorhat However, the above candidate is allowed to appear in other Examination s in 2023 and the concerned odd