• No results found

information from the document.

N/A
N/A
Protected

Academic year: 2023

Share "information from the document."

Copied!
36
0
0

Loading.... (view fulltext now)

Full text

(1)

Information Retrieval Information Extraction and

Akhil Deshmukh (06D5007) Narendra Kumar (06D05008) Kumar Lav (06D05012)

image: google

(2)

IR and IE

 IR is the process of selecting documents from the collection.

 IE is retrieval of specific

information from the document.

 IE transform the unstructured data into the structured.

 They together provide powerful

tool for text processing.

(3)

contd ....

IR and IE can be seen in our day to day life.

 The email-inbox

 The searches we make on internet.

 This seminar itself is a result of IR

and IE.

(4)

Information retrieval Tasks Information retrieval Tasks

• Search Engines (eg. Google and Yahoo).

• Feedback.

• Meta search.

• Data clustering and Visualization search

results.

(5)

Clustering

of search

results

(6)

IR measures IR measures

 Division of documents.

 Precision

 Recall

 F-score.

(7)

Models for IR Models for IR

 Language Models

In LM, the idea is to compute the conditional probability P(q | d), i.e. the probability of generating a query q given the observation of a document d.

 Vector Space Model (VSM)

VSM is a algebraic model, for representing

objects as vectors of identifiers.

(8)

Model 1

 Given a query q, we want to compute the relevance of document s. We find the probability -

p(r=1| q, d) , where r is the binary label for ‘relevant’.

 Applying Bayes rule - p(r = 1| q, d) = p(d, q | r=1) p(r=1) p(q , d)

We make two assumptions –

1. p(q | d, r =1) is computed from unigram model created from d,

2. p(q | d, r = 0) = p(q | r = 0)

Here we consider the ratio of log values –

log p(r = 1 | q, d) = log p(q | d, r=1)

+

log p( r=1 | d) p(r = 0 | q, d) p(q | d, r=0) p(r=0 | d)

Disadvantages

Language Models for IR

Language Models for IR

(9)

Vector Space Model for IR Vector Space Model for IR

 Each document d is represented as a scalar tf.idf, known as tf-idf scheme.

 Where “ tf ” is normalize term frequency, and “ idf “ is the inverse document

frequency.

tfw =

c(w, d) / max

v

c(v, d) , c =count

idfw =

log(N/n

w

) N = no. Of documents in collection

n = no. Of documents containing w

so we have tf . idf

w

= tf

w

× idf

w

.

(10)

Fig: http://www.miislita.com/term-vector/term-vector-3.html

(11)

sim(d, q) = cos( )

=

d

T

q

=

j

W

d,j

x W

q,j

|d| |q|

|d| |q|

here

|d|= ∑

j

W

d,j 2 ,

|q| = ∑

j

W

q,j 2

and Wd,j = weight of term j in document d

i.e. W

d,j

= tf

d,j

x idf

j

Wq,j = weight of term j in query q

• Normalization ??

• Documents are ranked by their cosine

similarity to q

(12)

Trigger Language Model

Here we calculate the probability p(q| M

d

).

where Md is the language model for document d.

a) Estimation

p(q

i

| M

d

) = tf(q

i

, d) , tf(q

i

, d)=

frequency of q

i

in d

dl

d

dl

d

= total no. of terms in d.

b) Zero counts

p(q

i

| M

d

) = p(q

i

| M

d

) , if tf(q

i

, d) >0 = cf

qi

, otherwise cs

cfqi = raw count of qi in all documents

cs = total no. of terms in all documents

c) Smoothing: calculate pavg(qi) .

d)Final term probability, p#(qi | Md) = p(qi | Md)(1-R) x pavg(qi)R , if tf(qi, d)>0

= cfqi , otherwise

cs

(13)

e) Finally:

p(q | Md) = πqi∈q p#(qi | Md) x πqi∈q [1- p#(qi | Md)]

To minimize the smoothing risk, R = [1/(1+f)] x [f/(1+f)]

tf

Where f is the mean term frequency of term,

f = pavg(qi) x dld

Fig by : ZHANG Jun-lin, SUN Le QU Wei-min, SUN Yu-

fang,2002

(14)

Information Extraction (IE)

 Task of locating specific information from a natural language document.

 Convert information from different text documents into database

entries (structured format).

 It runs on the top of IR systems.

(15)

Is IE and IR same ?

 IE systems do not understand the text in the input documents.

 IR retrieves relevant documents, IE extracts relevant information

 IE sees text as structure, for IR they are bag of words.

Both, together provide powerful

tools for text processing.

(16)

Extraction

Free text:

◦ Natural language. Articles in newspapers etc.

◦ Uses NLP techniques. like POS tagging.

◦ Rules based on syntactic and semantic information.

Structured Text:

◦ Text as in database or tables.

◦ Extraction is easy and direct.

◦ Rules are based on the structure of the

text.

(17)

Extraction

contd...

 Semi-structured:

◦ are between free and structured text.

◦ comprises of phrases.

◦ regular expressions are used.

Information may be :

 single-slot : single information.

eg. Micrsoft CEO Bill Gates

 multi-slot : set of information.

eg. Powai: 1 BHK Rs.7,000 p.m., 2 BHK Rs.

20,000 p.m.

(18)

IE System Design

Two main approaches :

Knowledge engineering :

◦ grammars expressing rules are constructed by hand.

◦ specific to application.

◦ rely on the skill of knowledge engineer.

Automatic training :

◦ based on training-testing paradigm.

◦ highly depend on the corpus for

accuracy.

(19)

Learning pattern by LP (LP) 2

It learns rules by generalising over a set of examples marked via SGML tags in a training corpus

It induces two types of symbolic rules:

tagging rules :

◦ inserts single SGML tag.

◦ selects a tag in the training corpus and extracts from the text a window of w words to the left and w words to the right.

◦ Windows are tagged by SGML tags.

◦ Uses Context rules.

◦ eg. Seminar at 9 pm. After tagging Seminar at

<time> 9 </time> pm

(20)

(LP) 2 contd ..

 Correction rules:

◦ shifts wrongly positioned tags to the correct positions.

◦ Identical to the tagging rules except:

 patterns match also the tags inserted by the tagging rules

 actions shift misplaced tags rather than

adding new ones.

(21)

(LP) 2 contd ..

 Generalisation:

It generalises the initially induced

rules. Each generalisation is tested on the training corpus and the best k are retained.

 adding the wildcard characters like in 4 pm * pm

 Using POS tagger NLP knowledge

(22)

(LP) 2 contd ..

 Information extraction:

Information is extracted in following ways:

 best pool rules are used to tag.

 correction rules are applied to remove errors.

 tags are validated i.e. uncoupled tags are

removed.

(23)

(LP) 2 contd ..

NLP based.

[7]

(24)

WHISK

System name

Text style

Multi-slot Need syntax

Need trim

WI Struct Yes No No

SRV Semi No No No

RAPIER Semi No No No

AutoSlog Free No Yes Yes

CRYSTAL Free Yes Yes Yes

CRYSTAL Semi Yes Yes Yes

WHISK Struct Yes No No

WHISK Semi Yes No No

WHISK Free Yes Yes No

IE

Systems

[6]

(25)

WHISK:

 Input Data.

 WHISK rule representation.

 WHISK Algorithm.

(26)

Input Data:

 Structured data Input:

eg: Profiles on web host, consider facebook, orkut etc.

 Semi-Structured:

eg: Ads in newspaper.

 Free text:

eg: Any random article.

 Single-slot / Multi-slot

(27)

WHISK Rules:

 For Structured and semi- structured text:

eg:

Consider an add for renting house. The corresponding rule will be of the form:

ID:: 1

Pattern:: * ( Digit ) ‘R’ * ‚’$’ ( Number )

OutPut:: Rental {Bedrooms $1} {Price $2}

Disjunction allowed:

Bdrm = (brs | br | bds | bdrm | bd | bedrooms | bedroom | bed)

Thus the newly formed rule will be:

ID:: 2

Pattern:: * ( Nghbr ) * ( Digit ) ‘ ‘ Bdrm * ‘$‘ ( Number ) Output:: Rental {Neighborhood $1} {Bedrooms $2}

{Price $}

We have used Nghbr, what does it stand for?

(28)

Contd:

 Extension of the rules for grammatical text:

• Pre-processing required, Syntactic analysis.

@S[ {SUBJ @PN[ C. Vincent Protho ]PN , @PS[ chairman and chief excutive officer ]

of this maker of semiconductots, } {VB @Passive was named @nam }

{PP to the additional post of @PS[ president ]PS , } {REL_V succeeding @succeed @PN[ John W. Smith ]PN ,

who resigned @resign to pursue @pursu other interests. }

]@S 8910130051-1

• The corresponding rule pattern will be:

ID:: 3

Pattern:: *(Person)* ‘@Passive‘ *F ‘named‘ * {PP *F (Position) *

‘@succeed ‚ (Person)

Output:: Succession {PersonIn $1} {Post $2} {PersonOut $3}

(29)

WHISK Algorithm:

 Features:

◦ A supervised learning algorithm

◦ Tagging process is interleaved with the learning.

◦ Presents the user with a batch of instances to tag.

◦ Induces a set of rules from the expanded training set.

 Progress:

◦ Begins with a reservoir of untagged instances.

◦ At each iteration a set of untagged instances are selected from the reservoir and presented to the user to annotate.

◦ The user adds a tag for each case frame to be extracted from the instance, ‘Training Instances’

(30)

Algorithm:

@S[ Capitol Hill – 1 br twnhne. Fplc D/W W/D. Undrgrnd pkg incl

$675. 3 BR,

upper flr of turn of ctry HOME. Incl gar, grt N. Hill loc $995. (206) 999-9999 <br>

<i> <font size=-2> (This ad last ran on 08/03/97.) </font> </i>

<hr>

]@S 5

@@TAGS Rental {Neighborhood Capitol Hill} {Bedrooms 1} {Price 675}

@@TAGS Rental {Neighborhood Capitol Hill} {Bedrooms 3} {Price 995}

WHISK(Reservoir) RuleSet = NULL Training = NULL

Repeat at user‘s request

Select a batch of NewInst from Reservoir (User tags the NewInst)

Add NewInst to Training

Discard rules with errors on NewInst For each Inst in Training

For each Tag of Inst

If Tag is not covered by RuleSet

Rule = GROW_RULE(Inst, Tag, Training) Prune RuleSet

(31)

Anchoring the extraction Slot:

Empty Rule: “ * ( * ) * ( * ) * ( * ) * “ Anchoring Slot 1:

Base_1: * ( Nghbr )

Base_2: ‘@start‘ ( * ) ‘ -‘

Anchoring Slot 2:

Base_1: * ( Nghbr ) * ( Digit )

Base_2: * ( Nghbr ) * ‘- ‘ ( * ) ‘ br‘

Anchoring Slot 3:

Base_1: * ( Nghbr ) * ( Digit ) * ( Number ) Base_2: * ( Nghbr ) * ( Digit ) * ‘$‘ ( * ) ‘.‘

Contd:

(32)

GROW_RULE(Inst, Tag, Training)

Rule = empty rule (terms replaced by wildcards) For i = 1 to number of slots in Tag

ANCHOR(Rule, Inst, Tag, Training, i)

Do until Rule makes no errors on Training or no improvement in Laplacian EXTEND_RULE(Rule, Inst, Tag, Training)

ANCHOR(Rule, Inst, Tag, Training, i)

Base_1 = Rule + terms just within extraction i Test first i slots of Base_1 on Training

While Base_1 does not cover Tag

EXTEND_RULE(Base_1, Inst, Tag, Training) Base_2 = Rule + terms just outside extraction i Test first i slots of Base_2 on Training

While Base_2 does not cover Tag

EXTEND_RULE(Base_2, Inst, Tag, Training) Rule = Base_1

If Base_2 covers more of Training than Base_1 Rule = Base_2

Laplacian = (e +1) / (n +1), where e is the number of errors and n is the number of

extractions made on the training set

Contd:

(33)

EXTEND_RULE(Rule, Inst, Tag, Training) Best_Rule = NULL

Best_L = 1.0

If Laplacian of Rule within error tolerance Best_Rule = Rule

Best_L = Laplacian of Rule For each Term in Inst

Proposed = Rule +Term Test Proposed on Training

If Laplacian of Proposed < Best_L Best_Rule = Proposed

Best_L = Laplacian of Proposed Rule = Best_Rule

WHISK represents a form of hill climbing, and cannot guarantee that the rules it grows on are optimal.

Contd:

(34)

Interactive preparation of training

 Selecting informative instances:

 In each iteration of WHISK, a batch of instances is selected from the reservoir of untagged instances, presented to the user for tagging, and then added to the training set.

 Classes of untagged instances:

Instances covered by an existing rule

Instances that are near misses of a rule

Instances not covered by any rule

 When to stop tagging?

(35)

Conclusion

 Till now the IR is done with the

assumption of term independence, work is going on to overcome this assumption.

 Ability to use dynamic knowledge in IR.

 Visual IR and IE will improve our

ability to use the WWW.

(36)

References References

1. Xiaojin Zhu, “Advanced NLP: Information Retrieval”, CS838-1, WISC,2007.

2. Dr. Ian Soboroff, ”Information Retrieval”, CMSC 491I/691I UMBC, 2002

3. Nazli Goharian, “Language Models”, CS429, Illinois Institute of Technology, 2007

4. Fabio Ciravegna, “(LP)2, an Adaptive Algorithm for Information Extraction from Web-related Texts”,IJCAI- 2001 Workshop on Adaptive Text Extraction and

Mining , 2001.

5. Stephen Soderland, “Learning Information Extraction Rules for Semi-Structured and Free Text”,Machine Learning 34, 233–272 ,1999

6. Line Eikvil, “Information Extraction from world wide web”, A Survey. No. 945, ISBN 82-539-0429-0, pp. 39, 1999

References

Related documents

These gains in crop production are unprecedented which is why 5 million small farmers in India in 2008 elected to plant 7.6 million hectares of Bt cotton which

INDEPENDENT MONITORING BOARD | RECOMMENDED ACTION.. Rationale: Repeatedly, in field surveys, from front-line polio workers, and in meeting after meeting, it has become clear that

Of those who have used the internet to access information and advice about health, the most trustworthy sources are considered to be the NHS website (81 per cent), charity

Harmonization of requirements of national legislation on international road transport, including requirements for vehicles and road infrastructure ..... Promoting the implementation

To break the impasse, the World Bank’s Energy Sector Management Assistance Program (ESMAP), in collaboration with Loughborough University and in consultation with multiple

Angola Benin Burkina Faso Burundi Central African Republic Chad Comoros Democratic Republic of the Congo Djibouti Eritrea Ethiopia Gambia Guinea Guinea-Bissau Haiti Lesotho

The scan line algorithm which is based on the platform of calculating the coordinate of the line in the image and then finding the non background pixels in those lines and

Daystar Downloaded from www.worldscientific.com by INDIAN INSTITUTE OF ASTROPHYSICS BANGALORE on 02/02/21.. Re-use and distribution is strictly not permitted, except for Open