• No results found

Contents - CFILT, IIT Bombay

N/A
N/A
Protected

Academic year: 2023

Share "Contents - CFILT, IIT Bombay"

Copied!
47
0
0

Loading.... (view fulltext now)

Full text

Introduction

  • Named Entities
  • Named Entity Recognition and Classification (NER)
  • Applications of NER
    • Introduction to Cross-Lingual Information Access (CLIA)
  • Challenges in NER for Indian Languages
  • Roadmap

Most research on NER systems is structured by taking an unannotated block of text, for example: “The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Interlingual Information Access (CLIA) is a mission mode project to be executed by a consortium of academic and research institutions and industry partners in India. Although we can gain much insight from the methods used for English, there are many issues that make the nature of the problem different for Indian languages.

Another characteristic of these languages ​​is that most of them use scripts of Brahmi origin, which have very phonetic characteristics that could be used for multilingual NER. 7. noun) which can also be used as names (proper nouns) is very large for, in contrast to European languages, where a larger part of first names are not used as ordinary words. And the frequency with which they can be used as common nouns in relation to personal names is more or less unpredictable.

Due to the alpha-syllabic nature of the Indian scripts, abbreviation can be expressed through a sequence of letters or syllables, but most importantly, there is a serious lack of labeled data for machine learning. In Chapter 3, we describe the work done at IIT Bombay on NER for Indian languages, especially the CLGIN system.

Figure 1.1: CLIA Architecture
Figure 1.1: CLIA Architecture

NER Survey

  • General Observations
  • Language factor
  • Textual genre or domain factor
  • Entity type factor
  • Learning methods
    • Supervised learning
    • Semi-supervised learning
    • Unsupervised learning
  • Feature space for Named Entity Recognition
    • Word-level features
    • List lookup features
  • Corpus Statistics and NER
    • Informativeness measures and statistics
    • Application to NER

The factor of text genre (journalistic, scientific, informal, etc.) and field (gardening, sports, business, etc.) has not been extensively investigated in the NER literature. In the term "Named Entity", the word "Named" is intended to limit the assignment to only those entities for which one or more rigid markers are referents. There is general agreement in the NER community to include time terms and some numerical terms such as amounts of money and other types of units.

While some examples of these types are good examples of rigid markers (e.g. 2001 is the year 2001 of the Gregorian calendar), there are also many invalid ones (e.g. June refers to a month of an undefined year - last June, this June, June 2020, etc.) . Similarly, subcategories of "fine-grained persons" such as "politician" and "entertainer" appear in *FL01+ and. The "miscellaneous" type is used in CONLL conferences and includes proper names that do not belong to the classic "enamex".

A baseline SL method often proposed consists of tagging words in a test corpus when they are annotated as entities in the training corpus. For example, the passage "The Robots of Dawn, by Isaac Asimov (Paperback)" would make it possible to find, on the same site, "The Ants" by Bernard Werber (Paperback)". Riloff and Jones note that the performance of this algorithm can degrade quickly when noise is introduced into the entity list or pattern list.

For example, when X is an upper case sequence, the query "like X" is searched on the web and, in the retrieved documents, the noun immediately preceding the query can be selected as the hypername of X. The enumerated pattern attribute is a condensed form of the above in which consecutive character types are not repeated in the mapped string. Common nouns listed in a dictionary are useful, for example, in the disambiguation of capitalized words in ambiguous positions (eg sentence beginning).

With measures or scores that indicate how topic-oriented or “informative” each word in the corpus is, we can identify named entities in individual corpus documents. The principle behind the IDF measure is that the fewer documents a word appears in, the greater the chance that it is highly relevant to those documents, and the greater the information content of the word. Dw = Number of documents in which the word w appears D = Total number of documents in the corpus.

The concept of expected IDF based on the frequency of words in the corpus was introduced by [CHU95+. Most of the current approaches to NEI/NER do not use global distribution features of words (e.g. information content, term co-occurrence statistics, etc.) when considering a large corpus.

Table 2.2: List lookup features for NER  2.6.2.2 Words that are typical of organization names
Table 2.2: List lookup features for NER 2.6.2.2 Words that are typical of organization names

The CLGIN system

MEMM Based NER System for Hindi

  • TagSet
  • Features Used
  • Results
  • Using Foreign Language Word Information for NER

All NNPs are skipped and then the system checks the next word after the NNPs. We first skip the entire list of NNPs and then check the next word after the NNPs. This included: ORG dict, ORG add next, ORG next2 add, ORG prev2 add, ORG nextcontext1, ORG nextcontext2, ORG prevcontext1 and ORG prevcontext2- Context words: Previous and next words.

For each tag, we generate the probability of reaching that tag based on possible previous tags. At each step, for each label, we select the path that has the greatest chance of reaching the current state. As the numbers show, there is a significant improvement in the accuracy of detecting organization names.

Information on "Foreign Language Words" has been found to be a useful feature for identifying organization tags.

Fig 3.2: Accuracy figures for CLIGN system
Fig 3.2: Accuracy figures for CLIGN system

Combining Global and Local Characteristics for NEI and NER

  • CLGIN Approach
  • Tagging using Global Distribution (NEIG)
    • Information Measure/Score
    • Heuristics for Pruning and Augmenting NE List
  • Performance Comparison of NEIG and CLGIN Approaches (Training and Test
  • Performance Comparison of Baseline, NEIG and CLGIN (Training and Test

The output of the MEMM system thus obtained is the final output of the CLGIN approach. NEs are highly relevant words in a document [CCR02] and are expected to have a high information content [RJ05]. In this step, the top few words with a high information score are selected as NEs (threshold is set using a development set).

Of all the measures, Residual IDF performed best and was used to generate the ranked list of words expected to be NEs using the information measure. In this step, the following pruning and augmentation heuristics are applied to the ranked NE list. From the previous step, a list of words with a high information score (Say, top t) is taken.

In this step, t more words are taken and for each word, w, a vector of the size of the number of distinct words in the corpus is created. Each term in the vector represents the frequency with which it occurs in the context (of three words) of word, w. It was observed that the NEs were grouped in some clusters and common words in other clusters.

A cluster is marked as an NE cluster if the average of the ranks of 50% of the top ranked word within the cluster is low (< t=2), and the words in that set are added as NEs. Also, if most of the words in the cluster have a higher rank, ie lower information content, they are removed from the NE set. Words that have some common suffixes like (on), (yenge), etc. appears, is removed from the NE list.

The randomization resulted in document mixing, with each partition containing documents from all books. But, in this experiment, we divided the documents into two groups, so documents from a few books (genre: History and History) were placed in one group and the rest in another group (genre: Biography and Essay). NEIG and CLGIN come close to S-MEMM and it shows that CLGIN results are significantly better than the baseline system.

Table 3.3: Summary of Approaches   3.2.1 CLGIN Approach
Table 3.3: Summary of Approaches 3.2.1 CLGIN Approach

Unsupervised Approach for NER : Clustering Based on Distributional Similarity 36

  • Results

In Chapter 2, we then provided various approaches that have been developed over time for named entity recognition. Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods. LI07] : David Nadeau, Satoshi Sekine. An investigation of named entity recognition and classification: Lingvisticae Investigationes, Vol.

Early results for named entity recognition with conditional random fields, feature induction, and web-enhanced lexicons. SH10] Shalini Gupta, Pushpak Bhattacharyya Think globally, apply locally: using distributional features for Hindi named entity identification NEWS '10 Proceedings of the 2010 Named Entities Workshop.

Table 3.6 :Cluster1
Table 3.6 :Cluster1

Figure

Figure 1.1: CLIA Architecture
Table 2.2: List lookup features for NER  2.6.2.2 Words that are typical of organization names
Table 3.1: Tagset for MEMM system
Fig 3.2: Accuracy figures for CLIGN system
+7

References

Related documents

For future research in this area, we see the follow- ing possibilities: • Develop methods that take into account the conceptual relatedness of source and target domains • Device deep