Language Technologies

doi: 10.2498/iti.2013.0563

Domain Dependence of Statistical Named Entity Recognition and Classification in Croatian Texts

Željko Agić, Božo Bekavac

Abstract

Influence of text domain selection on statistical named entity recognition and classification in Croatian texts is investigated. Two datasets of Croatian newspaper texts of differing text domains were manually annotated for named entities and used for training and testing the Stanford NER system for named entity recognition based on sequence labeling with CRF. State of the art scores were observed in both domains. A strong preference for systems trained on mixed text domains is established by the experiment. The top-performing system was recorded with an overall F1-score of 0.876 on mixed-domain test sets, scoring 0.899 in one of the selected domains and 0.852 in the other. The single best domain F1-scores were recorded at 0.910 and 0.858.

Keywords

text domain, domain dependence, named entity recognition, Croatian language

Full text is available at IEEE Xplore digital library.


doi: 10.2498/iti.2013.0565

NLTK Tagger for Albanian using Iterative Approach

Arbana Kadriu

Abstract

This paper presents a research done about a model of tagging for Albanian texts, using the NLTK toolkit. The model uses cascading of three taggers with backoff. We use a dictionary of around 32000 words, together their correspondent POS tags and a set of regular expressions rules too. A lemmatize module is implemented in order to convert nouns and verbs to their lemma. The text is tagged initially with a unigram tagger based on the dictionary. This is used as a baseline tagger for a regular expressions tagger. A correction is made for not correct lemmatized words, creating a third lookup tagger. This tagger will be used with the first and second tagger as backoff.

Keywords

POS tagging, Albanian language, NLTK

Full text is available at IEEE Xplore digital library.


doi: 10.2498/iti.2013.0510

Query Expansion Using Mutual Information Between Words in Question and Its Answer

Kanako Komiya, Yuji Abe, Yoshiyuki Kotani

Abstract

Question Answering (QA) is a task of answering natural language questions with an adequate sentence. This paper proposes to improve the query expansion in the relevant document retrieval module. We proposed modification of measure of mutual information for the query expansion; we calculate it between two words of a question and a word in its answer in the Q & A site corpus not to choose the words that are not suitable. The experiments with Japanese Q & A site corpus revealed that the proposed method significantly improved the accuracies and MRRs (Mean Reciprocal Rank).

Keywords

Question Answering, query expansion, mutual information

Full text is available at IEEE Xplore digital library.