Language Technologies

doi: 10.2498/iti.2012.0411

Can we beat Google Translate?

Marija Brkić, Božena Bašić Mikulić, Maja Matetić

Abstract

This paper presents a machine translation evaluation study for Croatian-English language pair. In-domain and out-of-domain translations from Croatian into English have been obtained from Google Translate, our own statistical machine translation system LegTran, and from a professional translator. These translations have been evaluated by six different automatic metrics. The gains obtained from increasing the number of reference translations have been explored and measured. System level correlation between automatic evaluation metrics is given and the significance of the results is discussed. Bootstrapping, approximate randomization and the sign test have been used for confidence intervals and hypothesis testing.

Keywords

automatic evaluation, BLEU, F-measure, Google Translate, NIST, PER, reference set, SMT (statistical machine translation), TER, translation evaluation, WER

Full text is available at IEEE Xplore digital library.


doi: 10.2498/iti.2012.0467

Cross-Lingual Document Similarity

Andrej Muhič, Jan Rupnik, Primož Škraba

Abstract

In this paper we investigated how to compute similarities between documents written in different languages based on a weekly aligned multi-lingual collection of documents. Computing the cross-lingual similarities is based on an aligned set of basis vectors obtained by either latent semantic indexing or the k-means algorithm on an aligned multi-lingual corpus. We evaluated the methods on two data sets: Wikipedia and European Parliament Proceedings Parallel Corpus.

Keywords

cross-lingual, similarity, LSI, k-means, Wikipedia, information retrieval

Full text is available at IEEE Xplore digital library.


doi: 10.2498/iti.2012.0413

Using Google Search Engine for Word Frequency Analysis

Krešimir Pavlina

Abstract

This paper presents results of research which examined correlation between frequency of 100 most commonly used words in three corpora of croatian language and number of web pages returned by Google Search engine which contain those words.

Keywords

word frequency, google, correlation

Full text is available at IEEE Xplore digital library.


doi: 10.2498/iti.2012.0386

Natural Language Processing Resources: Using Semantic Web Technologies

Sandi Pohorec, Ines Čeh, Milan Zorman, Marjan Mernik, Peter Kokol

Abstract

Natural language processing relies heavily on resources. Most common usage scenarios include using the resources for automated lexical tagging or named entity recognition. Also manually annotated language resources are used for benchmarking of new automated approaches. To allow any processing on a large scale and considering the complexity of natural language (words can have multiple meanings within the same general context) the resources have to be quite large. In this paper we focus on lexical resources in ontology form.

Keywords

natural language processing, lexical resource, semantic web, ontology, Linked Data

Full text is available at IEEE Xplore digital library.