Language Technologies
Can we beat Google Translate?
Marija Brkić, Božena Bašić Mikulić, Maja Matetić
Abstract
This paper presents a machine translation evaluation study for Croatian-English language pair. In-domain and out-of-domain translations from Croatian into English have been obtained from Google Translate, our own statistical machine translation system LegTran, and from a professional translator. These translations have been evaluated by six different automatic metrics. The gains obtained from increasing the number of reference translations have been explored and measured. System level correlation between automatic evaluation metrics is given and the significance of the results is discussed. Bootstrapping, approximate randomization and the sign test have been used for confidence intervals and hypothesis testing.
Keywords
automatic evaluation, BLEU, F-measure, Google Translate, NIST, PER, reference set, SMT (statistical machine translation), TER, translation evaluation, WER
Full text is available at IEEE Xplore digital library.
Cross-Lingual Document Similarity
Andrej Muhič, Jan Rupnik, Primož Škraba
Abstract
In this paper we investigated how to compute similarities between documents written in different languages based on a weekly aligned multi-lingual collection of documents. Computing the cross-lingual similarities is based on an aligned set of basis vectors obtained by either latent semantic indexing or the k-means algorithm on an aligned multi-lingual corpus. We evaluated the methods on two data sets: Wikipedia and European Parliament Proceedings Parallel Corpus.
Keywords
cross-lingual, similarity, LSI, k-means, Wikipedia, information retrieval
Full text is available at IEEE Xplore digital library.
Using Google Search Engine for Word Frequency Analysis
Krešimir Pavlina
Abstract
This paper presents results of research which examined correlation between frequency of 100 most commonly used words in three corpora of croatian language and number of web pages returned by Google Search engine which contain those words.
Keywords
word frequency, google, correlation
Full text is available at IEEE Xplore digital library.
Natural Language Processing Resources: Using Semantic Web Technologies
Sandi Pohorec, Ines Čeh, Milan Zorman, Marjan Mernik, Peter Kokol
Abstract
Natural language processing relies heavily on resources. Most common usage scenarios include using the resources for automated lexical tagging or named entity recognition. Also manually annotated language resources are used for benchmarking of new automated approaches. To allow any processing on a large scale and considering the complexity of natural language (words can have multiple meanings within the same general context) the resources have to be quite large. In this paper we focus on lexical resources in ontology form.
Keywords
natural language processing, lexical resource, semantic web, ontology, Linked Data
Full text is available at IEEE Xplore digital library.