Eiríkur Rögnvaldsson

Eiríkur Rögnvaldsson is professor of Icelandic language at the University of Iceland, and has been one of the chief movers of the Icelandic language technology programme. At SaLTMiL'08 he presented a paper called "Icelandic Language Technology Ten Years Later", showing "how the authorities, industry, and academia can fruitfully cooperate to build language technology resources and tools from scratch in a relatively short time for a relatively small budget."

In 1998 there was, to all intents and purposes, no Icelandic language technology. The Icelandic government commissioned a report into what should be done to rectify this situation. As a result, the government launched the Language Technology Program in 2000, to "support institutions and companies to create basic resources for Icelandic language technology work". One proposal was to develop education and training in language technology, and to this end a Masters programme in language technology has been launched at the University of Iceland.

The original proposals involved a five year plan for kickstarting Icelandic language technology. Seven "priority tasks" were specified:

  1. Software translation - localised versions of Windows XP, Vista and Office 2007 have been launched, along with various flavours of GNU/Linux.
  2. Icelandic characters - nowadays most TV sets and mobile phones are compliant with ISO 8859-1, which contains all the necessary Icelandic characters.
  3. Morphological and syntactic parsing - see below.
    1. A balanced corpus
    2. A semantically annotated lexicon
  4. Spelling and grammar checkers - there is a range of word-based spell-checking programs available, and work is currently being undertaken on a context-sensitive spell-checker.
  5. Text-to-speech system - a system was developed by the University of Iceland, Iceland Telecom and Hex Software (powered by Nuance speech technology), and came on to the market in 2007.
  6. Speech recognition - an isolated word speech recogniser has been developed by the University of Iceland and four telecommunication companies. It has a 97% recognition rate.
  7. Machine translation - limited progress from isolated experiments.

Three major projects were funded by the Language Technology Project in the area of syntactic and morphological parsing:

  • The Institute of Lexicography are building a full-form morphological database, which now contains 259,000 lexemes and 5.9 million inflectional forms.
  • Three POS taggers (TnT, MXPOST and fnTBL) were developed from a manually tagged corpus of 500,000 words, also at the Institute of Lexicography.
  • Frisk Software are developing an HPSG-based parser, to be used in grammar and style-checking software.

In addition, the Institute of Lexicography is building a 25 million word balanced, morphologically tagged corpus of Modern Icelandic text, including transcribed spoken language, to be completed in 2008. No real progress has been made on a semantically annotated lexicon (like PAROLE/SIMPLE), but the ISLEX database is under development, and will contain 50,000 Icelandic words with their equivalents in Danish, Norwegian and Swedish.

No comments:

Post a Comment