Arboretum:
A syntactic tree corpus of Danish

Goals and organisation

The Arboretum project, launched in 2002, aims at building a large, multi-format Danish treebank from automatically parsed and manually revised running text data. It is based at the Institute of Language and Communication (ISK), University of Southern Denmark (SDU), and carried out within the VISL-project framework, where it is currently integrated with the Norfa-supported Constraint Grammar project PaNoLa (Parsing Nordic Languages).

Arboretum is affiliated with the Nordic Treebank Network, and notationally compatible with its sister projects for Portuguese (Floresta Sintá(c)tica) and French (L'Arboratoire).

Current project members include:

  • Eckhard Bick, project leader (automatic analysis, grammars, lexicography)
  • Camilla Pedersen, student researcher (CG-revision, GG-to-Tree revision)
  • Ina Størner Rasmussen, student researcher (CG-revision, CG-to-Tree revision)
  • Dorte Lønsmann, student researcher (Tree-revision)
  • Kim Ebensgaard Jensen, student researcher (Tree-revision)
  • Søren Harder, Ph.D. student (Dependency tranformation)

    Data and annotation

    The text data used in the Arboretum project are portions of the Danish mixed genre corpora Corpus90 and Corpus2000, both compiled by DSL and grammatically annotated by Eckhard Bick (VISL). The current project strives to go beyond the earlier annotation by using a multi-stage hybrid system (DanPars) containing both Constraint Grammar and PSG modules. Thus, ordinary syntactic CG-output, enhanced by special structural CG-rules (a), is used as non-terminal input for (b) a function based PSG and (c) a Prolog dependency specifier. By eventually making available all three formats in parallel, Arboretum aims at bridging these different paradigms, exploiting their respective strengths and compensating for their weaknesses, both at the processing and descriptive levels.

    A full description of both CG-tags and VISL-treetags is provided in the info folder of the Danish VISL-section. Some documentaion and evaluation about the DanGram parser and its precurser, Portuguese PALAVRAS, is available as postscript downloadables.

    Revision

    The final size of the Arboretum treebank is targeted at 10 million words. However, and since this size obviously forbids complete manual revision, a triage system is envisioned, where a few hundred thousand words (the "Botanical Garden") will be linguisticall revised, serving as a "gold standard" bench mark for category definitions, structural examples, teaching and - possibly - training of probabilistic systems. A larger chunk (the "Plantation forest", a few million words) will be partially revised, either for major function categories or structural well-formedness. Finally, the remaining data (the unkempt "Jungle") will be analysed with the updated DanGram-parser, implementing as much as possible of what has been learned and decided during the revision phase.

    The revision technique currently employed mirrors the modular hierarchy of the parsing system. Thus, CG-output is revised first, then fed into the automatic PSG grammar, preventing error propagation from the former to the latter, and minimising parse-failures. Also, the PSG-rules, being less robust than CG-rules, serve as a kind of stringency filter, since parse failures will often point out tagging inconsistencies from earlier levels. These will then be manually corrected in the CG-format, allowing a better PSG-reparse, and so on. Finally, all sentences are reviewed again in tree-format, using VISL's graphical tree-inspection and -manipulation tools. So far, roughly 100.000 words have been linguistically revised this way.

    Online access

    It is possible to search the revised part of Arboretum trees through a special search interface, which allows both text and pattern searches, including form & function searches and node sequences. The same material is also searchable through VISL's general search interface, where it is used for beefing up the smaller and more domesticized teaching corpus of hand-crafted sentences. The CG-version of the revised part of Arboretum has been partially integrated into the general corpus interface for Korpus90/2000, where hits should return revised material before unrevised data.