/httpd/html/Corpus Eye

Corpus sources and copyright

We are grateful to all organisations and individuals who have provided/licensed corpus texts for use at the Institute of Language and Communication (ISK) at the University of Southern Denmark.

Credits: For your publications or other references, please use the text and provider details listed below. For annotation and site credits, see also our work credits page.

Please note that corpus search engines are meant to provide researchers with language data and statistics, not running text. Thus, ordinary copyright still holds. This implies for instance that you mustn't try to extract larger, contiguous text portions from any of the corpora.


Danish corpus-sources:

LOKE, an online news and literature magazine, copyright Arne Herløv Petersen
Parliamentary debates, from the Danish
Folketing, kindly provided by webmaster Benny Høyer
Udklipsbureauet, prose fiction by Ole Dalgaard
Bar el Gazel, prose fiction by Ole Dalgaard
Litteraturvidenskaben siden nykritikken, by Ole Sauerberg (2000)
Ret og pligt i det 17. århundrede, by Knud E. Korff (1996)
Skalk is a Danish
journal of archaeology (ISSN 0560-1894)
Munk-korpus, by Ulrik Petersen (2006)
Europarl is the Danish part (both original and translated) of the European Parliament Proceedings Parallel Corpus 1996-2003, which is a brain child of Philipp Koehn, and available without copyright at his website.
Wikipedia is the Danish part of a 9 language, Google-style snapshot (December 2005) of Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.
Korpus 90/2000, a mixed genre "quote corpus", has been compiled by DSL (Det Danske Sprog- og Litteraturselskab), and grammatically annotated with VISL tools in a joint venture framework. The text corpus, Den Danske Ordbogs Citatkorpus (DDOC-korpus), is a subset of Den Danske Ordbog (DDO) korpus. In the construction of the DDOC corpus the following steps were used: 1. automatic orthographic sentence chunking (though with some errors due to ambiguous full stops, in particular, 2. removal of a random third of the sentences, 3. randomised ordering of the remaining sentences. Another subset of the DDO corpus is Korpus 90, which contains all DDO texts from 1988-92 (25 million running word forms). Korpus 90/2000 is accessible at the website of the Korpus 2000 project. The Leipzig corpus is the Danish section of the Leipzig Corpora Collection, compiled from Internet sources at the University of Leipzig.
Information is a newspaper corpus, consisting of 14.780 articles from the publicably searchable archive of the Danish newspaper daily Information (1996-2008). The corpus contains about 92 million words and was kindly made available by Johannes Wehner in the context of a proposed joint semantic web project involving Information, VISL and GrammarSoft.


Portuguese corpus-sources:

Europarl is the Portuguese part (both original and translated) of the European Parliament Proceedings Parallel Corpus 1996-2003, which is a brain child of Philipp Koehn, and available without copyright at his website.
Wikipedia is the Portuguese part of a 9 language, Google-style snapshot (December 2005) of Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.
CETEMPúblico: A large corpus of European Portuguese (1991-1998), containing articles and other material from the Público newpaper (180 million words). The corpus was compiled by Linguateca and is freely available online. The corpus was morphosyntactically annotated with the PALAVRAS parser as part of theAC/DC project, a joint venture between VISL and the Processamento computacional do português initiative.
CETENFolha: A corpus of Brazilian Portuguese, containing one year's collection of the Folha de São Paulo newspaper (1994), about 25 million words. Like the CETEMPúblico, this is a Linguateca corpus, PALAVRAS-annotated within the AC/DC project, and available online
COLONIA: A corpus of historical Brazilian Portuguese (100 texts, ~ 5 million words), covering the entire period of colonial Brazil. The corpus was developed at the University of Cologne and provided by Marcos Zampieri. For more information, see the project's homepage. The CorpusEye annotation retains original wordforms but internally attempts orthographical normalisation to allow POS and syntactic tagging.

Some other Portuguese texts at this site are corpus samples that have been tagged with the PALAVRAS parser for testing and evaluation purposes, in cooperation with the following research teams:
speech data: Annotated data from the C-ORAL-Brasil project and the CORDIAL-SIN project
historical texts: The TYCHO BRAHE Corpus of Historical Portuguese
modern texts: The NILC project
ad corpus: 580 advertisements from the 2005 and 2006 editions of the Portuguese journals Activa, Lux Woman, GQ, Visão and Caras Decoração, collected by Alexandra Pinto at FLUP, comprising texts of at least 15 content words.


German corpus-sources:
Europarl is the German part (both original and translated) of the
European Parliament Proceedings Parallel Corpus 1996-2003, which is a brain child of Philipp Koehn, and available without copyright at his website.
Wikipedia is the German part of a 9 language, Google-style snapshot (December 2005) of Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.
bzk: Bonner Zeitungskorpus. Password protected for ISK-researchers only.
mak:Mannheimer Korpora. Password protected for ISK-researchers only.
ecide3: Frankfurter Rundschau newspaper text (ca. 1992), as compiled for the multilingual ECI-collection by the European chapter of the Association of Computational Linguistics (EACL). Password protected for ISK-researchers only.

English corpus-sources:
Europarl is the English part (both original and translated) of the
European Parliament Proceedings Parallel Corpus 1996-2003, which is a brain child of Philipp Koehn, and available without copyright at his website.
Wikipedia is the English part of a 9 language, Google-style snapshot (December 2005) of Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.
bnc: British National Corpus. Password protected for ISK-researchers only.
KEMPE: 'Korpus of Early Modern Playtexts in English', was initially compiled by Lene B. Petersen and Marcus X. Dahl, in association with VISL, SDU, 2001-2003. The fully searchable version of the corpus was prepared by Lene B. Petersen and Eckhard Bick, July 2004, and may be freely accessed online without a password. Please report any mis-tagged word forms to lene.petersen@uwe.ac.uk.
Chat is a corpus of 4 different chat logs from Project JJ (http://www.projectjj.com), administrated by Tino Didriksen. The logs were collected between August 2002 and August 2004, and cover the topics (a) Harry Potter, (b) Goth Chat, (c) X Underground and (d) Amarantus: War in New York.
Enron e-mails is a corpus of corporate e-mails, called the Enron Email Dataset, and made available for research by William Cohen on his website. The data was originally made public, and posted to the web, by the (US) Federal Energy Regulatory Commission during its investigation (history and credits).
Wikipedia Talkpages is a "speech-like" corpus of Wikipedia author discussions, called the Wikipedia Talk Page Conversations, and made available for research by Cristian Danescu-Niculescu-Mizil, Lillian Lee, Bo Pang, and Jon Kleinberg at their corpus download page, together with their article Echoes of power: Language effects and power differences in social interaction.
Supreme Court Dialogs is a speech corpus of about 50.000 utterances by 300 participants in 204 law suits. The was made available by Cristian Danescu-Niculescu-Mizil, Lillian Lee, Bo Pang, and Jon Kleinberg at their corpus download page, together with their article Echoes of power: Language effects and power differences in social interaction, building on earlier work by Timothy W. Hawes in his M.A. thesis.
French corpus-sources:
Europarl is the French part (both original and translated) of the
European Parliament Proceedings Parallel Corpus 1996-2003, which is a brain child of Philipp Koehn, and available without copyright at his website.
Wikipedia is the French part of a 9 language, Google-style snapshot (December 2005) of Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.
ecifr1 Le Monde newspaper text (1989/1990), as compiled for the multilingual ECI-collection by the European chapter of the Association of Computational Linguistics (EACL). Password protected for ISK-researchers only.
Ananas, also password-protected, consists of part of ecifr1 as well as other news excerpts from the Ananas project. Joint Venture with Susanne Salmon-Alt (ATILF - Loria-LED
Spanish corpus-sources:
Europarl is the Spanish part (both original and translated) of the
European Parliament Proceedings Parallel Corpus 1996-2003, which is a brain child of Philipp Koehn, and available without copyright at his website.
Wikipedia is the Spanish part of a 9 language, Google-style snapshot (December 2005) of Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.
ecies2 El Diario Sur newspaper text (April & September 1991), as compiled for the multilingual ECI-collection by the European chapter of the Association of Computational Linguistics (EACL). Password protected for ISK-researchers only.

Esperanto corpus-sources:
TTT Internet corpus: This corpus was compiled from a random crawl of Esperanto pages on the internet, performed once in 2004 and once in 2009. Other-language and binary sections were filtered out, and dozens of encoding conventions unified into iso-latin-1, but for better or worse, the Internet corpus is different from the literary, wiki and news corpora, containing a larger portion of non-standard language usage, typing errors etc. Wikipedia Wikipedia-2005 is the Esperanto part of a 9 language, Google-style snapshot (December 2005) of
Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances. Wikipedia-2010 is an equivalent, later - and hence larger - snapshot, using the raw version of Apertiums Esperanto Wikipedia corpus, compiled by Jacob Norfalk.
Eventoj: Newspaper text from the internet version of the 2-weekly Eventoj magazine (1992-2002), published by the Budapest based LINGVO studio. Together with other material and Esperanto-services, Eventoj archives are accessible at the Eventoj Esperanto Center. Corpus use was kindly permitted by László Szilvási.
Monato: Monthly news magazine with international topics. Monato, a kind of Esperanto "Newsweek", is published by Flandra Esperanto-Ligo and has a 25-year history. Files with compiled back issue articles are on-line available at Edmund Grimley Evans' Tekstaro page.
Zamenhof classics: Esperanto texts from the Zamenhof period (Biblio, Andersen's and Grimm's Fabeloj, La Faraono, Proverbaro, Revizoro and Marta). The texts are electronically available on cd-rom, kindly provided by Wolfram Diestel.
Esperanto literature: Esperanto literature on the internet, searched through eLibrejo.

Italian corpus-sources:
Wikipedia is the Italian part of a 9 language, Google-style snapshot (December 2005) of
Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.

Romanian corpus-sources:
The Business corpus was compiled by Arina Greavu from
Revista Capital (1998-2005) and Adevarul Economic (1999-2004). Only out-of-context concordance quotes are provided, not entire articles.

Swedish corpus-sources:
GöteborgsPosten is a Swedish newspaper corpus compiled by Leif Grönqvist from 12 year collections (1992-2003) of
Göteborgs-Posten. In un-annotated form, the corpus is also searchable at Språkbanken's website. The CorpusEye search interface does allow grammatical/syntactic, but will only show single-sentence concordances, without context.
Europarl is the Swedish part (both original and translated) of the European Parliament Proceedings Parallel Corpus 1996-2003, which is a brain child of Philipp Koehn, and available without copyright at his website.
The Leipzig corpus is the Swedish section of the Leipzig Corpora Collection, compiled from Internet sources at the University of Leipzig.

Norwegian corpus-sources:
Wikipedia is the Norwegian part of a 9 language, Google-style snapshot (December 2005) of
Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.
The Leipzig corpus is the Norwegian section of the Leipzig Corpora Collection, compiled from Internet sources at the University of Leipzig.

Other sources:
Some further material, for some of the languages, was provided by ISK members and integrated into this site to allow easier internal access for statistical and distributional research. Please feel free to contact us if you have corpus material yourself that you would like to have annotated and made searchable through this site.