Manual Window

Please park me in a convenient corner of your screen

This search engine has two search modes, one for pure-text corpora (Danish, English, German, Spanish), yielding traditional concordances, and another one for tagged corpora (Portuguese or Danish), using colour notation for PoS and syntactic indexing. In this frame window, you can find information and examples on how to run effective searches in both modes.


(1) Pure-text corpora

This version is a grep-based string tool, which means that you can use it in either intuitive og elaborate mode. If you are content with simple word searches, just type your text in the window and hit the "search"-button. Sentence start is marked by triple hyphens (- - -). The search formalism allows regular expressions, so if you want to run flexible searches, try some of the following:

. means 'any character, exactly 1'
? means '0 or 1 instances of preceding letter or expression'
* means '0 or more instances of preceding letter or expression, longest match'
*? means '0 or more instances of preceding letter or expression, shortest match'
+ means '1 or more instances of preceding letter or expression'
[] surrounds sets of characters
[^ ] surrounds sets of NOT-ed characters
() surrounds expressions
\w means a word character (alphanumeric), same as [a-zA-Z_0-9]
From this, you can build, for example:

[^ ] means 'any character but a space'
\w+ or [^ ]+ means 'any word'
[^\w] means a non-word character, including spaces, tabs and newlines [A-Z]+ means capital letters
[0-9]+ means numbers
[aeiou] means a vowel
[^aeiou ] means anything but a vowel and a blank
( [aeiou][^ ]+s)( [aeiou][^ ]+s)+[ ] means a sequence of 2 or more plural words with initial vowels
( ([gc][aeiou])[^ ]+)( \2[^ ]+)+[ ] finds you "gaelic" alliteration rhyms
\w(\w+)-\w\1, surrounded by single spaces, finds you lots of "willy-nilly"-constructions as well as some "cut-out" and "four-hour" cases. [a-z]([a-z]+))\-[a-z]\1 does the same, but avoids soccer results.
Remember spaces for word boundaries!

Try to force triple alliterations, as an exercise, and you will notice that complex searches will take a very long time, - especially if the server is serving many people at the same time. Thus the triple gaelic alliterations took about 7 minutes in a triple run on the BNC with its 600 MB of data

Remember that this search engine is sentence based, and will not transcend sentence window boundaries. In the initial concordance, you will be shown an 80-letter context from any sentence that matches your query, followed by a source-ID, consisting of the text segment (id), text type (n) and/or paragraph (p), as well as a sentence number (s) within the concerning segment. You can, however, - by clicking the sentence's radio button - add full sentence context "on demand" to be shown in the lower frame (context window). More running context is available by clicking on the rectangular id-buttons.


(2) Tagged corpora

This version of the search engine handles searches for all kinds of morphological and syntactic information. More specifically, one can search for words, base forms, grammatical category, inflexional features and syntactic functions on the word and clause levels:

word forms are enclosed in double quotes
base forms are enclosed in single quotes (not in the Danish quote corpus)
morphological tags relating to the same word are separated by blank spaces
words or tag chains for individual words are separated by an underscore
sentence start is marked by ">>>" as a "word form"

Some regular expression conventions work in this version, too. Thus, regular expressions can be used in word form strings, e.g. "køb(e|r|te)" or "køb[ert]+". Between searched-for tag strings, use '_._' for dummy words, '_.?_' for one optional dummy word, '_.*_' for one or more optional dummy words, and '_.+_' for one or more obligatory dummy words. For tags other than word forms, regular expressions do not work in full, but alternatives can be expressed within parentheses, separated by '|', e.g. '(DET|ADJ)_PROP (@SUBJ>|@<ACC)'.

On the tagged Danish quote corpus, try for instance the following:
ADV_INFM_@ICL @<ACC to find pre-infinitive adverbs in connection with non-finite direct object clauses
"bilist.*" P to find plural forms of 'bilist'
@>>P_.+_PRP_(^@P<|@>N) to find "stranded" prepositions with fronted arguments
@>>P_.+_PRP_PRP ditto followed by yet another preposition
"hvor" ADV_(^V)+_@FS @N< to find postnominal relative clauses introduced by the relative adverb "hvor"