Il Traduttore Intemperante: Corpus Linguistics and Corpora

What is corpus linguistics (I)?

Corpus linguistics is a method of carrying out linguistic analyses. As it can be used for the investigation of many kinds of linguistic questions and as it has been shown to have the potential to yield highly interesting, fundamental, and often surprising new insights about language, it has become one of the most wide-spread methods of linguistic investigation in recent years.

What data do linguists use to investigate linguistic phenomena?

Roughly, four types of data for linguistic analysis can be distinguished:

1) data gained by intuition

a) the researcher‟s own intuition (“introspection”)

b) other people‟s (“informant‟s”) intuition (accessed, for example, by elicitation tests)

2) naturally occurring language

a) randomly collected texts or occurrences (“anecdotal evidence”)

b) systematic collections of texts (“corpora”)

(For further reading on corpora vs. intuition, see Fillmore 1992)

What is a corpus?

A corpus can be defined as a systematic collection of naturally occurring texts (of both written and spoken language).

“Systematic” means that the structure and contents of the corpus follows certain extralinguistic principles (“sampling principles”, i.e. principles on the basis of which the texts included were chosen). For example, a corpus is often restricted to certain text types, to one or several varieties of English, and to a certain time span. If several subcategories (e.g. several text types, varieties etc.) are represented in a corpus, these are often represented by the same amount of text. “Systematic” also means that information on the exact composition of the corpus is available to the researcher (including the number of words in each category and in the whole corpus, how the texts included in the corpus were sampled etc).

Although “corpus” can refer to any systematic text collection, it is commonly used in a narrower sense today, and is often only used to refer to systematic text collections that have been computerized.

What is corpus linguistics (II)?

Corpus linguistics thus is the analysis of naturally occurring language on the basis of computerized corpora. Usually, the analysis is performed with the help of the computer, i.e. with specialised software, and takes into account the frequency of the phenomena investigated.

What corpora are there?

There are many types of corpora, which can be used for different kinds of analyses (cf. Kennedy 1998). Some (not necessarily mutually exclusive) examples of corpus types are (for a description of the individual corpora see below):

- general/reference corpora (vs. specialized corpora)

(e.g. BNC = British National Corpus, or Bank of English):

aim at representing a language or variety as a whole (contain both spoken and written language, different text types etc.)

- historical corpora (vs. corpora of present-day language)

(e.g. Helsinki Corpus, ARCHER)

Corpus Linguistics:

A Practical Introduction

Nadja Nesselhauf, October 2005