CorpusMate

The proposed corpus platform (CorpusMate), is developed by Dr. Peter Crosthwaite (UQ) and Dr. Vít Baisa, creator of the popular corpus platforms SkELL and Versatext, will provide a streamlined, simplified language data analysis experience for younger (L2) learners, incorporating the best features of currently available tools into an integrated digital environment designed specifically for secondary school learners.

The KWIC pattern view was inspired by the work of Prof. Laurence Anthony (Waseda), the creator of Antconc v4 where this view was first introduced. We are grateful to Laurence for his support.
Anthony, L. (2022). What can corpus software do?, in A O’Keeffe & M McCarthy (Eds.) The Routledge Handbook of Corpus Linguistics. Abingdon: UK. Routledge Press.

Help on queries

Apart from words and phrases (e.g. similarity), you can use three wildcard symbols.

* (asterisk) which is interpreted as any word, e.g. in the most * way

? (question mark) which stands for zero or one occurrence of the preceeding word, e.g. a small? thing and

/ (forward slash) which allows for a conjunction (one of two words), e.g. in the/a small

The Corpus

The corpus has been compiled from 5 different resources (see below).

The texts were normalized, tokenized and PoS tagged.

The texts were checked for profanity with a special tool and all offending sentences were filtered out from the corpus.

BAWE

The corpus is a record of proficient university-level student writing from around 2000. The disciplines in metadata were mapped onto a predefined list of topics (see the table below). More info can be found at the project's website.

BASE

The BASE corpus consists of 160 lectures and 38 seminars recorded in a variety of university departments. The disciplines in metadata were mapped onto a predefined list of topics (see the table below). More info can be found at the project's website.

TED Talks

Transcripts of TED talks. The original keywords were mapped onto the predefined topics (see the table below).

Simple English Wikipedia

The whole Wikipedie in simple English was processed but only articles containing a preselected keywords (branches of sciences) were processed further. The mediawiki format used internally in all Wikipedias was converted with pandoc into HTML format. This was then cleaned with a series of custom scripts and turned into the final form. The topics were estimated based on keywords in the articles themselves which were then mapped onto the predefined topics (see the table below).

BBC Teach

BBC Teach contains transcriptions of learning videos from BBC Teach website. The texts were scraped from the website. Topics were taken from the website and from the description of the videos.

Elsevier

Elsevier OA CC-BY Corpus is a collection of 40,000 scientific articles from across Elsevier's journals. More information can be found at the dataset website. The scientific fields from the metadata were mapped onto the project's topics.

Statistics

Data sources

CollectionTokensDocuments
Wiki 13,494,855 40,134
Elsevier 11,969,773 2,200
BAWE 7,402,159 2,761
TED 5,925,929 2,456
BASE 1,680,673 198
BBC 436,610 520
Sum 40,909,999 48,269

Topics

TopicTokensDocuments
Health and Medicine 15,158,033 9,706
Culture, Arts and Music 13,753,903 8,658
Biology 13,105,827 14,465
Science 11,460,623 14,431
Technology 10,616,895 7,443
History 10,421,808 11,050
Physics 9,943,324 6,314
Geography, Agriculture and Environment 9,235,409 5,872
Business and Economics 9,015,322 4,824
Society 8,606,908 4,615
Politics 5,727,551 3,411
Engineering 5,632,925 2,548
Psychology 4,880,119 2,306
Mathematics 4,330,219 9,235
Law 4,286,358 2,687
English Language and Literature 3,955,349 3,995
Chemistry 3,520,567 4,003
Education 2,422,862 2,655
Architecture, Planning and Design 2,040,627 2,710
No topic 898,925 9,808
Journalism 699,257 611

Modality

ModeTokensDocuments
written 32,866,787 45,095
spoken 8,043,212 3,174

Register

RegisterTokensDocuments
academic 21,052,605 5,159
general 19,857,394 43,110