by Tony McEnery and Andrew Hardie; published by Cambridge University Press, 2012  

Home > (1) Corpus linguistics

Website contents

  • (1) Corpus Linguistics
  • Mode of communication
  • Corpus-based, corpus-driven
  • Data collection
  • Annotated corpora
  • Multilingual corpora
  • (2) Analysing corpus data
  • (3) The web, laws and ethics
  • (4) English Corpus Linguistics
  • Extended footnotes
  • Answers to exercises
  • Weblink directory
  • Corpus tools
  • Other resources
  • Buy the book
  • About the authors

Part 1: Corpus Linguistics

What is corpus linguistics.

Corpus linguistics is a field which focuses upon a set of procedures, or methods, for studying language. We can take a corpus-based approach to many areas of linguistics. Importantly, the development of corpus linguistics has also spawned new theories of language – theories which draw their inspiration from attested language use and the findings drawn from it.

But corpus linguistics is not a monolithic, consensually agreed set of methods and procedures. It is in fact a heterogeneous field – although there are some basic generalisations that we can make.

A concordance in the AntConc tool

The main features of corpus linguistics

Research in corpus linguistics deals with some set of machine-readable texts which is deemed an appropriate basis on which to study a particular research questions. The set of texts or corpus is usually of a size which defies analysis by hand and eye alone within any reasonable timeframe. For this reason, corpora are invariably exploited using software search tools. Concordancers allow users to look at words in context. Other tools allow the production of frequency data , for example a word frequency list, which lists all words appearing in a corpus and specifies how many times each one occurs in that corpus. Concordances and frequency data exemplify respectively the two forms of analysis, namely qualitative and quantitative, that are equally important to corpus linguistics.

A word frequency list in the AntConc tool

Different types of corpus study

The following features effectively distinguish different types of studies in corpus linguistics:

This page was last modified on Thursday 26 May 2011 at 3:49 am.

Welcome | Part 1 | Part 2 | Part 3 | Part 4 | Footnotes | Answers | Weblinks | Corpus tools | Other resources | Buy the book | About the authors | References

Department of Linguistics and English Language, Lancaster University, United Kingdom

  • Search Menu
  • Browse content in Arts and Humanities
  • Browse content in Archaeology
  • Anglo-Saxon and Medieval Archaeology
  • Archaeological Methodology and Techniques
  • Archaeology by Region
  • Archaeology of Religion
  • Archaeology of Trade and Exchange
  • Biblical Archaeology
  • Contemporary and Public Archaeology
  • Environmental Archaeology
  • Historical Archaeology
  • History and Theory of Archaeology
  • Industrial Archaeology
  • Landscape Archaeology
  • Mortuary Archaeology
  • Prehistoric Archaeology
  • Underwater Archaeology
  • Urban Archaeology
  • Zooarchaeology
  • Browse content in Architecture
  • Architectural Structure and Design
  • History of Architecture
  • Residential and Domestic Buildings
  • Theory of Architecture
  • Browse content in Art
  • Art Subjects and Themes
  • History of Art
  • Industrial and Commercial Art
  • Theory of Art
  • Biographical Studies
  • Byzantine Studies
  • Browse content in Classical Studies
  • Classical History
  • Classical Philosophy
  • Classical Mythology
  • Classical Literature
  • Classical Reception
  • Classical Art and Architecture
  • Classical Oratory and Rhetoric
  • Greek and Roman Epigraphy
  • Greek and Roman Law
  • Greek and Roman Archaeology
  • Greek and Roman Papyrology
  • Late Antiquity
  • Religion in the Ancient World
  • Digital Humanities
  • Browse content in History
  • Colonialism and Imperialism
  • Diplomatic History
  • Environmental History
  • Genealogy, Heraldry, Names, and Honours
  • Genocide and Ethnic Cleansing
  • Historical Geography
  • History by Period
  • History of Agriculture
  • History of Education
  • History of Emotions
  • History of Gender and Sexuality
  • Industrial History
  • Intellectual History
  • International History
  • Labour History
  • Legal and Constitutional History
  • Local and Family History
  • Maritime History
  • Military History
  • National Liberation and Post-Colonialism
  • Oral History
  • Political History
  • Public History
  • Regional and National History
  • Revolutions and Rebellions
  • Slavery and Abolition of Slavery
  • Social and Cultural History
  • Theory, Methods, and Historiography
  • Urban History
  • World History
  • Browse content in Language Teaching and Learning
  • Language Learning (Specific Skills)
  • Language Teaching Theory and Methods
  • Browse content in Linguistics
  • Applied Linguistics
  • Cognitive Linguistics
  • Computational Linguistics
  • Forensic Linguistics
  • Grammar, Syntax and Morphology
  • Historical and Diachronic Linguistics
  • History of English
  • Language Acquisition
  • Language Variation
  • Language Families
  • Language Evolution
  • Language Reference
  • Lexicography
  • Linguistic Theories
  • Linguistic Typology
  • Linguistic Anthropology
  • Phonetics and Phonology
  • Psycholinguistics
  • Sociolinguistics
  • Translation and Interpretation
  • Writing Systems
  • Browse content in Literature
  • Bibliography
  • Children's Literature Studies
  • Literary Studies (Asian)
  • Literary Studies (European)
  • Literary Studies (Eco-criticism)
  • Literary Studies (Modernism)
  • Literary Studies (Romanticism)
  • Literary Studies (American)
  • Literary Studies - World
  • Literary Studies (1500 to 1800)
  • Literary Studies (19th Century)
  • Literary Studies (20th Century onwards)
  • Literary Studies (African American Literature)
  • Literary Studies (British and Irish)
  • Literary Studies (Early and Medieval)
  • Literary Studies (Fiction, Novelists, and Prose Writers)
  • Literary Studies (Gender Studies)
  • Literary Studies (Graphic Novels)
  • Literary Studies (History of the Book)
  • Literary Studies (Plays and Playwrights)
  • Literary Studies (Poetry and Poets)
  • Literary Studies (Postcolonial Literature)
  • Literary Studies (Queer Studies)
  • Literary Studies (Science Fiction)
  • Literary Studies (Travel Literature)
  • Literary Studies (War Literature)
  • Literary Studies (Women's Writing)
  • Literary Theory and Cultural Studies
  • Mythology and Folklore
  • Shakespeare Studies and Criticism
  • Browse content in Media Studies
  • Browse content in Music
  • Applied Music
  • Dance and Music
  • Ethics in Music
  • Ethnomusicology
  • Gender and Sexuality in Music
  • Medicine and Music
  • Music Cultures
  • Music and Religion
  • Music and Culture
  • Music and Media
  • Music Education and Pedagogy
  • Music Theory and Analysis
  • Musical Scores, Lyrics, and Libretti
  • Musical Structures, Styles, and Techniques
  • Musicology and Music History
  • Performance Practice and Studies
  • Race and Ethnicity in Music
  • Sound Studies
  • Browse content in Performing Arts
  • Browse content in Philosophy
  • Aesthetics and Philosophy of Art
  • Epistemology
  • Feminist Philosophy
  • History of Western Philosophy
  • Metaphysics
  • Moral Philosophy
  • Non-Western Philosophy
  • Philosophy of Science
  • Philosophy of Action
  • Philosophy of Law
  • Philosophy of Religion
  • Philosophy of Language
  • Philosophy of Mind
  • Philosophy of Perception
  • Philosophy of Mathematics and Logic
  • Practical Ethics
  • Social and Political Philosophy
  • Browse content in Religion
  • Biblical Studies
  • Christianity
  • East Asian Religions
  • History of Religion
  • Judaism and Jewish Studies
  • Qumran Studies
  • Religion and Education
  • Religion and Health
  • Religion and Politics
  • Religion and Science
  • Religion and Law
  • Religion and Art, Literature, and Music
  • Religious Studies
  • Browse content in Society and Culture
  • Cookery, Food, and Drink
  • Cultural Studies
  • Customs and Traditions
  • Ethical Issues and Debates
  • Hobbies, Games, Arts and Crafts
  • Lifestyle, Home, and Garden
  • Natural world, Country Life, and Pets
  • Popular Beliefs and Controversial Knowledge
  • Sports and Outdoor Recreation
  • Technology and Society
  • Travel and Holiday
  • Visual Culture
  • Browse content in Law
  • Arbitration
  • Browse content in Company and Commercial Law
  • Commercial Law
  • Company Law
  • Browse content in Comparative Law
  • Systems of Law
  • Competition Law
  • Browse content in Constitutional and Administrative Law
  • Government Powers
  • Judicial Review
  • Local Government Law
  • Military and Defence Law
  • Parliamentary and Legislative Practice
  • Construction Law
  • Contract Law
  • Browse content in Criminal Law
  • Criminal Procedure
  • Criminal Evidence Law
  • Sentencing and Punishment
  • Employment and Labour Law
  • Environment and Energy Law
  • Browse content in Financial Law
  • Banking Law
  • Insolvency Law
  • History of Law
  • Human Rights and Immigration
  • Intellectual Property Law
  • Browse content in International Law
  • Private International Law and Conflict of Laws
  • Public International Law
  • IT and Communications Law
  • Jurisprudence and Philosophy of Law
  • Law and Politics
  • Law and Society
  • Browse content in Legal System and Practice
  • Courts and Procedure
  • Legal Skills and Practice
  • Primary Sources of Law
  • Regulation of Legal Profession
  • Medical and Healthcare Law
  • Browse content in Policing
  • Criminal Investigation and Detection
  • Police and Security Services
  • Police Procedure and Law
  • Police Regional Planning
  • Browse content in Property Law
  • Personal Property Law
  • Study and Revision
  • Terrorism and National Security Law
  • Browse content in Trusts Law
  • Wills and Probate or Succession
  • Browse content in Medicine and Health
  • Browse content in Allied Health Professions
  • Arts Therapies
  • Clinical Science
  • Dietetics and Nutrition
  • Occupational Therapy
  • Operating Department Practice
  • Physiotherapy
  • Radiography
  • Speech and Language Therapy
  • Browse content in Anaesthetics
  • General Anaesthesia
  • Neuroanaesthesia
  • Browse content in Clinical Medicine
  • Acute Medicine
  • Cardiovascular Medicine
  • Clinical Genetics
  • Clinical Pharmacology and Therapeutics
  • Dermatology
  • Endocrinology and Diabetes
  • Gastroenterology
  • Genito-urinary Medicine
  • Geriatric Medicine
  • Infectious Diseases
  • Medical Oncology
  • Medical Toxicology
  • Pain Medicine
  • Palliative Medicine
  • Rehabilitation Medicine
  • Respiratory Medicine and Pulmonology
  • Rheumatology
  • Sleep Medicine
  • Sports and Exercise Medicine
  • Clinical Neuroscience
  • Community Medical Services
  • Critical Care
  • Emergency Medicine
  • Forensic Medicine
  • Haematology
  • History of Medicine
  • Browse content in Medical Dentistry
  • Oral and Maxillofacial Surgery
  • Paediatric Dentistry
  • Restorative Dentistry and Orthodontics
  • Surgical Dentistry
  • Medical Ethics
  • Browse content in Medical Skills
  • Clinical Skills
  • Communication Skills
  • Nursing Skills
  • Surgical Skills
  • Medical Statistics and Methodology
  • Browse content in Neurology
  • Clinical Neurophysiology
  • Neuropathology
  • Nursing Studies
  • Browse content in Obstetrics and Gynaecology
  • Gynaecology
  • Occupational Medicine
  • Ophthalmology
  • Otolaryngology (ENT)
  • Browse content in Paediatrics
  • Neonatology
  • Browse content in Pathology
  • Chemical Pathology
  • Clinical Cytogenetics and Molecular Genetics
  • Histopathology
  • Medical Microbiology and Virology
  • Patient Education and Information
  • Browse content in Pharmacology
  • Psychopharmacology
  • Browse content in Popular Health
  • Caring for Others
  • Complementary and Alternative Medicine
  • Self-help and Personal Development
  • Browse content in Preclinical Medicine
  • Cell Biology
  • Molecular Biology and Genetics
  • Reproduction, Growth and Development
  • Primary Care
  • Professional Development in Medicine
  • Browse content in Psychiatry
  • Addiction Medicine
  • Child and Adolescent Psychiatry
  • Forensic Psychiatry
  • Learning Disabilities
  • Old Age Psychiatry
  • Psychotherapy
  • Browse content in Public Health and Epidemiology
  • Epidemiology
  • Public Health
  • Browse content in Radiology
  • Clinical Radiology
  • Interventional Radiology
  • Nuclear Medicine
  • Radiation Oncology
  • Reproductive Medicine
  • Browse content in Surgery
  • Cardiothoracic Surgery
  • Gastro-intestinal and Colorectal Surgery
  • General Surgery
  • Neurosurgery
  • Paediatric Surgery
  • Peri-operative Care
  • Plastic and Reconstructive Surgery
  • Surgical Oncology
  • Transplant Surgery
  • Trauma and Orthopaedic Surgery
  • Vascular Surgery
  • Browse content in Science and Mathematics
  • Browse content in Biological Sciences
  • Aquatic Biology
  • Biochemistry
  • Bioinformatics and Computational Biology
  • Developmental Biology
  • Ecology and Conservation
  • Evolutionary Biology
  • Genetics and Genomics
  • Microbiology
  • Molecular and Cell Biology
  • Natural History
  • Plant Sciences and Forestry
  • Research Methods in Life Sciences
  • Structural Biology
  • Systems Biology
  • Zoology and Animal Sciences
  • Browse content in Chemistry
  • Analytical Chemistry
  • Computational Chemistry
  • Crystallography
  • Environmental Chemistry
  • Industrial Chemistry
  • Inorganic Chemistry
  • Materials Chemistry
  • Medicinal Chemistry
  • Mineralogy and Gems
  • Organic Chemistry
  • Physical Chemistry
  • Polymer Chemistry
  • Study and Communication Skills in Chemistry
  • Theoretical Chemistry
  • Browse content in Computer Science
  • Artificial Intelligence
  • Computer Architecture and Logic Design
  • Game Studies
  • Human-Computer Interaction
  • Mathematical Theory of Computation
  • Programming Languages
  • Software Engineering
  • Systems Analysis and Design
  • Virtual Reality
  • Browse content in Computing
  • Business Applications
  • Computer Security
  • Computer Games
  • Computer Networking and Communications
  • Digital Lifestyle
  • Graphical and Digital Media Applications
  • Operating Systems
  • Browse content in Earth Sciences and Geography
  • Atmospheric Sciences
  • Environmental Geography
  • Geology and the Lithosphere
  • Maps and Map-making
  • Meteorology and Climatology
  • Oceanography and Hydrology
  • Palaeontology
  • Physical Geography and Topography
  • Regional Geography
  • Soil Science
  • Urban Geography
  • Browse content in Engineering and Technology
  • Agriculture and Farming
  • Biological Engineering
  • Civil Engineering, Surveying, and Building
  • Electronics and Communications Engineering
  • Energy Technology
  • Engineering (General)
  • Environmental Science, Engineering, and Technology
  • History of Engineering and Technology
  • Mechanical Engineering and Materials
  • Technology of Industrial Chemistry
  • Transport Technology and Trades
  • Browse content in Environmental Science
  • Applied Ecology (Environmental Science)
  • Conservation of the Environment (Environmental Science)
  • Environmental Sustainability
  • Environmentalist Thought and Ideology (Environmental Science)
  • Management of Land and Natural Resources (Environmental Science)
  • Natural Disasters (Environmental Science)
  • Nuclear Issues (Environmental Science)
  • Pollution and Threats to the Environment (Environmental Science)
  • Social Impact of Environmental Issues (Environmental Science)
  • History of Science and Technology
  • Browse content in Materials Science
  • Ceramics and Glasses
  • Composite Materials
  • Metals, Alloying, and Corrosion
  • Nanotechnology
  • Browse content in Mathematics
  • Applied Mathematics
  • Biomathematics and Statistics
  • History of Mathematics
  • Mathematical Education
  • Mathematical Finance
  • Mathematical Analysis
  • Numerical and Computational Mathematics
  • Probability and Statistics
  • Pure Mathematics
  • Browse content in Neuroscience
  • Cognition and Behavioural Neuroscience
  • Development of the Nervous System
  • Disorders of the Nervous System
  • History of Neuroscience
  • Invertebrate Neurobiology
  • Molecular and Cellular Systems
  • Neuroendocrinology and Autonomic Nervous System
  • Neuroscientific Techniques
  • Sensory and Motor Systems
  • Browse content in Physics
  • Astronomy and Astrophysics
  • Atomic, Molecular, and Optical Physics
  • Biological and Medical Physics
  • Classical Mechanics
  • Computational Physics
  • Condensed Matter Physics
  • Electromagnetism, Optics, and Acoustics
  • History of Physics
  • Mathematical and Statistical Physics
  • Measurement Science
  • Nuclear Physics
  • Particles and Fields
  • Plasma Physics
  • Quantum Physics
  • Relativity and Gravitation
  • Semiconductor and Mesoscopic Physics
  • Browse content in Psychology
  • Affective Sciences
  • Clinical Psychology
  • Cognitive Neuroscience
  • Cognitive Psychology
  • Criminal and Forensic Psychology
  • Developmental Psychology
  • Educational Psychology
  • Evolutionary Psychology
  • Health Psychology
  • History and Systems in Psychology
  • Music Psychology
  • Neuropsychology
  • Organizational Psychology
  • Psychological Assessment and Testing
  • Psychology of Human-Technology Interaction
  • Psychology Professional Development and Training
  • Research Methods in Psychology
  • Social Psychology
  • Browse content in Social Sciences
  • Browse content in Anthropology
  • Anthropology of Religion
  • Human Evolution
  • Medical Anthropology
  • Physical Anthropology
  • Regional Anthropology
  • Social and Cultural Anthropology
  • Theory and Practice of Anthropology
  • Browse content in Business and Management
  • Business Strategy
  • Business History
  • Business Ethics
  • Business and Government
  • Business and Technology
  • Business and the Environment
  • Comparative Management
  • Corporate Governance
  • Corporate Social Responsibility
  • Entrepreneurship
  • Health Management
  • Human Resource Management
  • Industrial and Employment Relations
  • Industry Studies
  • Information and Communication Technologies
  • International Business
  • Knowledge Management
  • Management and Management Techniques
  • Operations Management
  • Organizational Theory and Behaviour
  • Pensions and Pension Management
  • Public and Nonprofit Management
  • Strategic Management
  • Supply Chain Management
  • Browse content in Criminology and Criminal Justice
  • Criminal Justice
  • Criminology
  • Forms of Crime
  • International and Comparative Criminology
  • Youth Violence and Juvenile Justice
  • Development Studies
  • Browse content in Economics
  • Agricultural, Environmental, and Natural Resource Economics
  • Asian Economics
  • Behavioural Finance
  • Behavioural Economics and Neuroeconomics
  • Econometrics and Mathematical Economics
  • Economic Systems
  • Economic Methodology
  • Economic History
  • Economic Development and Growth
  • Financial Markets
  • Financial Institutions and Services
  • General Economics and Teaching
  • Health, Education, and Welfare
  • History of Economic Thought
  • International Economics
  • Labour and Demographic Economics
  • Law and Economics
  • Macroeconomics and Monetary Economics
  • Microeconomics
  • Public Economics
  • Urban, Rural, and Regional Economics
  • Welfare Economics
  • Browse content in Education
  • Adult Education and Continuous Learning
  • Care and Counselling of Students
  • Early Childhood and Elementary Education
  • Educational Equipment and Technology
  • Educational Strategies and Policy
  • Higher and Further Education
  • Organization and Management of Education
  • Philosophy and Theory of Education
  • Schools Studies
  • Secondary Education
  • Teaching of a Specific Subject
  • Teaching of Specific Groups and Special Educational Needs
  • Teaching Skills and Techniques
  • Browse content in Environment
  • Applied Ecology (Social Science)
  • Climate Change
  • Conservation of the Environment (Social Science)
  • Environmentalist Thought and Ideology (Social Science)
  • Natural Disasters (Environment)
  • Social Impact of Environmental Issues (Social Science)
  • Browse content in Human Geography
  • Cultural Geography
  • Economic Geography
  • Political Geography
  • Browse content in Interdisciplinary Studies
  • Communication Studies
  • Museums, Libraries, and Information Sciences
  • Browse content in Politics
  • African Politics
  • Asian Politics
  • Chinese Politics
  • Comparative Politics
  • Conflict Politics
  • Elections and Electoral Studies
  • Environmental Politics
  • European Union
  • Foreign Policy
  • Gender and Politics
  • Human Rights and Politics
  • Indian Politics
  • International Relations
  • International Organization (Politics)
  • International Political Economy
  • Irish Politics
  • Latin American Politics
  • Middle Eastern Politics
  • Political Methodology
  • Political Communication
  • Political Philosophy
  • Political Sociology
  • Political Theory
  • Political Behaviour
  • Political Economy
  • Political Institutions
  • Politics and Law
  • Public Administration
  • Public Policy
  • Quantitative Political Methodology
  • Regional Political Studies
  • Russian Politics
  • Security Studies
  • State and Local Government
  • UK Politics
  • US Politics
  • Browse content in Regional and Area Studies
  • African Studies
  • Asian Studies
  • East Asian Studies
  • Japanese Studies
  • Latin American Studies
  • Middle Eastern Studies
  • Native American Studies
  • Scottish Studies
  • Browse content in Research and Information
  • Research Methods
  • Browse content in Social Work
  • Addictions and Substance Misuse
  • Adoption and Fostering
  • Care of the Elderly
  • Child and Adolescent Social Work
  • Couple and Family Social Work
  • Developmental and Physical Disabilities Social Work
  • Direct Practice and Clinical Social Work
  • Emergency Services
  • Human Behaviour and the Social Environment
  • International and Global Issues in Social Work
  • Mental and Behavioural Health
  • Social Justice and Human Rights
  • Social Policy and Advocacy
  • Social Work and Crime and Justice
  • Social Work Macro Practice
  • Social Work Practice Settings
  • Social Work Research and Evidence-based Practice
  • Welfare and Benefit Systems
  • Browse content in Sociology
  • Childhood Studies
  • Community Development
  • Comparative and Historical Sociology
  • Economic Sociology
  • Gender and Sexuality
  • Gerontology and Ageing
  • Health, Illness, and Medicine
  • Marriage and the Family
  • Migration Studies
  • Occupations, Professions, and Work
  • Organizations
  • Population and Demography
  • Race and Ethnicity
  • Social Theory
  • Social Movements and Social Change
  • Social Research and Statistics
  • Social Stratification, Inequality, and Mobility
  • Sociology of Religion
  • Sociology of Education
  • Sport and Leisure
  • Urban and Rural Studies
  • Browse content in Warfare and Defence
  • Defence Strategy, Planning, and Research
  • Land Forces and Warfare
  • Military Administration
  • Military Life and Institutions
  • Naval Forces and Warfare
  • Other Warfare and Defence Issues
  • Peace Studies and Conflict Resolution
  • Weapons and Equipment

The Oxford Handbook of Linguistic Analysis

A newer edition of this book is available.

  • < Previous chapter
  • Next chapter >

The Oxford Handbook of Linguistic Analysis

8 Corpus-Based and Corpus-driven Analyses of Language Variation and Use

Douglas Biber is Regents' Professor of English (Applied Linguistics) at Northern Arizona University. His research efforts have focused on corpus linguistics, English grammar, and register variation (in English and cross-linguistic; synchronic and diachronic). His publications include books on register variation and corpus linguistics published by Cambridge University Press (1988, 1995, 1998, to appear), the co-authored Longman Grammar of Spoken and Written English (1999), and more recent studies of language use in university settings and discourse structure investigated from a corpus perspective (both published by Benjamins: 2006 and 2007).

  • Published: 18 September 2012
  • Cite Icon Cite
  • Permissions Icon Permissions

Corpus linguistics is a research approach that has developed over the past few decades to support empirical investigations of language variation and use, resulting in research findings which have much greater generalizability and validity than would otherwise be feasible. Corpus studies have used two major research approaches: ‘corpus-based’ and ‘corpus-driven’. Corpus-based research assumes the validity of linguistic forms and structures derived from linguistic theory. The primary goal of research is to analyse the systematic patterns of variation and use for those pre-defined linguistic features. Corpus-driven research is more inductive, so that the linguistic constructs themselves emerge from analysis of a corpus. This chapter illustrates the kinds of analyses and perspectives on language use possible from both corpus-based and corpus-driven approaches.

8.1 Introduction

C orpus linguistics is a research approach that has developed over the past several decades to support empirical investigations of language variation and use, resulting in research findings that have much greater generalizability and validity than would otherwise be feasible. Corpus linguistics is not in itself a model of language. In fact, at one level it can be regarded as primarily a methodological approach:

it is empirical, analyzing the actual patterns of use in natural texts;

it utilizes a large and principled collection of natural texts, known as a “corpus”, as the basis for analysis;

it makes extensive use of computers for analysis, using both automatic and interactive techniques;

it depends on both quantitative and qualitative analytical techniques (Biber et al. 1998 : 4).

At the same time, corpus linguistics is much more than a methodological approach: these methodological innovations have enabled researchers to ask fundamentally different kinds of research questions, sometimes resulting in radically different perspectives on language variation and use from those taken in previous research. Corpus linguistic research offers strong support for the view that language variation is systematic and can be described using empirical, quantitative methods. Variation often involves complex patterns consisting of the interaction among several different linguistic parameters, but, in the end, it is systematic. Beyond this, the major contribution of corpus linguistics is to document the existence of linguistic constructs that are not recognized by current linguistic theories. Research of this type—referred to as a “corpus-driven” approach—identifies strong tendencies for words and grammatical constructions to pattern together in particular ways, while other theoretically possible combinations rarely occur. Corpus-driven research has shown that these tendencies are much stronger and more pervasive than previously suspected and that they usually have semantic or functional associations (see section 8.3 below).

In some ways, corpus research can be seen as a logical extension of quantitative research in sociolinguistics begun in the 1960s (e.g., Labov 1966 ), which rejected “free variation” as an adequate account of linguistic choice and argued instead for the existence of linguistic variable rules (see Chambers and Trudgill 1980 : 59–61; 146–9). However, research in corpus linguistics differs from quantitative sociolinguistic research in at least two major ways:

(1) Quantitative sociolinguistics has focused on a relatively small range of varieties: usually the social dialects that exist within a single city, with secondary attention given to the set of “styles” that occur during a sociolinguistic interview. In contrast, corpus research has investigated the patterns of variation among a much wider range of varieties, including spoken and written registers as well as dialects.

Corpus-based dialect studies have investigated national varieties, regional dialects within a country, and social dialects. However, the biggest difference from quantitative sociolinguistics here has to do with the investigation of situationally-defined varieties: “registers”. Quantitative sociolinguistics has restricted itself to the investigation of only spoken varieties, and considered only a few “styles”, which speakers produce during the course of a sociolinguistic interview (e.g., telling a story vs. reading a word list). In contrast, corpus-based research investigates the patterns of variation among the full set of spoken and written registers in a language. In speech, these include casual face-to-face conversation, service encounters, lectures, sermons, political debates, etc.; and, in writing, these include email messages, text-messaging, newspaper editorials, academic research articles, etc.

(2) Quantitative sociolinguistics has focused on analysis of “linguistic variables”, defined such that the variants must have identical referential meaning. Related to this restriction, quantitative sociolinguistic research has focused exclusively on nonfunctional variation. For these reasons, most quantitative sociolinguistic research has focused on phonological variables, such as [t] vs. [ θ ]. Sociolinguistic variation is described as indexing different social varieties, but there is no possibility of functional explanations for why a particular linguistic variant would be preferred in one variety over another.

In contrast, corpus research considers all aspects of language variation and choice, including the choice among roughly synonymous words (e.g., big, large, great ), and the choice among related grammatical constructions (e.g., active vs. passive voice, dative movement, particle movement with phrasal verbs, extraposed vs. subject complement clauses). Corpus-based research goes even further, investigating distributional differences in the extent to which varieties rely on core grammatical features (e.g., the relative frequency of nouns, verbs, prepositional phrases, etc.). All of these aspects of linguistic variation are interpreted in functional terms, attempting to explain the linguistic patterns by reference to communicative and situational differences among the varieties. In fact, much corpus-based research is based on the premise that language variation is functional: that we choose to use particular linguistic features because those forms fit the communicative context of the text, whether in conversation, a political speech, a newspaper editorial, or an academic research article.

In both of these regards, corpus-based research is actually more similar to research in functional linguistics than research in quantitative sociolinguistics. By studying linguistic variation in naturally occurring discourse, functional linguists have been able to identify systematic differences in the use of linguistic variants. An early study of this type is Prince ( 1978 ), who compares the distribution and discourse functions of WH-clefts and it -clefts in spoken and written texts. Thompson and Schiffrin have carried out numerous studies in this research tradition: Thompson on detached participial clauses (1983), adverbial purpose clauses (1985), omission of the complementizer that (Thompson and Mulac 1991 a ; 1991 b ), relative clauses (Fox and Thompson 1990 ); and Schiffrin on verb tense ( 1981 ), causal sequences (1985 a ), and discourse markers (1985 b ). Other early studies of this type include Ward ( 1990 ) on VP preposing, Collins ( 1995 ) on dative alternation, and Myhill ( 1995 ; 1997 ) on modal verbs.

More recently, researchers on discourse and grammar have begun to use the tools and techniques available from corpus linguistics, with its greater emphasis on the representativeness of the language sample, and its computational tools for investigating distributional patterns across registers and across discourse contexts in large text collections (see Biber et al. 1998 ; Kennedy 1998 ; Meyer 2002 ; and McEnery et al. 2006 ). There are a number of book-length treatments reporting corpus-based investigations of grammar and discourse: for example, Tottie ( 1991 a ) on negation, Collins ( 1991 ) on clefts, Mair ( 1990 ) on infinitival complement clauses, Meyer ( 1992 ) on apposition, Mindt 1995 on modal verbs, Hunston and Francis ( 2000 ) on pattern grammar, Aijmer ( 2002 ) on discourse particles, Rohdenburg and Mondorf ( 2003 ) on grammatical variation; Lindquist and Mair ( 2004 ) on grammaticalization, Mahlberg ( 2005 ) on general nouns, Römer (2005) on progressives.

A central concern for corpus-based studies is the representativeness of the corpus (see Biber 1993 ; Biber et al. 1998 : 246–50; McEnery et al. 2006 : 13–21, 125–30). Two considerations are crucial for corpus design: size and composition. First, corpora need to be large enough to accurately represent the distribution of linguistic features. Second, the texts in a corpus must be deliberately sampled to represent the registers in the target domain of use.

Corpus studies have used two major research approaches: “corpus-based” and “corpus-driven”. Corpus-based research assumes the validity of linguistic forms and structures derived from linguistic theory; the primary goal of research is to analyze the systematic patterns of variation and use for those predefined linguistic features. One of the major general findings from corpus-based research is that descriptions of grammatical variation and use are usually not valid for the language as a whole. Rather, characteristics of the textual environment interact with register differences, so that strong patterns in one register often represent weak patterns in other registers. As a result, most corpus-based studies of grammatical variation include consideration of register differences. The recent Longman Grammar of Spoken and Written English (Biber et al. 1999 ) is the most comprehensive reference work of this kind, applying corpus-based analyses to show how any grammatical feature can be described for its patterns of use across discourse contexts and across spoken and written registers.

In contrast, “corpus-driven” research is more inductive, so that the linguistic constructs themselves emerge from analysis of a corpus. The availability of very large, representative corpora, combined with computational tools for analysis, make it possible to approach linguistic variation from this radically different perspective. The corpus-driven approach differs from the standard practice of linguistics in that it makes minimal a priori assumptions regarding the linguistic features that should be employed for the corpus analysis. In its most basic form, corpus-driven analysis assumes only the existence of words, while concepts like “phrase” and “clause” have no a priori status. Rather, co-occurrence patterns among words, discovered from the corpus analysis, are the basis for subsequent linguistic descriptions.

The following sections illustrate the kinds of analyses and perspectives on language use possible from both corpus-based and corpus-driven approaches. section 8.2 illustrates the corpus-based approach, which documents the systematic patterns of language use, often showing that intuitions about use are wrong. section 8.3 then illustrates the corpus-driven approach, showing how corpus research can uncover linguistic units that are not detectable using the standard methods of linguistic analysis.

8.2 Corpus-based research studies

As noted above, the corpus-based approach has some of the same basic goals as research in functional linguistics generally, to describe and explain linguistic patterns of variation and use. The goal is not to discover new linguistic features but rather to discover the systematic patterns of use that govern the linguistic features recognized by standard linguistic theory.

One major contribution of the corpus-based approach is that it establishes the centrality of register for descriptions of language use. That is, corpus-based research has shown that almost any linguistic feature or variant is distributed and used in dramatically different ways across different registers. Taken together, corpus-based studies challenge the utility of general linguistic descriptions of a language; rather, these studies have shown that any linguistic description that disregards register is incomplete or sometimes even misleading.

Considered within the larger context of quantitative social science research, the major strengths of the corpus-based approach are its high reliability and external validity. The use of computational tools ensures high reliability, since a computer program should make the same analytical decision every time it encounters the same linguistic phenomenon. More importantly, the corpus itself is deliberately constructed and evaluated for the extent to which it represents the target domain (e.g., a register or dialect). Thus, the linguistic patterns of use described in corpus-based analysis are generalizable, explicitly addressing issues of external validity.

However, judged by the normal interests of linguists, the greater contribution of the corpus-based approach is that it often produces surprising findings that run directly counter to our prior intuitions. That is, as linguists we often have strong intuitions about language use (in addition to intuitions about grammaticality), believing that we have a good sense of what is normal in discourse. While it is difficult to evaluate intuitions about grammaticality, intuitions about use are open to empirical investigation. Corpus-based research is ideally suited for this task, since one of the main research goals of this approach is to empirically identify the linguistic patterns that are extremely frequent or rare in discourse from a particular variety. And when such empirical investigations are conducted, they often reveal patterns that are directly counter to our prior expectations.

A simple case study of this type, taken from the Longman Grammar of Spoken and Written English (Biber et al. 1999 : 460–3), concerns the distribution of verb aspect in English conversation. There are three aspects distinguished in English verb phrases:

Simple aspect: Do you like it? Progressive aspect: I was running around the house like a maniac . Perfect aspect: You have n't even gone yet .

The question to consider is which grammatical aspect is most common in face-to-face conversation?

It is much easier to illustrate the unreliability of intuitions in a spoken lecture because audience members can be forced to commit to an answer before seeing the corpus findings. For full effect, the reader here should concretely decide on an answer before reading further.

Hundreds of linguists have been polled on this question, and the overwhelming majority have selected progressive aspect as the most common verb aspect in English conversation. In fact, as Figure 8.1 shows, progressive aspect is more common in conversation than in other registers. The contrast with academic prose is especially noteworthy: progressive aspect is rare in academic prose but common in conversation.

However, as Figure 8.2 shows, it is not at all correct to conclude that progressive aspect is the most common choice in conversation. Rather, simple aspect is clearly the unmarked choice. In fact, simple aspect verb phrases are more than 20 times as common as progressives in conversation.

The following conversation illustrates this extreme reliance on simple aspect ( underlined ) in contrast to the much more specialized use of progressive aspect (in bold italics ):

Jan Well girls we better open the presents, I'm going to fall asleep. Kris I know . Amanda Okay, right after he rolls out this last batch. Rita Your face is really hot. Why are you leaving it, we' re not leaving till Sunday are we? Jan Which ever day you prefer , Saturday or Sunday. Rita When are you leaving ? Amanda Sunday morning. Rita Oh, well we don't have to do it right away. Kris Oh well let's just do it. Rita Iʼd rather wait till I feel like it. Jan But we' re doing it. Kris Just do and be done with it. Smoke a joint <laugh>. Jan Rita that'd help you sleep . Rita No Jan I don't think so. Amanda They used to make me sleep . Rita No that would make my mind race , yeah, typical. Jan Okay let 's do the Christmas. Rita If I drink Amanda Okay. Rita If I smoke , anything, makes my mind race . Amanda These tins are the last ones. Jan It' s just a little something Rita. Rita You go overboard. Now, don't you make us feel guilty.

Distribution of progressive aspect verb phrases across registers

As the conversational excerpt above shows, verbs of all types tend to occur with simple aspect rather than progressive aspect, including stative relational verbs (e.g., be ), mental verbs (e.g., know, prefer, feel, think ), verbs of facilitation or causation (e.g., let, help, make ), and activity verbs (e.g., do, open, fall, roll, wait, smoke, sleep, race, drink, go ). There are a few particular verbs that occur more often with progressive aspect than simple aspect, such as bleeding, chasing, shopping, dancing, dripping, marching, raining, sweating, chatting, joking, moaning, looking forward to, studying, lurking (see Biber et al. 1999 : 471–5). However, the normal style of discourse in conversation relies on simple aspect verbs (usually present tense), with shifts into progressive aspect being used to mark specialized meanings.

Distribution of aspect types across registers

A second case study—focusing on dependent clause types—illustrates how corpus-based research has established the centrality of register for descriptions of language use. Dependent clauses are often regarded as one of the best measures of grammatical complexity. In some approaches, all dependent clause types are grouped together as manifesting complexity, as with the use of t-unit length to measure language development. Further, there is a strong expectation that writing manifests a much greater use of dependent clauses than speech. So, for example, students are expected to develop increasing use of dependent clauses as they progress in their academic writing skills (see, for example, Wolfe-Quintero et al. 1998 ).

Distribution of dependent clause types across registers

Corpus-based research has shown that these predictions are based on faulty intuitions about use. That is, different dependent clause types are used and distributed in dramatically different ways, and some dependent clause types are actually much more common in conversation than in academic writing. Thus, the practice of treating all types of dependent clause as a single unified construct has no basis in actual language use.

For example, Figure 8.3 compares the use of dependent clause types in five spoken and written registers: conversation, university office hours, university teaching, university textbooks, and academic prose. Relative clauses follow the expected pattern of being much more common in academic writing and textbooks than in conversation (and office hours). Class teaching is intermediate between conversation and academic writing in the use of relative clauses. However, the other two clause types—adverbial clauses and complement clauses—are much more common in conversation than in academic writing. Office hours are interesting here because they are even more sharply distinguished from writing, with extremely frequent use of adverbial clauses and complement clauses. Class teaching is very similar to conversation in the frequent use of complement clauses and finite adverbial clauses.

Closer consideration of these patterns shows that they are interpretable in functional terms. For example, in conversation both adverbial and complement clauses occur with a highly restricted range of forms. Most adverbial clauses in conversation are finite, with especially high frequencies of if -clauses and because -clauses. Similarly, most complement clauses in conversation are finite ( that -clauses and WH-clauses). In most cases, these complement clauses are controlled by a verb that expresses a “stance” relative to the proposition contained in the complement clause (e.g., I thought that …, I don't know why … ).

In general, these distributional patterns conform to the general reliance on clausal rather than phrasal syntax in conversation (see Biber and Conrad to appear) and the communicative purposes of focusing on personal experience and activities rather than conveying more abstract information. These kinds of findings are typical of other corpus-based research, showing how the patterns of linguistic variation are systematically distributed in ways that have clear functional interpretations but are often not anticipated ahead of time.

8.3 Corpus-driven research studies

While corpus-based studies uncover surprising patterns of variation, corpus-driven analyses exploit the potential of a corpus to identify linguistic categories and units that have not been previously recognized. That is, in a corpus-driven analysis, the “descriptions aim to be comprehensive with respect to corpus evidence” (Tognini-Bonelli and Elena 2001 : 84), so that even the “linguistic categories” are derived “systematically from the recurrent patterns and the frequency distributions that emerge from language in context” (Tognini-Bonelli and Elena 2001 : 87).

In its most extreme form, the corpus-driven approach assumes only the existence of word forms; grammatical classes and syntactic structures have no a priori status in the analysis. In fact, even inflected variants of the same lemma are treated separately, with the underlying claim that each word form has its own grammar and its own meanings. So, for example, Stubbs ( 1993 : 16) cites the example of eye vs. eyes , taken from Sinclair ( 1991 b ). The plural form eyes often refers to the physical body part and is modified by an attributive adjective (e.g., blue eyes ) or a possessive determiner (e.g., your eyes ). In contrast, the singular form rarely refers to a specific body part but is commonly used in fixed expressions, like make eye contact, keep an eye on/out, catch your eye, in my mind's eye. Thus, some corpus-driven research has challenged the utility of the notion of lemma , arguing instead that each word form tends to occur in distinctive grammatical contexts and tends to have distinct meanings and uses.

In actual practice, a fairly wide range of methodologies have been used under the umbrella of corpus-driven research. These methodologies can all be distinguished from corpus-based research by the nature of their central research goals:

corpus-driven research: attempting to uncover new linguistic constructs through inductive analysis of corpora;

corpus-based research: attempting to describe the systematic patterns of variation and use for linguistic features and constructs that have been previously identified by linguistic theory.

However, corpus-driven methodologies can differ from one study to the next in three key respects:

the extent to which they are based on analysis of lemmas vs. each word form;

the extent to which they are based on previously defined linguistic constructs (e.g., part-of-speech categories and syntactic structures) vs. simple sequences of words;

the role of frequency evidence in the analysis.

The following sections survey some major corpus-driven studies, introducing the contributions that result from this research approach while also describing the key methodological differences within this general approach. section 8.3.1 illustrates one specific type of analysis undertaken from an extreme corpus-driven approach: the investigation of “lexical bundles”, which are the most common recurrent sequences of word forms in a register. It turns out that these word sequences have distinctive structural and functional correlates, even though they rarely correspond to complete linguistic structures recognized by current linguistic theories.

Next, section 8.3.2 surveys research done within the framework of “pattern grammar”. These studies adopt a more hybrid approach: they assume the existence of some grammatical classes (e.g., verb, noun) and basic syntactic structures, but they are corpus-driven in that they focus on the linguistic units that emerge from corpus analysis, with a primary focus on the inter-relation of words, grammar, and meaning. Frequency plays a relatively minor role in analyses done within this framework. In fact, as discussed in section 8.3.3 , there is somewhat of a disconnect between theoretical discussions of the corpus-driven approach, where analyses are based on “recurrent patterns” and “frequency distributions” (Tognini-Bonelli 2001 : 87), and the actual practice of scholars working in pattern grammar, which has focused much more on form—meaning associations with relatively little accountability to quantitative evidence from the corpus.

Finally, section 8.3.4 introduces Multi-Dimensional analysis, which might also be considered a hybrid approach: it assumes the validity of predefined grammatical categories (e.g., nominalizations, past tense verbs) and syntactic features (e.g., WH relative clauses, conditional adverbial clauses), but it uses frequency-based corpus-driven methods to discover the underlying parameters of linguistic variation that best distinguish among spoken and written registers.

8.3.1 Lexical bundles

As noted above, the strictest form of corpus-driven analysis assumes only the existence of word forms. Some researchers interested in the study of formulaic language have adopted this approach, beginning with simple word forms and giving priority to frequency, to identify recurrent word sequences (e.g., Salem 1987 ; Altenberg and Eeg-Olofsson 1990 ; Altenberg 1998 ; Butler 1998 ; and Schmitt et al. 2004 ). Several of these studies have investigated recurrent word sequences under the rubric of “lexical bundles”, comparing their characteristics in different spoken and written registers (e.g., Biber et al. 1999 , Chapter 13; Biber and Conrad 1999 ; Biber et al. 2004 ; Cortes 2002 ; 2004 ; Partington and Morley 2004 ; Nesi and Basturkmen 2006 ; Biber and Barbieri 2007 ; Tracy-Ventura et al. 2007 ; and Biber et al. to appear).

Lexical bundles are defined as the multi-word sequences that recur most frequently and are distributed widely across different texts. Lexical bundles in English conversation are word sequences like I don't know if or I just wanted to. They are usually neither structurally complete nor idiomatic in meaning.

The initial analysis of lexical bundles in English (Biber et al. 1999 , Chapter 13) compared the frequent word sequences in conversation and academic prose, based on analysis of c .5-million-word sub-corpora from each register. Figure 8.4 shows the overall distribution of all 3-word and 4-word lexical bundles occurring more than 10 times per million words (distributed across at least five different texts). Not surprisingly, there are almost 10 times as many 3-word bundles as 4-word bundles. It is perhaps more surprising that there are many more lexical bundles in conversation than in academic writing.

Lexical bundles are identified using a corpus-driven approach, based solely on distributional criteria (rate of occurrence of word sequences and their distribution across texts). As a result, lexical bundles are not necessarily complete structural units recognized by current linguistic theories. However, once they have been identified using corpus-driven techniques, it is possible to carry out an interpretive analysis to determine if they have any systematic structural and functional characteristics.

This post-hoc analysis shows that lexical bundles differ from the formulaic expressions identified using traditional methods in three major respects. First, lexical bundles are by definition extremely common. Second, most lexical bundles are not idiomatic in meaning and not perceptually salient. For example, the meanings of bundles like do you want to or I don't know what are transparent from the individual words. And, finally, lexical bundles usually do not represent a complete structural unit. For example, Biber et al. ( 1999 : 993–1000) found that only 15% of the lexical bundles in conversation can be regarded as complete phrases or clauses, while less than 5% of the lexical bundles in academic prose represent complete structural units. Instead, most lexical bundles bridge two structural units: they begin at a clause or phrase boundary, but the last words of the bundle are the beginning elements of a second structural unit. Most of the bundles in speech bridge two clauses (e.g., I want to know, well that's what I ), while bundles in writing usually bridge two phrases (e.g., in the case of, the base of the ).

Number of different lexical bundles in English (occurring more than 10 times per million words)

In contrast, the formulaic expressions recognized by linguistic theory are usually complete structural units and idiomatic in meaning. However, corpus analysis shows that formulaic expressions with those characteristics are usually quite rare. For example, idioms such as kick the bucket and a slap in the face are rarely attested in natural conversation. (Idioms are occasionally used in fictional dialogue, but even there they are not common; see Biber et al. 1999 : 1024–6).

Although most lexical bundles are not complete structural units, they do usually have strong grammatical correlates. For example, bundles like you want me to are constructed from verbs and clause components, while bundles like in the case of are constructed from noun phrase and prepositional phrase components. In English, two major structural types of lexical bundle can be distinguished: clausal and phrasal. Many clausal bundles simply incorporate verb phrase fragments, such as it's going to be and what do you think. Other clausal bundles are composed of dependent clause fragments rather than simple verb phrase fragments, such as when we get to and that I want to. In contrast, phrasal bundles either consist of noun phrase components, usually ending with the start of a postmodifier (e.g., the end of the, those of you who ), or prepositional phrase components with embedded modifiers (e.g., of the things that ).

Figure 8.5 plots the distribution of these lexical bundle types across registers, showing that the structural correlates of lexical bundles in conversation are strikingly different from those in academic prose. (Figure 8.5 is based on a detailed analysis of the 4-word bundles that occur more than 40 times per million words.) In conversation, almost 90% of all common lexical bundles are declarative or interrogative clause segments. In fact, c .50% of these lexical bundles begin with a personal pronoun + verb phrase (such as I don't know why, I thought that was ). An additional 19% of the bundles consist of an extended verb phrase fragment (e.g., have a look at ), while another 17% of the bundles are question fragments (e.g., can I have a ). In contrast, the lexical bundles in academic prose are phrasal rather than clausal. Almost 70% of the common bundles in academic prose consist of a noun phrase with an embedded prepositional phrase fragment (e.g., the nature of the ) or a sequence that bridges across two prepositional phrases (e.g., as a result of ).

Although they are neither idiomatic nor structurally complete, lexical bundles are important building blocks in discourse. Lexical bundles often provide a kind of pragmatic “head” for larger phrases and clauses; the bundle functions as a discourse frame for the expression of new information in the following slot. That is, the lexical bundle usually expresses stance or textual meanings, while the remainder of the phrase/clause expresses new propositional information that has been framed by the lexical bundle. In this way, lexical bundles provide interpretive frames for the developing discourse. For example,

I want you to write a very brief summary of his lecture . Hermeneutic efforts are provoked by the fact that the interweaving of system integration and social integration […] keeps societal processes transparent …

Three primary discourse functions can be distinguished for lexical bundles in English: (1) stance expressions, (2) discourse organizers, and (3) referential expressions (see Biber et al. 2004 ). Stance bundles express epistemic evaluations or attitudinal/modality meanings:

Epistemic lexical bundles : I don't know what the voltage is here . I thought it was the other way around . Attitudinal/modality bundles : I don't want to deliver bad news to her . All you have to do is work on it .

Distribution of lexical bundles across structural types (4-word bundles occurring more than 40 times per million words)

Discourse-organizing bundles function to indicate the overall discourse structure: introducing topics, topic elaboration/clarification, confirmation checks, etc.:

What I want to do is quickly run through the exercise … Yes, you know there was more of a playful thing with it, you know what I mean?

Finally, referential bundles specify an entity or single out some particular attribute of an entity as especially important:

Students must define and constantly refine the nature of the problem . She's in that office down there, at the end of the hall .

Figure 8.6 shows that the typical discourse functions of lexical bundles are strikingly different in conversation vs. academic writing: most bundles are used for stance functions in conversation, with a number also being used for discourse-organizing functions. In contrast, most bundles are used for referential functions in academic prose. These findings indicate that formulaic expressions develop to serve the most important communicative needs of a register. It further turns out that there is a strong association between structural type and functional type for these lexical bundles: most stance bundles employ verbs or clause fragments, while most referential bundles are composed of noun phrase and prepositional phrase fragments.

Distribution of lexical bundles across functional types (4-word bundles occurring more than 40 times per million words)

In summary, a minimalist corpus-driven approach, beginning with only the existence of word forms, shows that words in English co-occur in highly frequent fixed sequences. These sequences are not complete constituents recognized by traditional theories, but they are readily interpretable in both structural and functional terms.

8.3.2 The interdependence of lexis, grammar, and meaning: Pattern grammar

Many scholars working within a corpus-driven framework have focused on the meaning and use of particular words, arguing that lexis, grammar, and meaning are fundamentally intertwined (e.g., Francis et al. 1996 ; 1998 ; Hunston and Francis 1998 ; 2000 ; Sinclair 1991a ; Stubbs 1993 ; and Tognini-Bonelli 2001 ). The best-developed application of corpus-driven research with these goals is the “pattern grammar” reference book series (e.g., Francis et al. 1996 ; 1998 ; see also Hunston and Francis 2000 ).

The pattern grammar studies might actually be considered hybrids, combining corpus-based and corpus-driven methodologies. They are corpus-based in that they assume the existence (and definition) of basic part-of-speech categories and some syntactic constructions, but they are corpus-driven in that they focus primarily on the construct of the grammatical pattern: “a phraseology frequently associated with (a sense of) a word … Patterns and lexis are mutually dependent, in that each pattern occurs with a restricted set of lexical items, and each lexical item occurs with a restricted set of patterns. In addition, patterns are closely associated with meaning, firstly because in many cases different senses of words are distinguished by their typical occurrence in different patterns; and secondly because words which share a given pattern tend also to share an aspect of meaning” (Hunston and Francis 2000 : 3). Thus, a pattern is a combination of words that “occurs relatively frequently”, is “dependent on a particular word choice”, and has “a clear meaning associated with it” (Hunston and Francis 2000 : 37). Grammatical patterns are not necessarily complete structures (phrases or clauses) recognized by linguistic theory. Thus, following the central defining characteristic of corpus-driven research given above, the pattern grammar studies attempt to uncover new linguistic constructs—the patterns —through inductive analysis of corpora.

A central claim of this framework is that grammatical patterns have inherent meaning, shared across the set of words that can occur in a pattern. For example, many of the verbs that occur in the grammatical pattern V+ over +NP express meanings relating to conflict or disagreement, such as bicker, disagree, fight, quarrel, quibble , and wrangle (see Hunston and Francis 2000 : 43–4); thus it can be argued that the grammatical pattern itself somehow entails this meaning.

The pattern grammar reference books (Francis et al. 1996 ; 1998 ) have attempted to provide a comprehensive catalog of the grammatical patterns for verbs, nouns, and adjectives in English. These books show that there are systematic regularities in the associations between grammatical frames, sets of words, and particular meanings on a much larger scale than it could have been possible to anticipate before the introduction of large-scale corpus analysis. For example, the reference book on grammatical patterns for verbs (Francis et al. 1996 ) includes over 700 different patterns and catalogs the use of over 4,000 verbs with respect to those patterns. The reference book on grammatical patterns for nouns and adjectives (Francis et al. 1998 ) is similar in scope, with over 200 patterns used to describe the use of over 8,000 nouns and adjectives.

The pattern grammar reference books do not address some of the stronger theoretical claims that have been associated with the corpus-driven approach. For example, “patterns” are based on analysis of lemmas rather than individual word forms, and thus the pattern grammar studies provide no support for the general claim that each word form has its own grammar. 1

The pattern grammar studies also do not support the strong version of the claim that each grammatical pattern has its own meaning. In fact, it is rarely the case that a grammatical frame corresponds to a single meaning domain. However, these studies do provide extensive support for a weaker form of the claim, documenting how the words that occur in a grammatical frame belong to a relatively small set of meaning groups. For example, the adjectives that occur in the grammatical frame ADJ in N mostly fall into several major meaning groups, such as:

adjectives that express high interest or participation:

e.g., absorbed, embroiled, engaged, engrossed, enmeshed, immersed, interested, involved, mixed up, wrapped up

adjectives that express a deficit:

e.g., deficient, lacking, wanting

adjectives that express an amount or degree:

e.g., awash, high, low, poor, rich

adjectives that express proficiency or fluency

e.g., fluent, proficient, schooled, skilful, skilled, versed

adjectives that express that something is covered

e.g., bathed, clad, clothed, coated, plastered, shrouded, smothered

(see Francis et al. 1998 : 444–51; Hunston and Francis 2000 : 75–6).

As noted above, the methodology used for the pattern grammar studies relaxes the strict requirements of corpus-driven methodology. First, predefined grammatical constructs are used in the approach, including basic grammatical classes, phrase types, and even distinctions that require a priori syntactic analysis. In addition, frequency plays only a minor role in the analysis, and some word combinations that occur frequently are not regarded as patterns at all. For example, the nouns followed by complementizer that are analyzed as patterns ( e.g., fact, claim, stipulation, expectation, disgust, problem , etc.), but nouns followed by the relative pronoun that do not constitute a pattern, even if the combination is frequent (e.g., extent, way, thing, questions, evidence, factors + that ). Similarly, prepositions are analyzed for their syntactic function in the sequence noun + preposition, to distinguish between prepositional phrases functioning as adverbials (which do not count as part of any pattern), vs. prepositional phrases that complement the preceding noun (which do constitute a pattern). So, for example, the combinations for the pattern ADJ in N listed above all include a prepositional phrase that complements the adjective. In contrast, when the prepositional phrase has an adverbial function, it is analyzed as not representing a pattern, even if the combination is frequent. Thus, the following adjectives do not belong to any pattern when they occur in the combination ADJ in N , even though they occur frequently and represent relatively coherent meaning groups:

adamant, firm, resolute, steadfast, unequivocal loud, vehement, vocal, vociferous (see Hunston and Francis 2000 : 76).

Regardless of the specific methodological considerations, the corpus-driven approach as realized in the pattern grammar studies has shown that there are systematic regularities in the associations between grammatical frames, sets of words, and particular meanings, on a much more comprehensive scale than it could have been possible to anticipate before the availability of large corpora and corpus-analysis tools.

8.3.3 The role of frequency in corpus-driven analysis

Surprisingly, one major difference among corpus-driven studies concerns the role of frequency evidence. Nearly every description of the corpus-driven approach includes mention of frequency, as in: (a) the “linguistic categories” are derived “systematically from the recurrent patterns and the frequency distributions that emerge from language in context” (Tognini-Bonelli 2001 : 87); (b) in a grammar pattern, “a combination of words occurs relatively frequently” (Hunston and Francis 2000 : 37).

In the study of lexical bundles, frequency evidence is primary. This framework can be regarded as the most extreme test of the corpus-driven approach, addressing the question of whether the most commonly occurring sequences of word forms can be interpreted as linguistically significant units. In contrast, frequency is not actually important in pattern grammar studies. On the one hand, frequent word combinations are not included in the pattern analysis if they represent different syntactic constructions, as described in the last section. The combination satisfaction that provides another example of this type. When the that initiates a complement clause, this combination is one of the realizations of the “happiness” N that pattern (Francis et al. 1998 : 111), as in:

One should of course record one's satisfaction that the two leaders got on well together .

However, it is much more frequent for the combination satisfaction that to represent different syntactic constructions, as in:

  The satisfaction provided by conformity is in competition with the often more immediate satisfaction that can be provided by crime .

He then proved to his own satisfaction that all such endeavours were doomed to failure .

In (a), the word that initiates a relative clause, and in (b), the that initiates a verb complement clause controlled by proved. Neither of these combinations are analyzed as belonging to a pattern, even though they are more frequent than the combination of satisfaction followed by a that noun complement clause.

Thus, frequency is not a decisive factor in identifying “patterns”, despite the definition that requires that the combination of words in a pattern must occur “relatively frequently”. Instead, the criteria that a grammatical pattern must be associated with a particular set of words and have a clear meaning are more decisive (see Hunston and Francis 2000 : 67–76).

In fact, some corpus-driven linguists interested in the lexis—grammar interface have overtly argued against the importance of frequency. For example, Sinclair notes that

some numbers are more important than others. Certainly the distinction between ο and 1 is fundamental, being the occurrence or non-occurrence of a phenomenon. The distinction between 1 and more than one is also of great importance … [because even two unconnected tokens constitute] the recurrence of a linguistic event …, [which] permits the reasonable assumption that the event can be systematically related to a unit of meaning. In the study of meaning it is not usually necessary to go much beyond the recognition of recurrence [i.e., two independent tokens] …. (Sinclair 2001 : 343–4)

Similarly, Tognini-Bonelli notes that

It is therefore appropriate to set up as the minimum sufficient condition for a pattern of occurrence to merit a place in the description of the language, that it occurs at least twice, and the occurrences appear to be independent of each other …. (Tognini-Bonelli 2001 : 89)

Thus, there is some tension here between the underlying definition of the corpus-driven approach, which derives linguistic categories from “recurrent patterns” and “frequency distributions” (Tognini-Bonelli 2001 : 87), and the actual practice of scholars working on pattern grammar and the lexis—grammar—meaning interconnection, which has focused much more on form—meaning associations with relatively little accountability to quantitative distributional patterns in a corpus. Here again, we see the central defining characteristic of corpus-driven research to be the shared goal of identifying new linguistic constructs through inductive analysis of a corpus, regardless of differences in the specific methodological approaches.

8.3.4 Linguistic “dimensions” of register variation

As discussed in section 8.2 above, corpus research has been used to describe particular linguistic features and their variants, showing how these features vary in their distribution and patterns of use across registers. This relationship can also be approached from the opposite perspective, with a focus on describing the registers rather than describing the use of particular linguistic features.

It turns out, though, that the distribution of individual linguistic features cannot reliably distinguish among registers. There are simply too many different linguistic characteristics to consider, and individual features often have idiosyncratic distributions. Instead, sociolinguistic research has argued that register descriptions must be based on linguistic co-occurrence patterns (see, for example, Ervin-Tripp 1972 ; Hymes 1974; Brown and Fraser 1979: 38–9; Halliday 1988: 162).

Multi-Dimensional (MD) analysis is a corpus-driven methodological approach that identifies the frequent linguistic co-occurrence patterns in a language, relying on inductive empirical/quantitative analysis (see, for example, Biber 1988 ; 1995). Frequency plays a central role in the analysis, since each dimension represents a constellation of linguistic features that frequently co-occur in texts. These “dimensions” of variation can be regarded as linguistic constructs not previously recognized by linguistic theory. Thus, although the framework was developed to describe patterns of register variation (rather than the meaning and use of individual words), MD analysis is clearly a corpus-driven methodology in that the linguistic constructs—the “dimensions”—emerge from analysis of linguistic co-occurrence patterns in the corpus.

The set of co-occurring linguistic features that comprise each dimension is identified quantitatively. That is, based on the actual distributions of linguistic features in a large corpus of texts, statistical techniques (specifically factor analysis) are used to identify the sets of linguistic features that frequently co-occur in texts.

The original MD analyses investigated the relations among general spoken and written registers in English, based on analysis of the LOB (Lancaster—Oslo—Bergen) Corpus (15 written registers) and the London—Lund Corpus (six spoken registers). Sixty-seven different linguistic features were analyzed computationally in each text of the corpus. Then, the co-occurrence patterns among those linguistic features were analyzed using factor analysis, identifying the underlying parameters of variation: the factors or “dimensions”. In the 1988 MD analysis, the 67 linguistic features were reduced to seven underlying dimensions. (The technical details of the factor analysis are given in Biber 1988 , Chapters 4–5; see also Biber 1995 , Chapter 5).

The dimensions are interpreted functionally, based on the assumption that linguistic co-occurrence reflects underlying communicative functions. That is, linguistic features occur together in texts because they serve related communicative functions.

The most important features on Dimensions 1–5 in the 1988 MD analysis are:

Dimension 1: Involved vs. Informational Production

Positive features: mental (private) verbs, that complementizer deletion, contractions, present tense verbs, WH-questions, 1st and 2nd person pronouns, pronoun it , indefinite pronouns, do as pro-verb, demonstrative pronouns, emphatics, hedges, amplifiers, discourse particles, causative subordination, sentence relatives, WH-clauses

Negative features: nouns, long words, prepositions, type/token ratio, attributive adjectives

Dimension 2: Narrative vs. Non-narrative Discourse

Positive features: past tense verbs, 3rd person pronouns, perfect aspect verbs, communication verbs

Negative features: present tense verbs, attributive adjectives

Dimension 3: Situation-dependent vs. Elaborated Reference

Positive features: time adverbials, place adverbials, other adverbs

Negative features: WH-relative clauses (subject gaps, object gaps), phrasal coordination, nominalizations

Dimension 4: Overt Expression of Argumentation

Positive features: prediction modals, necessity modals, possibility modals, suasive verbs, conditional subordination, split auxiliaries

Dimension 5: Abstract/Impersonal Style

Positive features: conjuncts, agentless passives, BY-passives, past participial adverbial clauses, past participial postnominal clauses, other adverbial subordinators

Each dimension can have “positive” and “negative” features. Rather than reflecting importance, positive and negative signs identify two groupings of features that occur in a complementary pattern as part of the same dimension. That is, when the positive features occur together frequently in a text, the negative features are markedly less frequent in that text, and vice versa.

On Dimension 1, the interpretation of the negative features is relatively straightforward. Nouns, word length, prepositional phrases, type/token ratio, and attributive adjectives all reflect an informational focus, a careful integration of information in a text, and precise lexical choice. Text Sample 1 illustrates these co-occurring linguistic characteristics in an academic article:

Text Sample 1. Technical academic prose

Apart from these very general group-related aspects, there are also individual aspects that need to be considered. Empirical data show that similar processes can be guided quite differently by users with different views on the purpose of the communication.

This text sample is typical of written expository prose in its dense integration of information: frequent nouns and long words, with most nouns being modified by attributive adjectives or prepositional phrases (e.g., general group-related aspects, individual aspects, empirical data, similar processes, users with different views on the purpose of the communication ).

The set of positive features on Dimension 1 is more complex, although all of these features have been associated with interpersonal interaction, a focus on personal stance, and real-time production circumstances. For example, first and second person pronouns, WH-questions, emphatics, amplifiers, and sentence relatives can all be interpreted as reflecting interpersonal interaction and the involved expression of personal stance (feelings and attitudes). Other positive features are associated with the constraints of real time production, resulting in a reduced surface form, a generalized or uncertain presentation of information, and a generally “fragmented” production of text; these include that -deletions, contractions, pro-verb DO, the pronominal forms, and final (stranded) prepositions. Text Sample 2 illustrates the use of positive Dimension 1 features in a workplace conversation:

Text Sample 2. Conversation at a reception at work

Sabrina I'm dying of thirst. Suzanna Mm, hmm. Do you need some M & Ms? Sabrina Desperately. <laugh> Ooh, thank you. Ooh, you're so generous. Suzanna Hey I try. Sabrina Let me have my Snapple first. Is that cold-cold ? Suzanna I don't know but there should be ice on uh, <unclear>. Sabrina I don't want to seem like I don't want to work and I don't want to seem like a stuffed shirt or whatever but I think this is really boring. Suzanna I know. Sabrina I would like to leave here as early as possible today, go to our rooms, and pick up this thing at eight o'clock in the morning. Suzanna Mm, hmm.

Overall, Factor 1 represents a dimension marking interactional, stance-focused, and generalized content (the positive features mentioned earlier) vs. high informational density and precise word choice (the negative features). Two separate communicative parameters seem to be represented here: the primary purpose of the writer/speaker (involved vs. informational), and the production circumstances (those restricted by real-time constraints vs. those enabling careful editing possibilities). Reflecting both of these parameters, the interpretive label “Involved vs. Informational Production” was proposed for the dimension underlying this factor.

The second major step in interpreting a dimension is to consider the similarities and differences among registers with respect to the set of co-occurring linguistic features. To achieve this, dimension scores are computed for each text, by summing the individual scores of the features that co-occur on a dimension (see Biber 1988 : 93–7). For example, the Dimension 1 score for each text was computed by adding together the frequencies of private verbs, that -deletions, contractions, present tense verbs, etc.—the features with positive loadings—and then subtracting the frequencies of nouns, word length, prepositions, etc.—the features with negative loadings.

Once a dimension score is computed for each text, the mean dimension score for each register can be computed. Plots of these mean dimension scores allow linguistic characterization of any given register, comparison of the relations between any two registers, and a fuller functional interpretation of the underlying dimension.

For example, Figure 8.7 plots the mean dimension scores of registers along Dimension 1 from the 1988 MD analysis. The registers with large positive values (such as face-to-face and telephone conversations), have high frequencies of present tense verbs, private verbs, first and second person pronouns, contractions, etc.—the features with salient positive weights on Dimension 1. At the same time, registers with large positive values have markedly low frequencies of nouns, prepositional phrases, long words, etc.—the features with salient negative weights on Dimension 1. Registers with large negative values (such as academic prose, press reportage and official documents) have the opposite linguistic characteristics: very high frequencies of nouns, prepositional phrases, etc., plus low frequencies of private verbs, contractions, etc.

The relations among registers shown in Figure 8.7 confirm the interpretation of Dimension 1 as distinguishing among texts along a continuum of involved vs. informational production. At the positive extreme, conversations are highly interactive and involved, with the language produced under real-time circumstances. Registers such as public conversations (interviews and panel discussions) are intermediate: they have a relatively informational purpose, but participants interact with one another and are still constrained by real time production. Finally, at the negative extreme, registers such as academic prose are non-interactive but highly informational in purpose, produced under controlled circumstances that permit extensive revision and editing.

Figure 8.7 shows that there is a large range of variation among spoken registers with respect to the linguistic features that comprise Dimension 1 (“Involved vs. Informational Production”). Conversation has extremely large positive Dimension 1 scores; spontaneous speeches and interviews have moderately large positive scores; while prepared speeches and broadcasts have scores around o.o (reflecting a balance of positive and negative linguistic features on this dimension). The written registers similarly show an extensive range of variation along Dimension 1. Expository informational registers, like official documents and academic prose, have very large negative scores; the fiction registers have scores around o.o; while personal letters have a relatively large positive score.

Mean scores of registers along Dimension 1: Involved vs. Informational Production (adapted from   Figure 7.1   in   Biber   1988 )

Note : Underlining denotes written registers; capitalization denotes spoken registers; F = 111.9, p <.0001, r 2 = 84.3%.

This distribution shows that no single register can be taken as representative of the spoken or written mode. At the extremes, written informational prose is dramatically different from spoken conversation with respect to Dimension 1 scores. But written personal letters are relatively similar to spoken conversation, while spoken prepared speeches share some Dimension 1 characteristics with written fictional registers. Taken together, these Dimension 1 patterns indicate that there is extensive overlap between the spoken and written modes in these linguistic characteristics, while the extremes of each mode (i.e., conversation vs. informational prose) are sharply distinguished from one another.

The overall comparison of speech and writing resulting from the 1988 MD analysis is actually much more complex because six separate dimensions of variation were identified and each of these defines a different set of relations among spoken and written registers. For example, Dimension 2 is interpreted as “Narrative vs. Non-narrative Concerns”. The positive features—past tense verbs, third person pronouns, perfect aspect verbs, communication verbs, and present participial clauses—are associated with past time narration. In contrast, the positive features—present tense verbs and attributive adjectives—have non-narrative communicative functions.

The distribution of registers along Dimension 2, shown in Figure 8.8 , further supports its interpretation as Narrative vs. Non-narrative Concerns. All types of fiction have markedly high positive scores, reflecting their emphasis on narrating events. In contrast, registers which are typically more concerned with events currently in progress (e.g., broadcasts) or with building arguments rather than narrating (e.g., academic prose) have negative scores on this dimension. Finally, some registers have scores around 0.0, reflecting a mix of narrative and other features. For example, face-to-face conversation will often switch back and forth between narration of past events and discussion of current interactions.

Each of the dimensions in the analysis can be interpreted in a similar way. Overall, the 1988 MD analysis showed that English registers vary along several underlying dimensions associated with different functional considerations, including: interactiveness, involvement and personal stance, production circumstances, informational density, informational elaboration, narrative purposes, situated reference, persuasiveness or argumentation, and impersonal presentation of information.

Mean scores for registers along Dimension 2: Narrative vs. Non-Narrative Discourse (adapted from   Figure 7.2   in   Biber   1988 )

Note : Underlining denotes written registers; capitalization denotes spoken registers; F = 32.3, p < .0001, r 2 = 60.8%.

Many studies have applied the 1988 dimensions of variation to study the linguistic characteristics of more specialized registers and discourse domains. For example:

However, other MD studies have undertaken new corpus-driven analyses to identify the distinctive sets of co-occurring linguistic features that occur in a particular discourse domain or in a language other than English. The following section surveys some of those studies.

8.3.4.1 Comparison of the multi-dimensional patterns across discourse domains and languages

Numerous other studies have undertaken complete MD analyses, using factor analysis to identify the dimensions of variation operating in a particular discourse domain in English, rather than applying the dimensions from the 1988 MD analysis (e.g., Biber 1992 ; 2001 ; 2006 ; 2008 ; Biber and Jones 2006 ; Biber et al. 2007 ; Friginal 2008 ; 2009 ; Kanoksilapatham 2007 ; Crossley and Louwerse 2007 ; Reppen 2001 ).

Given that each of these studies is based on a different corpus of texts, representing a different discourse domain, it is reasonable to expect that they would each identify a unique set of dimensions. This expectation is reinforced by the tact that the more recent studies have included additional linguistic features not used in earlier MD studies (e.g., semantic classes of nouns and verbs). However, despite these differences in design and research focus, there are certain striking similarities in the set of dimensions identified by these studies.

Most importantly, in nearly all of these studies, the first dimension identified by the factor analysis is associated with an informational focus vs. a personal focus (personal involvement/stance, interactivity, and/or real-time production features). For example:

It is perhaps not surprising that Dimension 1 in the original 1988 MD analysis was strongly associated with an informational vs. (inter)personal focus, given that the corpus in that study ranged from spoken conversational texts to written expository texts. For the same reason, it is somewhat predictable that a similar dimension would have emerged from the study of 18th-century written and speech-based registers. It is somewhat more surprising that academic spoken and written registers would be defined by a similar linguistic dimension (and especially surprising that classroom teaching is similar to conversation, and strikingly different from academic writing, in the use of these linguistic features). And it was completely unexpected that a similar oral/literate dimension—realized by essentially the same set of co-occurring linguistic features—would be fundamentally important in highly restricted discourse domains, including studies of job interviews, elementary school registers, and variations among the different kinds of conversation.

A second parameter found in most MD analyses corresponds to narrative discourse, reflected by the co-occurrence of features like past tense, third person pronouns, perfect aspect, and communication verbs (see, for example, the Biber 2006 study of university registers; Biber 2001 on 18th-century registers; and the Biber 2008 study of conversation text types). In some studies, a similar narrative dimension emerged with additional special characteristics. For example, in Reppen's ( 2001 ) study of elementary school registers, “narrative” features like past tense, perfect aspect, and communication verbs co-occurred with once-occurring words and a high type/token ratio; in this corpus, history textbooks rely on a specialized and diverse vocabulary to narrate past events. In the job interview corpus (White 1994 ), the narrative dimension reflected a fundamental opposition between personal/specific past events and experiences (past tense verbs co-occurring with first person singular pronouns) vs. general practice and expectations (present tense verbs co-occurring with first person plural pronouns). In Biber and Kurjian's ( 2007 ) study of web text types, narrative features co-occurred with features of stance and personal involvement on the first dimension, distinguishing personal narrative web pages (e.g., personal blogs) from the various kinds of more informational web pages.

At the same time, most of these studies have identified some dimensions that are unique to the particular discourse domain. For example, the factor analysis in Reppen (1994) identified a dimension of “Other-directed idea justification” in elementary student registers. The features on this dimension include second person pronouns, conditional clauses, and prediction modals; these features commonly co-occur in certain kinds of student writings (e.g., If you wanted to watch TV a lot you would not get very much done ).

The factor analysis in Biber's ( 2006 ) study of university spoken and written registers identified four dimensions. Two of these are similar linguistically and functionally to dimensions found in other MD studies: Dimension 1: “Oral vs. literate discourse”; and Dimension 3: “Narrative orientation”. However, the other two dimensions are specialized to the university discourse domain: Dimension 2 is interpreted as “Procedural vs. content-focused discourse”. The co-occurring “procedural” features include modals, causative verbs, second person pronouns, and verbs of desire + to -clause; these features are especially common in classroom management talk, course syllabi, and other institutional writing. The complementary “content-focused” features include rare nouns, rare adjectives, and simple occurrence verbs; these co-occurring features are typical of textbooks, and especially common in natural science textbooks. Dimension 4, interpreted as “Academic stance”, consists of features like stance adverbials (factual, attitudinal, likelihood) and stance nouns + that -clause; classroom teaching and classroom management talk is especially marked on this dimension.

A final example comes from Biber's ( 2008 ) MD analysis of conversational text types, which identified a dimension of “stance-focused vs. context-focused discourse”. Stance focused conversational texts were marked by the co-occurrence of that -deletions, mental verbs, factual verbs + that -clause, likelihood verbs + that -clause, likelihood adverbs, etc. In contrast, context-focused texts had high frequencies of nouns and WH-questions, used to inquire about past events or future plans. The text type analysis identified different sets of conversations characterized by one or the other of these two extremes.

In sum, corpus-driven MD studies of English registers have uncovered both surprising similarities and notable differences in the underlying dimensions of variation. Two parameters seem to be fundamentally important, regardless of the discourse domain: a dimension associated with informational focus vs. (inter)personal focus, and a dimension associated with narrative discourse. At the same time, these MD studies have uncovered dimensions particular to the communicative functions and priorities of each different domain of use.

These same general patterns have emerged from MD studies of languages other than English, including Nukulaelae Tuvaluan (Besnier 1988 ); Korean (Kim and Biber 1994 ); Somali (Biber and Hared 1992 ; 1994 ); Taiwanese (Jang 1998 ); Spanish (Biber et al. 2006 ; Biber and Tracy-Ventura 2007 ; Parodi 2007 ); Czech (Kodytek 2008), and Dagbani (Purvis 2008 ). Taken together, these studies provide the first comprehensive investigations of register variation in non-western languages.

Biber ( 1995 ) synthesizes several of these studies to investigate the extent to which the underlying dimensions of variation and the relations among registers are configured in similar ways across languages. These languages show striking similarities in their basic patterns of register variation, as reflected by:

the co-occurring linguistic features that define the dimensions of variation in each language;

the functional considerations represented by those dimensions;

the linguistic/functional relations among analogous registers.

For example, similar to the full MD analyses of English, these MD studies have all identified dimensions associated with informational vs. (inter)personal purposes, and with narrative discourse.

At the same time, each of these MD analyses have identified dimensions that are unique to a language, reflecting the particular communicative priorities of that language and culture. For example, the MD analysis of Somali identified a dimension interpreted as “Distanced, directive interaction”, represented by optative clauses, first and second person pronouns, directional preverbal particles, and other case particles. Only one register is especially marked for the frequent use of these co-occurring features in Somali: personal letters. This dimension reflects the particular communicative priorities of personal letters in Somali, which are typically interactive as well as explicitly directive.

The cross-linguistic comparisons further show that languages as diverse as English and Somali have undergone similar patterns of historical evolution following the introduction of written registers. For example, specialist written registers in both languages have evolved over time to styles with an increasingly dense use of noun phrase modification. Historical shifts in the use of dependent clauses is also surprising: in both languages, certain types of clausal embedding—especially complement clauses—turn out to be associated with spoken registers rather than written registers.

These synchronic and diachronic similarities raise the possibility of universale of register variation. Synchronically, such universals reflect the operation of underlying form/function associations tied to basic aspects of human communication; and diachronically, such universals relate to the historical development of written registers in response to the pressures of modernization and language adaptation.

8.4 Conclusion

The present chapter has illustrated how corpus analysis contributes to the description of language use, in many cases allowing us to think about language patterns in fundamentally new ways. Corpus-based analyses are the most traditional, employing the grammatical categories recognized by other linguistic theories but investigating their patterns of variation and use empirically. Such analyses have shown repeatedly that our intuitions about the patterns of use are often inaccurate, although the patterns themselves are highly systematic and explainable in functional terms.

Corpus-driven approaches are even more innovative, using corpus analysis to uncover linguistic constructs that are not recognized by traditional linguistic theories. Here again, corpus analyses have uncovered strong, systematic patterns of use, but even in this case the underlying constructs had not been anticipated by earlier theoretical frameworks.

In sum, corpus investigations show that our intuitions as linguists are not adequate for the task of identifying and characterizing linguistic phenomena relating to language use. Rather, corpus analysis has shown that language use is patterned much more extensively, and in much more complex ways, than previously anticipated.

Other studies that advocate this position have been based on a few selected case studies (e.g., Sinclair 1991 b on eye vs. eyes; Tognini-Bonelli and Elena 2001 : 92–8 on facing vs. faced , and saper vs. sapere in Italian). These case studies clearly show that word forms belonging to the same lemma do sometimes have their own distinct grammar and meaning. However, no empirical study to date has investigated the extent to which this situation holds across the full set of word forms and lemmas in a language. (In contrast, the pattern grammar reference books seem to implicitly suggest that most inflected word forms that belong to a single lemma “pattern” in similar ways.)

  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Northern Arizona University Logo

Research in Corpus Linguistics

Research output : Chapter in Book/Report/Conference proceeding › Chapter

Corpus linguistics is a research approach that has developed over the past few decades to support empirical investigations of language variation and use, resulting in research findings that are have much greater generalizability and validity than would otherwise be feasible. Corpus linguistics is not in itself a model of language. Rather, it can be regarded as primarily a methodological approach; it is empirical, analyzing the actual patterns of use in natural texts. It utilizes a large and principled collection of natural texts, known as a corpus, as the basis for analysis. At the same time, corpus linguistics is more than a methodological approach, because these methodological innovations have enabled researchers to ask fundamentally different kinds of research questions, sometimes resulting in radically different perspectives on language variation and use from those taken in previous research. Corpus linguistic research offers strong support for the view that language variation is systematic and can be described using empirical, quantitative methods.

  • Corpus linguistics
  • Empirical methods
  • Language variation
  • Natural texts
  • Quantitative methods

ASJC Scopus subject areas

  • General Arts and Humanities
  • General Social Sciences

Access to Document

  • 10.1093/oxfordhb/9780195384253.013.0038

Other files and links

  • Link to publication in Scopus

Fingerprint

  • Corpus Linguistics Arts & Humanities 100%
  • Language Variation Arts & Humanities 80%
  • linguistics Social Sciences 52%
  • Language Use Arts & Humanities 40%
  • Quantitative Methods Arts & Humanities 25%
  • quantitative method Social Sciences 19%
  • research approach Social Sciences 18%
  • Innovation Arts & Humanities 14%

T1 - Research in Corpus Linguistics

AU - Biber, Douglas

AU - Reppen, Randi

AU - Friginal, Eric

N1 - Publisher Copyright: © 2010 by Oxford University Press, Inc. All rights reserved.

PY - 2012/9/18

Y1 - 2012/9/18

N2 - Corpus linguistics is a research approach that has developed over the past few decades to support empirical investigations of language variation and use, resulting in research findings that are have much greater generalizability and validity than would otherwise be feasible. Corpus linguistics is not in itself a model of language. Rather, it can be regarded as primarily a methodological approach; it is empirical, analyzing the actual patterns of use in natural texts. It utilizes a large and principled collection of natural texts, known as a corpus, as the basis for analysis. At the same time, corpus linguistics is more than a methodological approach, because these methodological innovations have enabled researchers to ask fundamentally different kinds of research questions, sometimes resulting in radically different perspectives on language variation and use from those taken in previous research. Corpus linguistic research offers strong support for the view that language variation is systematic and can be described using empirical, quantitative methods.

AB - Corpus linguistics is a research approach that has developed over the past few decades to support empirical investigations of language variation and use, resulting in research findings that are have much greater generalizability and validity than would otherwise be feasible. Corpus linguistics is not in itself a model of language. Rather, it can be regarded as primarily a methodological approach; it is empirical, analyzing the actual patterns of use in natural texts. It utilizes a large and principled collection of natural texts, known as a corpus, as the basis for analysis. At the same time, corpus linguistics is more than a methodological approach, because these methodological innovations have enabled researchers to ask fundamentally different kinds of research questions, sometimes resulting in radically different perspectives on language variation and use from those taken in previous research. Corpus linguistic research offers strong support for the view that language variation is systematic and can be described using empirical, quantitative methods.

KW - Corpus linguistics

KW - Empirical methods

KW - Language variation

KW - Natural texts

KW - Quantitative methods

KW - Research

UR - http://www.scopus.com/inward/record.url?scp=84923292922&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84923292922&partnerID=8YFLogxK

U2 - 10.1093/oxfordhb/9780195384253.013.0038

DO - 10.1093/oxfordhb/9780195384253.013.0038

M3 - Chapter

AN - SCOPUS:84923292922

SN - 9780195384253

BT - The Oxford Handbook of Applied Linguistics, (2 Ed.)

PB - Oxford University Press

research questions in corpus linguistics

Announcements

Articles falling within one of the categories published in RiCL are welcome through the whole year

Current Issue

Book reviews, issn: 2243-4712.

SCImago Journal & Country Rank

Abstracting & indexing

Google Scholar

Index Copernicus International

Internet Archive Scholar

Linguistic Bibliography Online

MLA International Bibliography

Norwegian List

OASPA 

Publication Forum

ScienceGate

Scimago Journal Rank

Ulrich's Periodicals Directory

  • For Readers
  • For Authors
  • For Librarians

Asociación Española de Lingüística de Corpus /  Spanish Association for Corpus Linguistics Departamento de Filología Inglesa Facultad de Letras | Campus de La Merced Universidad de Murcia, 30003 Murcia, Spain

About this Publishing System

Cognitive Linguistics: Fostering English Language Proficiency in Higher Education

  • Regular Article
  • Published: 08 May 2024

Cite this article

research questions in corpus linguistics

  • Changjiang Tang 1  

39 Accesses

Explore all metrics

Theoretical linguistics, particularly within the domain of cognitive linguistic (CL) theories, serves as a comprehensive framework for understanding language interpretation and addressing fundamental questions about its nature. Within the framework of theoretical linguistics, this study focuses on linguistic theories that delve into cognitive processes. Specifically, it explores how CL theories contribute to the development of English language (EL) skills in college students. To achieve this goal, a well-structured questionnaire method was employed to gather insights from 190 college students, and the collected data were analyzed using SPSS. The study adopts a quantitative descriptive research approach with a cross-sectional research design. The chosen methodology involves a questionnaire survey method, specifically utilizing a closed-ended 5-point Likert scale for participant responses. The corpus linguistics-focused curriculum enhances college students’ writing complexity over traditional methods. This research contributes to the field of cognitive linguistics by not only emphasizing its role in EL development but also by addressing the integration of a corpus-based approach in English teaching. The study findings indicate frequent corpus-based language exploration correlates positively with students’ confidence in written and spoken English. Furthermore, the analysis results highlight the effectiveness of integrating CL techniques into EL teaching materials, showcasing improvements in students’ practical language skills and proficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

research questions in corpus linguistics

Similar content being viewed by others

research questions in corpus linguistics

Artificial intelligence in higher education: the state of the field

research questions in corpus linguistics

Impact of ChatGPT on learners in a L2 writing practicum: An exploratory investigation

research questions in corpus linguistics

Incorporating AI in foreign language education: An investigation into ChatGPT’s effect on foreign language learners

Data availability.

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Abdollahpour, Z., & Asadzadeh Maleki, N. (2023). Fostering academic vocabulary learning: Opportunities for explicit learning through a mobile-assisted app in the field of applied linguistics. International Journal of Foreign Language Teaching and Research, 11 (44), 133–152.

Google Scholar  

Afzaal, M., Naqvi, S. B., & Qiang, G. (Eds.). (2023). Language, corpora, and technology in applied linguistics . Frontiers Media SA.

AghajanzadehKiasi, G., & PourhoseinGilakjani, A. (2023). The effects of definitional, sentential, and textual vocabulary learning strategies on Iranian EFL learners’ vocabulary learning and retention. Reading & Writing Quarterly, 39 (2), 155–172.

Article   Google Scholar  

Asadova, B. (2023). Phonetic fluidity in english pronunciation: Techniques for native-like articulation. Norwegian Journal of Development of the International Science, 121 , 81.

Bailey, D. R., & Almusharraf, N. (2022). A structural equation model of second language writing strategies and their influence on anxiety, proficiency, and perceived benefits with online writing. Education and Information Technologies, 27 (8), 10497–10516.

Baker, M. (2019). Corpus linguistics and translation studies*: Implications and applications. In K. H. Kim & Y. Zhu (Eds.), Researching translation in the age of technology and global conflict (pp. 9–24). Routledge.

Chapter   Google Scholar  

Boontam, P. (2022). The effect of teaching english synonyms through data-driven learning (DDL) on thai efl students’ vocabulary learning. Shanlax International Journal of Education, 10 (2), 80–91.

Boughezal, A. (2021). The role of the corpus-based approach in developing EFL students’ writing proficiency: The case of second year LMD students in the Department of Letters and English Language, Hadj Lakhdar University. Batna (Doctoral dissertation)

Bühler, K. (1990). Theory of language. The representational function of language . John Benjamins.

Book   Google Scholar  

Curado Fuentes, A. (2023). Corpus affordances in foreign language reading comprehension. In Curado Fuentes, A. (eds) Demystifying corpus linguistics for English language teaching (pp. 99–118). Springer.

Doiz, A., & Lasagabaster, D. (2020). Dealing with language issues in English-medium instruction at university: A comprehensive approach. International Journal of Bilingual Education and Bilingualism, 23 (3), 257–262.

Erarslan, A. (2021). Correlation between metadiscourse, lexical complexity, readability and writing performance in efl university students’ research-based essays. Shanlax International Journal of Education, 9 , 238–254.

Esfandiari, R., Ahmadi, M., & Schaefer, E. (2021). A corpus-based study on the use and syntactic functions of lexical bundles in applied linguistics research articles in two contexts of publications. Applied Research on English Language, 10 (4), 139–166.

Friginal, E., Cox, A., & Udell, R. (2023). Corpus linguistics and writing instruction. Demystifying corpus linguistics for english language teaching (pp. 79–97). Springer.

Hinkel, E. (2020). Teaching academic L2 writing: Practical techniques in vocabulary and grammar . Routledge.

Huang, L. F., Lin, Y. L., & Gráf, T. (2023). Development of the use of discourse markers across different fluency levels of CEFR: A learner corpus analysis. Pragmatics, 33 (1), 49–77.

Jagaiah, T., Olinghouse, N. G., & Kearns, D. M. (2020). Syntactic complexity measures: Variation by genre, grade-level, students’ writing abilities, and writing quality. Reading and Writing, 33 , 2577–2638.

Kamarudin, R., Abdullah, S., & Aziz, R. A. (2020). Examining ESL learners’ knowledge of collocations. International Journal of Applied Linguistics and English Literature, 9 (1), 1–6.

Khonamri, F., Ahmadi, F., Pavlikova, M., & Petrikovicova, L. (2020). The effect of awareness raising and explicit collocation instruction on writing fluency of efl learners. European Journal of Contemporary Education, 9 (4), 786–806.

Kondo, M., Fontan, L., Le Coz, M., Konishi, T., & Detey, S. (2020). Phonetic fluency of Japanese learners of English: Automatic vs native and non-native assessment. In Proceedings of the International Conference on Speech Prosody (pp. 784–788).

Lateh, N. H. M., Arif, N. N. A. N. M., Nasir, M., Mohamed, A. F., Rusdi, F. A.., & Baharuddin, K. H. (2021). Exploring users’ awareness and use of English collocations in everyday communication . Kulliyyah of Languages and Management, International Islamic University Malaysia

Lee, H., & Jin, S. (2023). Measuring communicative intentions and capabilities: Learners’ English proficiency as a supplement to willingness to communicate. English Language Assessment, 18 (1), 97–117.

Lee, H., Jin, S., & Lee, J. H. (2023). Ending the cycle of anxiety in language learning: A non-recursive path analysis approach. System, 118 , 103154.

Lei, L., Wen, J., & Yang, X. (2023). A large-scale longitudinal study of syntactic complexity development in EFL writing: A mixed-effects model approach. Journal of Second Language Writing, 59 , 100962.

Llinares, A., & McCabe, A. (2023). Systemic functional linguistics: The perfect match for content and language integrated learning. International Journal of Bilingual Education and Bilingualism, 26 (3), 245–250.

Ma, Q., Tang, J., & Lin, S. (2022). The development of corpus-based language pedagogy for TESOL teachers: A two-step training approach facilitated by online collaboration. Computer Assisted Language Learning, 35 (9), 2731–2760.

Macaro, E., Akincioglu, M., & Han, S. (2020). English medium instruction in higher education: Teacher perspectives on professional development and certification. International Journal of Applied Linguistics, 30 (1), 144–157.

Mahyoob, M. (2020). Challenges of e-Learning during the COVID-19 pandemic experienced by EFL learners. Arab World English Journal, 11 (4), 351–362.

Messina, C. M., Jones, C. E., & Poe, M. (2023). Prompting reflection: Using corpus linguistic methods in the local assessment of reflective writing. Written Communication, 40 (2), 620–650.

Meyer, C. F. (2023). English corpus linguistics: An introduction . Cambridge University Press.

Muftah, M. (2023a). Communication apprehension and self-perceived communication competence: A study of undergraduate students in their final year. Higher Education, Skills and Work-Based Learning, 13 , 1187–1203.

Muftah, M. (2023). Data-driven learning (DDL) activities: Do they truly promote EFL students’ writing skills development? Education and Information Technologies, 28 , 1–27.

Parmaxi, A., & Demetriou, A. A. (2020). Augmented reality in language learning: A state-of-the-art review of 2014–2019. Journal of Computer Assisted Learning, 36 (6), 861–875.

Pikhart, M., Klimova, B., & Ruschel, F. B. (2023). Foreign language vocabulary acquisition and retention in print text vs digital media environments. Systems, 11 (1), 30.

Plengkham, B., & Wasanasomsithi, P. (2023). Effects of integrated performance assessment modules on English speaking ability of Thai EFL undergraduate students. LEARN Journal: Language Education and Acquisition Research Network, 16 (1), 448–472.

Ravshanovna, X. N. (2023). Corpus linguistics in language teaching. Miasto Przyszłości, 35 , 67–70.

Saddhono, K., Rohmadi, M., Setiawan, B., Suhita, R., Rakhmawati, A., Hastuti, S., & Islahuddin, I. (2023). Corpus linguistics use in vocabulary teaching principle and technique application: A study of Indonesian language for foreign speakers. International Journal of Society, Culture & Language, 11 (1), 231–245.

Stefanowitsch, A. (2020). Corpus linguistics: A guide to the methodology . Language Science Press.

Sultan, A. H. H. (2023). The key English pronunciation difficulties for Egyptian EFL learners. ANGLICA—An International Journal of English Studies, 32 (2), 115–136.

Teng, M. F., & Yue, M. (2023). Metacognitive writing strategies, critical thinking skills, and academic writing performance: A structural equation modeling approach. Metacognition and Learning, 18 (1), 237–260.

Tian, X. (2022). Construction of a multimodal corpus of college students’ spoken English based on semantic concepts. Mobile Information Systems, 2022 , 5270408.

Uchihara, T., & Clenton, J. (2023). The role of spoken vocabulary knowledge in second language speaking proficiency. The Language Learning Journal, 51 (3), 376–393.

Umair, H. M., Imran, M., & Sarwat, S. (2023). An analysis of the use of collocations in English essay writing at undergraduate level. Pakistan Languages and Humanities Review, 7 (1), 43–54.

Vang, M. (2023). Second-generation Hmong Americans’ self-confidence and self-perceived competency communicating in English in a variety of settings. In Poster session presented at the Oklahoma State University Undergraduate Research Symposium, Stillwater, OK .

Vosiljonov, A. (2022). Basic theoretical principles of corpus linguistics. Academicia Globe, 3 (02), 173–175.

Waseem-Ul-Hameed, M. A., Ali, M., Nadeem, S., & Amjad, T. (2017). The role of distribution channels and educational level towards insurance awareness among the general public. International Journal of Supply Chain Management, 6 (4), 308.

Yilmaz, R. M., Topu, F. B., & TakkaçTulgar, A. (2022). An examination of vocabulary learning and retention levels of pre-school children using augmented reality technology in English language learning. Education and Information Technologies, 27 (5), 6989–7017.

Zhang, H., & Shi, Y. (2023). Evolution of English language education policies in the Chinese Mainland in the 21st century: A corpus-based analysis of official language policy documents. Linguistics and Education, 76 , 101190.

Zou, D., Xie, H., & Wang, F. L. (2023). Effects of technology enhanced peer, teacher and self-feedback on students’ collaborative writing, critical thinking tendency and engagement in learning. Journal of Computing in Higher Education, 35 (1), 166–185.

Download references

Not applicable.

Author information

Authors and affiliations.

School of General Education, Guangzhou Vocational College of Technology & Business Guangzhou, Guangzhou, China

Changjiang Tang

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Changjiang Tang .

Ethics declarations

Conflict of interest, human and animal rights.

This article does not contain any studies with human or animal subjects performed by any of the authors.

Informed Consent

Informed consent was obtained from all individual participants included in the study.

Consent to Participate

Consent for publication, additional information, publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Tang, C. Cognitive Linguistics: Fostering English Language Proficiency in Higher Education. Asia-Pacific Edu Res (2024). https://doi.org/10.1007/s40299-024-00833-6

Download citation

Accepted : 15 February 2024

Published : 08 May 2024

DOI : https://doi.org/10.1007/s40299-024-00833-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • English language
  • Cognitive linguistics
  • Reading skills
  • Language proficiency
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. (PDF) Corpus Linguistics: Mixed‐Methods Research

    research questions in corpus linguistics

  2. PPT

    research questions in corpus linguistics

  3. (PDF) Corpus Linguistics What Is a Corpus

    research questions in corpus linguistics

  4. (PDF) Today's corpus linguistics: Some open questions

    research questions in corpus linguistics

  5. (PDF) CORPUS LINGUISTICS: POTENTIALS AND LIMITATIONS

    research questions in corpus linguistics

  6. Panel on Corpus Linguistics and information retrieval

    research questions in corpus linguistics

VIDEO

  1. Corpus Linguistics Approaches to Discourse Studies

  2. Unit 6 Corpus Linguistics Part 1

  3. Corpus Basics II

  4. Corpus Linguistics

  5. Unit 6 Corpus Linguistics Part 2

  6. Unit 6 Corpus Linguistics Part 3

COMMENTS

  1. Corpus Linguistics (Chapter 5)

    Summary. Chapter 5 describes the fundamental research questions, empirical approaches and findings of corpus linguistics. Basically, it is an empirical approach investigating language use in its natural context with different types of corpora as its data base. Methodological issues include considerations on corpus linguistic approaches, types ...

  2. PDF An IntroductIon to corpus LInguIstIcs

    corpus linguistics serves to answer two fundamental research questions: 1. What particular patterns are associated with lexical or grammatical features? 2. How do these patterns differ within varieties and registers? Many notable scholars, have, of course, contributed to the development of mod-ern-day corpus linguistics: Leech, Biber, Johansson ...

  3. Research in Corpus Linguistics

    At the same time, corpus linguistics is more than a methodological approach, because these methodological innovations have enabled researchers to ask fundamentally different kinds of research questions, sometimes resulting in radically different perspectives on language variation and use from those taken in previous research. Corpus linguistic ...

  4. PDF Corpus Linguistics

    2.2 Are corpora the answer to all research questions in linguistics? 27 2.3 Corpus annotation 29 2.4 Introducing concordances 35 2.5 A historical overview of corpus analysis tools 37 2.6 Statistics in corpus linguistics 48 2.7 Summary 53 Further reading 54 Practical activities 55 Questions for discussion 55 3 The web, laws and ethics 57 3.1 ...

  5. Working with Corpora Small and Large: Qualitative and Quantitative

    As Paul Baker (2006, 175) points out, any corpus linguistic study "involves a great deal of human choice at every stage: forming research questions, designing and building corpora, deciding which techniques to use, interpreting the results and framing explanations for them." The remainder of this section discusses the arguably substantive ...

  6. Corpus Linguistics: Method, theory and practice

    The main features of corpus linguistics. Research in corpus linguistics deals with some set of machine-readable texts which is deemed an appropriate basis on which to study a particular research questions. The set of texts or corpus is usually of a size which defies analysis by hand and eye alone within any reasonable timeframe.

  7. Corpus-Based and Corpus-driven Analyses of Language Variation and Use

    At the same time, corpus linguistics is much more than a methodological approach: these methodological innovations have enabled researchers to ask fundamentally different kinds of research questions, sometimes resulting in radically different perspectives on language variation and use from those taken in previous research. Corpus linguistic ...

  8. Review of Corpus Linguistics for Education: A Guide for Research

    Corpus Linguistics for Education shows that corpus linguistics research is not only useful in the field of linguistics but also in other fields, such as education. Researchers can use this book as a guideline for conducting educational research by adopting a linguistics-based corpus. It describes in detail how corpora can be used to explore ...

  9. (PDF) Research Trends in Corpus Linguistics: A Bibliometric Analysis of

    This paper uses a bibliometric analysis to map the field of Corpus Linguistics (CL) research in arts and humanities over the last 20 years, while tracking changes in the popular CL research topics ...

  10. Research in Corpus Linguistics

    At the same time, corpus linguistics is more than a methodological approach, because these methodological innovations have enabled researchers to ask fundamentally different kinds of research questions, sometimes resulting in radically different perspectives on language variation and use from those taken in previous research. Corpus linguistic ...

  11. 102 questions with answers in CORPUS LINGUISTICS

    Answer. Hi, Yes, you can utilise corpus linguistics on its own to examine the representation of Christian identity in a text. Qualitative analysis can centre on keyword frequency, collocations ...

  12. (PDF) CORPUS METHODS IN LANGUAGE STUDIES

    It defines corpus linguistics, explores its theoretical background, and discusses the steps and procedures involved in building and analyzing corpora. ... analyst's research aim and questions, a ...

  13. Understanding corpus linguistics

    The authors discuss language documentation methods and the way fieldwork data are turned into a corpus. Some research questions which are appropriate for small corpora are introduced and exemplified by the authors. The last chapter of the book discusses the relationship between corpus linguistics and linguistic typology.

  14. Understanding corpus linguistics

    contexts. Understanding Corpus Linguistics is an introduction to the goals, methods, and achievements of corpus linguistics, which is written mainly as a textbook for undergraduate and graduate students, while advanced scholars also could benefit from reading it. The authors, who have been engaged in corpus linguistics as their

  15. Corpus Linguistics

    The term corpus linguistics refers to corpus-based linguistic studies in general ( Biber et al., 1998; Tognini-Bonelli, 2001, among others). Archetypical corpus work existed well before the modern digital era, as exemplified by the early attempts of word indexing and concordancing of the Christian Bible in the thirteenth century.

  16. Research in Corpus Linguistics

    View All Issues. Research in Corpus Linguistics (RiCL, ISSN 2243-4712) is a scholarly peer-reviewed international scientific journal published annually, aiming at the publication of contributions which contain empirical analyses of data from different languages and from different theoretical perspectives and frameworks.

  17. PDF Research Questions 1 in Linguistics

    in Linguistics 1. Jane Sunderland. Chapter outline. This chapter takes as given that research questions, appropriately designed and worded, are the key to any good empirical research project. Starting with why we need research questions (as opposed to topics or even hypotheses), I explore where they might come from, and propose different types ...

  18. (PDF) What is Corpus Linguistics?

    Corpus linguistics is one of the fastest-growing meth odologies in contemporary linguistics. In a. conversational format, this article answers a few questions that corpu s linguists regularly face ...

  19. Research Questions in Language Education and Applied Linguistics

    Mohebbi and Coombe's book, Research Questions in Language Education and Applied Linguistics: A Reference Guide, helps budding researchers take the first step and develop a solid research question. As the field of language education evolves, we need continual research to improve our instructional and assessment practices and our understanding ...

  20. Linguistic Variations Between Translated and Non-Translated English

    To answer the research questions, a series of one-way ANOVAs were conducted to measure the differences between the subcorpora of COCS on the six-dimension scores. The homogeneity of variance was checked using Levene's Test for equality of variances. ... Corpus Linguistics and Linguistic Theory, 15(2), 347-382. Google Scholar. Huang D., Wang ...

  21. Corpus Linguistics and Corpus-Based Research and Its Implication in

    Corpus linguistics is a research approach that has developed over the past few decades to support empirical investigations of language variation and use, resulting in research findings that are ...

  22. Cognitive Linguistics: Fostering English Language Proficiency in Higher

    To understand the enhancement of the EL and the perspectives of college students on CLs, the above research questions were proposed, and the answers to these proposed research questions are identified and demonstrated in the following sections. ... Future research should investigate synergies between corpus linguistics, research tourism, and ...

  23. Digital Research Methods for Translation Studies. Julie McDonough

    Julie McDonough Dolmaya | Semantic Scholar. DOI: 10.1093/llc/fqae026. Corpus ID: 269701595. Digital Research Methods for Translation Studies. Julie McDonough Dolmaya. Yuhua Fang, Jinqiao Zhou. Published in Digital Scholarship in the… 9 May 2024. Linguistics, Computer Science. View via Publisher.