Small Sample Research: Considerations Beyond Statistical Power

  • Published: 19 August 2015
  • Volume 16 , pages 1033–1036, ( 2015 )

Cite this article

limitations of small sample size in quantitative research

  • Kathleen E. Etz 1 &
  • Judith A. Arroyo 2  

21k Accesses

30 Citations

Explore all metrics

Small sample research presents a challenge to current standards of design and analytic approaches and the underlying notions of what constitutes good prevention science. Yet, small sample research is critically important as the research questions posed in small samples often represent serious health concerns in vulnerable and underrepresented populations. This commentary considers the Special Section on small sample research and also highlights additional challenges that arise in small sample research not considered in the Special Section, including generalizability, determining what constitutes knowledge, and ensuring that research designs match community desires. It also points to opportunities afforded by small sample research, such as a focus on and increased understanding of context and the emphasis it may place on alternatives to the randomized clinical trial. The commentary urges the development and adoption of innovative strategies to conduct research with small samples.

Explore related subjects

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

Small sample research presents a direct challenge to current standards of design and analytic approaches and the underlying notions of what constitutes good prevention science. While we can have confidence that our scientific methods have the ability to answer many research questions, we have been limited in our ability to take on research with small samples because we have not developed or adopted the means to support rigorous small sample research. This Special Section identifies some tools that can be used for small sample research. It reminds us that progress in this area will likely require expansion of our ideas of what constitutes rigor in analysis and design strategies that address the unique characteristics and accompanying challenges of small sample research. Advances will also require making room for the adoption of innovative design and statistical analysis approaches. The collection of papers makes a significant contribution to the literature and marks major development in the field.

Innovations in small sample research are particularly critical because the research questions posed in small samples often focus on serious health concerns in vulnerable populations. Individuals most at risk for or afflicted by health disparities (e.g., racial and ethnic minorities) are by definition small in number when compared to the larger, dominant society. The current state of the art in design and statistical analysis in prevention science, which is highly dependent on large samples, has severely handicapped investigation of health disparities in these smaller populations. Unless we develop research techniques suitable for small group design and expand our concepts of what design and analytic strategies provide sufficient scientific rigor, health disparities will continue to lay waste to populations that live in smaller communities or who are difficult to recruit in large numbers. Particularly when considering high-risk, low base rate behaviors such as recurrent binge drinking or chronic drug use, investigators are often limited by small populations in many health disparity groups and by small numbers of potential participants in towns, villages, and rural communities. Even in larger, urban settings, researchers may experience constraints on recruitment such as difficulty identifying a sufficiently large sample, distrust of research, lack of transportation or time outside of work hours, or language issues. Until now, small sample sizes and the lack of accepted tools for small sample research have decreased our ability to harness the power of science to research preventive solutions to health disparities. The collection of articles in this Special Section helps to address this by bringing together multiple strategies and demonstrating their strength in addressing research questions with small samples.

Small sample research issues also arise in multi-level, group-based, or community-level intervention research (Trickett et al. 2011 ). An example of this is a study that uses a media campaign and compares the efficacy of that campaign across communities. In such cases, the unit of analysis is the group, and the limited number of units that can be feasibly involved in a study makes multi-level intervention research inevitably an analysis of small samples. The increasingly recognized importance of intervening in communities at multiple levels (Frohlich and Potvin 2008 ) and the desire to understand the efficacy and effectiveness of multi-level interventions (Hawe 1994 ) increase the need to devise strategies for assessing interventions conducted with small samples.

The Special Section makes a major contribution to small sample research, identifying tools that can be used to address small sample design and analytic challenges. The articles here can be grouped into four areas: (1) identification of refinements in statistical applications and measurement that can facilitate analyses with small samples, (2) alternatives to randomized clinical trial (RCT) designs that maintain rigor while maximizing power, (3) use of qualitative and mixed methods, and (4) Bayesian analysis. The Special Section provides a range of alternative strategies to those that are currently employed with larger samples. The first and last papers in the Special Section (Fok et al. 2015 ; Henry et al. 2015a ) examine and elaborate on the contributions of these articles to the field. As this is considered elsewhere, we will focus our comments more on issues that are not already covered but that will be increasingly important as this field moves forward.

One challenge that is not addressed by the papers in this Special Section is the generalizability of small sample research findings, particularly when working with culturally distinct populations. Generalizability poses a different obstacle than those associated with design and analysis, in that it is not related to rigor or the confidence we can have in our conclusions. Rather, it limits our ability to assume the results will apply to populations other than those from whom a sample is drawn and, as such, can limit the application of the work. The need to discover prevention solutions for all people, even if they happen to be members of a small population, begs questions of the value of generalizability and of the importance ascribed to it. Further, existing research raises long-standing important questions about whether knowledge produced under highly controlled conditions can generalize to ethnoculturally diverse communities (Atkins et al. 2006 ; Beeker et al. 1998 ; Green and Glasgow 2006 ). Regardless, the inability to generalize beyond a small population can present a barrier to funding. When grant applications are reviewed, projects that are not seen as widely generalizable often receive poor ratings. Scientists conducting small sample research with culturally distinct groups are frequently stymied by how they can justify their research when it is not generalizable to large segments of the population. In some instances, the question that drives the research is that which limits generalizability. For example, research projects on cultural adaptations of established interventions are often highly specific. An adaptation that might be efficacious in one small sample might not be so in other contexts. This is particularly the case if the adaptation integrates local culture, such as preparing for winter and subsistence activities in Alaska or integrating the horse culture of the Great Plains. Even if local adaptation is not necessary, dissemination research to ascertain the efficacy and/or effectiveness of mainstream, evidence-based interventions when applied to diverse groups will be difficult to conduct if we cannot address concerns about generalizability.

It is not readily apparent how to address issues of generalizability, but it is clear that this will be challenging and will require creativity. One potential strategy is to go beyond questions of intervention efficacy to address additional research questions that have the potential to advance the field more generally. For example, Allen and colleagues’ ( 2014 ) scientific investigations extended beyond development of a prevention intervention in Alaska Native villages to identification and testing of the underlying prevention processes that were at the core of the culturally specific intervention. This isolation of the key components of the prevention process has the potential to inform and generalize across settings. The development of new statistical tools for small culturally distinct samples might also be helpful in other research contexts. Similarly, the identification of the most potent prevention processes for adaptation also might generalize. As small sample research evolves, we must remain open to how this work has the potential to be highly valuable despite recognizing that not all aspects of it will generalize and also take care to identify what can be applied generally.

While not exclusive to small sample research, additional difficulties that can arise in conducting research in some small, culturally distinct samples are the questions of what constitutes knowledge and how to include alternative forms of knowledge (e.g., indigenous ways of knowing, folk wisdom) in health research (Aikenhead and Ogawa 2007 ; Gone 2012 ). For many culturally distinct communities that turn to research to address their health challenges, the need for large samples and methods demanded by mainstream science might be incongruent with local epistemologies and cultural understandings of how the knowledge to inform prevention is generated and standards of evidence are established. Making sense of how or whether indigenous knowledge and western scientific approaches can work together is an immense challenge. The Henry, Dymnicki, Mohatt, Kelly, and Allen article in this Special Section recommends combining qualitative and quantitative methods as one way to address this conundrum. However, this strategy is not sufficient to address all of the challenges encountered by those who seek to integrate traditional knowledge into modern scientific inquiry. For culturally distinct groups who value forms of knowledge other than those generated by western science, the research team, including the community members, will need to work together to identify ways to best ensure that culturally valued knowledge is incorporated into the research endeavor. The scientific field will need to make room for approaches that stem from the integration of culturally valued knowledge.

Ensuring that the research design and methods correspond to community needs and desires can present an additional challenge. Investigations conducted with small, culturally distinct groups often use community-based participatory research (CBPR) approaches (Minkler and Wallerstein 2008 ). True CBPR mandates that community partners be equal participants in every phase of the research, including study design. From an academic researcher’s perspective, the primary obstacle for small sample research may be insufficient statistical power to conduct a classic RCT. However, for the small group partner, the primary obstacle may be the RCT design itself. Many communities will not allow a RCT because assignment of some community members to a no-treatment control condition can violate culturally based ethical principles that demand that all participants be treated equally. Particularly in communities experiencing severe health disparities, community members may want every person to receive the active intervention. While the RCT has become the gold standard because it is believed to be the most rigorous test of intervention efficacy, it is clear the RCT does not serve the needs of all communities.

While presenting challenges for current methods, it is important to note that small sample research can also expand our horizons. For example, attempts to truly comprehend culturally distinct groups will lead to a better understanding of the role of context in health outcomes. Current approaches more often attempt to control for extraneous variables rather than work to more accurately model potentially rich contextual variables. This blinds us to cultural differences between and among small groups that might contribute to outcomes and improve health. Analytical strategies that mask these nuances will fail to detect information about risk and resilience factors that could impact intervention. Multi-level intervention research (which we pointed out earlier qualifies as small sample research) that focuses on contextual changes as well as or instead of change in the individual will also inform our understanding of context, elucidating how to effectively intervene to change context to promote health outcomes. Thus, considering how prevailing methods limit our work in small samples can also expose ways that alternative methods may advance our science more broadly by enhancing both our understanding of context and how to intervene in context.

Small sample science requires us to consider alternatives to the RCT, and this consideration introduces additional opportunities. The last paper in this Special Section (Henry et al. 2015b ) notes compelling critiques of RCT. Small sample research demands we incorporate alternate strategies that may be superior in some instances regarding their efficiency in their use of available information, in contrast to the classic RCT, and may be more aligned with community desires. Alternative designs for small sample research may offer means to enhance and ensure scientific rigor without depending on RCT design (Srinivasan et al. 2015 ). It is important to consider what alternative approaches can contribute rather than adhering rigidly to the RCT.

New challenges require innovative solutions. Innovation is the foundation of scientific advances. It is one of only five National Institutes of Health grant review criteria. Despite the value to science of innovation, research grant application reviewers are often skeptical of new strategies and are reluctant to support risk taking in science. As a field, we seem accustomed to the use of certain methods and statistics, generally accepting and rarely questioning if they are the best approach. Yet, it is clear that common methods that work well with large samples are not always appropriate for small samples. Progress will demand that new approaches be well justified and also that the field supports innovation and the testing of alternative approaches. Srinivasan and colleagues ( 2015 ) further recommend that it might be necessary to offer training to grant application peer reviewers on innovative small sample research methods, thus ensuring that they are knowledgeable in this area and score grant applications appropriately. Alternative approaches need to be accepted into the repertoire of available design and assessment tools. The articles in this Special Section all highlight such innovation for small sample research.

It would be a failure of science and the imagination if newly discovered or re-discovered (i.e., Bayesian) strategies are not employed to facilitate rigorous assessment of interventions in small samples. It is imperative that the tools of science do not limit our ability to address pressing public health questions. New approaches can be used to address contemporary research questions, including providing solutions to the undue burden of disease that can and often does occur in small populations. It must be the pressing nature of the questions, not the limitations of our methods, that determines what science is undertaken (see also Srinivasan et al. 2015 ). While small sample research presents a challenge for prevailing scientific approaches, the papers in this Special Section identify ways to move this science forward with rigor. It is imperative that the field accommodates these advances, and continues to be innovative in response to the challenge of small sample research, to ensure that science can provide answers for those most in need.

Aikenhead, G. S., & Ogawa, M. (2007). Indigenous knowledge and science revisited. Cultural Studies of Science Education, 2 , 539–620.

Article   Google Scholar  

Allen, J., Mohatt, G. V., Fok, C. C. T., Henry, D., Burkett, R., & People Awakening Project. (2014). A protective factors model for alcohol abuse and suicide prevention among Alaska Native youth. American Journal of Community Psychology, 54 , 125–139.

Article   PubMed Central   PubMed   Google Scholar  

Atkins, M. S., Frazier, S. L., & Cappella, E. (2006). Hybrid research models: Natural opportunities for examining mental health in context. Clinical Psychology Review, 13 , 105–108.

Google Scholar  

Beeker, C., Guenther-Grey, C., & Raj, A. (1998). Community empowerment paradigm drift and the primary prevention of HIV/AIDS. Social Science & Medicine, 46 , 831–842.

Article   CAS   Google Scholar  

Fok, Henry, D., Allen, J. (2015). Maybe small is too small a term: Introduction to advancing small sample prevention science. Prevention Science .

Frohlich, K. L., & Potvin, L. (2008). Transcending the known in public health practice: The inequality paradox: The population approach and vulnerable populations. American Journal of Public Health, 98 , 216–221.

Gone, J. P. (2012). Indigenous traditional knowledge and substance abuse treatment outcomes: The problem of efficacy evaluation. American Journal of Drug and Alcohol Abuse, 38 , 493–497.

Article   PubMed   Google Scholar  

Green, L. W., & Glasgow, R. E. (2006). Evaluating the relevance, generalization, and applicability of research: Issues in external validation and translation methodology. Evaluation & the Health Professions, 29 , 126–153.

Hawe, P. (1994). Capturing the meaning of “community” in community intervention evaluation: Some contributions from community psychology. Health Promotion International, 9 , 199–210.

Henry, D., Dymnicki, A. B., Mohatt, N., Kelly, J. G., & Allen, J. (2015a). Clustering methods with qualitative data: A mixed methods approach for prevention research with small samples. Prevention Science . doi: 10.1007/s11121-015-0561-z .

Henry, D., Fok, C.C.T., Allen, J. (2015). Why small is too small a term: Prevention science for health disparities, culturally distinct groups, and community-level intervention. Prevention Science.

Minkler, M., & Wallerstein, N. (Eds.). (2008). Community-based participatory research for health: From process to outcomes (2nd ed.). San Francisco: Jossey-Bass.

Srinivasan, S., Moser, R. P., Willis, G., Riley, W., Alexander, M., Berrigan, D., & Kobrin, S. (2015). Small is essential: Importance of subpopulation research in cancer control. American Journal of Public Health, 105 , 371–373.

Trickett, E. J., Beehler, S., Deutsch, C., Green, L. W., Hawe, P., McLeroy, K., Miller, R. L., Rapkin, B. D., Schensul, J. J., Schulz, A. J., & Trimble, J. E. (2011). Advancing the science of community-level interventions. American Journal of Public Health, 11 , 1410–1419.

Download references

Compliance with Ethical Standards

No external funding supported this work.

Conflict of Interest

The authors declare that they have no conflict of interest.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Because this article is a commentary, informed consent is not applicable.

Author information

Authors and affiliations.

National Institute on Drug Abuse, National Institutes of Health, 6001 Executive Blvd., Bethesda, MD, 20852, USA

Kathleen E. Etz

National Institute on Alcohol Abuse and Alcoholism, National Institutes of Health, 5635 Fishers Lane, Bethesda, MD, 20852, USA

Judith A. Arroyo

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Kathleen E. Etz .

Additional information

The opinions and conclusions here represent those of the authors and do not represent the National Institutes of Health, the National Institute on Drug Abuse, the National Institute on Alcohol Abuse and Alcoholism, or the US Government.

Rights and permissions

Reprints and permissions

About this article

Etz, K.E., Arroyo, J.A. Small Sample Research: Considerations Beyond Statistical Power. Prev Sci 16 , 1033–1036 (2015). https://doi.org/10.1007/s11121-015-0585-4

Download citation

Published : 19 August 2015

Issue Date : October 2015

DOI : https://doi.org/10.1007/s11121-015-0585-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Find a journal
  • Publish with us
  • Track your research

Sciencing_Icons_Science SCIENCE

Sciencing_icons_biology biology, sciencing_icons_cells cells, sciencing_icons_molecular molecular, sciencing_icons_microorganisms microorganisms, sciencing_icons_genetics genetics, sciencing_icons_human body human body, sciencing_icons_ecology ecology, sciencing_icons_chemistry chemistry, sciencing_icons_atomic & molecular structure atomic & molecular structure, sciencing_icons_bonds bonds, sciencing_icons_reactions reactions, sciencing_icons_stoichiometry stoichiometry, sciencing_icons_solutions solutions, sciencing_icons_acids & bases acids & bases, sciencing_icons_thermodynamics thermodynamics, sciencing_icons_organic chemistry organic chemistry, sciencing_icons_physics physics, sciencing_icons_fundamentals-physics fundamentals, sciencing_icons_electronics electronics, sciencing_icons_waves waves, sciencing_icons_energy energy, sciencing_icons_fluid fluid, sciencing_icons_astronomy astronomy, sciencing_icons_geology geology, sciencing_icons_fundamentals-geology fundamentals, sciencing_icons_minerals & rocks minerals & rocks, sciencing_icons_earth scructure earth structure, sciencing_icons_fossils fossils, sciencing_icons_natural disasters natural disasters, sciencing_icons_nature nature, sciencing_icons_ecosystems ecosystems, sciencing_icons_environment environment, sciencing_icons_insects insects, sciencing_icons_plants & mushrooms plants & mushrooms, sciencing_icons_animals animals, sciencing_icons_math math, sciencing_icons_arithmetic arithmetic, sciencing_icons_addition & subtraction addition & subtraction, sciencing_icons_multiplication & division multiplication & division, sciencing_icons_decimals decimals, sciencing_icons_fractions fractions, sciencing_icons_conversions conversions, sciencing_icons_algebra algebra, sciencing_icons_working with units working with units, sciencing_icons_equations & expressions equations & expressions, sciencing_icons_ratios & proportions ratios & proportions, sciencing_icons_inequalities inequalities, sciencing_icons_exponents & logarithms exponents & logarithms, sciencing_icons_factorization factorization, sciencing_icons_functions functions, sciencing_icons_linear equations linear equations, sciencing_icons_graphs graphs, sciencing_icons_quadratics quadratics, sciencing_icons_polynomials polynomials, sciencing_icons_geometry geometry, sciencing_icons_fundamentals-geometry fundamentals, sciencing_icons_cartesian cartesian, sciencing_icons_circles circles, sciencing_icons_solids solids, sciencing_icons_trigonometry trigonometry, sciencing_icons_probability-statistics probability & statistics, sciencing_icons_mean-median-mode mean/median/mode, sciencing_icons_independent-dependent variables independent/dependent variables, sciencing_icons_deviation deviation, sciencing_icons_correlation correlation, sciencing_icons_sampling sampling, sciencing_icons_distributions distributions, sciencing_icons_probability probability, sciencing_icons_calculus calculus, sciencing_icons_differentiation-integration differentiation/integration, sciencing_icons_application application, sciencing_icons_projects projects, sciencing_icons_news news.

  • Share Tweet Email Print
  • Home ⋅
  • Math ⋅
  • Probability & Statistics ⋅

The Disadvantages of a Small Sample Size

The Disadvantages of a Small Sample Size

How to Select a Statistically Significant Sample Size

Researchers and scientists conducting surveys and performing experiments must adhere to certain procedural guidelines and rules in order to insure accuracy by avoiding sampling errors such as large variability, bias or undercoverage. Sampling errors can significantly affect the precision and interpretation of the results, which can in turn lead to high costs for businesses or government agencies, or harm to populations of people or living organisms being studied.

TL;DR (Too Long; Didn't Read)

To conduct a survey properly, you need to determine your sample group. This sample group should include individuals who are relevant to the survey's topic. You want to survey as large a sample size as possible; smaller sample sizes get decreasingly representative of the entire population.

A small sample size can also lead to cases of bias, such as non-response, which occurs when some subjects do not have the opportunity to participate in the survey. Alternatively, voluntary response bias occurs when only a small number of non-representative subjects have the opportunity to participate in the survey, usually because they are the only ones who know about it.

Sample Size

In the case of researchers conducting surveys, for example, sample size is essential. To conduct a survey properly, you need to determine your sample group. This sample group should include individuals who are relevant to the survey's topic.

For instance, if you are conducting a survey on whether a certain kitchen cleaner is preferred over another brand, then you should survey a large number of people who use kitchen cleaners. The only way to achieve 100 percent accurate results is to survey every single person who uses kitchen cleaners; however, as this is not feasible, you will need to survey as large a sample group as possible.

Disadvantage 1: Variability

Variability is determined by the standard deviation of the population; the standard deviation of a sample is how the far the true results of the survey might be from the results of the sample that you collected. You want to survey as large a sample size as possible; the larger the standard deviation, the less accurate your results might be, since smaller sample sizes get decreasingly representative of the entire population.

Disadvantage 2: Uncoverage Bias

A small sample size also affects the reliability of a survey's results because it leads to a higher variability, which may lead to bias. The most common case of bias is a result of non-response. Non-response occurs when some subjects do not have the opportunity to participate in the survey. For example, if you call 100 people between 2 and 5 p.m. and ask whether they feel that they have enough free time in their daily schedule, most of the respondents might say "yes." This sample - and the results - are biased, as most workers are at their jobs during these hours.

People who are at work and unable to answer the phone may have a different answer to the survey than people who are able to answer the phone in the afternoon. These people will not be included in the survey, and the survey's accuracy will suffer from non-response. Not only does your survey suffer due to timing, but the number of subjects does not help make up for this deficiency.

Disadvantage 3: Voluntary Response Bias

Voluntary response bias is another disadvantage that comes with a small sample size. If you post a survey on your kitchen cleaner website, then only a small number of people have access to or knowledge about your survey, and it is likely that those who do participate will do so because they feel strongly about the topic. Therefore, the results of the survey will be skewed to reflect the opinions of those who visit the website. If an individual is on a company's website, then it is likely that he supports the company; he may, for example, be looking for coupons or promotions from that manufacturer. A survey posted only on its website limits the number of people who will participate to those who already had an interest in their products, which causes a voluntary response bias.

Related Articles

How to calculate a sample size population, how to calculate statistical sample sizes, how to calculate p-hat, what is the meaning of sample size, how to report a sample size, how to interpret likert surveys, advantages & disadvantages of finding variance, how to calculate bias, the advantages of a large sample size, how to calculate sample size from a confidence interval, weighted averages in survey analysis, how do i determine my audit sample size, how to figure survey percentages, what type of sample is used for probability, how to calculate the mtbf, slovin's formula sampling techniques, how to calculate x-bar, statistics project ideas.

  • Stanford University: Survey Research

About the Author

A.E. Simmons has worked as a freelance writer since 2009. She specializes in business, consumer products, home economics and sports and recreation. Simmons is a student in the Kenan-Flagler Business School at the University of North Carolina at Chapel Hill.

Find Your Next Great Science Fair Project! GO

  • Search Menu
  • Sign in through your institution
  • Advance Articles
  • Supplements
  • Special Series
  • Breast Cancer
  • Cancer Diagnostics and Molecular Pathology
  • Community Outreach
  • Endocrinology
  • Gastrointestinal Cancer
  • Genitourinary Cancer
  • Geriatric Oncology
  • Global Health and Cancer
  • Gynecologic Oncology
  • Head and Neck Cancers
  • Health Outcomes and Economics of Cancer Care
  • Hematologic Malignancies
  • Hepatobiliary
  • Immuno-Oncology
  • Lung Cancer
  • Medical Ethics
  • Melanoma and Cutaneous Malignancies
  • Neuro-Oncology
  • New Drug Development and Clinical Pharmacology
  • Pediatric Oncology
  • Radiation Oncology
  • Browse content in Regulatory Issues
  • Regulatory Issues: FDA
  • Regulatory Issues: EMA
  • Symptom Management and Supportive Care
  • Author Guidelines
  • Submission Site
  • Why Publish With Us?
  • Open Access Options
  • Self-Archiving Policy
  • Advertising & Corporate Services
  • Reprints, ePrints, Supplements
  • About The Oncologist
  • About the Society for Translational Oncology
  • Editorial Board
  • Discussions with Don S. Dizon
  • Journals on Oxford Academic
  • Books on Oxford Academic

The Oncologist

Article Contents

Sample and sample size, what is a “small” sample size, what happens with a small sample size and why is it not ideal, author contributions, conflicts of interest, data availability.

  • < Previous

Why is a small sample size not enough?

ORCID logo

  • Article contents
  • Figures & tables
  • Supplementary Data

Ying Cao, Ronald C Chen, Aaron J Katz, Why is a small sample size not enough?, The Oncologist , Volume 29, Issue 9, September 2024, Pages 761–763, https://doi.org/10.1093/oncolo/oyae162

  • Permissions Icon Permissions

Clinical studies are often limited by resources available, which results in constraints on sample size. We use simulated data to illustrate study implications when the sample size is too small.

Using 2 theoretical populations each with N  = 1000, we randomly sample 10 from each population and conduct a statistical comparison, to help make a conclusion about whether the 2 populations are different. This exercise is repeated for a total of 4 studies: 2 concluded that the 2 populations are statistically significantly different, while 2 showed no statistically significant difference.

Our simulated examples demonstrate that sample sizes play important roles in clinical research. The results and conclusions, in terms of estimates of means, medians, Pearson correlations, chi-square test, and P values, are unreliable with small samples.

A sample comprises the individuals from whom we collect data and represents a share of the population ( N ) for whom we want to draw conclusions (eg, women breast cancer).

The sample size ( n ) is the number of individual people, experimental units, or other elements included in a sample, and is a central concept in statistical applications to clinical research. Given that researchers often have limited resources (financial and personnel) and time to conduct a study, it is not feasible to collect data from an entire population and, in some cases, only possible to obtain information from a seemingly small sample of individuals.

There is no universal agreement, and it remains controversial as to what number designates a small sample size. Some researchers consider a sample of n  = 30 to be “small” while others use n  = 20 or n  = 10 to distinguish a small sample size.

“Small” is also relative in statistical analysis. For example, in genome-wide association studies and microbiome research, although the sample size ( n ) is often in the hundreds or even thousands of observations, the number of markers ( p ) of interest (eg, single-nucleotide polymorphisms) is typically in the hundreds of thousands, creating a “large p small n ” conundrum that necessitates the use of advanced statistical techniques for analysis. 1

To illustrate some points, we use simulated data representing 2 different theoretical populations (group 1 and group 2) 2 with a normal distribution for each of the populations ( N  = 1000 for each). Group 1 population has an asymptotic mean = 0 and SD = 1, while the group 2 population has an asymptotic mean = 0.5 and SD = 0.5. This is the entire population and therefore represents the “truth.” Now we randomly select 10 values (10 data points) from each of the normal distributions and perform a (nonparametric) Wilcoxon rank-sum test (also known as Mann-Whitney U test) to examine whether both groups come from the same population or have the same shape. We repeat this exercise multiple times ( Figure 1 ).

(A) Two random samples of n = 10 each were drawn from 2 normally distributed populations each with N = 1000. Population 1 has means 0 and SD 1, and population 2 has mean 0.5 and SD 0.5. (B-D) Images illustrate new random samples using the same methodology as in panel (A).

(A) Two random samples of n  = 10 each were drawn from 2 normally distributed populations each with N  = 1000. Population 1 has means 0 and SD 1, and population 2 has mean 0.5 and SD 0.5. (B-D) Images illustrate new random samples using the same methodology as in panel (A).

These 4 results do not support a firm conclusion as to whether the 2 population distributions are either statistically the same or different. Why? Because in 2 of the random samples drawn, as shown in Figure 1a , 1b , the median values differ between the 2 groups, suggesting the population distributions are significantly ( P value < 0.05) different; still, in the other 2 random samples, as shown in Figure 1c , 1d , the medians are close together, which suggests the population distributions are similar (the P values are much larger than .05).

Results from further simulations (not shown) demonstrate that once the sample size reaches n  = 50, the results from the Wilcoxon rank-sum test (with continuity correction) begin to approach those of the 2-sample t -test (with Welch correction for unequal variances), which indicates that the randomly drawn samples are starting to follow a normal distribution. As the sample size increases, the results of the Wilcoxon rank-sum test and 2-sample t -tests continue to converge. This yields an explicit confirmation of the large sample theory (asymptotic approximation).

These observations are directly relevant to clinicians and clinical research. For example, an investigator wants to compare the survival outcomes of patients with stage 1 lung cancer treated with lobectomy or stereotactic body radiation therapy. With small sample sizes (eg, 10 patients in each treatment group), there can be random variation in the results; thus, multiple studies of small sample sizes might provide different/opposite findings. With larger sample sizes, such random variation would be reduced and thereby provide more valid results.

This same concept also applies to estimates of other statistics, including the Pearson correlation coefficient r , chi-square test, and related P values.

Our simulated example demonstrates that sample sizes play important roles in clinical research. The results and conclusions, in terms of estimates of means, medians, Pearson correlations, chi-square test, and P values, are unreliable with small samples. Even when “statistically significant”, small sample size studies might provide spurious results. Thus, caution is needed when interpreting results from small studies.

Ying Cao performed the data simulations. All authors contributed to the conception and design, manuscript writing, revision of the original submission, and final approval of the manuscript.

None declared.

The authors indicated no financial relationships.

The data were created by computer algorithms in software R Core Team, 3 therefore not directly related to any clinical resources or patients.

Hastie T , Tibshirani R , Friedman J. Chapter 18 . In: The Elements of Statistical Learning: Data Mining, Inference, and Prediction . 2nd ed. Springer Series in Statistics . Springer ; 2016 .

Google Scholar

Google Preview

Halsey L , Curran-Everett D , Bowler S , et al. . The fickle P value generates irreproducible results . Nat Methods . 2015 ; 12 ( 3 ): 179 - 185 .

R Core Team . R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing ; 2022 . https://www.R-project.org/

Month: Total Views:
June 2024 195
July 2024 355
August 2024 428

Email alerts

Citing articles via.

  • Advertising and Corporate Services
  • Journals Career Network

Affiliations

The Oncologist

  • Online ISSN 1549-490X
  • Print ISSN 1083-7159
  • Copyright © 2024 Oxford University Press
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Sample Size for Survey Research: Review and Recommendations

  • Journal of Applied Structural Equation Modeling 4(2):i-xx

Mumtaz Ali Memon at Sohar University

  • Sohar University

Hiram Ting at i-CATS University College

  • i-CATS University College

Jun-Hwa Cheah at University of East Anglia

  • University of East Anglia

T. Ramayah at Universiti Sains Malaysia

  • Universiti Sains Malaysia

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations

Erwyn Chin Wei Ooi

  • CURR PSYCHOL

Hawri Yaba Mohammed-Ameen

  • Suhailah Mohammed Ali

Kwadwo Asante

  • Normy Rafida

Jeniza Jamaludin

  • Ridzuan Masri
  • J PSYCHOSOM RES

Rebekah Davenport

  • L. Kiropoulos
  • NURS EDUC TODAY

Xiyi Wang

  • Mark Williams

Chris Ryan

  • Int J Environ Res Publ Health

Anna Forsberg

  • MOTIV EMOTION

Bård Kuvaas

  • J BUS ETHICS

Michael D. Giebelhausen

  • INT J MANPOWER

Mumtaz Ali Memon

  • Nik Hasnaa Nik Mahmood
  • J CLEAN PROD

Hilary Omatule Onubi

  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

MeasuringU Logo

Best Practices for Using Statistics on Small Sample Sizes

limitations of small sample size in quantitative research

Put simply, this is wrong, but it’s a common misconception .

There are appropriate statistical methods to deal with small sample sizes.

Although one researcher’s “small” is another’s large, when I refer to small sample sizes I mean studies that have typically between 5 and 30 users total—a size very common in usability studies .

But user research isn’t the only field that deals with small sample sizes. Studies involving fMRIs, which cost a lot to operate, have limited sample sizes as well [pdf] as do studies using laboratory animals.

While there are equations that allow us to properly handle small “n” studies, it’s important to know that there are limitations to these smaller sample studies: you are limited to seeing big differences or big “effects.”

To put it another way, statistical analysis with small samples is like making astronomical observations with binoculars . You are limited to seeing big things: planets, stars, moons and the occasional comet.  But just because you don’t have access to a high-powered telescope doesn’t mean you cannot conduct astronomy. Galileo, in fact, discovered Jupiter’s moons with a telescope with the same power as many of today’s binoculars .

Just as with statistics, just because you don’t have a large sample size doesn’t mean you cannot use statistics. Again, the key limitation is that you are limited to detecting large differences between designs or measures.

Fortunately, in user-experience research we are often most concerned about these big differences—differences users are likely to notice, such as changes in the navigation structure or the improvement of a search results page.

Here are the procedures which we’ve tested for common, small-sample user research, and we will cover them all at the UX Boot Camp in Denver next month.

If you need to compare completion rates, task times, and rating scale data for two independent groups, there are two procedures you can use for small and large sample sizes.  The right one depends on the type of data you have: continuous or discrete-binary.

Comparing Means : If your data is generally continuous (not binary), such as task time or rating scales, use the two sample t-test . It’s been shown to be accurate for small sample sizes.

Comparing Two Proportions : If your data is binary (pass/fail, yes/no), then use the N-1 Two Proportion Test. This is a variation on the better known Chi-Square test (it is algebraically equivalent to the N-1 Chi-Square test). When expected cell counts fall below one, the Fisher Exact Test tends to perform better. The online calculator handles this for you and we discuss the procedure in Chapter 5 of Quantifying the User Experience .

Confidence Intervals

When you want to know what the plausible range is for the user population from a sample of data, you’ll want to generate a confidence interval . While the confidence interval width will be rather wide (usually 20 to 30 percentage points), the upper or lower boundary of the intervals can be very helpful in establishing how often something will occur in the total user population.

For example, if you wanted to know if users would read a sheet that said “Read this first” when installing a printer, and six out of eight users didn’t read the sheet in an installation study, you’d know that at least 40% of all users would likely do this –a substantial proportion.

There are three approaches to computing confidence intervals based on whether your data is binary, task-time or continuous.

Confidence interval around a mean : If your data is generally continuous (not binary) such as rating scales, order amounts in dollars, or the number of page views, the confidence interval is based on the t-distribution (which takes into account sample size).

Confidence interval around task-time :  Task time data is positively skewed . There is a lower boundary of 0 seconds. It’s not uncommon for some users to take 10 to 20 times longer than other users to complete the same task. To handle this skew, the time data needs to be log-transformed   and the confidence interval is computed on the log-data, then transformed back when reporting. The online calculator handles all this.

Confidence interval around a binary measure: For an accurate confidence interval around binary measures like completion rate or yes/no questions, the Adjusted Wald interval performs well for all sample sizes.

Point Estimates (The Best Averages)

The “best” estimate for reporting an average time or average completion rate for any study may vary depending on the study goals.  Keep in mind that even the “best” single estimate will still differ from the actual average, so using confidence intervals provides a better method for estimating the unknown population average.

For the best overall average for small sample sizes, we have two recommendations for task-time and completion rates, and a more general recommendation for all sample sizes for rating scales.

Completion Rate : For small-sample completion rates, there are only a few possible values for each task. For example, with five users attempting a task, the only possible outcomes are 0%, 20%, 40%, 60%, 80% and 100% success. It’s not uncommon to have 100% completion rates with five users. There’s something about reporting perfect success at this sample size that doesn’t resonate well. It sounds too good to be true.

We experimented [pdf] with several estimators with small sample sizes and found the LaPlace estimator and the simple proportion (referred to as the Maximum Likelihood Estimator) generally work well for the usability test data we examined. When you want the best estimate, the calculator will generate it based on our findings.

Rating Scales : Rating scales are a funny type of metric, in that most of them are bounded on both ends (e.g. 1 to 5, 1 to 7 or 1 to 10) unless you are Spinal Tap of course. For small and large sample sizes, we’ve found reporting the mean to be the best average over the median [pdf] . There are in fact many ways to report the scores from rating scales, including top-two boxes . The one you report depends on both the sensitivity as well as what’s used in an organization.

Average Time : One long task time can skew the arithmetic mean and make it a poor measure of the middle. In such situations, the median is a better indicator of the typical or “average” time. Unfortunately, the median tends to be less accurate and more biased than the mean when sample sizes are less than about 25. In these circumstances, the geometric mean (average of the log values transformed back) tends to be a better measure of the middle. When sample sizes get above 25, the median works fine.

You might also be interested in

feature image

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Implications of Small Samples for Generalization: Adjustments and Rules of Thumb

Affiliations.

  • 1 1 Department of Human Development, Teachers College, Columbia University, New York, NY, USA.
  • 2 2 University of Chicago - Urban Labs, Chicago, IL, USA.
  • 3 3 Institute for Policy Research, Northwestern University, Evanston, IL, USA.
  • 4 4 Quantitative Methods Division, Graduate School of Education, University of Pennsylvania, Philadelphia, PA USA.
  • PMID: 27402612
  • DOI: 10.1177/0193841X16655665

Background: Policy makers and researchers are frequently interested in understanding how effective a particular intervention may be for a specific population. One approach is to assess the degree of similarity between the sample in an experiment and the population. Another approach is to combine information from the experiment and the population to estimate the population average treatment effect (PATE).

Method: Several methods for assessing the similarity between a sample and population currently exist as well as methods estimating the PATE. In this article, we investigate properties of six of these methods and statistics in the small sample sizes common in education research (i.e., 10-70 sites), evaluating the utility of rules of thumb developed from observational studies in the generalization case.

Result: In small random samples, large differences between the sample and population can arise simply by chance and many of the statistics commonly used in generalization are a function of both sample size and the number of covariates being compared. The rules of thumb developed in observational studies (which are commonly applied in generalization) are much too conservative given the small sample sizes found in generalization.

Conclusion: This article implies that sharp inferences to large populations from small experiments are difficult even with probability sampling. Features of random samples should be kept in mind when evaluating the extent to which results from experiments conducted on nonrandom samples might generalize.

Keywords: content area; education; methodological development.

PubMed Disclaimer

Similar articles

  • Generalizing Treatment Effect Estimates From Sample to Population: A Case Study in the Difficulties of Finding Sufficient Data. Stuart EA, Rhodes A. Stuart EA, et al. Eval Rev. 2017 Aug;41(4):357-388. doi: 10.1177/0193841X16660663. Epub 2016 Aug 4. Eval Rev. 2017. PMID: 27491758 Free PMC article.
  • Statistical power and optimal design in experiments in which samples of participants respond to samples of stimuli. Westfall J, Kenny DA, Judd CM. Westfall J, et al. J Exp Psychol Gen. 2014 Oct;143(5):2020-45. doi: 10.1037/xge0000014. Epub 2014 Aug 11. J Exp Psychol Gen. 2014. PMID: 25111580
  • Marginal mean weighting through stratification: a generalized method for evaluating multivalued and multiple treatments with nonexperimental data. Hong G. Hong G. Psychol Methods. 2012 Mar;17(1):44-60. doi: 10.1037/a0024918. Epub 2011 Aug 15. Psychol Methods. 2012. PMID: 21843003
  • Do the observational studies using propensity score analysis agree with randomized controlled trials in the area of sepsis? Zhang Z, Ni H, Xu X. Zhang Z, et al. J Crit Care. 2014 Oct;29(5):886.e9-15. doi: 10.1016/j.jcrc.2014.05.023. Epub 2014 Jun 4. J Crit Care. 2014. PMID: 24996762 Review.
  • Atypical antipsychotics for disruptive behaviour disorders in children and youths. Loy JH, Merry SN, Hetrick SE, Stasiak K. Loy JH, et al. Cochrane Database Syst Rev. 2012 Sep 12;(9):CD008559. doi: 10.1002/14651858.CD008559.pub2. Cochrane Database Syst Rev. 2012. Update in: Cochrane Database Syst Rev. 2017 Aug 09;8:CD008559. doi: 10.1002/14651858.CD008559.pub3. PMID: 22972123 Updated. Review.
  • Apolipoproteine and KLOTHO Gene Variants Do Not Affect the Penetrance of Fragile X-Associated Tremor/Ataxia Syndrome. Winarni TI, Hwang YH, Rivera SM, Hessl D, Durbin-Johnson BP, Utari A, Hagerman R, Tassone F. Winarni TI, et al. Int J Mol Sci. 2024 Jul 25;25(15):8103. doi: 10.3390/ijms25158103. Int J Mol Sci. 2024. PMID: 39125677 Free PMC article.
  • Letter to the editor for the article "Health-related quality of life following salvage radical prostatectomy for recurrent prostate cancer after radiotherapy or focal therapy". Lu J, Huang J, Yang G, Chi H. Lu J, et al. World J Urol. 2024 May 10;42(1):312. doi: 10.1007/s00345-024-05026-w. World J Urol. 2024. PMID: 38727729 No abstract available.
  • Limitations in Medical Research: Recognition, Influence, and Warning. Ott DE. Ott DE. JSLS. 2024 Jan-Mar;28(1):e2023.00049. doi: 10.4293/JSLS.2023.00049. JSLS. 2024. PMID: 38405216 Free PMC article.
  • Temporal variations in female moose responses to roads and logging in the absence of wolves. Gagnon M, Lesmerises F, St-Laurent MH. Gagnon M, et al. Ecol Evol. 2024 Feb 1;14(2):e10909. doi: 10.1002/ece3.10909. eCollection 2024 Feb. Ecol Evol. 2024. PMID: 38304262 Free PMC article.
  • Efficacy of differential reinforcement of other behaviors therapy for tic disorder: a meta-analysis. Mohamed ZA, Xue Y, Bai M, Dong H, Jia F. Mohamed ZA, et al. BMC Neurol. 2024 Jan 2;24(1):3. doi: 10.1186/s12883-023-03501-2. BMC Neurol. 2024. PMID: 38166709 Free PMC article.

Publication types

  • Search in MeSH

LinkOut - more resources

Full text sources, other literature sources.

  • scite Smart Citations

full text provider logo

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

  • Research article
  • Open access
  • Published: 21 November 2018

Characterising and justifying sample size sufficiency in interview-based studies: systematic analysis of qualitative health research over a 15-year period

  • Konstantina Vasileiou   ORCID: orcid.org/0000-0001-5047-3920 1 ,
  • Julie Barnett 1 ,
  • Susan Thorpe 2 &
  • Terry Young 3  

BMC Medical Research Methodology volume  18 , Article number:  148 ( 2018 ) Cite this article

751k Accesses

1279 Citations

183 Altmetric

Metrics details

Choosing a suitable sample size in qualitative research is an area of conceptual debate and practical uncertainty. That sample size principles, guidelines and tools have been developed to enable researchers to set, and justify the acceptability of, their sample size is an indication that the issue constitutes an important marker of the quality of qualitative research. Nevertheless, research shows that sample size sufficiency reporting is often poor, if not absent, across a range of disciplinary fields.

A systematic analysis of single-interview-per-participant designs within three health-related journals from the disciplines of psychology, sociology and medicine, over a 15-year period, was conducted to examine whether and how sample sizes were justified and how sample size was characterised and discussed by authors. Data pertinent to sample size were extracted and analysed using qualitative and quantitative analytic techniques.

Our findings demonstrate that provision of sample size justifications in qualitative health research is limited; is not contingent on the number of interviews; and relates to the journal of publication. Defence of sample size was most frequently supported across all three journals with reference to the principle of saturation and to pragmatic considerations. Qualitative sample sizes were predominantly – and often without justification – characterised as insufficient (i.e., ‘small’) and discussed in the context of study limitations. Sample size insufficiency was seen to threaten the validity and generalizability of studies’ results, with the latter being frequently conceived in nomothetic terms.

Conclusions

We recommend, firstly, that qualitative health researchers be more transparent about evaluations of their sample size sufficiency, situating these within broader and more encompassing assessments of data adequacy . Secondly, we invite researchers critically to consider how saturation parameters found in prior methodological studies and sample size community norms might best inform, and apply to, their own project and encourage that data adequacy is best appraised with reference to features that are intrinsic to the study at hand. Finally, those reviewing papers have a vital role in supporting and encouraging transparent study-specific reporting.

Peer Review reports

Sample adequacy in qualitative inquiry pertains to the appropriateness of the sample composition and size . It is an important consideration in evaluations of the quality and trustworthiness of much qualitative research [ 1 ] and is implicated – particularly for research that is situated within a post-positivist tradition and retains a degree of commitment to realist ontological premises – in appraisals of validity and generalizability [ 2 , 3 , 4 , 5 ].

Samples in qualitative research tend to be small in order to support the depth of case-oriented analysis that is fundamental to this mode of inquiry [ 5 ]. Additionally, qualitative samples are purposive, that is, selected by virtue of their capacity to provide richly-textured information, relevant to the phenomenon under investigation. As a result, purposive sampling [ 6 , 7 ] – as opposed to probability sampling employed in quantitative research – selects ‘information-rich’ cases [ 8 ]. Indeed, recent research demonstrates the greater efficiency of purposive sampling compared to random sampling in qualitative studies [ 9 ], supporting related assertions long put forward by qualitative methodologists.

Sample size in qualitative research has been the subject of enduring discussions [ 4 , 10 , 11 ]. Whilst the quantitative research community has established relatively straightforward statistics-based rules to set sample sizes precisely, the intricacies of qualitative sample size determination and assessment arise from the methodological, theoretical, epistemological, and ideological pluralism that characterises qualitative inquiry (for a discussion focused on the discipline of psychology see [ 12 ]). This mitigates against clear-cut guidelines, invariably applied. Despite these challenges, various conceptual developments have sought to address this issue, with guidance and principles [ 4 , 10 , 11 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 ], and more recently, an evidence-based approach to sample size determination seeks to ground the discussion empirically [ 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 , 35 ].

Focusing on single-interview-per-participant qualitative designs, the present study aims to further contribute to the dialogue of sample size in qualitative research by offering empirical evidence around justification practices associated with sample size. We next review the existing conceptual and empirical literature on sample size determination.

Sample size in qualitative research: Conceptual developments and empirical investigations

Qualitative research experts argue that there is no straightforward answer to the question of ‘how many’ and that sample size is contingent on a number of factors relating to epistemological, methodological and practical issues [ 36 ]. Sandelowski [ 4 ] recommends that qualitative sample sizes are large enough to allow the unfolding of a ‘new and richly textured understanding’ of the phenomenon under study, but small enough so that the ‘deep, case-oriented analysis’ (p. 183) of qualitative data is not precluded. Morse [ 11 ] posits that the more useable data are collected from each person, the fewer participants are needed. She invites researchers to take into account parameters, such as the scope of study, the nature of topic (i.e. complexity, accessibility), the quality of data, and the study design. Indeed, the level of structure of questions in qualitative interviewing has been found to influence the richness of data generated [ 37 ], and so, requires attention; empirical research shows that open questions, which are asked later on in the interview, tend to produce richer data [ 37 ].

Beyond such guidance, specific numerical recommendations have also been proffered, often based on experts’ experience of qualitative research. For example, Green and Thorogood [ 38 ] maintain that the experience of most qualitative researchers conducting an interview-based study with a fairly specific research question is that little new information is generated after interviewing 20 people or so belonging to one analytically relevant participant ‘category’ (pp. 102–104). Ritchie et al. [ 39 ] suggest that studies employing individual interviews conduct no more than 50 interviews so that researchers are able to manage the complexity of the analytic task. Similarly, Britten [ 40 ] notes that large interview studies will often comprise of 50 to 60 people. Experts have also offered numerical guidelines tailored to different theoretical and methodological traditions and specific research approaches, e.g. grounded theory, phenomenology [ 11 , 41 ]. More recently, a quantitative tool was proposed [ 42 ] to support a priori sample size determination based on estimates of the prevalence of themes in the population. Nevertheless, this more formulaic approach raised criticisms relating to assumptions about the conceptual [ 43 ] and ontological status of ‘themes’ [ 44 ] and the linearity ascribed to the processes of sampling, data collection and data analysis [ 45 ].

In terms of principles, Lincoln and Guba [ 17 ] proposed that sample size determination be guided by the criterion of informational redundancy , that is, sampling can be terminated when no new information is elicited by sampling more units. Following the logic of informational comprehensiveness Malterud et al. [ 18 ] introduced the concept of information power as a pragmatic guiding principle, suggesting that the more information power the sample provides, the smaller the sample size needs to be, and vice versa.

Undoubtedly, the most widely used principle for determining sample size and evaluating its sufficiency is that of saturation . The notion of saturation originates in grounded theory [ 15 ] – a qualitative methodological approach explicitly concerned with empirically-derived theory development – and is inextricably linked to theoretical sampling. Theoretical sampling describes an iterative process of data collection, data analysis and theory development whereby data collection is governed by emerging theory rather than predefined characteristics of the population. Grounded theory saturation (often called theoretical saturation) concerns the theoretical categories – as opposed to data – that are being developed and becomes evident when ‘gathering fresh data no longer sparks new theoretical insights, nor reveals new properties of your core theoretical categories’ [ 46 p. 113]. Saturation in grounded theory, therefore, does not equate to the more common focus on data repetition and moves beyond a singular focus on sample size as the justification of sampling adequacy [ 46 , 47 ]. Sample size in grounded theory cannot be determined a priori as it is contingent on the evolving theoretical categories.

Saturation – often under the terms of ‘data’ or ‘thematic’ saturation – has diffused into several qualitative communities beyond its origins in grounded theory. Alongside the expansion of its meaning, being variously equated with ‘no new data’, ‘no new themes’, and ‘no new codes’, saturation has emerged as the ‘gold standard’ in qualitative inquiry [ 2 , 26 ]. Nevertheless, and as Morse [ 48 ] asserts, whilst saturation is the most frequently invoked ‘guarantee of qualitative rigor’, ‘it is the one we know least about’ (p. 587). Certainly researchers caution that saturation is less applicable to, or appropriate for, particular types of qualitative research (e.g. conversation analysis, [ 49 ]; phenomenological research, [ 50 ]) whilst others reject the concept altogether [ 19 , 51 ].

Methodological studies in this area aim to provide guidance about saturation and develop a practical application of processes that ‘operationalise’ and evidence saturation. Guest, Bunce, and Johnson [ 26 ] analysed 60 interviews and found that saturation of themes was reached by the twelfth interview. They noted that their sample was relatively homogeneous, their research aims focused, so studies of more heterogeneous samples and with a broader scope would be likely to need a larger size to achieve saturation. Extending the enquiry to multi-site, cross-cultural research, Hagaman and Wutich [ 28 ] showed that sample sizes of 20 to 40 interviews were required to achieve data saturation of meta-themes that cut across research sites. In a theory-driven content analysis, Francis et al. [ 25 ] reached data saturation at the 17th interview for all their pre-determined theoretical constructs. The authors further proposed two main principles upon which specification of saturation be based: (a) researchers should a priori specify an initial analysis sample (e.g. 10 interviews) which will be used for the first round of analysis and (b) a stopping criterion , that is, a number of interviews (e.g. 3) that needs to be further conducted, the analysis of which will not yield any new themes or ideas. For greater transparency, Francis et al. [ 25 ] recommend that researchers present cumulative frequency graphs supporting their judgment that saturation was achieved. A comparative method for themes saturation (CoMeTS) has also been suggested [ 23 ] whereby the findings of each new interview are compared with those that have already emerged and if it does not yield any new theme, the ‘saturated terrain’ is assumed to have been established. Because the order in which interviews are analysed can influence saturation thresholds depending on the richness of the data, Constantinou et al. [ 23 ] recommend reordering and re-analysing interviews to confirm saturation. Hennink, Kaiser and Marconi’s [ 29 ] methodological study sheds further light on the problem of specifying and demonstrating saturation. Their analysis of interview data showed that code saturation (i.e. the point at which no additional issues are identified) was achieved at 9 interviews, but meaning saturation (i.e. the point at which no further dimensions, nuances, or insights of issues are identified) required 16–24 interviews. Although breadth can be achieved relatively soon, especially for high-prevalence and concrete codes, depth requires additional data, especially for codes of a more conceptual nature.

Critiquing the concept of saturation, Nelson [ 19 ] proposes five conceptual depth criteria in grounded theory projects to assess the robustness of the developing theory: (a) theoretical concepts should be supported by a wide range of evidence drawn from the data; (b) be demonstrably part of a network of inter-connected concepts; (c) demonstrate subtlety; (d) resonate with existing literature; and (e) can be successfully submitted to tests of external validity.

Other work has sought to examine practices of sample size reporting and sufficiency assessment across a range of disciplinary fields and research domains, from nutrition [ 34 ] and health education [ 32 ], to education and the health sciences [ 22 , 27 ], information systems [ 30 ], organisation and workplace studies [ 33 ], human computer interaction [ 21 ], and accounting studies [ 24 ]. Others investigated PhD qualitative studies [ 31 ] and grounded theory studies [ 35 ]. Incomplete and imprecise sample size reporting is commonly pinpointed by these investigations whilst assessment and justifications of sample size sufficiency are even more sporadic.

Sobal [ 34 ] examined the sample size of qualitative studies published in the Journal of Nutrition Education over a period of 30 years. Studies that employed individual interviews ( n  = 30) had an average sample size of 45 individuals and none of these explicitly reported whether their sample size sought and/or attained saturation. A minority of articles discussed how sample-related limitations (with the latter most often concerning the type of sample, rather than the size) limited generalizability. A further systematic analysis [ 32 ] of health education research over 20 years demonstrated that interview-based studies averaged 104 participants (range 2 to 720 interviewees). However, 40% did not report the number of participants. An examination of 83 qualitative interview studies in leading information systems journals [ 30 ] indicated little defence of sample sizes on the basis of recommendations by qualitative methodologists, prior relevant work, or the criterion of saturation. Rather, sample size seemed to correlate with factors such as the journal of publication or the region of study (US vs Europe vs Asia). These results led the authors to call for more rigor in determining and reporting sample size in qualitative information systems research and to recommend optimal sample size ranges for grounded theory (i.e. 20–30 interviews) and single case (i.e. 15–30 interviews) projects.

Similarly, fewer than 10% of articles in organisation and workplace studies provided a sample size justification relating to existing recommendations by methodologists, prior relevant work, or saturation [ 33 ], whilst only 17% of focus groups studies in health-related journals provided an explanation of sample size (i.e. number of focus groups), with saturation being the most frequently invoked argument, followed by published sample size recommendations and practical reasons [ 22 ]. The notion of saturation was also invoked by 11 out of the 51 most highly cited studies that Guetterman [ 27 ] reviewed in the fields of education and health sciences, of which six were grounded theory studies, four phenomenological and one a narrative inquiry. Finally, analysing 641 interview-based articles in accounting, Dai et al. [ 24 ] called for more rigor since a significant minority of studies did not report precise sample size.

Despite increasing attention to rigor in qualitative research (e.g. [ 52 ]) and more extensive methodological and analytical disclosures that seek to validate qualitative work [ 24 ], sample size reporting and sufficiency assessment remain inconsistent and partial, if not absent, across a range of research domains.

Objectives of the present study

The present study sought to enrich existing systematic analyses of the customs and practices of sample size reporting and justification by focusing on qualitative research relating to health. Additionally, this study attempted to expand previous empirical investigations by examining how qualitative sample sizes are characterised and discussed in academic narratives. Qualitative health research is an inter-disciplinary field that due to its affiliation with medical sciences, often faces views and positions reflective of a quantitative ethos. Thus qualitative health research constitutes an emblematic case that may help to unfold underlying philosophical and methodological differences across the scientific community that are crystallised in considerations of sample size. The present research, therefore, incorporates a comparative element on the basis of three different disciplines engaging with qualitative health research: medicine, psychology, and sociology. We chose to focus our analysis on single-per-participant-interview designs as this not only presents a popular and widespread methodological choice in qualitative health research, but also as the method where consideration of sample size – defined as the number of interviewees – is particularly salient.

Study design

A structured search for articles reporting cross-sectional, interview-based qualitative studies was carried out and eligible reports were systematically reviewed and analysed employing both quantitative and qualitative analytic techniques.

We selected journals which (a) follow a peer review process, (b) are considered high quality and influential in their field as reflected in journal metrics, and (c) are receptive to, and publish, qualitative research (Additional File  1 presents the journals’ editorial positions in relation to qualitative research and sample considerations where available). Three health-related journals were chosen, each representing a different disciplinary field; the British Medical Journal (BMJ) representing medicine, the British Journal of Health Psychology (BJHP) representing psychology, and the Sociology of Health & Illness (SHI) representing sociology.

Search strategy to identify studies

Employing the search function of each individual journal, we used the terms ‘interview*’ AND ‘qualitative’ and limited the results to articles published between 1 January 2003 and 22 September 2017 (i.e. a 15-year review period).

Eligibility criteria

To be eligible for inclusion in the review, the article had to report a cross-sectional study design. Longitudinal studies were thus excluded whilst studies conducted within a broader research programme (e.g. interview studies nested in a trial, as part of a broader ethnography, as part of a longitudinal research) were included if they reported only single-time qualitative interviews. The method of data collection had to be individual, synchronous qualitative interviews (i.e. group interviews, structured interviews and e-mail interviews over a period of time were excluded), and the data had to be analysed qualitatively (i.e. studies that quantified their qualitative data were excluded). Mixed method studies and articles reporting more than one qualitative method of data collection (e.g. individual interviews and focus groups) were excluded. Figure  1 , a PRISMA flow diagram [ 53 ], shows the number of: articles obtained from the searches and screened; papers assessed for eligibility; and articles included in the review (Additional File  2 provides the full list of articles included in the review and their unique identifying code – e.g. BMJ01, BJHP02, SHI03). One review author (KV) assessed the eligibility of all papers identified from the searches. When in doubt, discussions about retaining or excluding articles were held between KV and JB in regular meetings, and decisions were jointly made.

figure 1

PRISMA flow diagram

Data extraction and analysis

A data extraction form was developed (see Additional File  3 ) recording three areas of information: (a) information about the article (e.g. authors, title, journal, year of publication etc.); (b) information about the aims of the study, the sample size and any justification for this, the participant characteristics, the sampling technique and any sample-related observations or comments made by the authors; and (c) information about the method or technique(s) of data analysis, the number of researchers involved in the analysis, the potential use of software, and any discussion around epistemological considerations. The Abstract, Methods and Discussion (and/or Conclusion) sections of each article were examined by one author (KV) who extracted all the relevant information. This was directly copied from the articles and, when appropriate, comments, notes and initial thoughts were written down.

To examine the kinds of sample size justifications provided by articles, an inductive content analysis [ 54 ] was initially conducted. On the basis of this analysis, the categories that expressed qualitatively different sample size justifications were developed.

We also extracted or coded quantitative data regarding the following aspects:

Journal and year of publication

Number of interviews

Number of participants

Presence of sample size justification(s) (Yes/No)

Presence of a particular sample size justification category (Yes/No), and

Number of sample size justifications provided

Descriptive and inferential statistical analyses were used to explore these data.

A thematic analysis [ 55 ] was then performed on all scientific narratives that discussed or commented on the sample size of the study. These narratives were evident both in papers that justified their sample size and those that did not. To identify these narratives, in addition to the methods sections, the discussion sections of the reviewed articles were also examined and relevant data were extracted and analysed.

In total, 214 articles – 21 in the BMJ, 53 in the BJHP and 140 in the SHI – were eligible for inclusion in the review. Table  1 provides basic information about the sample sizes – measured in number of interviews – of the studies reviewed across the three journals. Figure  2 depicts the number of eligible articles published each year per journal.

figure 2

The publication of qualitative studies in the BMJ was significantly reduced from 2012 onwards and this appears to coincide with the initiation of the BMJ Open to which qualitative studies were possibly directed.

Pairwise comparisons following a significant Kruskal-Wallis Footnote 2 test indicated that the studies published in the BJHP had significantly ( p  < .001) smaller samples sizes than those published either in the BMJ or the SHI. Sample sizes of BMJ and SHI articles did not differ significantly from each other.

Sample size justifications: Results from the quantitative and qualitative content analysis

Ten (47.6%) of the 21 BMJ studies, 26 (49.1%) of the 53 BJHP papers and 24 (17.1%) of the 140 SHI articles provided some sort of sample size justification. As shown in Table  2 , the majority of articles which justified their sample size provided one justification (70% of articles); fourteen studies (25%) provided two distinct justifications; one study (1.7%) gave three justifications and two studies (3.3%) expressed four distinct justifications.

There was no association between the number of interviews (i.e. sample size) conducted and the provision of a justification (rpb = .054, p  = .433). Within journals, Mann-Whitney tests indicated that sample sizes of ‘justifying’ and ‘non-justifying’ articles in the BMJ and SHI did not differ significantly from each other. In the BJHP, ‘justifying’ articles ( Mean rank  = 31.3) had significantly larger sample sizes than ‘non-justifying’ studies ( Mean rank  = 22.7; U = 237.000, p  < .05).

There was a significant association between the journal a paper was published in and the provision of a justification (χ 2 (2) = 23.83, p  < .001). BJHP studies provided a sample size justification significantly more often than would be expected ( z  = 2.9); SHI studies significantly less often ( z  = − 2.4). If an article was published in the BJHP, the odds of providing a justification were 4.8 times higher than if published in the SHI. Similarly if published in the BMJ, the odds of a study justifying its sample size were 4.5 times higher than in the SHI.

The qualitative content analysis of the scientific narratives identified eleven different sample size justifications. These are described below and illustrated with excerpts from relevant articles. By way of a summary, the frequency with which these were deployed across the three journals is indicated in Table  3 .

Saturation was the most commonly invoked principle (55.4% of all justifications) deployed by studies across all three journals to justify the sufficiency of their sample size. In the BMJ, two studies claimed that they achieved data saturation (BMJ17; BMJ18) and one article referred descriptively to achieving saturation without explicitly using the term (BMJ13). Interestingly, BMJ13 included data in the analysis beyond the point of saturation in search of ‘unusual/deviant observations’ and with a view to establishing findings consistency.

Thirty three women were approached to take part in the interview study. Twenty seven agreed and 21 (aged 21–64, median 40) were interviewed before data saturation was reached (one tape failure meant that 20 interviews were available for analysis). (BMJ17). No new topics were identified following analysis of approximately two thirds of the interviews; however, all interviews were coded in order to develop a better understanding of how characteristic the views and reported behaviours were, and also to collect further examples of unusual/deviant observations. (BMJ13).

Two articles reported pre-determining their sample size with a view to achieving data saturation (BMJ08 – see extract in section In line with existing research ; BMJ15 – see extract in section Pragmatic considerations ) without further specifying if this was achieved. One paper claimed theoretical saturation (BMJ06) conceived as being when “no further recurring themes emerging from the analysis” whilst another study argued that although the analytic categories were highly saturated, it was not possible to determine whether theoretical saturation had been achieved (BMJ04). One article (BMJ18) cited a reference to support its position on saturation.

In the BJHP, six articles claimed that they achieved data saturation (BJHP21; BJHP32; BJHP39; BJHP48; BJHP49; BJHP52) and one article stated that, given their sample size and the guidelines for achieving data saturation, it anticipated that saturation would be attained (BJHP50).

Recruitment continued until data saturation was reached, defined as the point at which no new themes emerged. (BJHP48). It has previously been recommended that qualitative studies require a minimum sample size of at least 12 to reach data saturation (Clarke & Braun, 2013; Fugard & Potts, 2014; Guest, Bunce, & Johnson, 2006) Therefore, a sample of 13 was deemed sufficient for the qualitative analysis and scale of this study. (BJHP50).

Two studies argued that they achieved thematic saturation (BJHP28 – see extract in section Sample size guidelines ; BJHP31) and one (BJHP30) article, explicitly concerned with theory development and deploying theoretical sampling, claimed both theoretical and data saturation.

The final sample size was determined by thematic saturation, the point at which new data appears to no longer contribute to the findings due to repetition of themes and comments by participants (Morse, 1995). At this point, data generation was terminated. (BJHP31).

Five studies argued that they achieved (BJHP05; BJHP33; BJHP40; BJHP13 – see extract in section Pragmatic considerations ) or anticipated (BJHP46) saturation without any further specification of the term. BJHP17 referred descriptively to a state of achieved saturation without specifically using the term. Saturation of coding , but not saturation of themes, was claimed to have been reached by one article (BJHP18). Two articles explicitly stated that they did not achieve saturation; instead claiming a level of theme completeness (BJHP27) or that themes being replicated (BJHP53) were arguments for sufficiency of their sample size.

Furthermore, data collection ceased on pragmatic grounds rather than at the point when saturation point was reached. Despite this, although nuances within sub-themes were still emerging towards the end of data analysis, the themes themselves were being replicated indicating a level of completeness. (BJHP27).

Finally, one article criticised and explicitly renounced the notion of data saturation claiming that, on the contrary, the criterion of theoretical sufficiency determined its sample size (BJHP16).

According to the original Grounded Theory texts, data collection should continue until there are no new discoveries ( i.e. , ‘data saturation’; Glaser & Strauss, 1967). However, recent revisions of this process have discussed how it is rare that data collection is an exhaustive process and researchers should rely on how well their data are able to create a sufficient theoretical account or ‘theoretical sufficiency’ (Dey, 1999). For this study, it was decided that theoretical sufficiency would guide recruitment, rather than looking for data saturation. (BJHP16).

Ten out of the 20 BJHP articles that employed the argument of saturation used one or more citations relating to this principle.

In the SHI, one article (SHI01) claimed that it achieved category saturation based on authors’ judgment.

This number was not fixed in advance, but was guided by the sampling strategy and the judgement, based on the analysis of the data, of the point at which ‘category saturation’ was achieved. (SHI01).

Three articles described a state of achieved saturation without using the term or specifying what sort of saturation they had achieved (i.e. data, theoretical, thematic saturation) (SHI04; SHI13; SHI30) whilst another four articles explicitly stated that they achieved saturation (SHI100; SHI125; SHI136; SHI137). Two papers stated that they achieved data saturation (SHI73 – see extract in section Sample size guidelines ; SHI113), two claimed theoretical saturation (SHI78; SHI115) and two referred to achieving thematic saturation (SHI87; SHI139) or to saturated themes (SHI29; SHI50).

Recruitment and analysis ceased once theoretical saturation was reached in the categories described below (Lincoln and Guba 1985). (SHI115). The respondents’ quotes drawn on below were chosen as representative, and illustrate saturated themes. (SHI50).

One article stated that thematic saturation was anticipated with its sample size (SHI94). Briefly referring to the difficulty in pinpointing achievement of theoretical saturation, SHI32 (see extract in section Richness and volume of data ) defended the sufficiency of its sample size on the basis of “the high degree of consensus [that] had begun to emerge among those interviewed”, suggesting that information from interviews was being replicated. Finally, SHI112 (see extract in section Further sampling to check findings consistency ) argued that it achieved saturation of discursive patterns . Seven of the 19 SHI articles cited references to support their position on saturation (see Additional File  4 for the full list of citations used by articles to support their position on saturation across the three journals).

Overall, it is clear that the concept of saturation encompassed a wide range of variants expressed in terms such as saturation, data saturation, thematic saturation, theoretical saturation, category saturation, saturation of coding, saturation of discursive themes, theme completeness. It is noteworthy, however, that although these various claims were sometimes supported with reference to the literature, they were not evidenced in relation to the study at hand.

Pragmatic considerations

The determination of sample size on the basis of pragmatic considerations was the second most frequently invoked argument (9.6% of all justifications) appearing in all three journals. In the BMJ, one article (BMJ15) appealed to pragmatic reasons, relating to time constraints and the difficulty to access certain study populations, to justify the determination of its sample size.

On the basis of the researchers’ previous experience and the literature, [30, 31] we estimated that recruitment of 15–20 patients at each site would achieve data saturation when data from each site were analysed separately. We set a target of seven to 10 caregivers per site because of time constraints and the anticipated difficulty of accessing caregivers at some home based care services. This gave a target sample of 75–100 patients and 35–50 caregivers overall. (BMJ15).

In the BJHP, four articles mentioned pragmatic considerations relating to time or financial constraints (BJHP27 – see extract in section Saturation ; BJHP53), the participant response rate (BJHP13), and the fixed (and thus limited) size of the participant pool from which interviewees were sampled (BJHP18).

We had aimed to continue interviewing until we had reached saturation, a point whereby further data collection would yield no further themes. In practice, the number of individuals volunteering to participate dictated when recruitment into the study ceased (15 young people, 15 parents). Nonetheless, by the last few interviews, significant repetition of concepts was occurring, suggesting ample sampling. (BJHP13).

Finally, three SHI articles explained their sample size with reference to practical aspects: time constraints and project manageability (SHI56), limited availability of respondents and project resources (SHI131), and time constraints (SHI113).

The size of the sample was largely determined by the availability of respondents and resources to complete the study. Its composition reflected, as far as practicable, our interest in how contextual factors (for example, gender relations and ethnicity) mediated the illness experience. (SHI131).

Qualities of the analysis

This sample size justification (8.4% of all justifications) was mainly employed by BJHP articles and referred to an intensive, idiographic and/or latently focused analysis, i.e. that moved beyond description. More specifically, six articles defended their sample size on the basis of an intensive analysis of transcripts and/or the idiographic focus of the study/analysis. Four of these papers (BJHP02; BJHP19; BJHP24; BJHP47) adopted an Interpretative Phenomenological Analysis (IPA) approach.

The current study employed a sample of 10 in keeping with the aim of exploring each participant’s account (Smith et al. , 1999). (BJHP19).

BJHP47 explicitly renounced the notion of saturation within an IPA approach. The other two BJHP articles conducted thematic analysis (BJHP34; BJHP38). The level of analysis – i.e. latent as opposed to a more superficial descriptive analysis – was also invoked as a justification by BJHP38 alongside the argument of an intensive analysis of individual transcripts

The resulting sample size was at the lower end of the range of sample sizes employed in thematic analysis (Braun & Clarke, 2013). This was in order to enable significant reflection, dialogue, and time on each transcript and was in line with the more latent level of analysis employed, to identify underlying ideas, rather than a more superficial descriptive analysis (Braun & Clarke, 2006). (BJHP38).

Finally, one BMJ paper (BMJ21) defended its sample size with reference to the complexity of the analytic task.

We stopped recruitment when we reached 30–35 interviews, owing to the depth and duration of interviews, richness of data, and complexity of the analytical task. (BMJ21).

Meet sampling requirements

Meeting sampling requirements (7.2% of all justifications) was another argument employed by two BMJ and four SHI articles to explain their sample size. Achieving maximum variation sampling in terms of specific interviewee characteristics determined and explained the sample size of two BMJ studies (BMJ02; BMJ16 – see extract in section Meet research design requirements ).

Recruitment continued until sampling frame requirements were met for diversity in age, sex, ethnicity, frequency of attendance, and health status. (BMJ02).

Regarding the SHI articles, two papers explained their numbers on the basis of their sampling strategy (SHI01- see extract in section Saturation ; SHI23) whilst sampling requirements that would help attain sample heterogeneity in terms of a particular characteristic of interest was cited by one paper (SHI127).

The combination of matching the recruitment sites for the quantitative research and the additional purposive criteria led to 104 phase 2 interviews (Internet (OLC): 21; Internet (FTF): 20); Gyms (FTF): 23; HIV testing (FTF): 20; HIV treatment (FTF): 20.) (SHI23). Of the fifty interviews conducted, thirty were translated from Spanish into English. These thirty, from which we draw our findings, were chosen for translation based on heterogeneity in depressive symptomology and educational attainment. (SHI127).

Finally, the pre-determination of sample size on the basis of sampling requirements was stated by one article though this was not used to justify the number of interviews (SHI10).

Sample size guidelines

Five BJHP articles (BJHP28; BJHP38 – see extract in section Qualities of the analysis ; BJHP46; BJHP47; BJHP50 – see extract in section Saturation ) and one SHI paper (SHI73) relied on citing existing sample size guidelines or norms within research traditions to determine and subsequently defend their sample size (7.2% of all justifications).

Sample size guidelines suggested a range between 20 and 30 interviews to be adequate (Creswell, 1998). Interviewer and note taker agreed that thematic saturation, the point at which no new concepts emerge from subsequent interviews (Patton, 2002), was achieved following completion of 20 interviews. (BJHP28). Interviewing continued until we deemed data saturation to have been reached (the point at which no new themes were emerging). Researchers have proposed 30 as an approximate or working number of interviews at which one could expect to be reaching theoretical saturation when using a semi-structured interview approach (Morse 2000), although this can vary depending on the heterogeneity of respondents interviewed and complexity of the issues explored. (SHI73).

In line with existing research

Sample sizes of published literature in the area of the subject matter under investigation (3.5% of all justifications) were used by 2 BMJ articles as guidance and a precedent for determining and defending their own sample size (BMJ08; BMJ15 – see extract in section Pragmatic considerations ).

We drew participants from a list of prisoners who were scheduled for release each week, sampling them until we reached the target of 35 cases, with a view to achieving data saturation within the scope of the study and sufficient follow-up interviews and in line with recent studies [8–10]. (BMJ08).

Similarly, BJHP38 (see extract in section Qualities of the analysis ) claimed that its sample size was within the range of sample sizes of published studies that use its analytic approach.

Richness and volume of data

BMJ21 (see extract in section Qualities of the analysis ) and SHI32 referred to the richness, detailed nature, and volume of data collected (2.3% of all justifications) to justify the sufficiency of their sample size.

Although there were more potential interviewees from those contacted by postcode selection, it was decided to stop recruitment after the 10th interview and focus on analysis of this sample. The material collected was considerable and, given the focused nature of the study, extremely detailed. Moreover, a high degree of consensus had begun to emerge among those interviewed, and while it is always difficult to judge at what point ‘theoretical saturation’ has been reached, or how many interviews would be required to uncover exception(s), it was felt the number was sufficient to satisfy the aims of this small in-depth investigation (Strauss and Corbin 1990). (SHI32).

Meet research design requirements

Determination of sample size so that it is in line with, and serves the requirements of, the research design (2.3% of all justifications) that the study adopted was another justification used by 2 BMJ papers (BMJ16; BMJ08 – see extract in section In line with existing research ).

We aimed for diverse, maximum variation samples [20] totalling 80 respondents from different social backgrounds and ethnic groups and those bereaved due to different types of suicide and traumatic death. We could have interviewed a smaller sample at different points in time (a qualitative longitudinal study) but chose instead to seek a broad range of experiences by interviewing those bereaved many years ago and others bereaved more recently; those bereaved in different circumstances and with different relations to the deceased; and people who lived in different parts of the UK; with different support systems and coroners’ procedures (see Tables 1 and 2 for more details). (BMJ16).

Researchers’ previous experience

The researchers’ previous experience (possibly referring to experience with qualitative research) was invoked by BMJ15 (see extract in section Pragmatic considerations ) as a justification for the determination of sample size.

Nature of study

One BJHP paper argued that the sample size was appropriate for the exploratory nature of the study (BJHP38).

A sample of eight participants was deemed appropriate because of the exploratory nature of this research and the focus on identifying underlying ideas about the topic. (BJHP38).

Further sampling to check findings consistency

Finally, SHI112 argued that once it had achieved saturation of discursive patterns, further sampling was decided and conducted to check for consistency of the findings.

Within each of the age-stratified groups, interviews were randomly sampled until saturation of discursive patterns was achieved. This resulted in a sample of 67 interviews. Once this sample had been analysed, one further interview from each age-stratified group was randomly chosen to check for consistency of the findings. Using this approach it was possible to more carefully explore children’s discourse about the ‘I’, agency, relationality and power in the thematic areas, revealing the subtle discursive variations described in this article. (SHI112).

Thematic analysis of passages discussing sample size

This analysis resulted in two overarching thematic areas; the first concerned the variation in the characterisation of sample size sufficiency, and the second related to the perceived threats deriving from sample size insufficiency.

Characterisations of sample size sufficiency

The analysis showed that there were three main characterisations of the sample size in the articles that provided relevant comments and discussion: (a) the vast majority of these qualitative studies ( n  = 42) considered their sample size as ‘small’ and this was seen and discussed as a limitation; only two articles viewed their small sample size as desirable and appropriate (b) a minority of articles ( n  = 4) proclaimed that their achieved sample size was ‘sufficient’; and (c) finally, a small group of studies ( n  = 5) characterised their sample size as ‘large’. Whilst achieving a ‘large’ sample size was sometimes viewed positively because it led to richer results, there were also occasions when a large sample size was problematic rather than desirable.

‘Small’ but why and for whom?

A number of articles which characterised their sample size as ‘small’ did so against an implicit or explicit quantitative framework of reference. Interestingly, three studies that claimed to have achieved data saturation or ‘theoretical sufficiency’ with their sample size, discussed or noted as a limitation in their discussion their ‘small’ sample size, raising the question of why, or for whom, the sample size was considered small given that the qualitative criterion of saturation had been satisfied.

The current study has a number of limitations. The sample size was small (n = 11) and, however, large enough for no new themes to emerge. (BJHP39). The study has two principal limitations. The first of these relates to the small number of respondents who took part in the study. (SHI73).

Other articles appeared to accept and acknowledge that their sample was flawed because of its small size (as well as other compositional ‘deficits’ e.g. non-representativeness, biases, self-selection) or anticipated that they might be criticized for their small sample size. It seemed that the imagined audience – perhaps reviewer or reader – was one inclined to hold the tenets of quantitative research, and certainly one to whom it was important to indicate the recognition that small samples were likely to be problematic. That one’s sample might be thought small was often construed as a limitation couched in a discourse of regret or apology.

Very occasionally, the articulation of the small size as a limitation was explicitly aligned against an espoused positivist framework and quantitative research.

This study has some limitations. Firstly, the 100 incidents sample represents a small number of the total number of serious incidents that occurs every year. 26 We sent out a nationwide invitation and do not know why more people did not volunteer for the study. Our lack of epidemiological knowledge about healthcare incidents, however, means that determining an appropriate sample size continues to be difficult. (BMJ20).

Indicative of an apparent oscillation of qualitative researchers between the different requirements and protocols demarcating the quantitative and qualitative worlds, there were a few instances of articles which briefly recognised their ‘small’ sample size as a limitation, but then defended their study on more qualitative grounds, such as their ability and success at capturing the complexity of experience and delving into the idiographic, and at generating particularly rich data.

This research, while limited in size, has sought to capture some of the complexity attached to men’s attitudes and experiences concerning incomes and material circumstances. (SHI35). Our numbers are small because negotiating access to social networks was slow and labour intensive, but our methods generated exceptionally rich data. (BMJ21). This study could be criticised for using a small and unrepresentative sample. Given that older adults have been ignored in the research concerning suntanning, fair-skinned older adults are the most likely to experience skin cancer, and women privilege appearance over health when it comes to sunbathing practices, our study offers depth and richness of data in a demographic group much in need of research attention. (SHI57).

‘Good enough’ sample sizes

Only four articles expressed some degree of confidence that their achieved sample size was sufficient. For example, SHI139, in line with the justification of thematic saturation that it offered, expressed trust in its sample size sufficiency despite the poor response rate. Similarly, BJHP04, which did not provide a sample size justification, argued that it targeted a larger sample size in order to eventually recruit a sufficient number of interviewees, due to anticipated low response rate.

Twenty-three people with type I diabetes from the target population of 133 ( i.e. 17.3%) consented to participate but four did not then respond to further contacts (total N = 19). The relatively low response rate was anticipated, due to the busy life-styles of young people in the age range, the geographical constraints, and the time required to participate in a semi-structured interview, so a larger target sample allowed a sufficient number of participants to be recruited. (BJHP04).

Two other articles (BJHP35; SHI32) linked the claimed sufficiency to the scope (i.e. ‘small, in-depth investigation’), aims and nature (i.e. ‘exploratory’) of their studies, thus anchoring their numbers to the particular context of their research. Nevertheless, claims of sample size sufficiency were sometimes undermined when they were juxtaposed with an acknowledgement that a larger sample size would be more scientifically productive.

Although our sample size was sufficient for this exploratory study, a more diverse sample including participants with lower socioeconomic status and more ethnic variation would be informative. A larger sample could also ensure inclusion of a more representative range of apps operating on a wider range of platforms. (BJHP35).

‘Large’ sample sizes - Promise or peril?

Three articles (BMJ13; BJHP05; BJHP48) which all provided the justification of saturation, characterised their sample size as ‘large’ and narrated this oversufficiency in positive terms as it allowed richer data and findings and enhanced the potential for generalisation. The type of generalisation aspired to (BJHP48) was not further specified however.

This study used rich data provided by a relatively large sample of expert informants on an important but under-researched topic. (BMJ13). Qualitative research provides a unique opportunity to understand a clinical problem from the patient’s perspective. This study had a large diverse sample, recruited through a range of locations and used in-depth interviews which enhance the richness and generalizability of the results. (BJHP48).

And whilst a ‘large’ sample size was endorsed and valued by some qualitative researchers, within the psychological tradition of IPA, a ‘large’ sample size was counter-normative and therefore needed to be justified. Four BJHP studies, all adopting IPA, expressed the appropriateness or desirability of ‘small’ sample sizes (BJHP41; BJHP45) or hastened to explain why they included a larger than typical sample size (BJHP32; BJHP47). For example, BJHP32 below provides a rationale for how an IPA study can accommodate a large sample size and how this was indeed suitable for the purposes of the particular research. To strengthen the explanation for choosing a non-normative sample size, previous IPA research citing a similar sample size approach is used as a precedent.

Small scale IPA studies allow in-depth analysis which would not be possible with larger samples (Smith et al. , 2009). (BJHP41). Although IPA generally involves intense scrutiny of a small number of transcripts, it was decided to recruit a larger diverse sample as this is the first qualitative study of this population in the United Kingdom (as far as we know) and we wanted to gain an overview. Indeed, Smith, Flowers, and Larkin (2009) agree that IPA is suitable for larger groups. However, the emphasis changes from an in-depth individualistic analysis to one in which common themes from shared experiences of a group of people can be elicited and used to understand the network of relationships between themes that emerge from the interviews. This large-scale format of IPA has been used by other researchers in the field of false-positive research. Baillie, Smith, Hewison, and Mason (2000) conducted an IPA study, with 24 participants, of ultrasound screening for chromosomal abnormality; they found that this larger number of participants enabled them to produce a more refined and cohesive account. (BJHP32).

The IPA articles found in the BJHP were the only instances where a ‘small’ sample size was advocated and a ‘large’ sample size problematized and defended. These IPA studies illustrate that the characterisation of sample size sufficiency can be a function of researchers’ theoretical and epistemological commitments rather than the result of an ‘objective’ sample size assessment.

Threats from sample size insufficiency

As shown above, the majority of articles that commented on their sample size, simultaneously characterized it as small and problematic. On those occasions that authors did not simply cite their ‘small’ sample size as a study limitation but rather continued and provided an account of how and why a small sample size was problematic, two important scientific qualities of the research seemed to be threatened: the generalizability and validity of results.

Generalizability

Those who characterised their sample as ‘small’ connected this to the limited potential for generalization of the results. Other features related to the sample – often some kind of compositional particularity – were also linked to limited potential for generalisation. Though not always explicitly articulated to what form of generalisation the articles referred to (see BJHP09), generalisation was mostly conceived in nomothetic terms, that is, it concerned the potential to draw inferences from the sample to the broader study population (‘representational generalisation’ – see BJHP31) and less often to other populations or cultures.

It must be noted that samples are small and whilst in both groups the majority of those women eligible participated, generalizability cannot be assumed. (BJHP09). The study’s limitations should be acknowledged: Data are presented from interviews with a relatively small group of participants, and thus, the views are not necessarily generalizable to all patients and clinicians. In particular, patients were only recruited from secondary care services where COFP diagnoses are typically confirmed. The sample therefore is unlikely to represent the full spectrum of patients, particularly those who are not referred to, or who have been discharged from dental services. (BJHP31).

Without explicitly using the term generalisation, two SHI articles noted how their ‘small’ sample size imposed limits on ‘the extent that we can extrapolate from these participants’ accounts’ (SHI114) or to the possibility ‘to draw far-reaching conclusions from the results’ (SHI124).

Interestingly, only a minority of articles alluded to, or invoked, a type of generalisation that is aligned with qualitative research, that is, idiographic generalisation (i.e. generalisation that can be made from and about cases [ 5 ]). These articles, all published in the discipline of sociology, defended their findings in terms of the possibility of drawing logical and conceptual inferences to other contexts and of generating understanding that has the potential to advance knowledge, despite their ‘small’ size. One article (SHI139) clearly contrasted nomothetic (statistical) generalisation to idiographic generalisation, arguing that the lack of statistical generalizability does not nullify the ability of qualitative research to still be relevant beyond the sample studied.

Further, these data do not need to be statistically generalisable for us to draw inferences that may advance medicalisation analyses (Charmaz 2014). These data may be seen as an opportunity to generate further hypotheses and are a unique application of the medicalisation framework. (SHI139). Although a small-scale qualitative study related to school counselling, this analysis can be usefully regarded as a case study of the successful utilisation of mental health-related resources by adolescents. As many of the issues explored are of relevance to mental health stigma more generally, it may also provide insights into adult engagement in services. It shows how a sociological analysis, which uses positioning theory to examine how people negotiate, partially accept and simultaneously resist stigmatisation in relation to mental health concerns, can contribute to an elucidation of the social processes and narrative constructions which may maintain as well as bridge the mental health service gap. (SHI103).

Only one article (SHI30) used the term transferability to argue for the potential of wider relevance of the results which was thought to be more the product of the composition of the sample (i.e. diverse sample), rather than the sample size.

The second major concern that arose from a ‘small’ sample size pertained to the internal validity of findings (i.e. here the term is used to denote the ‘truth’ or credibility of research findings). Authors expressed uncertainty about the degree of confidence in particular aspects or patterns of their results, primarily those that concerned some form of differentiation on the basis of relevant participant characteristics.

The information source preferred seemed to vary according to parents’ education; however, the sample size is too small to draw conclusions about such patterns. (SHI80). Although our numbers were too small to demonstrate gender differences with any certainty, it does seem that the biomedical and erotic scripts may be more common in the accounts of men and the relational script more common in the accounts of women. (SHI81).

In other instances, articles expressed uncertainty about whether their results accounted for the full spectrum and variation of the phenomenon under investigation. In other words, a ‘small’ sample size (alongside compositional ‘deficits’ such as a not statistically representative sample) was seen to threaten the ‘content validity’ of the results which in turn led to constructions of the study conclusions as tentative.

Data collection ceased on pragmatic grounds rather than when no new information appeared to be obtained ( i.e. , saturation point). As such, care should be taken not to overstate the findings. Whilst the themes from the initial interviews seemed to be replicated in the later interviews, further interviews may have identified additional themes or provided more nuanced explanations. (BJHP53). …it should be acknowledged that this study was based on a small sample of self-selected couples in enduring marriages who were not broadly representative of the population. Thus, participants may not be representative of couples that experience postnatal PTSD. It is therefore unlikely that all the key themes have been identified and explored. For example, couples who were excluded from the study because the male partner declined to participate may have been experiencing greater interpersonal difficulties. (BJHP03).

In other instances, articles attempted to preserve a degree of credibility of their results, despite the recognition that the sample size was ‘small’. Clarity and sharpness of emerging themes and alignment with previous relevant work were the arguments employed to warrant the validity of the results.

This study focused on British Chinese carers of patients with affective disorders, using a qualitative methodology to synthesise the sociocultural representations of illness within this community. Despite the small sample size, clear themes emerged from the narratives that were sufficient for this exploratory investigation. (SHI98).

The present study sought to examine how qualitative sample sizes in health-related research are characterised and justified. In line with previous studies [ 22 , 30 , 33 , 34 ] the findings demonstrate that reporting of sample size sufficiency is limited; just over 50% of articles in the BMJ and BJHP and 82% in the SHI did not provide any sample size justification. Providing a sample size justification was not related to the number of interviews conducted, but it was associated with the journal that the article was published in, indicating the influence of disciplinary or publishing norms, also reported in prior research [ 30 ]. This lack of transparency about sample size sufficiency is problematic given that most qualitative researchers would agree that it is an important marker of quality [ 56 , 57 ]. Moreover, and with the rise of qualitative research in social sciences, efforts to synthesise existing evidence and assess its quality are obstructed by poor reporting [ 58 , 59 ].

When authors justified their sample size, our findings indicate that sufficiency was mostly appraised with reference to features that were intrinsic to the study, in agreement with general advice on sample size determination [ 4 , 11 , 36 ]. The principle of saturation was the most commonly invoked argument [ 22 ] accounting for 55% of all justifications. A wide range of variants of saturation was evident corroborating the proliferation of the meaning of the term [ 49 ] and reflecting different underlying conceptualisations or models of saturation [ 20 ]. Nevertheless, claims of saturation were never substantiated in relation to procedures conducted in the study itself, endorsing similar observations in the literature [ 25 , 30 , 47 ]. Claims of saturation were sometimes supported with citations of other literature, suggesting a removal of the concept away from the characteristics of the study at hand. Pragmatic considerations, such as resource constraints or participant response rate and availability, was the second most frequently used argument accounting for approximately 10% of justifications and another 23% of justifications also represented intrinsic-to-the-study characteristics (i.e. qualities of the analysis, meeting sampling or research design requirements, richness and volume of the data obtained, nature of study, further sampling to check findings consistency).

Only, 12% of mentions of sample size justification pertained to arguments that were external to the study at hand, in the form of existing sample size guidelines and prior research that sets precedents. Whilst community norms and prior research can establish useful rules of thumb for estimating sample sizes [ 60 ] – and reveal what sizes are more likely to be acceptable within research communities – researchers should avoid adopting these norms uncritically, especially when such guidelines [e.g. 30 , 35 ], might be based on research that does not provide adequate evidence of sample size sufficiency. Similarly, whilst methodological research that seeks to demonstrate the achievement of saturation is invaluable since it explicates the parameters upon which saturation is contingent and indicates when a research project is likely to require a smaller or a larger sample [e.g. 29 ], specific numbers at which saturation was achieved within these projects cannot be routinely extrapolated for other projects. We concur with existing views [ 11 , 36 ] that the consideration of the characteristics of the study at hand, such as the epistemological and theoretical approach, the nature of the phenomenon under investigation, the aims and scope of the study, the quality and richness of data, or the researcher’s experience and skills of conducting qualitative research, should be the primary guide in determining sample size and assessing its sufficiency.

Moreover, although numbers in qualitative research are not unimportant [ 61 ], sample size should not be considered alone but be embedded in the more encompassing examination of data adequacy [ 56 , 57 ]. Erickson’s [ 62 ] dimensions of ‘evidentiary adequacy’ are useful here. He explains the concept in terms of adequate amounts of evidence, adequate variety in kinds of evidence, adequate interpretive status of evidence, adequate disconfirming evidence, and adequate discrepant case analysis. All dimensions might not be relevant across all qualitative research designs, but this illustrates the thickness of the concept of data adequacy, taking it beyond sample size.

The present research also demonstrated that sample sizes were commonly seen as ‘small’ and insufficient and discussed as limitation. Often unjustified (and in two cases incongruent with their own claims of saturation) these findings imply that sample size in qualitative health research is often adversely judged (or expected to be judged) against an implicit, yet omnipresent, quasi-quantitative standpoint. Indeed there were a few instances in our data where authors appeared, possibly in response to reviewers, to resist to some sort of quantification of their results. This implicit reference point became more apparent when authors discussed the threats deriving from an insufficient sample size. Whilst the concerns about internal validity might be legitimate to the extent that qualitative research projects, which are broadly related to realism, are set to examine phenomena in sufficient breadth and depth, the concerns around generalizability revealed a conceptualisation that is not compatible with purposive sampling. The limited potential for generalisation, as a result of a small sample size, was often discussed in nomothetic, statistical terms. Only occasionally was analytic or idiographic generalisation invoked to warrant the value of the study’s findings [ 5 , 17 ].

Strengths and limitations of the present study

We note, first, the limited number of health-related journals reviewed, so that only a ‘snapshot’ of qualitative health research has been captured. Examining additional disciplines (e.g. nursing sciences) as well as inter-disciplinary journals would add to the findings of this analysis. Nevertheless, our study is the first to provide some comparative insights on the basis of disciplines that are differently attached to the legacy of positivism and analysed literature published over a lengthy period of time (15 years). Guetterman [ 27 ] also examined health-related literature but this analysis was restricted to 26 most highly cited articles published over a period of five years whilst Carlsen and Glenton’s [ 22 ] study concentrated on focus groups health research. Moreover, although it was our intention to examine sample size justification in relation to the epistemological and theoretical positions of articles, this proved to be challenging largely due to absence of relevant information, or the difficulty into discerning clearly articles’ positions [ 63 ] and classifying them under specific approaches (e.g. studies often combined elements from different theoretical and epistemological traditions). We believe that such an analysis would yield useful insights as it links the methodological issue of sample size to the broader philosophical stance of the research. Despite these limitations, the analysis of the characterisation of sample size and of the threats seen to accrue from insufficient sample size, enriches our understanding of sample size (in)sufficiency argumentation by linking it to other features of the research. As the peer-review process becomes increasingly public, future research could usefully examine how reporting around sample size sufficiency and data adequacy might be influenced by the interactions between authors and reviewers.

The past decade has seen a growing appetite in qualitative research for an evidence-based approach to sample size determination and to evaluations of the sufficiency of sample size. Despite the conceptual and methodological developments in the area, the findings of the present study confirm previous studies in concluding that appraisals of sample size sufficiency are either absent or poorly substantiated. To ensure and maintain high quality research that will encourage greater appreciation of qualitative work in health-related sciences [ 64 ], we argue that qualitative researchers should be more transparent and thorough in their evaluation of sample size as part of their appraisal of data adequacy. We would encourage the practice of appraising sample size sufficiency with close reference to the study at hand and would thus caution against responding to the growing methodological research in this area with a decontextualised application of sample size numerical guidelines, norms and principles. Although researchers might find sample size community norms serve as useful rules of thumb, we recommend methodological knowledge is used to critically consider how saturation and other parameters that affect sample size sufficiency pertain to the specifics of the particular project. Those reviewing papers have a vital role in encouraging transparent study-specific reporting. The review process should support authors to exercise nuanced judgments in decisions about sample size determination in the context of the range of factors that influence sample size sufficiency and the specifics of a particular study. In light of the growing methodological evidence in the area, transparent presentation of such evidence-based judgement is crucial and in time should surely obviate the seemingly routine practice of citing the ‘small’ size of qualitative samples among the study limitations.

A non-parametric test of difference for independent samples was performed since the variable number of interviews violated assumptions of normality according to the standardized scores of skewness and kurtosis (BMJ: z skewness = 3.23, z kurtosis = 1.52; BJHP: z skewness = 4.73, z kurtosis = 4.85; SHI: z skewness = 12.04, z kurtosis = 21.72) and the Shapiro-Wilk test of normality ( p  < .001).

Abbreviations

British Journal of Health Psychology

British Medical Journal

Interpretative Phenomenological Analysis

Sociology of Health & Illness

Spencer L, Ritchie J, Lewis J, Dillon L. Quality in qualitative evaluation: a framework for assessing research evidence. National Centre for Social Research 2003 https://www.heacademy.ac.uk/system/files/166_policy_hub_a_quality_framework.pdf Accessed 11 May 2018.

Fusch PI, Ness LR. Are we there yet? Data saturation in qualitative research Qual Rep. 2015;20(9):1408–16.

Google Scholar  

Robinson OC. Sampling in interview-based qualitative research: a theoretical and practical guide. Qual Res Psychol. 2014;11(1):25–41.

Article   Google Scholar  

Sandelowski M. Sample size in qualitative research. Res Nurs Health. 1995;18(2):179–83.

Article   CAS   Google Scholar  

Sandelowski M. One is the liveliest number: the case orientation of qualitative research. Res Nurs Health. 1996;19(6):525–9.

Luborsky MR, Rubinstein RL. Sampling in qualitative research: rationale, issues. and methods Res Aging. 1995;17(1):89–113.

Marshall MN. Sampling for qualitative research. Fam Pract. 1996;13(6):522–6.

Patton MQ. Qualitative evaluation and research methods. 2nd ed. Newbury Park, CA: Sage; 1990.

van Rijnsoever FJ. (I Can’t get no) saturation: a simulation and guidelines for sample sizes in qualitative research. PLoS One. 2017;12(7):e0181689.

Morse JM. The significance of saturation. Qual Health Res. 1995;5(2):147–9.

Morse JM. Determining sample size. Qual Health Res. 2000;10(1):3–5.

Gergen KJ, Josselson R, Freeman M. The promises of qualitative inquiry. Am Psychol. 2015;70(1):1–9.

Borsci S, Macredie RD, Barnett J, Martin J, Kuljis J, Young T. Reviewing and extending the five-user assumption: a grounded procedure for interaction evaluation. ACM Trans Comput Hum Interact. 2013;20(5):29.

Borsci S, Macredie RD, Martin JL, Young T. How many testers are needed to assure the usability of medical devices? Expert Rev Med Devices. 2014;11(5):513–25.

Glaser BG, Strauss AL. The discovery of grounded theory: strategies for qualitative research. Chicago, IL: Aldine; 1967.

Kerr C, Nixon A, Wild D. Assessing and demonstrating data saturation in qualitative inquiry supporting patient-reported outcomes research. Expert Rev Pharmacoecon Outcomes Res. 2010;10(3):269–81.

Lincoln YS, Guba EG. Naturalistic inquiry. London: Sage; 1985.

Book   Google Scholar  

Malterud K, Siersma VD, Guassora AD. Sample size in qualitative interview studies: guided by information power. Qual Health Res. 2015;26:1753–60.

Nelson J. Using conceptual depth criteria: addressing the challenge of reaching saturation in qualitative research. Qual Res. 2017;17(5):554–70.

Saunders B, Sim J, Kingstone T, Baker S, Waterfield J, Bartlam B, et al. Saturation in qualitative research: exploring its conceptualization and operationalization. Qual Quant. 2017. https://doi.org/10.1007/s11135-017-0574-8 .

Caine K. Local standards for sample size at CHI. In Proceedings of the 2016 CHI conference on human factors in computing systems. 2016;981–992. ACM.

Carlsen B, Glenton C. What about N? A methodological study of sample-size reporting in focus group studies. BMC Med Res Methodol. 2011;11(1):26.

Constantinou CS, Georgiou M, Perdikogianni M. A comparative method for themes saturation (CoMeTS) in qualitative interviews. Qual Res. 2017;17(5):571–88.

Dai NT, Free C, Gendron Y. Interview-based research in accounting 2000–2014: a review. November 2016. https://ssrn.com/abstract=2711022 or https://doi.org/10.2139/ssrn.2711022 . Accessed 17 May 2018.

Francis JJ, Johnston M, Robertson C, Glidewell L, Entwistle V, Eccles MP, et al. What is an adequate sample size? Operationalising data saturation for theory-based interview studies. Psychol Health. 2010;25(10):1229–45.

Guest G, Bunce A, Johnson L. How many interviews are enough? An experiment with data saturation and variability. Field Methods. 2006;18(1):59–82.

Guetterman TC. Descriptions of sampling practices within five approaches to qualitative research in education and the health sciences. Forum Qual Soc Res. 2015;16(2):25. http://nbn-resolving.de/urn:nbn:de:0114-fqs1502256 . Accessed 17 May 2018.

Hagaman AK, Wutich A. How many interviews are enough to identify metathemes in multisited and cross-cultural research? Another perspective on guest, bunce, and Johnson’s (2006) landmark study. Field Methods. 2017;29(1):23–41.

Hennink MM, Kaiser BN, Marconi VC. Code saturation versus meaning saturation: how many interviews are enough? Qual Health Res. 2017;27(4):591–608.

Marshall B, Cardon P, Poddar A, Fontenot R. Does sample size matter in qualitative research?: a review of qualitative interviews in IS research. J Comput Inform Syst. 2013;54(1):11–22.

Mason M. Sample size and saturation in PhD studies using qualitative interviews. Forum Qual Soc Res 2010;11(3):8. http://nbn-resolving.de/urn:nbn:de:0114-fqs100387 . Accessed 17 May 2018.

Safman RM, Sobal J. Qualitative sample extensiveness in health education research. Health Educ Behav. 2004;31(1):9–21.

Saunders MN, Townsend K. Reporting and justifying the number of interview participants in organization and workplace research. Br J Manag. 2016;27(4):836–52.

Sobal J. 2001. Sample extensiveness in qualitative nutrition education research. J Nutr Educ. 2001;33(4):184–92.

Thomson SB. 2010. Sample size and grounded theory. JOAAG. 2010;5(1). http://www.joaag.com/uploads/5_1__Research_Note_1_Thomson.pdf . Accessed 17 May 2018.

Baker SE, Edwards R. How many qualitative interviews is enough?: expert voices and early career reflections on sampling and cases in qualitative research. National Centre for Research Methods Review Paper. 2012; http://eprints.ncrm.ac.uk/2273/4/how_many_interviews.pdf . Accessed 17 May 2018.

Ogden J, Cornwell D. The role of topic, interviewee, and question in predicting rich interview data in the field of health research. Sociol Health Illn. 2010;32(7):1059–71.

Green J, Thorogood N. Qualitative methods for health research. London: Sage; 2004.

Ritchie J, Lewis J, Elam G. Designing and selecting samples. In: Ritchie J, Lewis J, editors. Qualitative research practice: a guide for social science students and researchers. London: Sage; 2003. p. 77–108.

Britten N. Qualitative research: qualitative interviews in medical research. BMJ. 1995;311(6999):251–3.

Creswell JW. Qualitative inquiry and research design: choosing among five approaches. 2nd ed. London: Sage; 2007.

Fugard AJ, Potts HW. Supporting thinking on sample sizes for thematic analyses: a quantitative tool. Int J Soc Res Methodol. 2015;18(6):669–84.

Emmel N. Themes, variables, and the limits to calculating sample size in qualitative research: a response to Fugard and Potts. Int J Soc Res Methodol. 2015;18(6):685–6.

Braun V, Clarke V. (Mis) conceptualising themes, thematic analysis, and other problems with Fugard and Potts’ (2015) sample-size tool for thematic analysis. Int J Soc Res Methodol. 2016;19(6):739–43.

Hammersley M. Sampling and thematic analysis: a response to Fugard and Potts. Int J Soc Res Methodol. 2015;18(6):687–8.

Charmaz K. Constructing grounded theory: a practical guide through qualitative analysis. London: Sage; 2006.

Bowen GA. Naturalistic inquiry and the saturation concept: a research note. Qual Res. 2008;8(1):137–52.

Morse JM. Data were saturated. Qual Health Res. 2015;25(5):587–8.

O’Reilly M, Parker N. ‘Unsatisfactory saturation’: a critical exploration of the notion of saturated sample sizes in qualitative research. Qual Res. 2013;13(2):190–7.

Manen M, Higgins I, Riet P. A conversation with max van Manen on phenomenology in its original sense. Nurs Health Sci. 2016;18(1):4–7.

Dey I. Grounding grounded theory. San Francisco, CA: Academic Press; 1999.

Hays DG, Wood C, Dahl H, Kirk-Jenkins A. Methodological rigor in journal of counseling & development qualitative research articles: a 15-year review. J Couns Dev. 2016;94(2):172–83.

Moher D, Liberati A, Tetzlaff J, Altman DG, Prisma Group. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med 2009; 6(7): e1000097.

Hsieh HF, Shannon SE. Three approaches to qualitative content analysis. Qual Health Res. 2005;15(9):1277–88.

Boyatzis RE. Transforming qualitative information: thematic analysis and code development. Thousand Oaks, CA: Sage; 1998.

Levitt HM, Motulsky SL, Wertz FJ, Morrow SL, Ponterotto JG. Recommendations for designing and reviewing qualitative research in psychology: promoting methodological integrity. Qual Psychol. 2017;4(1):2–22.

Morrow SL. Quality and trustworthiness in qualitative research in counseling psychology. J Couns Psychol. 2005;52(2):250–60.

Barroso J, Sandelowski M. Sample reporting in qualitative studies of women with HIV infection. Field Methods. 2003;15(4):386–404.

Glenton C, Carlsen B, Lewin S, Munthe-Kaas H, Colvin CJ, Tunçalp Ö, et al. Applying GRADE-CERQual to qualitative evidence synthesis findings—paper 5: how to assess adequacy of data. Implement Sci. 2018;13(Suppl 1):14.

Onwuegbuzie AJ. Leech NL. A call for qualitative power analyses. Qual Quant. 2007;41(1):105–21.

Sandelowski M. Real qualitative researchers do not count: the use of numbers in qualitative research. Res Nurs Health. 2001;24(3):230–40.

Erickson F. Qualitative methods in research on teaching. In: Wittrock M, editor. Handbook of research on teaching. 3rd ed. New York: Macmillan; 1986. p. 119–61.

Bradbury-Jones C, Taylor J, Herber O. How theory is used and articulated in qualitative research: development of a new typology. Soc Sci Med. 2014;120:135–41.

Greenhalgh T, Annandale E, Ashcroft R, Barlow J, Black N, Bleakley A, et al. An open letter to the BMJ editors on qualitative research. BMJ. 2016;i563:352.

Download references

Acknowledgments

We would like to thank Dr. Paula Smith and Katharine Lee for their comments on a previous draft of this paper as well as Natalie Ann Mitchell and Meron Teferra for assisting us with data extraction.

This research was initially conceived of and partly conducted with financial support from the Multidisciplinary Assessment of Technology Centre for Healthcare (MATCH) programme (EP/F063822/1 and EP/G012393/1). The research continued and was completed independent of any support. The funding body did not have any role in the study design, the collection, analysis and interpretation of the data, in the writing of the paper, and in the decision to submit the manuscript for publication. The views expressed are those of the authors alone.

Availability of data and materials

Supporting data can be accessed in the original publications. Additional File 2 lists all eligible studies that were included in the present analysis.

Author information

Authors and affiliations.

Department of Psychology, University of Bath, Building 10 West, Claverton Down, Bath, BA2 7AY, UK

Konstantina Vasileiou & Julie Barnett

School of Psychology, Newcastle University, Ridley Building 1, Queen Victoria Road, Newcastle upon Tyne, NE1 7RU, UK

Susan Thorpe

Department of Computer Science, Brunel University London, Wilfred Brown Building 108, Uxbridge, UB8 3PH, UK

Terry Young

You can also search for this author in PubMed   Google Scholar

Contributions

JB and TY conceived the study; KV, JB, and TY designed the study; KV identified the articles and extracted the data; KV and JB assessed eligibility of articles; KV, JB, ST, and TY contributed to the analysis of the data, discussed the findings and early drafts of the paper; KV developed the final manuscript; KV, JB, ST, and TY read and approved the manuscript.

Corresponding author

Correspondence to Konstantina Vasileiou .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

Terry Young is an academic who undertakes research and occasional consultancy in the areas of health technology assessment, information systems, and service design. He is unaware of any direct conflict of interest with respect to this paper. All other authors have no competing interests to declare.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional Files

Additional file 1:.

Editorial positions on qualitative research and sample considerations (where available). (DOCX 12 kb)

Additional File 2:

List of eligible articles included in the review ( N  = 214). (DOCX 38 kb)

Additional File 3:

Data Extraction Form. (DOCX 15 kb)

Additional File 4:

Citations used by articles to support their position on saturation. (DOCX 14 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article.

Vasileiou, K., Barnett, J., Thorpe, S. et al. Characterising and justifying sample size sufficiency in interview-based studies: systematic analysis of qualitative health research over a 15-year period. BMC Med Res Methodol 18 , 148 (2018). https://doi.org/10.1186/s12874-018-0594-7

Download citation

Received : 22 May 2018

Accepted : 29 October 2018

Published : 21 November 2018

DOI : https://doi.org/10.1186/s12874-018-0594-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Sample size
  • Sample size justification
  • Sample size characterisation
  • Data adequacy
  • Qualitative health research
  • Qualitative interviews
  • Systematic analysis

BMC Medical Research Methodology

ISSN: 1471-2288

limitations of small sample size in quantitative research

limitations of small sample size in quantitative research

Research Limitations 101 📖

A Plain-Language Explainer (With Practical Examples)

By: Derek Jansen (MBA) | Expert Reviewer: Dr. Eunice Rautenbach | May 2024

Research limitations are one of those things that students tend to avoid digging into, and understandably so. No one likes to critique their own study and point out weaknesses. Nevertheless, being able to understand the limitations of your study – and, just as importantly, the implications thereof – a is a critically important skill.

In this post, we’ll unpack some of the most common research limitations you’re likely to encounter, so that you can approach your project with confidence.

Overview: Research Limitations 101

  • What are research limitations ?
  • Access – based limitations
  • Temporal & financial limitations
  • Sample & sampling limitations
  • Design limitations
  • Researcher limitations
  • Key takeaways

What (exactly) are “research limitations”?

At the simplest level, research limitations (also referred to as “the limitations of the study”) are the constraints and challenges that will invariably influence your ability to conduct your study and draw reliable conclusions .

Research limitations are inevitable. Absolutely no study is perfect and limitations are an inherent part of any research design. These limitations can stem from a variety of sources , including access to data, methodological choices, and the more mundane constraints of budget and time. So, there’s no use trying to escape them – what matters is that you can recognise them.

Acknowledging and understanding these limitations is crucial, not just for the integrity of your research, but also for your development as a scholar. That probably sounds a bit rich, but realistically, having a strong understanding of the limitations of any given study helps you handle the inevitable obstacles professionally and transparently, which in turn builds trust with your audience and academic peers.

Simply put, recognising and discussing the limitations of your study demonstrates that you know what you’re doing , and that you’ve considered the results of your project within the context of these limitations. In other words, discussing the limitations is a sign of credibility and strength – not weakness. Contrary to the common misconception, highlighting your limitations (or rather, your study’s limitations) will earn you (rather than cost you) marks.

So, with that foundation laid, let’s have a look at some of the most common research limitations you’re likely to encounter – and how to go about managing them as effectively as possible.

Need a helping hand?

limitations of small sample size in quantitative research

Limitation #1: Access To Information

One of the first hurdles you might encounter is limited access to necessary information. For example, you may have trouble getting access to specific literature or niche data sets. This situation can manifest due to several reasons, including paywalls, copyright and licensing issues or language barriers.

To minimise situations like these, it’s useful to try to leverage your university’s resource pool to the greatest extent possible. In practical terms, this means engaging with your university’s librarian and/or potentially utilising interlibrary loans to get access to restricted resources. If this sounds foreign to you, have a chat with your librarian 🙃

In emerging fields or highly specific study areas, you might find that there’s very little existing research (i.e., literature) on your topic. This scenario, while challenging, also offers a unique opportunity to contribute significantly to your field , as it indicates that there’s a significant research gap .

All of that said, be sure to conduct an exhaustive search using a variety of keywords and Boolean operators before assuming that there’s a lack of literature. Also, remember to snowball your literature base . In other words, scan the reference lists of the handful of papers that are directly relevant and then scan those references for more sources. You can also consider using tools like Litmaps and Connected Papers (see video below).

Limitation #2: Time & Money

Almost every researcher will face time and budget constraints at some point. Naturally, these limitations can affect the depth and breadth of your research – but they don’t need to be a death sentence.

Effective planning is crucial to managing both the temporal and financial aspects of your study. In practical terms, utilising tools like Gantt charts can help you visualise and plan your research timeline realistically, thereby reducing the risk of any nasty surprises. Always take a conservative stance when it comes to timelines, especially if you’re new to academic research. As a rule of thumb, things will generally take twice as long as you expect – so, prepare for the worst-case scenario.

If budget is a concern, you might want to consider exploring small research grants or adjusting the scope of your study so that it fits within a realistic budget. Trimming back might sound unattractive, but keep in mind that a smaller, well-planned study can often be more impactful than a larger, poorly planned project.

If you find yourself in a position where you’ve already run out of cash, don’t panic. There’s usually a pivot opportunity hidden somewhere within your project. Engage with your research advisor or faculty to explore potential solutions – don’t make any major changes without first consulting your institution.

Free Webinar: Research Methodology 101

Limitation #3: Sample Size & Composition

As we’ve discussed before , the size and representativeness of your sample are crucial , especially in quantitative research where the robustness of your conclusions often depends on these factors. All too often though, students run into issues achieving a sufficient sample size and composition.

To ensure adequacy in terms of your sample size, it’s important to plan for potential dropouts by oversampling from the outset . In other words, if you aim for a final sample size of 100 participants, aim to recruit 120-140 to account for unexpected challenges. If you still find yourself short on participants, consider whether you could complement your dataset with secondary data or data from an adjacent sample – for example, participants from another city or country. That said, be sure to engage with your research advisor before making any changes to your approach.

A related issue that you may run into is sample composition. In other words, you may have trouble securing a random sample that’s representative of your population of interest. In cases like this, you might again want to look at ways to complement your dataset with other sources, but if that’s not possible, it’s not the end of the world. As with all limitations, you’ll just need to recognise this limitation in your final write-up and be sure to interpret your results accordingly. In other words, don’t claim generalisability of your results if your sample isn’t random.

Limitation #4: Methodological Limitations

As we alluded earlier, every methodological choice comes with its own set of limitations . For example, you can’t claim causality if you’re using a descriptive or correlational research design. Similarly, as we saw in the previous example, you can’t claim generalisability if you’re using a non-random sampling approach.

Making good methodological choices is all about understanding (and accepting) the inherent trade-offs . In the vast majority of cases, you won’t be able to adopt the “perfect” methodology – and that’s okay. What’s important is that you select a methodology that aligns with your research aims and research questions , as well as the practical constraints at play (e.g., time, money, equipment access, etc.). Just as importantly, you must recognise and articulate the limitations of your chosen methods, and justify why they were the most suitable, given your specific context.

Limitation #5: Researcher (In)experience 

A discussion about research limitations would not be complete without mentioning the researcher (that’s you!). Whether we like to admit it or not, researcher inexperience and personal biases can subtly (and sometimes not so subtly) influence the interpretation and presentation of data within a study. This is especially true when it comes to dissertations and theses , as these are most commonly undertaken by first-time (or relatively fresh) researchers.

When it comes to dealing with this specific limitation, it’s important to remember the adage “ We don’t know what we don’t know ”. In other words, recognise and embrace your (relative) ignorance and subjectivity – and interpret your study’s results within that context . Simply put, don’t be overly confident in drawing conclusions from your study – especially when they contradict existing literature.

Cultivating a culture of reflexivity within your research practices can help reduce subjectivity and keep you a bit more “rooted” in the data. In practical terms, this simply means making an effort to become aware of how your perspectives and experiences may have shaped the research process and outcomes.

As with any new endeavour in life, it’s useful to garner as many outsider perspectives as possible. Of course, your university-assigned research advisor will play a large role in this respect, but it’s also a good idea to seek out feedback and critique from other academics. To this end, you might consider approaching other faculty at your institution, joining an online group, or even working with a private coach .

Your inexperience and personal biases can subtly (but significantly) influence how you interpret your data and draw your conclusions.

Key Takeaways

Understanding and effectively navigating research limitations is key to conducting credible and reliable academic work. By acknowledging and addressing these limitations upfront, you not only enhance the integrity of your research, but also demonstrate your academic maturity and professionalism.

Whether you’re working on a dissertation, thesis or any other type of formal academic research, remember the five most common research limitations and interpret your data while keeping them in mind.

  • Access to Information (literature and data)
  • Time and money
  • Sample size and composition
  • Research design and methodology
  • Researcher (in)experience and bias

If you need a hand identifying and mitigating the limitations within your study, check out our 1:1 private coaching service .

Literature Review Course

Psst… there’s more!

This post is an extract from our bestselling short course, Methodology Bootcamp . If you want to work smart, you don't want to miss this .

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

chrome icon

What are the limitations of using a small sample size in research?  

Insight from top 4 papers.

Using a small sample size in research has several limitations. Firstly, it may lead to underpowered studies, which are unreliable and may result in faulty conclusions and wasted resources [??] . Secondly, estimating sample size based on only two time points can be inadequate and biased, leading to systematically under- or over-estimated sample sizes [??] . Additionally, small sample sizes may not be representative of the population, resulting in imprecise results and questioning the validity of the study [??] . Moreover, small sample sizes may not be able to account for the complexity of the statistical model needed to answer the research question, especially in studies with latent variables and nested observations [??] . It is important to calculate the appropriate sample size during the planning phase of the study and report the details of the sample size calculation to ensure transparency and reliability of the results [??] .

Source Papers (4)

TitleInsight
,   PDF Talk with Paper
,   - Talk with Paper
, , , , , , , , ,   - Talk with Paper
, ,   PDF Talk with Paper

Related Questions

Yes, a small sample size does impact sampling error. Small sample sizes can lead to biased results due to sampling error, as they are less likely to be representative of the population, resulting in overestimation of probabilities of detection (POD) . In genetic programming (GP), small populations also tend to incorporate sampling error, affecting the reliability of search outcomes, making large initial population sizes necessary to reduce sampling error . Furthermore, in meta-analyses, sampling error from small sample sizes can introduce substantial bias in results, especially for effect sizes like standardized mean difference, odds ratio, risk ratio, and risk difference, emphasizing the need to consider sampling error in such analyses . Empirical studies in population genomics have shown that sample size strongly influences the accuracy of estimated parameters, with effective population size being particularly underestimated at low sample sizes, highlighting the impact of sample size on demographic inferences .

Statistical analysis with small samples can be challenging due to issues like low statistical power and potential bias. Research suggests that Bayesian methods can be beneficial for small samples, but caution is advised when using default priors as they may lead to biased results . Methods like selecting representative subplans from factorial experiments can aid in accurate parameter estimation with limited data . However, small trials in fields like football research are common but may lack the power to detect significant effects . Concerns arise when presenting data from small samples without rigorous peer review, as they can lead to overestimation of effects and unreliable conclusions . Nevertheless, innovative approaches like fitting curvilinear regressions to small data samples can provide valuable insights into child growth even with restricted resources . Therefore, while conducting statistical analysis with small samples is necessary in some situations, careful consideration and methodological rigor are crucial to ensure the validity and reliability of the results.

Limitations of using a small sample size in qualitative research include challenges related to achieving thorough data coverage, representing all themes adequately, and fully realizing the dimensionality of themes . Small sample sizes can lead to insufficient representation of codes and themes, potentially compromising the validity and generalizability of study results . Additionally, small samples may not capture the full complexity of the phenomenon under study, hindering the depth of insights gained from the research . Insufficient sample sizes can also impede the establishment of qualitative reliability and validity, impacting the robustness of the study findings . Overall, the use of small sample sizes in qualitative research can limit the richness and comprehensiveness of the data collected, affecting the overall quality and credibility of the research outcomes.

Small sample sizes in research, such as in IPF studies, can lead to various limitations and biases. These include low statistical power, potential publication bias, overestimation of effects, and sampling error leading to bias in meta-analysis results. Small trials in elite football research face challenges due to the conflict between small participant numbers and the importance of detecting even tiny effects. Additionally, small sample sizes in probability of detection (POD) analysis can introduce risks due to biased samples, leading to overestimation of POD. It is crucial to consider these limitations and biases associated with small sample sizes in IPF research to ensure the reliability and validity of study findings.

A small sample size can significantly impact the power of a study. In research, statistical power is the likelihood of correctly rejecting a false null hypothesis. With a small sample size, the study may lack the ability to detect true effects or relationships, leading to low statistical power. This increases the chances of false negatives, where real effects go undetected. Moreover, small sample sizes can result in inflated effect sizes, making findings seem more significant than they are in reality. Adequate sample sizes are crucial for ensuring the reliability and generalizability of study results, as they directly influence the study's ability to detect true effects and relationships with accuracy. Therefore, careful consideration and planning of sample sizes are essential to enhance the power and validity of research findings.

Trending Questions

Self-fulfilling prophecies (SFPs) are prevalent in educational settings, significantly influencing student outcomes based on teachers' expectations. Research indicates that these expectations can shape students' academic performance, motivation, and self-esteem, creating a cycle where beliefs manifest into reality. ## Impact of Teacher Expectations - Teachers' beliefs about students can lead to varying academic achievements. Inaccurate high expectations correlate with improved performance, while low expectations can hinder achievement. - A study found that teachers' attitudes towards special education students affected their assessments, although the relationship was complex and not statistically significant. ## Mechanisms of SFPs - SFPs operate through labeling and feedback, where teachers' perceptions can alter their interactions with students, ultimately affecting student motivation and self-efficacy. - The social cognitive theory underlines how these expectations can create a feedback loop, reinforcing students' beliefs about their capabilities. While the evidence supports the existence of SFPs, some studies suggest that the relationship between teacher expectations and student performance may not always be straightforward, indicating a need for further exploration into the complexities of these dynamics.

Misconceptions surrounding statistical significance are widespread and significantly impact scientific research. Many researchers overemphasize P values, often interpreting them as definitive proof of effect, which can lead to erroneous conclusions and poor reproducibility of findings. ## Misinterpretation of P Values - A common misconception is that a P value below 0.05 guarantees a meaningful effect, neglecting the importance of effect size. - Researchers frequently engage in "P-hacking," manipulating data analysis to achieve statistically significant results. ## Small Sample Size Issues - Statistically significant results from small samples can mislead researchers into believing there is a real effect, despite the high risk of Type I errors. - Small samples often lack the power to detect true effects, leading to overconfidence in findings. ## Recommendations for Improvement - Experts suggest adopting alternative statistical methods, such as Bayesian statistics, to mitigate reliance on P values. - Emphasizing comprehensive reporting and careful planning of statistical analyses can enhance the integrity of research findings. Despite these prevalent misconceptions, some argue that statistical significance still holds value in hypothesis testing, provided it is interpreted with caution and in conjunction with other metrics.

Chemical hazards on construction sites pose significant risks to workers and the environment. These hazards arise from various materials and practices, including the use of toxic compounds in construction materials and improper handling of hazardous substances. Understanding these risks is crucial for improving safety protocols and minimizing exposure. ## Types of Chemical Hazards - **Hazardous Compounds**: Common materials like concrete, plastics, and asbestos often contain harmful chemicals such as arsenic, lead, and chromium, which can leach into the environment. - **Toxic Chemicals**: Construction sites frequently utilize toxic agents that can adversely affect worker health, leading to long-term health issues. ## Exposure Risks - **Worker Behavior**: Unsafe actions and inadequate site conditions contribute significantly to chemical exposure, with improper storage and labeling of hazardous materials being critical factors. - **Perception of Risk**: Many workers underestimate the dangers of chemical exposure, which can lead to complacency regarding safety measures. While the construction industry is aware of these hazards, the challenge remains in effectively managing and mitigating risks to ensure worker safety and environmental protection.

To ensure a system's scalability and maintainability, several design choices must be strategically implemented. These choices encompass architectural frameworks, access techniques, and maintenance considerations that collectively enhance performance and longevity. ## Scalability Considerations - **Hybrid Access Techniques**: Utilizing Optical Code Division Multiple Access (OCDMA) combined with Time Division Multiplexing (OTDM) significantly increases network scalability, allowing for a higher number of simultaneous users without compromising data rates. - **Distributed Multi-Agent Systems**: Designing systems with distributed nodes enables dynamic reconfiguration, allowing agents to join or leave without degrading performance, thus enhancing scalability. ## Maintainability Strategies - **Architectural Patterns**: Implementing standard architectural patterns is crucial for improving software maintainability. These patterns provide a robust foundation that can adapt to changes and upgrades over time. - **Design for Accessibility**: Incorporating design factors that facilitate easy access to components can reduce maintenance time and costs, as demonstrated in the case of steam boiler systems. While these design choices focus on scalability and maintainability, it is essential to balance them with initial reliability and performance metrics to avoid potential pitfalls in system design.

The challenges in instructional practices are multifaceted, impacting both teachers and instructional leaders. These challenges stem from various factors, including resource limitations, technological advancements, and the need for effective leadership. ## Resource Limitations - Many teachers face significant constraints due to heavy workloads, which limit their time for developing instructional materials. - Inadequate resources and insufficient administrative support hinder effective instructional leadership, leading to poor performance in schools. ## Technological Advancements - Teachers often struggle to keep pace with rapid technological changes, which necessitate ongoing training and support to enhance their skills in instructional material development. ## Leadership Challenges - Instructional leaders frequently encounter internal challenges, such as limited experience and knowledge, which affect their ability to guide teachers effectively. - External pressures, including negative attitudes from parents and community members, further complicate the implementation of instructional practices. While these challenges are significant, they also present opportunities for improvement through targeted training, resource allocation, and enhanced collaboration among educational stakeholders. Addressing these issues is crucial for fostering effective instructional practices and improving educational outcomes.

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

A fault diagnosis method based on an improved diffusion model under limited sample conditions

Roles Conceptualization, Methodology, Resources, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

Affiliations Key Laboratory of Networked Control Systems, Chinese Academy of Sciences, Shenyang, China, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, China

Roles Conceptualization, Methodology, Software, Validation, Visualization, Writing – original draft

* E-mail: [email protected]

ORCID logo

Roles Data curation, Formal analysis, Validation, Visualization

Roles Formal analysis, Funding acquisition, Investigation, Project administration, Supervision

Affiliations Key Laboratory of Networked Control Systems, Chinese Academy of Sciences, Shenyang, China, State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, Liaoning Province, China

Roles Investigation, Project administration, Resources

Affiliation PipeChina Institute of Science and Technology, Langfang, China

  • Qiushi Wang, 
  • Zhicheng Sun, 
  • Yueming Zhu, 
  • Dong Li, 

PLOS

  • Published: September 3, 2024
  • https://doi.org/10.1371/journal.pone.0309714
  • Reader Comments

Fig 1

As a critical component in mechanical systems, the operational status of rolling bearings plays a pivotal role in ensuring the stability and safety of the entire system. However, in practical applications, the fault diagnosis of rolling bearings often encounters limitations due to the constraint of sample size, leading to suboptimal diagnostic accuracy. This article proposes a rolling bearing fault diagnosis method based on an improved denoising diffusion probability model (DDPM) to address this issue. The practical value of this research lies in its ability to address the limitation of small sample sizes in rolling bearing fault diagnosis. By leveraging DDPM to generate one-dimensional vibration data, the proposed method significantly enriches the datasets and consequently enhances the generalization capability of the diagnostic model. During the model training process, we innovatively introduce the feature differences between the original vibration data and the predicted vibration data generated based on prediction noise into the loss function, making the generated data more directional and targeted. In addition, this article adopts a one-dimensional convolutional neural network (1D-CNN) to construct a fault diagnosis model to more accurately extract and focus on key feature information related to faults. The experimental results show that this method can effectively improve the accuracy and reliability of rolling bearing fault diagnosis, providing new ideas and methods for fault detection and prevention in industrial applications. This advancement in diagnostic technology has the potential to significantly reduce the risk of system failures, enhance operational efficiency, and lower maintenance costs, thus contributing significantly to the safety and efficiency of mechanical systems.

Citation: Wang Q, Sun Z, Zhu Y, Li D, Ma Y (2024) A fault diagnosis method based on an improved diffusion model under limited sample conditions. PLoS ONE 19(9): e0309714. https://doi.org/10.1371/journal.pone.0309714

Editor: Lei Zhang, Beijing Institute of Technology, CHINA

Received: April 29, 2024; Accepted: August 14, 2024; Published: September 3, 2024

Copyright: © 2024 Wang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All dataset files are available from https://engineering.case.edu/bearingdatacenter/download-data-file and doi: 10.3390/s130608013 .

Funding: Applied Basic Research Project of Liaoning Province, China (Grant No. 2023JH2/101300183). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

In a wide range of applications, such as industrial manufacturing, aerospace and automotive engineering, rolling bearings play a key and indispensable role. However, due to the complexity and variability of their operating environment, as well as possible improper maintenance and other problems, rolling bearings often become the most common failure components in rotating machinery [ 1 ]. As a core component in mechanical equipment, the health condition of rolling bearings has a profound impact on the performance and stability of the entire system [ 1 , 2 ]. Therefore, timely and accurate fault diagnosis of rolling bearings is an indispensable part of ensuring the stable operation of mechanical equipment [ 3 ].

The process of fault diagnosis for rolling bearings typically encompasses a variety of methods and technologies. Currently, these methods and techniques can be systematically grouped into three main categories: model-based diagnostic methods, data-based analysis techniques [ 4 ], and hybrid integrated diagnostic strategies [ 5 , 6 ]. Model-based diagnostic methods aim to simulate the actual running state and potential failure modes of bearings by constructing physical or mathematical models of the bearings, so as to realize fault prediction and accurate diagnosis of bearings in operation [ 7 ]. On the other hand, data-based analysis technology relies on bearing operating data collected by sensors in real time, and with the help of data analysis tools and pattern recognition algorithms, a comprehensive assessment of the bearing’s health status is carried out. However, with the increasing complexity of modern equipment, it has become increasingly difficult to construct models that can accurately reflect failure mechanisms, which to some extent limits the application of physical models in the field of fault diagnosis. Therefore, data-based fault diagnosis methods are currently favored as mainstream diagnostic techniques in practical applications due to their flexibility and practicality [ 8 ].

With rapid advancements in science and technology, the field of data-based fault diagnosis is experiencing unprecedented changes. With this wave of change, deep learning-based fault diagnosis methods have garnered significant interest and application [ 9 , 10 ]. This is attributed to their ability to automatically extract and process features from raw vibration data, showing notable potential for practical application [ 11 ]. Guo et al. [ 12 ] improves the comprehensiveness and accuracy of fault diagnosis by fusing the time-domain and time-frequency-domain features of signals through parallel network deployment, while combining the anomalous attention mechanism of AT and the attributes of CBAM to form a dual attention mechanism. Chen et al. [ 13 ] proposed a bearing fault diagnosis algorithm based on multisource sensor data and an improved long short-term memory network (LSTM), which can effectively fuse features and cope with noise interference, improving diagnostic accuracy. Shao et al. [ 14 ] proposed a high-precision deep learning algorithm for machine fault diagnosis based on transfer learning, which converts sensor data into images, extracts features through pretrained networks, and fine tunes the network architecture. Chen et al. [ 15 ] combined CNN with transfer learning and proposed a transferable CNN algorithm that reuses prior knowledge to improve the learning performance of deep models in mechanical fault diagnosis. Xiao et al. [ 16 ] proposed a fault diagnosis algorithm based on a graph neural network (GNN). The algorithm constructs a graph through sample similarity, uses a GNN for feature mapping, fuses neighbor feature information, and then inputs the mapped samples into the basic detector for fault detection. Meanwhile, the attention mechanism, which has made a large splash in the field of natural language processing and computer vision, is now being actively explored and applied to the field of fault diagnosis by researchers, such as channel attention [ 17 ], spatial attention [ 18 ], self-attention [ 19 ], CBAM [ 20 ], and coordinate attention [ 21 ], which have led to new breakthroughs in fault diagnosis technology [ 22 ].

However, the above research relies heavily on laboratory environments where faults are artificially created to generate large amounts of fault data. Conversely, in real production environments, rolling bearings are shut down immediately when they fail, and companies tend to adopt preventive maintenance, which makes it difficult to collect fault data. In the realm of fault diagnosis [ 23 ], a large amount of normal data and a relatively small amount of fault data often occur during the monitoring process [ 24 ]. To solve this challenge, researchers have made many efforts and attempts. Yan et al. [ 25 ] proposed a deep regularized variational autoencoder (DRVAE) fault diagnosis method to optimize the VAE through regularization techniques, solve its overfitting problem, and enhance the feature learning capability of the model. Zhao et al. [ 26 ] proposed an improved generative adversarial network (GAN), which optimized the training process and improved the diagnostic performance by introducing auxiliary classifiers and autoencoder-based similarity estimation. Qiu et al. [ 27 ] proposed an auxiliary classifier generative adversarial network (ACGAN) to achieve controllable generation of category labels. Zhang et al. [ 28 ] proposed a CVAE-GAN model that enhances the GAN generator stability via a VAE encoder and introduces sample labeling to improve the training efficiency. When comparing the above methods, the GAN is deficient due to its instability in the training process and its susceptibility to pattern collapse, while the VAE is limited by the limited diversity of its generated data [ 29 ]. In contrast, a generative model called the denoising diffusion probabilistic model (DDPM) performs well in improving the quality and diversity of generated samples, and its training process is more stable and reliable. Cui et al. [ 30 ] proposed a fault diagnosis algorithm based on a symmetrized dot pattern (SDP) and DDPM, which converts one-dimensional vibration data into SDP and uses DDPM to generate samples to construct a datasets with significant and balanced features, thereby achieving accurate fault diagnosis. Yang et al. [ 31 ] generated more realistic and diverse generated samples based on DDPM and time-frequency maps of vibration data and mixed the real data with the generated data for fault diagnosis. However, methods using image data lead to the loss of temporal features when processing vibration data and lack additional guidance for the diffusion generation process.

Although significant results have been achieved in the field of small-sample fault diagnosis, considerable challenges remain in obtaining high-quality fault samples. For example, most of the current methods focus on sample generation from image data, while fault sample generation techniques for raw 1D data are still insufficient, which becomes a key challenge for us to further improve fault diagnosis performance. Therefore, this paper proposes an improved DDPM fault diagnosis method based on one-dimensional vibration data, aiming to solve the above problems and improve the diagnostic performance.

The contributions of this paper can be summarized as follows.

  • To address the problem of low model accuracy caused by insufficient fault data in rolling bearing fault diagnosis in reality, an improved 1D-DDPM model is proposed for generating fault samples.
  • The feature difference loss function is introduced in the training process of the 1D-DDPM model to make the generated data more directional and targeted and improve the quality of the generated samples.
  • Combining the data generation ability of the 1D-DDPM method and the feature extraction ability of the convolutional neural network, a 1D-DDPM-CNN fault diagnosis method is constructed, and the experiments show that this method is effective and accurate for the fault diagnosis of limited sample datasets.

The paper is organized as follows: The Methods section presents the methodology employed in this study. The Results and discussion section shows the results and discussion, and finally, the paper concludes with the Conclusion section.

This article proposes a rolling bearing fault diagnosis method that integrates one-dimensional DDPM and CNN, aiming to solve the problem of scarce fault data in real production environments. This method first uses one-dimensional DDPM to generate fault data, then mixes the original data with the generated data, and finally uses one-dimensional CNN for fault diagnosis. The specific process of the algorithm is shown in Fig 1 .

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0309714.g001

The concept of diffusion modeling has been rooted in researchers’ exploration since 2015, and after several years of deep cultivation and sharpening, its theory and application have gradually matured. Until 2020, Jonhan Ho and other scholars successfully introduced the DDPM model on the basis of previous work and after subtle adjustments to the mathematical structure, which brought new innovative momentum to related fields. However, at present, the DDPM model mainly focuses on the generation of image data, and its application is still insufficient for the key area of fault diagnosis, especially fault diagnosis based on one-dimensional vibration data. As a direct carrier of equipment vibration information, raw 1D vibration data contain rich and detailed details. Therefore, this paper is dedicated to exploring the possibility of applying the DDPM model to 1D vibration data to utilize its unique advantages in the field of fault diagnosis.

The structure of 1D-DDPM is shown as Fig 2 .

thumbnail

https://doi.org/10.1371/journal.pone.0309714.g002

limitations of small sample size in quantitative research

The reverse process involves a gradual denoising procedure, entailing the step-by-step removal of noise from data adhering to a normal distribution, ultimately leading to the generation of one-dimensional vibration data.

The reverse process is the process of gradual denoising which gradually denoises noise data that conform to a normal distribution and generates one-dimensional vibration data. However, due to the need to determine the data distribution from the complete datasets, we cannot easily predict q ( x t | x t −1 ). Therefore, the construction of a neural network parameterized by θ is adopted to approximate its distribution, assuming that p θ ( x t −1 | x t ) is the probability distribution of the inverse process and obeys a Gaussian distribution with its mean μ θ and variance ∑ θ both taking x t and t as input parameters.

limitations of small sample size in quantitative research

In the process of inverse diffusion, if we give x t and x 0 , we can calculate x t-1 based on the posterior diffusion conditional probability.

limitations of small sample size in quantitative research

The specific calculation formulas for each feature are listed in Table 1 .

thumbnail

https://doi.org/10.1371/journal.pone.0309714.t001

One-dimensional convolutional neural networks (1DCNNs), as variants of convolutional neural networks, perform excellently in handling local relationships in sequence data. It can not only reduce the complexity of the model and avoid tedious feature extraction processes, but also effectively reduce the number of required weights. Therefore, this article specifically uses a 1DCNN to process one-dimensional vibration data, and its structural diagram is shown in Fig 3 .

thumbnail

https://doi.org/10.1371/journal.pone.0309714.g003

limitations of small sample size in quantitative research

Finally, the fully connected layer maps the output of the pooling layer to the final output of the model and identifies the output probability through the softmax function to complete the fault diagnosis task.

Results and discussion

To comprehensively verify the effectiveness and superiority of the method proposed in this paper, we designed an exhaustive experiment. In this study, we choose two representative datasets for the experiments, aiming to comprehensively test the generalization ability and stability of the method. In addition, to objectively evaluate the performance of the method proposed in this paper, we conducted comparative experiments with popular generative models such as Variable AutoEncoder (VAE) and Generative Adversarial Network (GAE). We also compare CNN fault diagnosis algorithms that do not use generative models to highlight the advantages of this paper’s approach in dealing with data scarcity and generative capabilities.

For the hardware configurations of the experiments, we chose a high-performance computing environment, including a Windows 10 operating system, an RTX 3090 GPU, and a Core i7-12700K processor. These configurations provide sufficient computing resources for the experiments and ensure the accuracy and reliability of the results.

During the experiments, all the fault diagnosis models were trained iteratively for 100 training cycles, and a learning rate of 0.001 and an Adam optimizer were used for parameter optimization and model tuning to ensure that the models were fully trained and converged. We chose Python as the development language and relied on the powerful deep learning framework TensorFlow to build and train the models to ensure the smooth execution of the experiments and the accuracy of the results.

Case 1: CWRU bearing datasets

Data description..

The CWRU bearing fault diagnosis datasets is a commonly used datasets provided by Case Western Reserve University and is specifically designed for bearing fault detection and diagnosis. This datasets contains vibration signal data, covering the normal operating status of bearings and three common fault states, namely inner race faults, outer race faults, and rolling ball faults. These fault states simulate different fault situations that may occur in actual industry.

In this paper, we selected the 48k drive end bearing fault data from the CWRU bearing fault diagnosis datasets as the experimental data. Specifically, three types of faults were selected: inner ring fault, outer ring fault, and rolling element fault, with a speed of 1730 rpm, a horsepower of 3 hp, and a fault diameter of 21 miles. For the outer ring fault, we selected the fault data with the fault location at 12 o’clock. More descriptions of the datasets are described in Table 2 .

thumbnail

https://doi.org/10.1371/journal.pone.0309714.t002

To construct the experimental data, we combined the fault data with the normal data. For each fault type, we randomly selected 200 samples as the datasets, each containing 1024 vibration data points. Among them, we use 100 samples as training samples for the generative model, and the other 100 samples as test datasets to evaluate the performance of the fault diagnosis model.For the normal type, as there is no need for data generation, 200 samples are randomly selected as training samples, and the remaining 100 samples are used as testing samples.

To solve the problem of difficult attribute data processing by classifiers, we adopted unique hot encoding instead of real number encoding. This encoding method converts each attribute value into a binary vector, where only one element is 1 and the rest are 0. This approach can effectively represent attribute data, enabling the classifier to better process and learn features.

Through such experimental design and data preprocessing methods, we can use the CWRU bearing fault diagnosis datasets for research on bearing fault detection and diagnosis. This will help us extract features related to faults and train models to automatically identify and classify different types of bearing faults.

The performance of the 1D-DDPM.

The accuracy of fault diagnosis methods based on deep learning depends on the number of samples. The sample generation method of 1D-DDPM proposed in this paper can effectively supplement fault samples. However, the generated samples may be close to real samples in terms of their statistical characteristics, as they are actually generated through algorithm simulation, which cannot fully replicate the complexity and diversity of fault occurrence in the real environment and cannot completely replace real samples. To evaluate the impact of the number of generated samples on the accuracy of fault diagnosis, a series of experiments were conducted in this paper.

In these experiments, we first generated three types of fault samples using 1D-DDPM and mixed them with real samples to construct a training set. To ensure the consistency of the results, all data augmentation is performed only on the training set, while the test set contains only the original real samples. The purpose of doing so is to eliminate the interference of other factors when evaluating the impact of the generated sample size on accuracy. In addition, the number of normal samples used for training is not fixed but matches the number of faulty samples. This approach can ensure a balance between normal and faulty samples in the experiment, avoiding the impact of data imbalance on the results. Each experiment was repeated 10 times, and the average accuracy was recorded in Table 3 .

thumbnail

https://doi.org/10.1371/journal.pone.0309714.t003

In the above experiments, we conducted a series of different experiments, covering experiments 1 to 15. Experiments 1 to 5 considered only real data, experiments 6 to 10 used mixed generated data with real data for experimentation, and experiments 11 to 15 used only generated data for fault diagnosis.

From Experiment 1 to Experiment 5, it can be observed that the accuracy decreases as the number of training samples decreases. This indicates that the number of samples has a significant impact on accuracy without generating samples.

Through experiments 6 to 10, it can be observed that as the number of generated samples increases, the accuracy improves. However, when the ratio of generated samples to original samples is 1:1.5, the fault diagnosis accuracy is the highest. Excessive generation of fault data does not improve the accuracy of diagnostic models, as these samples may contain redundant information in addition to fault information. This further indicates that the generated samples cannot completely replace the real samples.

Through experiments 11 to 15, it can be observed that using only generated samples for training without using real samples resulted in a decrease in accuracy compared to experiments 1 to 5. This indicates that although generating samples can fit the distribution of real data, there are still certain limitations in fault diagnosis.

In summary, we can conclude that the method of mixing generated data with real data has better performance in fault diagnosis compared to a single data source. This means that mixing generated data with real data can provide better performance in fault diagnosis. However, although generating data can fit the distribution of real data well, there is still a certain gap in fault diagnosis compared to real data. This gap is mainly reflected in two aspects, namely, noise difference and dynamic characteristics. In terms of noise difference, although the generated data are similar to real data in distribution, they lack the randomness and complex noise in real-world data, which affects the robustness of the fault diagnosis model, and in terms of dynamic characteristics, the real data often contain complex dynamic processes, which are difficult to be fully captured by the generated model, leading to a diagnostic performance gap. The gap between the generated data and the original data can be visualized as Fig 4 , which shows the original data is more centralized in distribution than the generate data.

thumbnail

https://doi.org/10.1371/journal.pone.0309714.g004

The method of mixing generated data with real data performs well in improving fault diagnosis performance. When the ratio of real samples to generated samples is 1:1.5, the fault diagnosis effect is optimal. Although the sample generation method effectively compensates for the problem of insufficient samples, it is still necessary to pay attention to the situation where there is a certain gap in fault diagnosis performance between the generated data and the real data.

Fault diagnosis of rolling bearings based on 1D-DDPM-CNN

To further evaluate the feasibility and effectiveness of the method proposed in this article, comparative experiments were conducted. First, we use different sample generation methods to generate fault samples, where the ratio of real samples to fault samples is 1:1.5. Next, we use a mixed datasets to train the fault diagnosis model. Table 4 describes the detailed information of the data.

thumbnail

https://doi.org/10.1371/journal.pone.0309714.t004

Through the above experimental setup, we evaluate the impact of different sample generation methods on fault diagnosis models.

In this paper, a fault diagnosis model is constructed based on a 1D-CNN, and the hyperparameter settings are shown in Table 5 .

thumbnail

https://doi.org/10.1371/journal.pone.0309714.t005

The number of convolution kernels for the 1D-CNN in this article is set to 32, and the length of each convolution kernel is set to 32. When performing pooling operations, use a pooling window of size 3 is used. To reduce overfitting, the dropout is set to 0.2, which will randomly discard a portion of the neurons in the network. The initial learning rate is set to 0.001, and after every 5 iterations, the learning rate is halved. The batch size is set to 16 and the number of epochs is 30. Categorical_crossentropy is used as the loss function to update the model parameters. Fig 5 shows the training results of fault diagnosis for different generative models.

thumbnail

https://doi.org/10.1371/journal.pone.0309714.g005

By observing the figure, it can be observed that after 10 iterations, the training accuracy and loss of the four methods reached a convergence state, and the accuracy all reached 100%. However, compared to algorithms that only use real samples, algorithms that mix three types of generated samples with real samples have faster convergence speed and higher accuracy. In particular, the 1D-DDPM algorithm used in this article performs the best among these methods, proving the effectiveness and superiority of our method.

Table 6 shows the accuracy, recall and F1 values of the four fault diagnosis methods on the test set. From the table, it can be seen that the method proposed in this paper performs the best on all three indicators, further proving the effectiveness of the method proposed in this paper.

thumbnail

https://doi.org/10.1371/journal.pone.0309714.t006

To further validate the effectiveness of the developed method, we plotted Fig 6 , which shows the confusion matrix obtained when using a CNN model for fault diagnosis after supplementing fault samples with three generation models namely, 1D-DDPM, 1D-VAE, and 1D-GAN, and mixing them with real samples. The y-coordinate of the confusion matrix represents the classification of the actual labels, and the x-coordinate represents the predicted labels. The main diagonal elements of the confusion matrix represent the number of correctly classified samples in the current category,and the diagnostic accuracy for each running state is shown in Table 7 .

thumbnail

https://doi.org/10.1371/journal.pone.0309714.g006

thumbnail

https://doi.org/10.1371/journal.pone.0309714.t007

The recognition accuracy of the four algorithms for normal samples is 100%. However, for the fault samples generated by mixing the generated samples with the real samples, the recognition accuracy did not all reach 100%. This indicates that although the generative model can effectively supplement the missing fault samples, there is still a gap in the data quality of the generated samples compared to the real samples. On the other hand, for the recognition accuracy of the three types of fault samples, the algorithm proposed in this paper shows the highest accuracy, further proving the effectiveness and superiority of the algorithm proposed in this paper.

Case 2: JNU bearing datasets

The JUN bearing datasets, which originated from Jiangnan University, encompasses a comprehensive collection of bearing running status data [ 32 ]. This datasets was generated utilizing a centrifugal fan system test bed, equipped with a Mitsubishi SB-JR induction motor, where a fault was intentionally introduced into one of the bearings. The accelerometers were positioned perpendicular to the bearings to capture the vibration signals. The datasets encompasses four distinct running states: normal, inner ring fault, outer ring fault, and rolling element fault. The vibration acceleration signals were precisely captured at a sampling frequency of 50 kHz, across various rotational speeds of 600, 800, and 1000 rpm, providing a rich resource for multivariate analysis. For this study, the four states at 600 rpm were selected for fault diagnosis experiments, and the specific details are outlined in Table 8 .

thumbnail

https://doi.org/10.1371/journal.pone.0309714.t008

Fault diagnosis of rolling bearings based on 1D-DDPM-CNN.

The optimal ratio was verified by the CWRU datasets, i.e., a 1:1.5 ratio of real data to generated data. We generated 150 simulated data samples for each of the three different fault types based on 100 real data samples, and added an additional 250 normal data samples, which together constructed a comprehensive fault diagnosis training set. For the test set, we ensure that each operation state contains 100 samples to comprehensively evaluate the model performance. In this experiment, we tested the 1D-DDPM-CNN model, the VAE-CNN model, the GAN-CNN model proposed in this paper, and a traditional CNN model without using generated samples. The experimental results are shown in Table 9 , and the confusion matrix of the results is shown in Fig 7 . The results show that the algorithm proposed in this paper achieves the optimal performance in terms of accuracy, recall, and F1, which fully verifies the efficiency and practicability of this algorithm in fault diagnosis tasks.

thumbnail

https://doi.org/10.1371/journal.pone.0309714.g007

thumbnail

https://doi.org/10.1371/journal.pone.0309714.t009

Comparison with other methods.

To further evaluate the performance of the proposed method and its effectiveness in the field of fault diagnosis, we conduct a comparative study with five existing and representative methods. These comparative methods include two generative models (i.e., DCGAN and ERGAN) and three nongenerative models (DTL-Res2Net-CBAM, SMOTE, and DCNN). During the experiments, we use the widely recognized CWRU datasets to ensure the reliability and wide adaptability of the experimental results. For the generative models, we selected 100 real samples and the corresponding generative samples for each fault type to jointly construct the training set, in contrast to the training set for the nongenerative models, where we used only 100 real samples. The test set used another 100 real samples for each type to evaluate and compare the performances of all the models. The experimental results are shown in Table 10 .

thumbnail

https://doi.org/10.1371/journal.pone.0309714.t010

The experimental results show that the accuracy of the proposed method in this paper is the highest for both generative and nongenerative models, further proving the effectiveness of the method proposed in this paper.

The rolling bearing fault diagnosis method based on the improved denoising diffusion probability model (DDPM) proposed in this article has achieved significant results under limited sample conditions. By using DDPM to generate one-dimensional vibration data, the problem of insufficient data has been effectively solved, and the generalization ability of the model has improved. At the same time, introducing feature differences into the loss function makes the generated data more directional and improves diagnostic accuracy. In addition, the CNN model can better capture key features and enhance the robustness of the model.

However, there are still some shortcomings in this study. Under extremely sparse sample conditions, the performance of this method has not been fully validated, and its adaptability needs to be further strengthened. Moreover, this method also has certain limitations in terms of model transfer performance, making it difficult to directly apply to the fault diagnosis of rolling bearings of different models or working environments.

In future work, we will focus on addressing the issues mentioned above and implementing the following recommendations. First, we will dedicate efforts to optimizing the model structure, enhancing its universality, and improving its generalization ability. This will enable the model to effectively adapt to various fault modes in a better manner. Second, we will explore fault diagnosis methods specifically designed for situations with extremely sparse sample conditions. By doing so, we aim to enhance the model’s performance in such scenarios. Additionally, we will investigate strategies to improve the transfer performance of the model, allowing it to be applied more extensively in different scenarios for fault diagnosis of rolling bearings.

  • View Article
  • Google Scholar
  • PubMed/NCBI

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • An Bras Dermatol
  • v.89(4); Jul-Aug 2014

Sample size: how many participants do I need in my research? *

Jeovany martínez-mesa.

1 Latin American Cooperative Oncology Group - Porto Alegre (RS), Brazil.

David Alejandro González-Chica

2 Universidade Federal de Santa Catarina (UFSC) - Florianópolis (SC), Brazil.

João Luiz Bastos

Renan rangel bonamigo.

3 Universidade Federal de Ciências da Saúde de Porto Alegre (UFCSPA) - Porto Alegre (RS), Brazil.

Rodrigo Pereira Duquia

The importance of estimating sample sizes is rarely understood by researchers, when planning a study. This paper aims to highlight the centrality of sample size estimations in health research. Examples that help in understanding the basic concepts involved in their calculation are presented. The scenarios covered are based more on the epidemiological reasoning and less on mathematical formulae. Proper calculation of the number of participants in a study diminishes the likelihood of errors, which are often associated with adverse consequences in terms of economic, ethical and health aspects.

INTRODUCTION

Investigations in the health field are oriented by research problems or questions, which should be clearly defined in the study project. Sample size calculation is an essential item to be included in the project to reduce the probability of error, respect ethical standards, define the logistics of the study and, last but not least, improve its success rates, when evaluated by funding agencies.

Let us imagine that a group of investigators decides to study the frequency of sunscreen use and how the use of this product is distributed in the "population". In order to carry out this task, the authors define two research questions, each of which involving a distinct sample size calculation: 1) What is the proportion of people that use sunscreen in the population?; and, 2) Are there differences in the use of sunscreen between men and women, or between individuals that are white or of another skin color group, or between the wealthiest and the poorest, or between people with more and less years of schooling? Before doing the calculations, it will be necessary to review a few fundamental concepts and identify which are the required parameters to determine them.

WHAT DO WE MEAN, WHEN WE TALK ABOUT POPULATIONS?

First of all, we must define what is a population . Population is the group of individuals restricted to a geographical region (neighborhood, city, state, country, continent etc.), or certain institutions (hospitals, schools, health centers etc.), that is, a set of individuals that have at least one characteristic in common. The target population corresponds to a portion of the previously mentioned population, about which one intends to draw conclusions, that is to say, it is a part of the population whose characteristics are an object of interest of the investigator. Finally, study population is that which will actually be part of the study, which will be evaluated and will allow conclusions to be drawn about the target population, as long as it is representative of the latter. Figure 1 demonstrates how these concepts are interrelated.

An external file that holds a picture, illustration, etc.
Object name is abd-89-04-0609-g01.jpg

Graphic representation of the concepts of population, target population and study population

We will now separately consider the required parameters for sample size calculation in studies that aim at estimating the frequency of events (prevalence of health outcomes or behaviors, for example), to test associations between risk/protective factors and dichotomous health conditions (yes/no), as well as with health outcomes measured in numerical scales. 1 The formulas used for these calculations may be obtained from different sources - we recommend using the free online software OpenEpi ( www.openepi.com ). 2

WHICH PARAMETERS DOES SAMPLE SIZE CALCULATION DEPEND UPON FOR A STUDY THAT AIMS AT ESTIMATING THE FREQUENCY OF HEALTH OUTCOMES, BEHAVIORS OR CONDITIONS?

When approaching the first research question defined at the beginning of this article (What is the proportion of people that use sunscreen?), the investigators need to conduct a prevalence study. In order to do this, some parameters must be defined to calculate the sample size, as demonstrated in chart 1 .

Description of different parameters to be considered in the calculation of sample size for a study aiming at estimating the frequency of health ouctomes, behaviors or conditions

Population sizeTotal population size from which the sample will be drawn and about which researchers will draw conclusions (target population)Information regarding population size may be obtained based on secondary data from hospitals, health centers, census surveys (population, schools etc.).
The smaller the target population (for example, less than 100 individuals), the larger the sample size will proportionally be.
Expected prevalence of outcome or event of interestThe study outcome must be a percentage, that is, a number that varies from 0% to 100%.Information regarding expected prevalence rates should be obtained from the literature or by carrying out a pilot-study.
When this information is not available in the literature or a pilot-study cannot be carried out, the value that maximizes sample size is used (50% for a fixed value of sample error).
Sample error for estimateThe value we are willing to accept as error in the estimate obtained by the study.The smaller the sample error, the larger the sample size and the greater the precision. In health studies, values between two and five percentage points are usually recommended.
Significance levelIt is the probability that the expected prevalence will be within the error margin being established.The higher the confidence level (greater expected precision), the larger will be the sample size. This parameter is usually fixed as 95%.
Design effectIt is necessary when the study participants are chosen by cluster selection procedures. This means that, instead of the participants being individually selected (simple, systematic or stratified sampling), they are first divided and randomly selected in groups (census tracts, neighborhood, households, days of the week, etc.) and later the individuals are selected within these groups. Thus, greater similarity is expected among the respondents within a group than in the general population. This generates loss of precision, which needs to be compensated by a sample size adjustment (increase).The principle is that the total estimated variance may have been reduced as a consequence of cluster selection. The value of the design effect may be obtained from the literature. When not available, a value between 1.5 and 2.0 may be determined and the investigators should evaluate, after the study is completed, the actual design effect and report it in their publications.
The greater the homogeneity within each group (the more similar the respondents are within each cluster), the greater the design effect will be and the larger the sample size required to increase precision. In studies that do not use cluster selection procedures (simple, systematic or stratified sampling), the design effect is considered as null or 1.0.

Chart 2 presents some sample size simulations, according to the outcome prevalence, sample error and the type of target population investigated. The same basic question was used in this table (prevalence of sunscreen use), but considering three different situations (at work, while doing sports or at the beach), as in the study by Duquia et al. conducted in the city of Pelotas, state of Rio Grande do Sul, in 2005. 3

Sample size calculation to estimate the frequency (prevalence) of sunscreen use in the population, considering different scenarios but keeping the significance level (95%) and the design effect (1.0) constant

   
   
Health center users investigated in a single day (population = 100)90 59 96789780
All users in the area covered by a health center (population size = 1,000)464 122 687260707278
All users from the areas covered by all health centers in a city (population size = 10,000)796 137 17943381937370
The entire city population (N = 40.000)847 138 20723472265381

p.p.= percentage points

The calculations show that, by holding the sample error and the significance level constant, the higher the expected prevalence, the larger will be the required sample size. However, when the expected prevalence surpasses 50%, the required sample size progressively diminishes - the sample size for an expected prevalence of 10% is the same as that for an expected prevalence of 90%.

The investigator should also define beforehand the precision level to be accepted for the investigated event (sample error) and the confidence level of this result (usually 95%). Chart 2 demonstrates that, holding the expected prevalence constant, the higher the precision (smaller sample error) and the higher the confidence level (in this case, 95% was considered for all calculations), the larger also will be the required sample size.

Chart 2 also demonstrates that there is a direct relationship between the target population size and the number of individuals to be included in the sample. Nevertheless, when the target population size is sufficiently large, that is, surpasses an arbitrary value (for example, one million individuals), the resulting sample size tends to stabilize. The smaller the target population, the larger the sample will be; in some cases, the sample may even correspond to the total number of individuals from the target population - in these cases, it may be more convenient to study the entire target population, carrying out a census survey, rather than a study based on a sample of the population.

SAMPLE CALCULATION TO TEST THE ASSOCIATION BETWEEN TWO VARIABLES: HYPOTHESES AND TYPES OF ERROR

When the study objective is to investigate whether there are differences in sunscreen use according to sociodemographic characteristics (such as, for example, between men and women), the existence of association between explanatory variables (exposure or independent variables, in this case sociodemographic variables) and a dependent or outcome variable (use of sunscreen) is what is under consideration.

In these cases, we need first to understand what the hypotheses are, as well as the types of error that may result from their acceptance or refutation. A hypothesis is a "supposition arrived at from observation or reflection, that leads to refutable predictions". 4 In other words, it is a statement that may be questioned or tested and that may be falsified in scientific studies.

In scientific studies, there are two types of hypothesis: the null hypothesis (H 0 ) or original supposition that we assume to be true for a given situation, and the alternative hypothesis (H A ) or additional explanation for the same situation, which we believe may replace the original supposition. In the health field, H 0 is frequently defined as the equality or absence of difference in the outcome of interest between the studied groups (for example, sunscreen use is equal in men and women). On the other hand, H A assumes the existence of difference between groups. H A is called two-tailed when it is expected that the difference between the groups will occur in any direction (men using more sunscreen than women or vice-versa). However, if the investigator expects to find that a specific group uses more sunscreen than the other, he will be testing a one-tailed H A .

In the sample investigated by Duquia et al., the frequency of sunscreen use at the beach was greater in men (32.7%) than in women (26.2%).3 Although this what was observed in the sample, that is, men do wear more sunscreen than women, the investigators must decide whether they refute or accept H 0 in the target population (which contends that there is no difference in sunscreen use according to sex). Given that the entire target population is hardly ever investigated to confirm or refute the difference observed in the sample, the authors have to be aware that, independently from their decision (accepting or refuting H 0 ), their conclusion may be wrong, as can be seen in figure 2 .

An external file that holds a picture, illustration, etc.
Object name is abd-89-04-0609-g02.jpg

Types of possible results when performing a hypothesis test

In case the investigators conclude that both in the target population and in the sample sunscreen use is also different between men and women (rejecting H 0 ), they may be making a type I or Alpha error, which is the probability of rejecting H 0 based on sample results when, in the target population, H 0 is true (the difference between men and women regarding sunscreen use found in the sample is not observed in the target population). If the authors conclude that there are no differences between the groups (accepting H 0 ), the investigators may be making a type II or Beta error, which is the probability of accepting H 0 when, in the target population, H 0 is false (that is, H A is true) or, in other words, the probability of stating that the frequency of sunscreen use is equal between the sexes, when it is different in the same groups of the target population.

In order to accept or refute H 0 , the investigators need to previously define which is the maximum probability of type I and II errors that they are willing to incorporate into their results. In general, the type I error is fixed at a maximum value of 5% (0.05 or confidence level of 95%), since the consequences originated from this type of error are considered more harmful. For example, to state that an exposure/intervention affects a health condition, when this does not happen in the target population may bring about behaviors or actions (therapeutic changes, implementation of intervention programs etc.) with adverse consequences in ethical, economic and health terms. In the study conducted by Duquia et al., when the authors contend that the use of sunscreen was different according to sex, the p value presented (<0.001) indicates that the probability of not observing such difference in the target population is less that 0.1% (confidence level >99.9%). 3

Although the type II or Beta error is less harmful, it should also be avoided, since if a study contends that a given exposure/intervention does not affect the outcome, when this effect actually exists in the target population, the consequence may be that a new medication with better therapeutic effects is not administered or that some aspects related to the etiology of the damage are not considered. This is the reason why the value of the type II error is usually fixed at a maximum value of 20% (or 0.20). In publications, this value tends to be mentioned as the power of the study, which is the ability of the test to detect a difference, when in fact it exists in the target population (usually fixed at 80%, as a result of the 1-Beta calculation).

SAMPLE CALCULATION FOR STUDIES THAT AIM AT TESTING THE ASSOCIATION BETWEEN A RISK/PROTECTIVE FACTOR AND AN OUTCOME, EVALUATED DICHOTOMOUSLY

In cases where the exposure variables are dichotomous (intervention/control, man/woman, rich/poor etc.) and so is the outcome (negative/positive outcome, to use sunscreen or not), the required parameters to calculate sample size are those described in chart 3 . According to the previously mentioned example, it would be interesting to know whether sex, skin color, schooling level and income are associated with the use of sunscreen at work, while doing sports and at the beach. Thus, when the four exposure variables are crossed with the three outcomes, there would be 12 different questions to be answered and consequently an equal number of sample size calculations to be performed. Using the information in the article by Duquia et al. 3 for the prevalence of exposures and outcomes, a simulation of sample size calculations was used for each one of these situations ( Chart 4 ).

Type I or Alpha errorIt is the probability of rejecting H0, when H0 is false in the target population. Usually fixed as 5%.It is expressed by the p value. It is usually 5% (p<0.05).
For sample size calculation, the confidence level may be adopted (usually 95%), calculated as 1-Alpha.
The smaller the Alpha error (greater confidence level), the larger will be the sample size.
Statistical Power (1-Beta)It is the ability of the test to detect a difference in the sample, when it exists in the target population.Calculated as 1-Beta.
The greater the power, the larger the required sample size will be.
A value between 80%-90% is usually used.
Relationship between non-exposed/exposed groups in the sampleIt indicates the existing relationship between non-exposed and exposed groups in the sample.For observational studies, the data are usually obtained from the scientific literature. In intervention studies, the value 1:1 is frequently adopted, indicating that half of the individuals will receive the intervention and the other half will be the control or comparison group. Some intervention studies may use a larger number of controls than of individuals receiving the intervention.
The more distant this ratio is from one, the larger will be the required sample size.
Prevalence of outcome in the non-exposed group (percentage of positive among the non-exposed)Proportion of individuals with the disease (outcome) among those non-exposed to the risk factor (or that are part of the control group).Data usually obtained from the literature. When this information is not available but there is information on general prevalence/incidence in the population, this value may be used in sample size calculation (values attributed to the control group in intervention studies) or estimated based on the following formula: PONE=pO/(pNE+(pE*PR) )
  where pO = prevalence of outcome; pNE = percentage of non-exposed; pE = percentage of exposed; PR = prevalence ratio (usually a value between 1.5 and 2.0).
Expected prevalence ratioRelationship between the prevalence of disease in the exposed (intervention) group and the prevalence of disease in the non-exposed group, indicating how many times it is expected that the prevalence will be higher (or lower) in the exposed compared to non-exposed group.It is the value that the investigators intend to find as HA, with the corresponding H0 equal to one (similar prevalence of the outcome in both exposed and non-exposed groups). For the sample size estimates, the expected outcome prevalence may be used for the non-exposed group, or the expected difference in the prevalence between the exposed and the non-exposed groups.
Usually, a value between 1.50 and 2.00 is used (exposure as risk factor) or between 0.50 and 0.75 (protective factor).
For intervention studies, the clinical relevance of this value should be considered.
The smaller the prevalence rate (the smaller the expected difference between the groups), the larger the required sample size.
Type of statistical testThe test may be one-tailed or two-tailed, depending on the type of the HA.Two-tailed tests require larger sample sizes

Ho - null hypothesis; Ha - alternative hypothesis

      
      
      
Female: 56%(E) n=1298n=388 n=487n=134 n=136n=28
Male:44%(NE) n=1738n=519 n=652n=179 n=181n=38
         
      
White: 82%(E) n=2630n=822 n=970n=276 n=275n=49
Other: 18%(NE) n=3520n=1100 n=1299n=370 n=368n=66
         
      
0-4 years: 25%(E) n=1340n=366 n=488n=131 n=138ND
>4 anos: 75%(NE) n=1795n=490 n=654n=175 n=184ND
         
      
≤133: 50%(E) n=1228n=360 n=458n=124 n=128n=28
>133: 50%(NE) n=1644n=480 n=612n=166 n=170n=36
         

E=exposed group; NE=non-exposed group; r=NE/E relationship; PONE=prevalence of outcome in the non-exposed group (percentage of positives in non-exposed group), estimated based on formula from chart 3 , considering an PR of 1.50; PR=prevalence ratio/incidence or expected relative risk; n= minimum necessary sample size; ND=value could not be determined, as prevalence of outcome in the exposed would be above 100%, according to specified parameters.

Estimates show that studies with more power or that intend to find a difference of a lower magnitude in the frequency of the outcome (in this case, the prevalence rates) between exposed and non-exposed groups require larger sample sizes. For these reasons, in sample size calculations, an effect measure between 1.5 and 2.0 (for risk factors) or between 0.50 and 0.75 (for protective factors), and an 80% power are frequently used.

Considering the values in each column of chart 3 , we may conclude also that, when the nonexposed/exposed relationship moves away from one (similar proportions of exposed and non-exposed individuals in the sample), the sample size increases. For this reason, intervention studies usually work with the same proportion of individuals in the intervention and control groups. Upon analysis of the values on each line, it can be concluded that there is an inverse relationship between the prevalence of the outcome and the required sample size.

Based on these estimates, assuming that the authors intended to test all of these associations, it would be necessary to choose the largest estimated sample size (2,630 subjects). In case the required sample size is larger than the target population, the investigators may decide to perform a multicenter study, lengthen the period for data collection, modify the research question or face the possibility of not having sufficient power to draw valid conclusions.

Additional aspects need to be considered in the previous estimates to arrive at the final sample size, which may include the possibility of refusals and/or losses in the study (an additional 10-15%), the need for adjustments for confounding factors (an additional 10-20%, applicable to observational studies), the possibility of effect modification (which implies an analysis of subgroups and the need to duplicate or triplicate the sample size), as well as the existence of design effects (multiplication of sample size by 1.5 to 2.0) in case of cluster sampling.

SAMPLE CALCULATIONS FOR STUDIES THAT AIM AT TESTING THE ASSOCIATION BETWEEN A DICHOTOMOUS EXPOSURE AND A NUMERICAL OUTCOME

Suppose that the investigators intend to evaluate whether the daily quantity of sunscreen used (in grams), the time of daily exposure to sunlight (in minutes) or a laboratory parameter (such as vitamin D levels) differ according to the socio-demographic variables mentioned. In all of these cases, the outcomes are numerical variables (discrete or continuous) 1 , and the objective is to answer whether the mean outcome in the exposed/intervention group is different from the non-exposed/control group.

In this case, the first three parameters from chart 4 (alpha error, power of the study and relationship between non-exposed/exposed groups) are required, and the conclusions about their influences on the final sample size are also applicable. In addition to defining the expected outcome means in each group or the expected mean difference between nonexposed/exposed groups (usually at least 15% of the mean value in non-exposed group), they also need to define the standard deviation value for each group. There is a direct relationship between the standard deviation value and the sample size, the reason why in case of asymmetric variables the sample size would be overestimated. In such cases, the option may be to estimate sample sizes based on specific calculations for asymmetric variables, or the investigators may choose to use a percentage of the median value (for example, 25%) as a substitute for the standard deviation.

SAMPLE SIZE CALCULATIONS FOR OTHER TYPES OF STUDY

There are also specific calculations for some other quantitative studies, such as those aiming to assess correlations (exposure and outcome are numerical variables), time until the event (death, cure, relapse etc.) or the validity of diagnostic tests, but they are not described in this article, given that they were discussed elsewhere. 5

Sample size calculation is always an essential step during the planning of scientific studies. An insufficient or small sample size may not be able to demonstrate the desired difference, or estimate the frequency of the event of interest with acceptable precision. A very large sample may add to the complexity of the study, and its associated costs, rendering it unfeasible. Both situations are ethically unacceptable and should be avoided by the investigator.

Conflict of Interest: None

Financial Support: None

* Work carried out at the Latin American Cooperative Oncology Group (LACOG), Universidade Federal de Santa Catarina (UFSC), and Universidade Federal de Ciências da Saúde de Porto Alegre (UFCSPA), Brazil.

Como citar este artigo: Martínez-Mesa J, González-Chica DA, Bastos JL, Bonamigo RR, Duquia RP. Sample size: how many participants do I need in my research? An Bras Dermatol. 2014;89(4):609-15.

  • Systematic Review
  • Open access
  • Published: 07 September 2024

Assessing the effectiveness of greater occipital nerve block in chronic migraine: a systematic review and meta-analysis

  • Muhamad Saqlain Mustafa 1 , 7 ,
  • Shafin bin Amin 2 , 8 ,
  • Aashish Kumar 2 , 8 ,
  • Muhammad Ashir Shafique 1 , 7 ,
  • Syeda Mahrukh Fatima Zaidi 3 , 9 ,
  • Syed Ali Arsal 2 , 8 ,
  • Burhanudin Sohail Rangwala 1 , 7 ,
  • Muhammad Faheem Iqbal 3 , 9 ,
  • Adarsh Raja 2 , 8 ,
  • Abdul Haseeb 1 , 7 ,
  • Inshal Jawed 3 , 9 ,
  • Khabab Abbasher Hussien Mohamed Ahmed   ORCID: orcid.org/0000-0003-4608-5321 5 ,
  • Syed Muhammad Sinaan Ali 6 , 10 &
  • Giustino Varrassi 4  

BMC Neurology volume  24 , Article number:  330 ( 2024 ) Cite this article

Metrics details

Background & aims

Chronic migraine poses a global health burden, particularly affecting young women, and has substantial societal implications. This study aimed to assess the efficacy of Greater Occipital Nerve Block (GONB) in individuals with chronic migraine, focusing on the impact of local anesthetics compared with placebo.

A meta-analysis and systematic review were conducted following the PRISMA principles and Cochrane Collaboration methods. Eligible studies included case-control, cohort, and randomized control trials in adults with chronic migraine, adhering to the International Classification of Headache Disorders, third edition (ICHD3). Primary efficacy outcomes included headache frequency, duration, and intensity along with safety assessments.

Literature searches across multiple databases yielded eight studies for qualitative analysis, with five included in the final quantitative analysis. A remarkable reduction in headache intensity and frequency during the first and second months of treatment with GONB using local anesthetics compared to placebo has been reported. The incidence of adverse events did not differ significantly between the intervention and placebo groups.

The analysis emphasized the safety and efficacy of GONB, albeit with a cautious interpretation due to the limited number of studies and relatively small sample size. This study advocates for further research exploring various drugs, frequencies, and treatment plans to enhance the robustness and applicability of GONB for chronic migraine management.

Peer Review reports

Introduction

Among headache disorders, migraine is particularly ranked second worldwide in terms of disability and is the leading cause of disability among young women, according to the Global Burden of Disease 2019 data [ 1 ]. Recent findings indicate that the global prevalence of migraine is approximately 15%, which translates to 4.9% of all ill health measured in years lived with disability (YLDs) [ 2 ]. Women are more likely to experience migraine than men, particularly those aged 15–49 years [ 3 ]. Migraine has a substantial societal and financial impact owing to both direct and indirect costs resulting from decreased productivity and missed work [ 4 ].

Migraine is a complex neurovascular disorder that affects sensory processing and is characterized by a range of symptoms, with headache being the most common symptom [ 5 ]. Chronic migraine (CM) is defined as the frequent occurrence of headache episodes, with at least 15 or more episodes (which, on at least 8 days/month, have the features of migraine headache) occurring per month for more than three months [ 6 ]. Several medications are available for the preventive treatment of migraine, including anticonvulsants, antidepressants, beta-blockers, calcium channel blockers, botulinum toxin A, and more recently, drugs that block the calcitonin gene-related peptide (CGRP) pathway (i.e., monoclonal antibodies and antagonists) [ 7 ]. Despite the potential of anti-CGRP monoclonal antibodies (mAbs) in managing chronic migraine, a remarkable proportion of patients do not respond to this treatment [ 8 ]. Approximately 25% of patients are unresponsive to anti-CGRP monoclonal antibodies [ 9 ].

An important component of the brainstem, the Trigeminocervical Complex (TCC) acts as a central processing unit for pain and sensory data from the head and neck. This is the point of convergence of the upper cervical spinal nerves and the trigeminal nerve, which supplies feeling to the face, head, and some regions of the neck [ 10 , 11 ].

One of the TCC’s primary functions is the confluence of the occipital and trigeminal nerves there. The trigeminal nerve transmits sensory data from the face, scalp, and meninges through its three main branches (ophthalmic, maxillary, and mandibular). In the meanwhile, feelings from the back of the head are transmitted by the occipital nerves, which originate from the upper cervical spinal roots [ 10 , 11 ]. Wide-ranging integration of sensory inputs from the head and neck is made possible by the network formed when these neurons converge at the TCC. The brainstem area known as the trigeminocervical complex is crucial to migraine pain processing since it is responsible for processing pain signals originating from the head and neck. [ 10 , 11 ]. The face, head, and neck region’s sensory data—especially pain—are integrated by the TCC. Because of this integration, the TCC is an important piece of the migraine jigsaw when it comes to interpreting the location and degree of pain. The trigeminal, occipital, and TCC nerves are intricately intertwined with one another. A series of neurological events are set off during a migraine episode, beginning with the stimulation of the trigeminal nerve. This activation increases pain signals by causing the production of inflammatory chemicals around the TCC and blood arteries in the brain [ 10 , 11 ]. Accompanying this, the occipital nerves may also be affected, particularly if the headache radiates to the rear of the head. Because of its connection, the TCC is further stimulated by pain signals from the occipital area, worsening the migraine sensation (it produces a feedback loop) [ 10 , 11 ].

The main sensory nerve that serves the occipital region is the Greater Occipital Nerve (GON), which predominantly originates from the C2 dorsal root. The GON block is used in acute and preventive headache treatments as it targets the anatomical and functional connections between the trigeminal and cervical fibers within the trigemino-cervical complex (TCC). The rationale for using GON blocks is based on the integration of sensory neurons from C2 in the upper cervical spinal cord with neurons in the trigeminal nucleus caudalis (TNC). However, the precise mechanisms by which GON blocks may affect the TCC and potentially reduce its activity are still being researched [ 12 ]. However, there is currently no standard protocol for GONB. Local anesthetics function by preventing the activation of voltage-gated sodium channels, which reduce the transmission of sensory signals originating from areas innervated by the greater occipital nerve, such as the medial region of the posterior scalp [ 13 , 14 ], thereby preventing the activation of convergent neurons in the trigeminal-cervical complex. Combination therapy with corticosteroids may reduce inflammation, thereby attenuating pain, however, this role of corticosteroids also seems to be under debate.

The current management of chronic migraines is inadequate, as it lacks clear guidelines despite the various treatment options available. The evidence supporting the efficacy of GONB in preventing chronic migraines is limited and not recent [ 15 , 16 , 17 ]. However, the emergence of new clinical trials offers a promising opportunity for this study to provide valuable insights to healthcare providers. This study aims to fill the knowledge gaps by conducting a comprehensive systematic review and meta-analysis, providing healthcare professionals with a more complete understanding of the collective results of this approach for the treatment of chronic migraines.

A meta-analysis and a comprehensive systematic review were conducted to assess the efficacy of GONB in patients with CM, adhering to Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [ 18 ]. The PICO framework, a cornerstone of evidence-based medicine, organizes clinical questions and study designs into Population, Intervention, Comparison, and Outcome. In our research on chronic migraine treatment, we examine the efficacy of greater occipital nerve block (Intervention) with local anesthetics alone versus a placebo (Comparison) among adults with chronic migraine (Population), focusing on changes in migraine intensity measured by VAS, frequency, and adverse effects (Outcome).

Eligibility criteria

Inclusion criteria for studies considered in this meta-analysis encompassed randomized controlled trials (RCTs) evaluating the efficacy of greater occipital nerve block (GONB) with local anesthetics alone compared to a placebo in adult individuals diagnosed with chronic migraine. Studies were required to report outcomes including changes in migraine intensity measured by Visual Analog Scale (VAS), frequency of migraine episodes, and documentation of adverse effects. Exclusion criteria comprised studies that incorporated corticosteroids in conjunction with local anesthetics for GONB, non-randomized or non-controlled trials, studies with insufficient data for outcome assessment, and those involving populations other than adults with chronic migraine.

The primary efficacy endpoints were the change in headache intensity as measured by any scale, the frequency of headache (days per month) in the intervention group compared to the placebo group at a specific point in time, and the intensity of headache in the intervention group compared to the placebo group. To assess safety, the analysis focused on the number of participants who experienced at least one adverse event (AE) and the total number of participants who experienced AEs.

Literature search and study selection

A systematic search of PubMed, Medline, Scopus, Embase, Cochrane, Web of Science, and PsycINFO was performed as of June 2023 by two authors AR and AH. All languages and publication dates were considered and the search strategy involved both free and restricted terms pertaining to migraine and GONB, using key word ‘Chronic migraine’ or Migraine’ or ‘Greater Occipital Nerve Block’. Duplicates were eliminated and the titles and abstracts of the remaining articles were assessed to identify relevant studies. Subsequently, a full-text assessment was performed by two independent investigators (AK and BSR) and any discrepancies were resolved by a third investigator (MSM). The PRISMA flowchart (Fig.  1 ) illustrates the selection process.

figure 1

Prisma flow chart

Data extraction

We utilized a standard Microsoft Excel 2021 spreadsheet to gather data from each study included in a predetermined format. Two unbiased investigators (MAS and SMFZ) collected the following information from each study: author, year of publication, population, intervention and comparison drugs, techniques, primary and secondary outcomes, funding and potential conflicts of interest. If a disagreement arose, a third investigator made the final decision (GV).

Statistical analysis

Statistical analysis was conducted using Review Manager 5.3.22 and Comprehensive Meta-analysis. In order to account for anticipated between-study heterogeneity, we employed random-effects models in our meta-analysis of continuous outcomes. We reported the effect sizes as weighted mean differences (MD) with 95% confidence intervals (CI) for trials with similar results. The I 2 statistics were used to assess the statistical heterogeneity of the pooled estimates. While recognizing that statistical heterogeneity may not be significant when I 2 is < 40%, we performed this test. Regrettably, due to the limited number of included papers, we were unable to carry out a subgroup analysis or funnel plot assessment of publication bias.

Studies selection

The initial literature search yielded 3174 studies. After a detailed review of the selected studies and removal of duplicate entries, 1964 articles remained. These articles were then evaluated based on their titles and abstracts to determine whether they met the inclusion criteria for our study and those that did not were excluded. A comprehensive screening of the full text was performed in the remaining 30 studies. Studies which did not meet the inclusion criteria were excluded. The final quantitative analysis included five studies and 3 studies were included in the qualitative assessment as these studies used other drugs like corticosteroids thus with different interventions. A visual representation of the PRISMA flowchart effectively illustrated the study selection process (Fig.  1 ).

Quality assessment

In assessing the quality of RCTs, we extensively utilized the Cochrane Risk of Bias tool which categorizes studies into three risk levels: high, uncertain, and low, across seven specific domains encompassing aspects of selection, comparability, and outcome. Following rigorous evaluation, all studies included in our analysis were consistently classified as having low risk across these domains. A detailed presentation of the Risk of Bias assessment is shown in Fig.  2 .

figure 2

Risk of bias Assessment ( A ) Qualitative ( B ) Quantitative

Study and patient characteristics

All the included studies assessed outcomes in patients aged 18–75 years. The intervention group in three studies [ 19 , 20 , 21 ] used bupivacaine 0.5% 1.5 ml with or without 1 ml of saline (0.9%); one study [ 22 ] used lidocaine 2% 1 ml with 1 ml of saline solution (0.9%); and lastly, one study [ 23 ] used lidocaine 2% 2 ml as the interventional group. In the control groups, a saline solution of 0.9% (1.5, 2, or 2.5 ml) was used as a placebo. A total of 268 patients were included in all studies, ranging in age from 18 to 75 years. The studies differed in their follow-up procedures. Two studies were followed up at 4 weeks, one study was followed up for up to 2 months, and two studies were checked every month for up to 3 months A summary of patients’ baseline characteristics is provided in Table  1 .

Effect of GONB on headache intensity

In the initial month following GONB treatment, the meta-analysis of three studies showed a significant reduction in headache intensity as measured by the Visual Analog Scale (VAS). The standardized mean difference (SMD) was − 0.653, with a 95% confidence interval (CI) of -0.996 to -0.311 and a p-value of 0.0001. This indicates that the local anesthetic group experienced a greater reduction in headache intensity compared to the placebo group. Importantly, the I² value of 0% suggests that there was no observed heterogeneity among the studies, indicating consistent results across the studies analyzed. (Fig.  3 )

figure 3

Forest plot illustrating the effect of GONB on headache intensity, evaluated using VAS within the initial month

In the second month, an analysis of five studies continued to show a significant reduction in headache intensity with an SMD of -0.628 (95% CI -1.148 to -0.107; p  = 0.018). However, the I² value increased to 74%, indicating substantial heterogeneity among the studies. This heterogeneity was primarily due to one study (Inan et al.), which had an outlier SMD of 0.136. (Fig.  4 ) A leave-one-out analysis was conducted to address this issue and is shown in Fig.  5 .

figure 4

Forest plot illustrating the impact of GONB on headache intensity, evaluated using VAS during the second month

figure 5

Forest plot illustrating the effect of GONB on headache frequency within the initial month

Headache frequency

Within the initial month, the analysis of two studies showed a significant reduction in headache frequency, with an SMD of -0.755 (95% CI -1.133 to -0.377; p  = 0.0001). The results indicate a notable decrease in headache frequency in the local anesthetic group compared to the placebo group. The I² value of 0% indicates no heterogeneity between the studies, suggesting that the results were consistent. (Fig.  6 )

figure 6

Forest plot illustrating the impact of GONB on headache frequency during the second month

At the two-month mark, the analysis of four studies also showed a significant reduction in headache frequency with an SMD of -0.577 (95% CI -0.887 to -0.266; p  = 0.0001). The low I² value of 8.9% indicates minimal heterogeneity among the studies, reinforcing the consistency of the observed effect (Fig.  7 ).

figure 7

Forest plot displaying adverse events associated with the use of GONB

Adverse events

The meta-analysis of two studies on adverse events revealed no significant difference between the GONB treatment and placebo groups. The odds ratio (OR) was 1.379 with a 95% CI of 0.599 to 3.177 and a p-value of 0.450. The confidence interval crosses one, indicating that there is no clear increased risk of adverse events associated with GONB treatment. Additionally, the I² value of 0% suggests no heterogeneity between the studies, indicating consistent findings regarding the safety profile of GONB (Fig.  7 ).

We conducted an updated meta-analysis of GONB in patients with CM, incorporating findings from five RCTs. All RCTs used local anesthetics for GONB, while 0.9% saline served as the placebo. Our study focused on evaluating the impact of GONB on headache frequency, intensity, and associated adverse effects. The results demonstrated the beneficial effects of local anesthetics in reducing both the frequency and intensity of headaches during the first and second months of treatment. However, the outcomes related to adverse effects did not reach statistical significance. This meta-analysis included studies employing two distinct local anesthetics: 0.5% bupivacaine and 2% lidocaine. This suggests that the use of any local anesthetic could yield positive outcomes when compared with the effects of a placebo. Despite the positive results observed, we approached the evidence with caution because of the assessment of low certainty. Therefore, additional studies are warranted to further substantiate our findings and to enhance the reliability of the conclusions drawn from our meta-analysis.

Our meta-analysis demonstrated that GONB treatment significantly reduces both headache intensity and frequency in the initial and subsequent months post-treatment compared to placebo. During the first month, the studies consistently showed a marked reduction in headache intensity with no observed heterogeneity, indicating uniform results across the studies analyzed. In the second month, while the reduction in headache intensity remained remarkable, some heterogeneity was noted due to an outlier study. Similarly, the analysis revealed a notable decrease in headache frequency within the first month, again with consistent findings and no heterogeneity between the studies. By the second month, the reduction in headache frequency continued to be noteworthy, with minimal heterogeneity observed, reinforcing the consistency of the treatment effect. Furthermore, the analysis of adverse events indicated no significant difference between the GONB treatment and placebo groups, suggesting that GONB does not increase the risk of adverse events. The studies consistently supported the safety profile of GONB, with no observed heterogeneity. In terms of both safety and efficacy, our findings suggest that the use of local anesthetics in GONB is generally safe, as we did not identify any notable adverse effects in our intervention group. However, the certainty of our evidence is moderate, primarily because our results did not reach statistical significance, potentially influenced by the limited number of studies and relatively short follow-up phase. In our updated meta-analysis, building upon the original study by Velezquez et al. [ 24 ], we included an additional randomized RCT, contributing to a more comprehensive quantitative analysis. Although most of our study findings align with Velezquez’s findings [ 24 ], demonstrating the safety and effectiveness of GONB in treating chronic migraine, it is important to acknowledge some variations. Velezquez highlighted occasional negative effects associated with local anesthetics but found no remarkable side effects. In contrast, our study did not yield statistically significant outcomes in defining these results. A noteworthy distinction lies in the consideration of adjuvants: while our study did not account for steroids or other adjuvants, Velezquez considered steroids for every study outcome. This discrepancy underscores the need for further exploration and standardization of variables in future research to establish a more definitive understanding of the safety and efficacy of GONB in the management of chronic migraine.

Our findings strongly suggest that GONB is a safe and effective method for treating migraine. This assertion is consistent with existing research that characterizes GONB as a highly effective and safe therapy with minimal adverse effects, recommending its consideration when alternative treatments are unsuccessful [ 21 ]. This viewpoint is further supported by another study that affirms our findings, emphasizing a preference for GONB in cases of resistant migraine [ 22 ]. Moreover, evidence suggests the potential applicability of GONB in the treatment of various types of headaches [ 23 , 25 ]. A retrospective cohort study also indicated that GONB may be beneficial in addressing acute migraine episodes, albeit with a cautionary note regarding the potential negative effects occurring during the procedure rather than during the follow-up period [ 26 ]. Additional observational studies [ 25 , 27 ] reinforce our findings. However, a study comparing the effectiveness of GONB with placebo in preventing migraine revealed that while there was no marked change in headache frequency, GONB still played a remarkable role in lowering intensity [ 28 ]. Notably, these studies underscored the benefits of GONB, often involving the adjunct use of steroids. In a randomized controlled trial that focused on patients treated with bilateral GONB, the results indicated that the administration of a local anesthetic was associated with lower frequency, reduced intensity, and increased pressure thresholds. However, it is important to note that this study predominantly involved female participants [ 29 ]. However, it is essential to acknowledge that trials exclusively assessing the independent use of local anesthetics in GONB are currently lacking, as steroids are commonly employed as adjuvants in the majority of studies. This finding suggests the need for further investigation to delineate the unique contributions of local anesthetics to GONB outcomes.

Prior research has emphasized the necessity of comparing various treatment plans for GONB, incorporating diverse anesthetics and adjuncts to comprehensively evaluate its effectiveness, the need for additional intervention, and safety considerations, it is crucial to note that we did not incorporate any adjuncts, preventing us from commenting on their potential impact on the treatment outcomes. The absence of adjunct utilization in our study underscores the need for further exploration of how these additions may influence the overall efficacy and safety of GONB. Most trials in our analysis used weekly injections, resulting in a lack of comprehensive data for comparing various frequencies. Nevertheless, some studies have suggested the potential advantages associated with monthly use [ 26 ]. The American Headache Society also suggests and has shown interest in the efficacy of nerve blocks for headache treatment. Their endorsement highlights the growing recognition of nerve blocks as a valuable therapeutic option for managing headaches [ 30 , 31 ].

Included studies present diverse methodologies in terms of dosage, injection sites, duration and timing of the intervention, and primary endpoints for the evaluation of GONB efficacy in migraine treatment. The administration and makeup of the GONB differed substantially across the studies. For example, Gul et al. [ 20 ] used 0.5% bupivacaine diluted in 1 ml, while Inan et al. [ 19 ] used a slightly larger volume of the same concentration. Ozer et al. [ 22 ] combined 2% lidocaine with saline, and Ashkenazi et al. [ 32 ] mixed lidocaine and bupivacaine. These variations could lead to differences in efficacy and side effects. The addition of corticosteroids, as observed in Dilli et al. [ 33 ], introduces another variable that may enhance the anti-inflammatory effects but could also influence the outcome independently of the nerve block’s anesthetic action. Although the studies targeted the GON, the exact injection sites varied slightly. Most studies, such as those by Gul et al. [ 20 ], Inan et al. [ 19 ], and Cuadrado et al. [ 34 ], selected a site approximately 2 cm lateral and 2 cm inferior to the external occipital protuberance. Palamar et al. [ 21 ] used ultrasound guidance, which might improve accuracy and potential efficacy. Ashkenazi et al. [ 32 ] included additional trigger point injections (TPIs), which could complicate the specific effects of the GONB.

The administration of GONB varied in frequency and duration among different studies. While some research, such as that conducted by Gul et al. [ 20 ] and Inan et al. [ 19 ], administered the blocks weekly for four weeks, others like Chowdury et al. [ 23 ] extended the injections over a period of 12 weeks. On the other hand, Cuadrado et al. [ 34 ] and Dilli et al. [ 33 ] examined single-time administrations. These discrepancies in timing may affect both short-term and long-term outcomes, with more frequent administrations potentially leading to more sustained relief, but also increasing the risks of cumulative side effects. The primary endpoints of the studies varied but generally included measures of headache frequency and intensity. For instance, Gul et al. [ 20 ] and Palamar et al. [ 21 ] focused on the number of headache days per month, while Inan et al. assessed both frequency and intensity. Ozer et al. [ 22 ] and Cuadrado et al. [ 34 ] emphasized the reduction in headache frequency, while Dilli et al. [ 33 ] sought a 50% reduction in migraine frequency as a measure of success. The variation in endpoints underscores the multifaceted nature of migraine impact and the significance of selecting appropriate, consistent measures for evaluating the efficacy of treatments.

Despite the differences in methodology, the studies collectively indicate that GONB can effectively decrease the frequency and severity of migraines. The consistent reporting of substantial improvements across a range of dosages, injection techniques, and primary outcomes reinforces the potential usefulness of GONB in clinical practice. However, the variation in methodologies highlights the need for standardized protocols to improve the comparability and generalizability of the findings. While the reviewed studies indicate promising outcomes for GONB in migraine treatment, the variability in dosage, injection sites, administration timing, and primary endpoints necessitates caution.

Examining these frequencies is particularly vital because of the invasive nature of the procedure, which offers valuable insights into its safety profile. An essential aspect of chronic migraine management is patient adherence, which markedly contributes to treatment success. It is imperative to assess the level of adherence to GONB. Unfortunately, we could not find relevant research on participants discontinuing their medication owing to side effects, hindering our ability to determine the tolerability of the treatment. Another unresolved concern revolves around the choice between unilateral and bilateral GONB and their relative efficacy. A retrospective cohort study comparing patients who underwent bilateral versus unilateral GONB demonstrated equal effectiveness [ 35 ]. However, a definitive conclusion remains elusive as additional evidence from diverse studies is lacking. Addressing these gaps in research would contribute substantially to refining our understanding of GONB’s optimal parameters for improved outcomes in chronic migraine management. Longitudinal studies and studies on the frequency of nerve block use are needed to assess long-term efficacy.

Limitations

Although this meta-analysis offers valuable insights, it is crucial to acknowledge its limitations. First, the small sample size resulting from the limited availability of new studies may compromise the reliability and accuracy of our findings. Although incorporating more studies could alleviate this concern, the scarcity of available data remains an issue. Second, the absence of sufficient data from recent trials prevented consideration of baseline characteristics, hindering our ability to perform meta-regression. This limitation underscores the importance of comprehensive data collection in future studies to increase the depth of our analyses. Third, oversight of not accounting for pretreatment medications taken by patients during the procedure might introduce a confounding factor. Although the existing data may be insufficient to draw definitive conclusions, recognizing and addressing this aspect in future research is essential for a more nuanced understanding. Moreover, this meta-analysis did not explicitly address patient comorbidities. These factors could potentially influence the safety of the procedure in patients with various comorbidities. Future studies should delve into these aspects to provide a more comprehensive assessment of the safety profile of the procedure in diverse patient populations. In conclusion, although this meta-analysis provides valuable insights, researchers must remain cognizant of these limitations. Addressing these concerns in future studies will enhance the robustness and applicability of these findings in clinical settings.

Based on our investigation, we ascertained that the administration of Greater Occipital Nerve Blocks (GONB) with local anesthetic leads to a notable reduction in both the intensity and frequency of headaches when compared to placebo. Additionally, our research underscores the effectiveness of GONBs and affirms their satisfactory safety profile. However, it is important to acknowledge that our confidence in these findings is somewhat tempered by the limited number of studies and relatively modest sample size that underpins our conclusions. Therefore, we advocate that future studies should broaden their scope by incorporating larger and more diverse sample sizes. These studies should also explore a range of drugs, frequencies, and treatment plans to augment the robustness and applicability of the results, thereby providing a more comprehensive understanding of the potential benefits of GONBs for headache management.

Data availability

The data are available within the article and supplementary files. The authors confirm that data supporting the findings of this study are available in the article and supplementary files.

Steiner TJ, Stovner LJ, Jensen R, Uluduz D, Katsarava Z. Migraine remains second among the world’s causes of disability, and first among young women: findings from GBD2019. J Headache Pain. 2020;21(1):137.

Article   PubMed   PubMed Central   Google Scholar  

Steiner TJ, Stovner LJ. Global epidemiology of migraine and its implications for public health and health policy. Nat Rev Neurol. 2023;19(2):109–17.

Article   PubMed   Google Scholar  

Stovner LJ, Nichols E, Steiner TJ, Abd-Allah F, Abdelalim A, Al-Raddadi RM, et al. Global, regional, and national burden of migraine and tension-type headache, 1990–2016: a systematic analysis for the global burden of Disease Study 2016. Lancet Neurol. 2018;17(11):954–76.

Article   Google Scholar  

Ferrari MD. The economic burden of migraine to society. PharmacoEconomics. 1998;13(6):667–76.

Goadsby PJ, Holland PR. An update: pathophysiology of Migraine. Neurol Clin. 2019;37(4):651–71.

The International Classification of Headache Disorders, 3rd edition (beta version). Cephalalgia. 2013;33(9):629–808.

Urits I, Yilmaz M, Bahrun E, Merley C, Scoon L, Lassiter G, et al. Utilization of B12 for the treatment of chronic migraine. Best Pract Res Clin Anaesthesiol. 2020;34(3):479–91.

Hong JB, Lange KS, Overeem LH, Triller P, Raffaelli B, Reuter U. A scoping review and Meta-analysis of Anti-CGRP monoclonal antibodies: Predicting Response. Pharmaceuticals (Basel). 2023;16(7).

Han L, Liu Y, Xiong H, Hong P. CGRP monoclonal antibody for preventive treatment of chronic migraine: an update of meta-analysis. Brain Behav. 2019;9(2):e01215.

Al-Khazali HM, Krøll LS, Ashina H, Melo-Carrillo A, Burstein R, Amin FM et al. Neck pain and headache: Pathophysiology, treatments and future directions. Musculoskeletal Science and Practice [Internet]. 2023;66:102804. https://doi.org/10.1016/j.msksp.2023.102804

Bartsch T, Goadsby PJ. The trigeminocervical complex and migraine: current concepts and synthesis. Curr Sci Inc. 2003;7:371–6. https://doi.org/10.1007/s11916-003-0036-y .

Chowdhury D, Datta D, Mundra A. Role of Greater Occipital nerve Block in Headache disorders: a narrative review. Neurol India. 2021;69(Supplement):S228–56.

Anthony M. Headache and the greater occipital nerve. Clin Neurol Neurosurg. 1992;94(4):297–301.

Selekler MH. [Greater occipital nerve blockade: trigeminicervical system and clinical applications in primary headaches]. Agri. 2008;20(3):6–13.

PubMed   Google Scholar  

Shauly O, Gould DJ, Sahai-Srivastava S, Patel KM. Greater Occipital nerve block for the treatment of chronic migraine headaches: a systematic review and Meta-analysis. Plast Reconstr Surg. 2019;144(4):943–52.

Tang Y, Kang J, Zhang Y, Zhang X. Influence of greater occipital nerve block on pain severity in migraine patients: a systematic review and meta-analysis. Am J Emerg Med. 2017;35(11):1750–4.

Zhang H, Yang X, Lin Y, Chen L, Ye H. The efficacy of greater occipital nerve block for the treatment of migraine: a systematic review and meta-analysis. Clin Neurol Neurosurg. 2018;165:129–33.

Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71.

Inan LE, Inan N, Karadaş Ö, Gül HL, Erdemoğlu AK, Türkel Y, et al. Greater occipital nerve blockade for the treatment of chronic migraine: a randomized, multicenter, double-blind, and placebo-controlled study. Acta Neurol Scand. 2015;132(4):270–7.

Gul HL, Ozon AO, Karadas O, Koc G, Inan LE. The efficacy of greater occipital nerve blockade in chronic migraine: a placebo-controlled study. Acta Neurol Scand. 2017;136(2):138–44.

Palamar D, Uluduz D, Saip S, Erden G, Unalan H, Akarirmak U. Ultrasound-guided greater occipital nerve block: an efficient technique in chronic refractory migraine without aura? Pain Physician. 2015;18(2):153–62.

Özer D, Bölük C, Türk Börü Ü, Altun D, Taşdemir M, Köseoğlu Toksoy C. Greater occipital and supraorbital nerve blockade for the preventive treatment of migraine: a single-blind, randomized, placebo-controlled study. Curr Med Res Opin. 2019;35(5):909–15.

Chowdhury D, Tomar A, Deorari V, Duggal A, Krishnan A, Koul A. Greater occipital nerve blockade for the preventive treatment of chronic migraine: a randomized double-blind placebo-controlled study. Cephalalgia. 2023;43(2):3331024221143541.

Velásquez-Rimachi V, Chachaima-Mar J, Cárdenas-Baltazar EC, Loayza-Vidalon A, Morán-Mariños C, Pacheco-Barrios K, et al. Greater occipital nerve block for chronic migraine patients: a meta-analysis. Acta Neurol Scand. 2022;146(2):101–14.

Bovim G, Sand T. Cervicogenic headache, migraine without aura and tension-type headache. Diagnostic blockade of greater occipital and supra-orbital nerves. Pain. 1992;51(1):43–8.

Allen SM, Mookadam F, Cha SS, Freeman JA, Starling AJ, Mookadam M. Greater Occipital nerve block for Acute Treatment of Migraine Headache: a large Retrospective Cohort Study. J Am Board Fam Med. 2018;31(2):211–8.

Austin M, Hinson MR. Occipital nerve Block. StatPearls. Treasure Island (FL) ineligible companies. Disclosure: Melissa Hinson declares no relevant financial relationships with ineligible companies.: StatPearls Publishing Copyright © 2024. StatPearls Publishing LLC.; 2024.

Afridi SK, Shields KG, Bhola R, Goadsby PJ. Greater occipital nerve injection in primary headache syndromes–prolonged effects from a single injection. Pain. 2006;122(1–2):126–9.

Santos Lasaosa S, Cuadrado Pérez ML, Guerrero Peral AL, Huerta Villanueva M, Porta-Etessam J, Pozo-Rosich P, et al. Consensus recommendations for anaesthetic peripheral nerve block. Neurologia. 2017;32(5):316–30.

Rothrock JF. Occfipital nerve blocks. Headache the Journal of Head and Face Pain [Internet]. 2010;50(5):917–8. https://doi.org/10.1111/j.1526-4610.2010.01668.x

Blumenfeld A, Ashkenazi A, Napchan U, Bender SD, Klein BC, Berliner R et al. Expert Consensus Recommendations for the performance of Peripheral nerve blocks for headaches – A Narrative review. Headache the Journal of Head and Face Pain [Internet]. 2013;53(3):437–46. https://doi.org/10.1111/head.12053

Ashkenazi A, Matro R, Shaw JW, Abbas MA, Silberstein SD. Greater occipital nerve block using local anaesthetics alone or with triamcinolone for transformed migraine: a randomised comparative study. J Neurol Neurosurg Psychiatry. 2008;79(4):415–7.

Dilli E, Halker R, Vargas B, Hentz J, Radam T, Rogers R, et al. Occipital nerve block for the short-term preventive treatment of migraine: a randomized, double-blinded, placebo-controlled study. Cephalalgia. 2015;35(11):959–68.

Cuadrado ML, Aledo-Serrano Á, Navarro P, López-Ruiz P, Fernández-de-Las-Peñas C, González-Suárez I, et al. Short-term effects of greater occipital nerve blocks in chronic migraine: a double-blind, randomised, placebo-controlled clinical trial. Cephalalgia. 2017;37(9):864–72.

Karaoğlan M, Durmuş İE, Küçükçay B, Takmaz SA, İnan LE. Comparison of the clinical efficacy of bilateral and unilateral GON blockade at the C2 level in chronic migraine. Neurol Sci. 2022;43(5):3297–303.

Download references

Acknowledgements

We would like to thank the Paolo Procacci Foundation for their support.

The study was funded by the Paolo Procacci Foundation.

Author information

Authors and affiliations.

Department of Medicine, Jinnah Sindh Medical University, Karachi, 75510, Pakistan

Muhamad Saqlain Mustafa, Muhammad Ashir Shafique, Burhanudin Sohail Rangwala & Abdul Haseeb

Department of Medicine, Shaheed Mohtarma Benazir Bhutto Medical College, Karachi, 75400, Pakistan

Shafin bin Amin, Aashish Kumar, Syed Ali Arsal & Adarsh Raja

Department of Medicine, Dow University of Health Science, Karachi, 74200, Pakistan

Syeda Mahrukh Fatima Zaidi, Muhammad Faheem Iqbal & Inshal Jawed

Fondazione Paolo Procacci, Roma, 00193, Italy

Giustino Varrassi

Faculty of Medicine, University of Khartoum, Khartoum, 11111, Sudan

Khabab Abbasher Hussien Mohamed Ahmed

Department of Medicine, Liaquat National Hospital and Medical College, Karachi, Pakistan

Syed Muhammad Sinaan Ali

Jinnah Sindh Medical University, Rafiqi H J Shaheed Road, Karachi, 75510, Pakistan

Shaheed Mohtarma Benazir Bhutto Medical College, Lyari Hospital Rd, Rangiwara Karachi, Karachi, 75400, Pakistan

Dow University of Health Sciences, Mission Rd, New Labour Colony Nanakwara, Karachi, 74200, Pakistan

Liaquat National Hospital & Medical College, Stadium Road, Karachi, 74800, Pakistan

You can also search for this author in PubMed   Google Scholar

Contributions

M.S.M., S.B.A., A.K., M.A.S., S.M.F.Z., S.A.A. and B.S.R. wrote the main manuscript, visualized, validated and analyzed data. M.F.I., A.R., A.H., I.J., K.A.H.M.A., S.M.S.A. and G.V. wrote the main manuscript, conceived, visualized, validated, reviewed and edited.All authors made a significant contribution to the work reported, whether that is in the conception, study design, execution, acquisition of data, analysis and interpretation, or in all these areas; took part in drafting, revising or critically reviewing the article; gave final approval of the version to be published; have agreed on the journal to which the article has been submitted; and agree to be accountable for all aspects of the work.

Corresponding author

Correspondence to Khabab Abbasher Hussien Mohamed Ahmed .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Mustafa, M.S., bin Amin, S., Kumar, A. et al. Assessing the effectiveness of greater occipital nerve block in chronic migraine: a systematic review and meta-analysis. BMC Neurol 24 , 330 (2024). https://doi.org/10.1186/s12883-024-03834-6

Download citation

Received : 19 April 2024

Accepted : 28 August 2024

Published : 07 September 2024

DOI : https://doi.org/10.1186/s12883-024-03834-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Chronic migraine
  • Greater occipital nerve block (GONB)
  • Local anesthetics

BMC Neurology

ISSN: 1471-2377

limitations of small sample size in quantitative research

IMAGES

  1. PPT

    limitations of small sample size in quantitative research

  2. 21 Research Limitations Examples (2024)

    limitations of small sample size in quantitative research

  3. PPT

    limitations of small sample size in quantitative research

  4. Sample Size Determination: Definition, Formula, and Example

    limitations of small sample size in quantitative research

  5. PPT

    limitations of small sample size in quantitative research

  6. Sample size in quantitative research methods

    limitations of small sample size in quantitative research

VIDEO

  1. Step 3-5: Sample Size Calculation in Experimental Research

  2. Qualitative vs Quantitative Research Methods

  3. Sample size and its Estimation

  4. Neuromuscular Electrical Stimulation for Erectile Dysfunction ⚡️ #pelvicfloor #erectiledysfunction

  5. Sampling issues in qualitative research

  6. Sample Size 2

COMMENTS

  1. How sample size influences research outcomes

    An appropriate sample renders the research more efficient: Data generated are reliable, resource investment is as limited as possible, while conforming to ethical principles. The use of sample size calculation directly influences research findings. Very small samples undermine the internal and external validity of a study.

  2. Sample size, power and effect size revisited: simplified and practical

    Sample size, power and effect size revisited: simplified ... - NCBI

  3. On the scientific study of small samples: Challenges confronting

    Qualitative designs: Problems and limitations. There are at least six typical challenges that confront researchers who study small samples. The problems I identify here challenge both qualitative and quantitative designs; however, they will be more prevalent in research that solely uses a qualitative case study examination of a given phenomenon (e.g., a particular event, leader, institution ...

  4. Sample Size and its Importance in Research

    Sample size calculations require assumptions about expected means and standard deviations, or event risks, in different groups; or, upon expected effect sizes. For example, a study may be powered to detect an effect size of 0.5; or a response rate of 60% with drug vs. 40% with placebo. [1] When no guesstimates or expectations are possible ...

  5. Small Sample Research: Considerations Beyond Statistical Power

    Small sample research presents a challenge to current standards of design and analytic approaches and the underlying notions of what constitutes good prevention science. Yet, small sample research is critically important as the research questions posed in small samples often represent serious health concerns in vulnerable and underrepresented populations. This commentary considers the Special ...

  6. On the scientific study of small samples: Challenges confronting

    In this article I examine how small sample sizes can be studied scientifically. The article begins with an explanation of the distinction between research and science. ... Qualitative designs: Problems and limitations. ... Contrast in qualitative vs quantitative forms of inquiry in small samples. As the examples above should indicate, an ...

  7. The Disadvantages of a Small Sample Size

    Disadvantage 2: Uncoverage Bias. A small sample size also affects the reliability of a survey's results because it leads to a higher variability, which may lead to bias. The most common case of bias is a result of non-response. Non-response occurs when some subjects do not have the opportunity to participate in the survey.

  8. Why is a small sample size not enough?

    Sample and sample size. A sample comprises the individuals from whom we collect data and represents a share of the population (N) for whom we want to draw conclusions (eg, women breast cancer).The sample size (n) is the number of individual people, experimental units, or other elements included in a sample, and is a central concept in statistical applications to clinical research.

  9. Sample Size for Survey Research: Review and Recommendations

    (PDF) Sample Size for Survey Research: Review and ...

  10. Best Practices for Using Statistics on Small Sample Sizes

    The right one depends on the type of data you have: continuous or discrete-binary. Comparing Means: If your data is generally continuous (not binary), such as task time or rating scales, use the two sample t-test. It's been shown to be accurate for small sample sizes. Comparing Two Proportions: If your data is binary (pass/fail, yes/no), then ...

  11. Sample size determination: A practical guide for health researchers

    2.1. Expectations regarding sample size. A sample size can be small, especially when investigating rare diseases or when the sampling technique is complicated and costly. 4 , 7 Most academic journals do not place limitations on sample sizes. 8 However, an insufficiently small sample size makes it challenging to reproduce the results and may produce high false negatives, which in turn undermine ...

  12. Small sample sizes: A big data problem in high-dimensional data

    The procedure is valid for large dimensions and sample sizes. The case of small sample sizes has not been considered and therefore its applicability in such situations intrigues a detailed investigation. In their original paper, both the cases of studentized and non-studentized statistics have been considered.

  13. Sample size determination: A practical guide for health researchers

    Sample size determination: A practical guide for health ...

  14. Sample Size and its Importance in Research

    The sample size for a study needs to be estimated at the time the study is proposed; too large a sample is unnecessary and unethical, and too small a sample is unscientific and also unethical. The necessary sample size can be calculated, using statistical software, based on certain assumptions. If no assumptions can be made, then an arbitrary ...

  15. Sample size determination: A practical guide for health researchers

    Sample Size Determination: A Practical Guide for Health ...

  16. SampleSizePlanner: A Tool to Estimate and Justify Sample Size for Two

    In a recent overview, Lakens (2021) listed six types of general approaches to justify sample size in quantitative empirical studies: (a) measure entire population, (b) resource constraints, (c) a priori power analysis, (d) accuracy, (e) heuristics, and (f) no justification. For the first approach, no quantitative justification is necessary, and ...

  17. Big enough? Sampling in qualitative inquiry

    Any senior researcher, or seasoned mentor, has a practiced response to the 'how many' question. Mine tends to start with a reminder about the different philosophical assumptions undergirding qualitative and quantitative research projects (Staller, 2013). As Abrams (2010) points out, this difference leads to "major differences in sampling ...

  18. Implications of Small Samples for Generalization: Adjustments ...

    Background: Policy makers and researchers are frequently interested in understanding how effective a particular intervention may be for a specific population. One approach is to assess the degree of similarity between the sample in an experiment and the population. Another approach is to combine information from the experiment and the population to estimate the population average treatment ...

  19. The importance of small samples in medical research

    Statistically, a sample of n <30 for the quantitative outcome or [np or n (1 - p)] <8 (where P is the proportion) for the qualitative outcome is considered small because the central limit theorem for normal distribution does not hold in most cases with such a sample size and an exact method of analysis is required.

  20. Characterising and justifying sample size sufficiency in interview

    Characterising and justifying sample size sufficiency in ...

  21. Research Limitations: Simple Explainer With Examples

    Limitation #3: Sample Size & Composition. As we've discussed before, the size and representativeness of your sample are crucial, especially in quantitative research where the robustness of your conclusions often depends on these factors.All too often though, students run into issues achieving a sufficient sample size and composition. To ensure adequacy in terms of your sample size, it's ...

  22. What are the limitations of using a small sample size in research?

    Using a small sample size in research has several limitations. Firstly, it may lead to underpowered studies, which are unreliable and may result in faulty conclusions and wasted resources . Secondly, estimating sample size based on only two time points can be inadequate and biased, leading to systematically under- or over-estimated sample sizes . Additionally, small sample sizes may not be ...

  23. A fault diagnosis method based on an improved diffusion model under

    The practical value of this research lies in its ability to address the limitation of small sample sizes in rolling bearing fault diagnosis. By leveraging DDPM to generate one-dimensional vibration data, the proposed method significantly enriches the datasets and consequently enhances the generalization capability of the diagnostic model.

  24. Sample size: how many participants do I need in my research?

    It is the ability of the test to detect a difference in the sample, when it exists in the target population. Calculated as 1-Beta. The greater the power, the larger the required sample size will be. A value between 80%-90% is usually used. Relationship between non-exposed/exposed groups in the sample.

  25. Assessing the effectiveness of greater occipital nerve block in chronic

    The analysis emphasized the safety and efficacy of GONB, albeit with a cautious interpretation due to the limited number of studies and relatively small sample size. This study advocates for further research exploring various drugs, frequencies, and treatment plans to enhance the robustness and applicability of GONB for chronic migraine management.