analysis in biomedical research

Biostatistics for Biomedical Research

Frank E Harrell Jr

Department of Biostatistics School of Medicine Vanderbilt University

May 23, 2024

The book is aimed at exposing biomedical researchers to modern biostatistical methods and statistical graphics, highlighting those methods that make fewer assumptions, including nonparametric statistics and robust statistical measures. In addition to covering traditional estimation and inferential techniques, the course contrasts those with the Bayesian approach, and also includes several components that have been increasingly important in the past few years, such as challenges of high-dimensional data analysis, modeling for observational treatment comparisons, analysis of differential treatment effect (heterogeneity of treatment effect), statistical methods for biomarker research, medical diagnostic research, and methods for reproducible research. A glossary of statistical terms for non-statisticians is here . BBR course R Workflow is a useful companion to this book, especially for those needing to manipulate data in preparation for analysis and for those interested in embedding statistical analyses in state-of-the-art reproducible reports.

For information about adding annotations, comments, and questions inside the text click here: Comments

Symbols Used in the Right Margin of the Text

Blue symbols in the right margin starting with ABD designate section numbers (and occasionally page numbers preceeded by \(p\) ) in The Analysis of Biological Data, Second Edition by MC Whitlock and D Schluter, Greenwood Village CO, Roberts and Company, 2015.
Right blue symbols starting with RMS designate section numbers in Regression Modeling Strategies, 2nd ed. by FE Harrell, Springer, 2015.

Boxed blue text in the right margin represents a mnemonic key for linking to discussions about that section in datamethods . Anyone starting a new discussion about a topic related to the section should include the mnemonic somewhere in the posting. When you click on the blue boxed text the datamethods search result of all topics containing that mnemonic will appear, and the user can navigate from it to the topic of interest to read or add content.
An audio player symbol indicates that narration elaborating on the notes is available for the section. Red letters and numbers in the right margin are cues referred to within the audio recordings.
blog in the right margin is a link to a blog entry that further discusses the topic.

Other Information

YouTube channel BBRcourse for these notes
Discussion board about the overall course
Go directly to a YouTube video for BBR Session n by going to bit.ly/yt-bbrn
Glossary of statistical terms
Datamethods discussion board
Statistical papers written for clinical researchers
Statistical Thinking blog
Statistical Thinking News

Acknowledgement

This material grew largely out of teaching clinical scholars and in Master of Science in Clinical Investigation programs at Duke University, University of Virginia, and Vanderbilt University. I benefitted immensely from lecture notes from colleagues such as Kerry Lee of Duke University. Thanks also goes to Vanderbilt Biostatistics colleague James C. Slaughter who made several contributions to an earlier version of the book at hbiostat.org/doc/bbr.pdf .

Date	Sections	Changes	Thanks To
2024-04-16		New subsection on KCCQ ceiling effect problem
2024-04-16		New subsection on optimal model to replace change score
2023-11-10		Fixed mixed effects ordinal model for paired rank test by using quadrature
2023-09-22		New section on bootstrapping importantance ranks using one-at-a-time feature modeling
2023-09-16		New section on estimation of correlation matrices
2023-07-28		New section on using models for paired data
2023-07-26		Added example of ordinal model for 2-way ANOVA
2023-06-22		Added big picture
2023-06-16		Added more to section on how many covariates to add
2023-04-27		New section on sample size for ECDF
2023-04-05		Added confidence bands
2023-03-30		Fixed bug in simulation graphics
2023-03-29		New link to clinical trial design resource
2023-03-13		New subsection on the decline effect
2023-02-19		Added link to resources for learning probability
2022-12-29		Added single-axis nomogram example
2022-12-28		Started to add old study questions to end of selected chapters
2022-12-03		New section with real example of misleading change score
2022-11-27		New section on importance of current status vs. baseline status and irrelevance of change for patients
2022-08-02		Quote about weaknesses in sens and spec; link to CrossValidated discussion
2022-08-31		New material on sample size vs. P(correct sign on r)

Source Code

Open access
Published: 15 May 2023

Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Jörg Rahnenführer ORCID: orcid.org/0000-0002-8947-440X 1 ,
Riccardo De Bin ORCID: orcid.org/0000-0002-7441-6880 2 ,
Axel Benner ORCID: orcid.org/0000-0002-7238-6956 3 ,
Federico Ambrogi ORCID: orcid.org/0000-0001-9358-011X 4 , 5 ,
Lara Lusa ORCID: orcid.org/0000-0002-8981-2421 6 , 7 ,
Anne-Laure Boulesteix ORCID: orcid.org/0000-0002-2729-0947 8 ,
Eugenia Migliavacca 9 ,
Harald Binder ORCID: orcid.org/0000-0002-5666-8662 10 ,
Stefan Michiels ORCID: orcid.org/0000-0002-6963-2968 11 , 12 ,
Willi Sauerbrei ORCID: orcid.org/0000-0002-6792-4123 10 &
Lisa McShane ORCID: orcid.org/0000-0001-8195-3206 13

for topic group “High-dimensional data” (TG9) of the STRATOS initiative

BMC Medicine volume 21 , Article number: 182 ( 2023 ) Cite this article

8059 Accesses

5 Citations

8 Altmetric

Metrics details

In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions.

Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 “High-dimensional data” of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD.

The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided.

Conclusions

This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses.

Peer Review reports

The goal of the topic group TG9 “High-dimensional data” (HDD) of the STRATOS (STRengthening Analytical Thinking for Observational Studies) [ 1 ] initiative is to provide guidance for planning, conducting, analyzing, and reporting studies involving high-dimensional biomedical data. The increasing availability and use of “big” data in biomedical research, characterized by “large n ” (independent observations) and/or “large p ” (number of dimensions of a measurement or number of variables associated with each independent observation), has created a need for the development and novel application of statistical methods and computational algorithms. Either large n or p may present difficulties for data storage or computations, but large p presents several major statistical challenges and opportunities [ 2 ]. The dimension p can range from several dozen to millions. The situation of very large p is the focus of TG9 and this paper. Throughout the paper, “ p ” will refer to the number of variables and the term “subject” will be used broadly to refer to independent observations, including human or animal subjects, or biospecimens derived from them; or other independent experimental or observational units. Researchers who design and analyze such studies need a basic understanding of the commonly used analysis methods and should be aware of pitfalls when statistical methods that are established in the low-dimensional setting cannot, or should not, be used in the HDD setting.

This overview, a product of STRATOS topic group TG9, provides a gentle introduction to fundamental concepts in the analysis of HDD, in the setting of observational studies in biomedical research. The focus is on analytical methods; however, issues related to study design, interpretation, transportability of findings, and clinical usefulness of results should also be considered as briefly discussed throughout this paper.

The STRATOS initiative and the STRATOS topic group TG9 “High-dimensional data”

The STRATOS initiative ( www.stratos-initiative.org ) is a large collaboration involving experts in many different areas of biostatistical research. The objective of STRATOS is to provide accessible and sound guidance for the design and analysis of observational studies [ 1 ]. This guidance is intended for applied statisticians and other data analysts with varying levels of statistical training, experience and interests. TG9 is one of nine topic groups of STRATOS and deals with aspects of HDD analysis.

Main issues addressed by TG9 often overlap with those of other TGs, but in the work of TG9 there is always a focus on the HDD aspect. Sometimes TG9 guidance will build upon that of other TGs to adapt it for relevance to HDD (see the “ Discussion ” section), but also completely new issues arise and require novel statistical approaches.

High-dimensional data are now ubiquitous in biomedical research, very frequently in the context of observational studies. Particularly omics data, i.e., high-throughput molecular data (e.g., genomics, transcriptomics, proteomics, and metabolomics) have provided new insights into biological processes and disease pathogenesis and have furthered the development of precision medicine approaches [ 3 ]. Rapidly expanding stores of electronic health records contain not only standard demographic, clinical, and laboratory data collected through a patient history, but also information from potentially many different providers involved in a patient’s care [ 4 ]. Data may be derived from multiple sources and can be represented in many different forms. Collectively, these data can be leveraged to support programs in comparative effectiveness and health outcomes research, and to monitor public health. Many statistical methods that are discussed here may be applied to health records data as well as to omics data, but our primary focus here is on the analysis of omics data.

Simultaneously, advances in statistical methodology and machine learning methods have contributed to improved approaches for data mining, statistical inference, and prediction in the HDD setting. Strong collaborations between data and computational scientists (e.g., statisticians, computational biologists, bioinformaticians, and computer scientists) and other biomedical scientists (e.g., clinicians and biologists) are essential for optimal generation, management, processing, analysis, and interpretation of these high-dimensional biomedical data [ 5 ].

Credibility and importance of research findings from biomedical studies involving HDD can be better judged when there is understanding of various approaches for statistical design and analysis along with their strengths and weaknesses. While this overview directly aims to improve understanding, simultaneously this guidance implies what information is necessary to report to fully appreciate how a study was designed, conducted, and analyzed. Whether study results prompt further pre-clinical or early clinical work, or translation to clinical use, ability to judge quality, credibility, and relevance of those results is critical. It is important to avoid sending research programs down unproductive paths or allowing flawed research results such as poorly performing prognostic models or therapy selection algorithms generated from HDD to be implemented clinically [ 6 ]. Historically, research involving biomarkers and prognostic modelling has been criticized for lack of rigor, reproducibility, and clinical relevance [ 7 , 8 , 9 , 10 ], and for poor reporting [ 11 , 12 ]. At least as many deficiencies are also common in biomedical research involving HDD. The goal of STRATOS TG9 is to reduce these deficiencies, and improve rigor and reproducibility, by providing widely accessible didactic materials pertinent to studies involving HDD.

Study design

In any observational study, including in the HDD setting, study design plays a crucial role in relation to the research question. A first important point is the precise definition of the target population and the sampling procedure. The subjects included in a study (or biospecimens derived from them) may be selected from the population by a random or other statistically designed sampling procedure (e.g., case–control, case-cohort), or may simply represent a “convenience” sample. It is therefore important to understand whether the subjects are representative of the target population, how the variables associated with subjects were measured or ascertained, and whether there are potential confounding factors. Failure to account for confounding factors or minimize bias in subject or variable ascertainment can lead to useless or misleading results.

Outcome-dependent sampling is rather common in observational studies, particularly for those investigating risk factors for relatively uncommon diseases or outcomes. Examples include classical matched or unmatched case–control designs along with two-phase sampling from a cohort (case-cohort or nested case–control). Another often-used strategy oversamples long survivors, or, for continuous outcomes, subjects with high and low values of the outcome variable. When any such sampling strategies are employed, it is important to use inferential procedures [ 13 , 14 ] that properly account for the sampling design.

Laboratory experiments generating high-dimensional assay data should adhere to the same best practices as traditional controlled experiments measuring only one or a few analytes, including randomization, replication, blocking, and quality monitoring. Arguably, careful design might be even more important in the setting of HDD generation because HDD assays may be especially sensitive to technical artifacts. Even when a study is primarily observational yet involves analysis of stored biospecimens using omics assays, good design principles should be followed when performing the assays. Best practices include randomizing biospecimens to assays batches to avoid confounding assay batch effects with other factors of interest. For unmatched case–control studies, balancing (randomizing) cases and controls into batches may provide important advantages for reducing the influence of batch effects [ 15 ]. For matched case–control studies or studies involving analysis of serial specimens from each subject, grouping matched or longitudinal sets within the same assay batch can be a convenient way to control for batch effects.

Another fundamental aspect of design is sample size, which refers to the measurement of different subjects, which are referred to as biological replicates. Whenever there is interest in making inference beyond an individual subject, e.g., assessing differential gene expression between groups of subjects with different phenotypes or exposed to different conditions such as treatments, biological replicates are required. In the HDD setting, standard sample size calculations generally do not apply. If statistical tests are performed one variable at a time (e.g., differential expression of each gene comparing two groups), then the number of tests performed for HDD is typically so large that a sample size calculation applying stringent multiplicity adjustment would lead to an enormous sample size. Alternative approaches to controlling false positive findings in HDD studies are discussed in section “ TEST: Identification of informative variables and multiple testing .” If the goal is to develop a risk or prognostic model using HDD, typical recommendations about the number of events required per variable break down [ 16 ]. Other sample size methods that require assumptions about the model are challenging to implement considering the complexity of models that might be used in HDD settings [ 17 , 18 ], as discussed in section “PRED2.4: Sample size considerations.” In reality, HDD studies are often conducted with inadequate sample size, which is an important reason why many results are not reproducible and never advance to use in practice [ 19 ].

It is important to distinguish technical from biological replicates. Technical replication refers to repeating the measurement process on the same subject. It should not be confused with sample size. Technical replicates are useful for evaluating the variability in the measurement process, which may be comprised of multiple steps each potentially contributing to the total error in the measurement [ 20 ] (Fig. 1 ) described the many steps in gene expression microarray analysis of mouse brains. Technical replication could theoretically be carried out at any of those steps. Sometimes measurements are repeated using an alternative non-high-throughput measurement technique (e.g., RT-PCR assay to measure expression or Sanger sequencing of a specific gene) as a form of measurement validation, but this must not be confused with other forms of validation such as clinical validation of a prediction model (see section “ PRED2: Assess performance and validate prediction models ”). In presence of budget constraints, if the goal is to compare different biological conditions, it is advisable to invest in biological replicates. When biological samples are inexpensive compared to the cost of the measurement process, pooling is sometimes recommended as a way to reduce costs by making fewer total measurements [ 21 ]. However, caution is advised, as assumptions may be required about assay limits of detection or the correspondence between physical pooling and additivity of measurements [ 22 ]. The context of any technical replication must be carefully described along with any methods of summarizing over replicates in order to interpret results appropriately.

Correlogram of 12 male-specific genes expressed as log-counts-per-million from 69 lymphoblastoid cells derived from male (29) and female (40) Yoruba individuals. Variables (genes) are reordered to emphasize the similarity among of their relations. Lower triangle: correlations shown by color and intensity of shading; upper triangle: by circle filled proportionally to the correlation strength. Given the symmetrical nature of a correlogram, often different representations are used for the lower and the upper triangles. Source for the data [ 27 ]

Design of a study should ideally be placed in the context of an overarching analysis plan. Each individual study should be designed to produce results of sufficient reliability that its results will inform next steps in the research project.

Structure of the paper

This paper is organized with respect to subtopics that are most relevant for the analysis of HDD, particularly motivated by typical aims of biomedical studies but also applicable more generally. These subtopics are initial data analysis (IDA and Preprocessing, section “ IDA: Initial data analysis and preprocessing ”), exploratory data analysis (EDA, section “ EDA: Exploratory data analysis ”), multiple testing (section “ TEST: Identification of informative variables and multiple testing ”), and prediction (section “ PRED: Prediction ”). For each subtopic, we discuss a list of main analytical goals. For each goal, basic explanations, at a minimally technical level, are provided for some commonly used analysis methods. Situations are identified where performance of some traditional, possibly more familiar, statistical methods might break down in the HDD setting or might not be possible to apply at all when p is larger than n. Strengths and limitations of competing approaches are discussed, and some of the gaps in the availability of adequate analytic tools are noted when relevant. Many key references are provided. It should be noted that throughout this paper we are concerned almost exclusively with cross-sectional or independent observations rather than longitudinal observations.

Topics in the paper are organized into sections according to the structure summarized in Table 1 , followed by a discussion of the importance of good reporting to improve transparency and reproducible research in the “ Discussion ” section and a summarizing discussion in the “ Conclusions ” section.

IDA: Initial data analysis and preprocessing

Initial data analysis (IDA) is an important first step in every data analysis and can be particularly challenging in HDD settings. IDA is a term for all steps of data inspection and screening after the analysis plan and data collection have been finished but before the statistical analyses are performed [ 23 , 24 ]. It focuses on understanding the context in which the data were collected, on data cleaning (see section “ IDA1: Identify inconsistent, suspicious or unexpected values ”), and on data screening (see section “ IDA2: Describe distributions of variables, and identify missing values and systematic effects due to data acquisition ”). Data cleaning refers to identifying and possibly correcting errors. Data screening includes reviewing the characteristics of the data that could affect the subsequent analysis plan, for example, describing distributions of variables, by checking assumptions required for model fitting and hypothesis testing, describing missing values, and identifying the need for adjustments of systematic effects due to data collection. Systematic effects may include batch effects that are caused, e.g., by different technologies used for collecting the data or even by different technicians performing laboratory experiments, see section “IDA3.2: Batch correction” for details. Further, initial steps may include simplification of data, e.g., by excluding or collapsing variables, if deemed appropriate. Insights about the data gained from these screening steps might lead to refinement or updating of an analysis plan to ensure that the data are consistent with any assumptions or requirements of the proposed analysis strategies (see section “ IDA4: Simplify data and refine/update analysis plan if required ”). However, IDA should always be conducted independently of the analysis needed to address the research questions, in order to avoid biasing conclusions.

The term “data preprocessing” is often used in biomedical research involving analysis of HDD, especially in the omics field, to denote certain initial data cleaning and screening steps falling within the more general category of “initial data analysis.” Data preprocessing refers to the process of transforming “raw” data, obtained directly from measurement instrument, into quantifications that are suitable for the subsequent statistical analysis. This includes detection and handling of incomplete, incorrect or inaccurate values, application of normalization methods that aim to remove systematic biases (e.g., assay batch effects), and transformations of variables [ 25 ].

A first step of the data cleaning and screening process is often to standardize the names or terms of variables and observations, especially for omics data compiled using different technologies. This type of standardization helps facilitate other, more complex downstream analyses and interpretation of results, as well as better online dissemination and archiving of data.

The IDA material is organized for ease of discussion, but the IDA process is typically iterative. Preprocessing is discussed in section “ IDA3: Preprocessing the data ,” but after preprocessing one may need to go back to the data cleaning and screening steps described in sections “ IDA1: Identify inconsistent, suspicious or unexpected values ” and “ IDA2: Describe distributions of variables, and identify missing values and systematic effects due to data acquisition .” Note also that some model-based methods used for the identification of informative variables incorporate normalization into the data analysis model (see section “ TEST: Identification of informative variables and multiple testing ”).

IDA1: Identify inconsistent, suspicious or unexpected values

Identification and handling of incomplete, incorrect, or inaccurate values is logically a first step in IDA. Attention is directed toward distinguishing aberrant values that clearly originate from the data collection or generation process from those that might reflect true biological variability. Both visual and analytical inspections of the data are used for the detection of such values.

IDA1.1: Visual inspection of univariate and multivariate distributions

Graphical displays are helpful to both understand the structure of the data and detect potential anomalies. For HDD, it is rarely feasible to conduct a detailed examination of the distribution of every variable individually. Visual displays might be constructed only after variables of interest have been identified, for example because a gene is differentially expressed between two experimental conditions or because a particular variable is identified to have an unusual distribution by calculation of summary statistics or has an outlier. A practical alternative is to first calculate scores (summary statistics) for each variable or pair of variables, and then select both typical and interesting atypical variables, with respect to distributions of the scores, for more detailed inspection of their univariate or bivariate distributions. Types of scores to be used in these analyses should include those that capture specific features of the distributions, including measures of location, dispersion, skewness, kurtosis for univariate distributions, linear relationships for bivariate distributions, and metrics to detect outliers or influential values (Table 2 ).

Example for a heatmap, which is a data representation in the form of a map in which data values are color coded. Here, sequencing data from the 1000 genomes project [ 30 ] are visualized. Rows correspond to samples and are ordered by processing date, and columns represent genome location of the corresponding sequence. One can see that for the dates 243–254, orange color indicating high values is overrepresented, compared to blue color indicating low values. This demonstrates that so-called batch effects are present, i.e. systematic biases in the data, which are discussed in detail in section “IDA3.2: Batch correction.” Source for the data: [ 29 ]

IDA2: Describe distributions of variables, and identify missing values and systematic effects due to data acquisition

Ida2.1: descriptive statistics.

For understanding the structure of data, often univariate measures for location and scale of the variables are informative. In the HDD setting, graphical display is often helpful to scan these measures across the large number of variables, both for detecting anomalies in the data and for a general exploration of variable distributions and their consistency with assumptions required for certain analysis methods. An example of the use of boxplots and of smooth histograms for exploratory purposes can be found in [ 31 ].

Standardization of data values is often performed prior to data analyses. Typically, this refers to normalization with respect to scale and location (e.g., subtract mean or median and divide by standard deviation). This can be helpful to give variables similar weight, especially if they are measured on different scales. However, standardization removes information about absolute magnitude of effects, so it should not be used when the actual magnitude of differences is of interest (e.g., differences in mean expression values between two groups). Another caution is that HDD will typically contain a certain number of variables that are uninformative because they do not vary much across observations, with variability essentially reflecting noise in the data. Standardization of such variables can exaggerate the noise to give these variables undue influence in analyses that is on par with that of truly informative variables. It is often preferred to drop such uninformative variables at the start of analyses (Table 3 ).

Visualization of the insights obtained from an RLE plot, representing ( a ) log gene expression distributions for 27 samples (without performing quantile normalization) and ( b ) relative log gene expression distributions for the same 27 samples. The RLE plot allows to highlight the unwanted variation due to the between-batch variation (cyan versus magenta boxplots) as well as the within-batch variation as suggested by both the difference in location (median further from 0) and spread (higher IQR) of the boxplots. This interpretation is under the often-plausible assumption that expression levels of most genes are unaffected by the biological factors of interest. Source: [ 32 ]

Comparison of a scatterplot (left) and a Bland–Altman plot (right, also MA plot for omics data) of the same data. In this example, the predicted values of two regression models (including and excluding a variable called FLGROSS) are compared. The scatterplot shows similar values for most observations, with points close to the diagonal. The Bland–Altman plot, with differences on the y -axis (on log-scale for MA plots on omics data typically log-ratios), better visualizes the dependence on the average value of the predictions (typically average log intensity for MA plots). The smoothing line in the example Bland–Altman plot indicates the shape of dependence of the differences on the average values. Source: [ 35 ]

IDA2.2: Tabulation of missing data

Missing values are ubiquitous in real-world data and may have major implications for choice of analysis methods and interpretation of results [ 36 ]. In fact, most multivariable and multivariate analyses methods have as their default requirement that values of all variables are available for all subjects, i.e., all observations are “complete.” An important early step in any analysis is tabulation of the missing values, i.e., the identification of the number of missing values per subject and per variable, respectively, to provide an overview of the missing data structure. In multi-omics integrative studies, high-dimensional data from different data types are collected for the same subjects. In such studies, small sample size caused by experimental and financial constraints, which can also vary between data types, can be the reason for missing data, the absence of which has to be taken into account in the subsequent statistical analysis.

IDA2.3: Analysis of control values

Laboratory assay measurements can be affected by technical artifacts related to factors such as reagent lots, equipment drift, or environmental conditions. Sometimes these artifacts can be detected, and potentially adjusted for, through use of control and reference standard samples, which have expected distributions of measurements. For single-analyte assays, a calibration process is typically performed to adjust raw measurements and produce final reported values (Table 4 ).

Visualization of calibration curves, representing the relationship between values of an analyte measured on a set of samples by some experimental assay ( y -axis) and values obtained for those samples from some reference assay that is considered to represent truth and to be measured with negligible error ( x -axis). The curve may be inverted to correct values obtained from the experimental assay to bring them closer to values of the analyte that would have been expected from the reference assay. Source: [ 37 ]

IDA2.4: Graphical displays

Systematic artifact effects arising from data acquisition processes can often be detected with graphical displays that visualize the data in a comprehensive manner. A widely used graphical representation for multivariate data is a principal components plot, which is also useful in exploratory data analysis, as described in section “ EDA: Exploratory data analysis ” (Table 5 ).

Principal component analysis plot depicting 62 lymphoma samples represented by their first and second principal component calculated from gene expression profiles comprising expression levels of 4026 genes on each lymphoma sample. The samples have been annotated in the plot according to pathologic subtype: 11 B-cell chronic lymphocytic leukemia (B-CLL; blue squares), 9 follicular lymphoma (FL; black triangles), and 42 diffuse large B-cell lymphoma (DLCL; red dots). Source: [ 39 ]

Biplot constructed by superimposing a PCA plot of 62 lymphoma samples (see Fig. 6 ) onto a PCA plot of genes where first and second principal component for the genes are calculated from gene expression profiles comprising expression levels of the 62 samples for each gene. Genes are represented in the plot as small green dots. Genes representing the three classes well are indicated by numbers. Source: [ 39 ]

IDA3: Preprocessing the data

Data generated by omics assay technologies typically require preprocessing by specially tailored methods that are based on understanding of the sophisticated instrumentation and scientific underpinnings of the technologies. Omics data are some of the most frequently encountered HDD in biomedical settings and are the focus in this paper. However, similar challenges exist with other types of HDD in biomedical research. Notably, high-dimensional imaging data are becoming commonplace, with examples including those generated by digital radiography, PET scans, and magnetic resonance imaging. In the following, we explain the main principles of data preprocessing using omics data examples.

Omics technologies are highly sensitive to experimental conditions and can exhibit systematic technical effects due to time, place, equipment, environmental conditions, reagent lots, operators, etc. In general, the first step of preprocessing aims to obtain an “analyzable” signal from the “raw” measurements. Subsequently, the signal is separated from possible systematic technical effects. The corrected signal may then be transformed to fulfill certain distributional properties, e.g., approximating a normal distribution. Note that sometimes the transformation may be applied before correcting the signal.

Preprocessing aimed at removal of systematic effects is often conducted as a separate step, as part of the IDA process, before the statistical analysis for answering the research question is undertaken. If the data have already been corrected for systematic effects and retain only the signals of interest (e.g., treatment effects), then the preprocessed (“normalized”) measurements for the biological samples can be analyzed using statistical methods that are easily accessible to researchers. However, conducting normalization as a separate step has important disadvantages. For instance, the normalized values are estimates and often carry with themselves some uncertainty, which should be taken into account in the analysis of the normalized data. However, this complicates the statistical analysis.

If inferential analysis is of interest, e.g., when comparing groups of samples to assess for biological differences, then a preferred approach is to consider normalization as part of a comprehensive statistical analysis model. The model is then used both to remove systematic technical differences and to quantify biological effects of interest (e.g., treatment effects). In that case, the uncertainty related to the normalization part of the analysis is naturally included in the estimates of uncertainty (standard errors) of the quantities of biological interest.

IDA3.1: Background subtraction and normalization

Omics data are prone to perturbations due to systematic effects induced by the measurement technology, also referred to as the assay platform . Many of these effects are unique to the assay platform, but there are some commonalities. A biological sample may have its gene expression profile measured using a single microarray or gene chip or its protein profile measured using a mass spectrometry system. The set of experimental conditions that gives rise to profiles such as these will be referred to here as an experimental run . However, even for the same sample, measurements obtained in different runs may differ due to factors such as different amounts of biological material input to the measurement system, settings on the instrumentation, environmental conditions in the laboratory, and so forth. These “between-run” differences may confound the “between sample” comparisons of scientific interest. Thus, these nuisance run effects should be removed to allow valid comparisons among data obtained in different runs. A generic preprocessing step aimed at removing between-run differences is often termed normalization . Even before normalization methods are applied, data generated by omics technologies generally require correction to subtract background noise from measurements to reveal their signal components. In Table 6 we introduce some basic methods for background subtraction and normalization.

Boxplots representing artificial distributions of values for 30 samples (subjects), before quantile normalization (top) and after quantile normalization (bottom), showing that all distributions are fully aligned with each other after the transformation. Source: [ 44 ]

IDA3.2: Batch correction

Another example of a systematic effect that is common to many technologies is a “batch effect.” The effect may arise when groups of biological samples (“batches”) have something in common in the way they are processed, e.g., same day or time of day, on same instrument, same operators, but these aspects are different for other groups of samples. Besides these measurement conditions, factors at play prior to measurement can cause batch effects. For example, clinical centers might differ in their standard operating procedures for processing, handling, and storing biospecimens, giving rise to pre-analytic factors that could influence downstream measurements. Patient characteristics, co-morbidities, or concomitant medications could additionally vary by batch, and may give rise to different distributions of measured values that have biological basis. Batch effects are widespread [ 29 ]. The challenge for batch correction is removal of nuisance effects such as those due to pre-analytic or technical factors while not inadvertently removing true biological differences. To facilitate appropriate correction, batch information such as dates, instrument, operator, and specimen collection sites should be recorded and patient factors might need to be taken into account in analyses. Above all, it is critical to avoid poor study designs in which important patient characteristics (including outcomes) are confounded with nuisance batch effects, as this could make it impossible to remove nuisance batch effects adequately.

Preprocessing of omics data aimed at removal of the aforementioned artifact effects poses several challenges. For instance, normalization is often data-driven and uses methods based on assumptions about the nature of the biological mechanisms. If those assumptions do not hold, then the methods might not work as intended. An example of a commonly made assumption in experiments involving genome-wide expression data is that most genes are not differentially expressed under the compared conditions. It may be challenging to verify whether such assumptions are correct.

The dependence of systematic effects on the platform raises an important issue for novel technologies, for which sources of measurement variation may not be fully established or understood. Out of convenience, preprocessing approaches developed for one platform have often been applied to other platforms. For example, normalization methods developed for microarrays are also used for proteomic [ 45 ] and metabolomic [ 46 ] mass spectrometry experiments. This might be reasonable in some settings, but the assumptions required for adequate performance of a normalization method should always be reviewed carefully for appropriateness prior to its application to another technology.

In addition, it is worth noting that preprocessing may need to be tailored to the analysis goals. For instance, it is problematic to remove batch effects when constructing a classification rule. This is because the measurements for a new sample presented for classification will most likely include effects from batches not represented in the data used to construct the classification rule. Consequently, a classification rule should be constructed using data that have not been batch corrected so that robustness to batch effects can be built in (Table 7 ).

Visualization of the effect of batch correction. Heatmaps of hierarchical clustering of sponge metagenomics data studying two tissues types (C and E) with 2 batches, before and after Combat batch correction. Without batch correction (top figure), the clustering is mainly driven by the batch effect. After correction, the clustering is driven by the tissue type (bottom figure). Source: [ 49 ]

IDA4: Simplify data and refine/update analysis plan if required

The findings from the IDA steps can have substantial impact on the choice of appropriate analytical methods for subsequent statistical analyses. Therefore, the analysis plan should be refined or updated as necessary and according to the relevant findings from the IDA analysis [ 23 ].

IDA4.1: Recoding

Recoding primarily refers to transformations of the (original, raw) data, which allow for easier handling for a specific purpose. This is particularly useful in HDD settings, in which simple representation of the information can be challenging and sometimes even impossible due to the large number of variables (Table 8 ).

IDA4.2: Variable filtering and exclusion of uninformative variables

Variable filtering refers to the exclusion of variables that are considered uninteresting, before the statistical analysis to address the main research question is even started. This practice is widespread in HDD analysis where any steps to reduce the dimensionality and complexity of models at the outset are appreciated. If many irrelevant variables are filtered out, the multiple testing problem (see section “ TEST: Identification of informative variables and multiple testing ”) is diminished, and the statistical power of subsequent analysis steps can substantially increase. However, as discussed below, caution is required when applying certain filtering strategies that may introduce bias (Table 9 ).

IDA4.3: Construction of new variables

Sometimes it is useful to construct new variables as an initial step of data analysis by combining the variables that are available in the dataset in a meaningful way, using expert knowledge. For example, in medical studies investigating factors affecting health, often, overweight status is an important variable to consider in the analysis. Because weight and height must be considered together in assessing whether an individual is overweight, constructed variables like body mass index (BMI) have been used. The importance of fat distribution has also been recognized, and it has motivated the combined measured of waist-hip ratio (WHR). Instead of relying on the ability of statistical methods and algorithms to construct such variables implicitly, e.g., during a statistical modelling process, it is useful to be informed by expert knowledge and to include these constructed variables directly into analyses.

Not all constructed variables are derived using expert knowledge. Some, like principal component scores (see section “IDA2.4: Graphical displays”), are constructed in an unsupervised manner meaning that they are constructed to capture features of the data based only on the explanatory variables without using dependent variables such as outcomes. These constructed variables are sometimes used as explanatory variables when building prediction models (see section “ PRED: Prediction ”), and they can also be used for exploratory data analysis (see section “ EDA: Exploratory data analysis ”). As discussed in section “IDA2.4: Graphical displays,” plots of (typically the first two) principal components are often helpful for detecting peculiarities in the data or problems such as batch effects. Some constructed variables are derived using outcomes or other dependent variables. Examples of outcome-informed constructed variables include supervised principal component [ 55 ], or partial least squares (PLS) scores (see section “PRED1.3: Dimension reduction” for further discussion). Sometimes new variables are constructed by discretization of continuous variables, but this practice is problematic and should generally be discouraged (Table 10 ).

IDA4.4: Removal of variables or observations due to missing values

The simplest approach to deal with missing data is a “complete case analysis.” That is, if a single variable is missing for an observation, the observation is fully excluded from the dataset. Basing analyses on only complete cases at best only leads to loss of statistical power, but at worst can lead to substantially biased analyses. Impact of missing data will depend on how many cases have missing data, how many variables have missing values, how many values are missing, and whether the likelihood of missing values in a variable is related to the value of that variable or other variables. When few observations have missing values for few variables, then the impact on results of subsequent analyses may be limited, but when the number is large, the impact can be substantial.

A typical strategy for dealing with missing data is to exclude variables from the analysis that have a large number of missing values. Obviously, the possible relevance of such variables is neglected. Only when the missingness (the events that lead to a value being missing) is independent of both unobserved and observed values, i.e., the data are missing “completely at random” (MCAR), are the results of the complete case analysis (using likelihood-based methods) unbiased. When missing values depend on the unobserved values themselves (e.g., it is more likely that the measurement of a variable is missing when the value of the biomarker is very high or very low), then the missing values are said to be “missing not at random” (MNAR), and the resulting complete case analysis is biased.

Between the two extreme situations of MCAR and MNAR, there is a third possibility: missing values are called “missing at random” (MAR), when the missingness is independent of the unobserved values after controlling for the other available variables. One way to diagnose whether data are MCAR or MAR is to tabulate a missing value indicator against the values of other variables. As an example, if the value of a biomarker (e.g., gene expression level) is missing with higher frequency in males than in females, but within these strata, the missing values are missing completely at random, then it is likely a situation of MAR and not MCAR.

In HDD settings, when a large number of variables must be considered, complete case analysis may require exclusion of too many observations. To avoid this, common approaches involve first removing variables for which more than a specified percentage (e.g., 5 or 10%) of observations are missing and then removing observations for which more than a specified percentage (e.g., 2%) of variables have missing values. For studies with more complex designs, additional considerations may apply. For example, it is common in case–control studies to remove variables for which there is larger imbalance (e.g., more than 5 or 10% difference) in the percentage of missing values between cases and controls.

IDA4.5: Imputation

For MAR situations, methods more sophisticated than complete case analyses or dropping variables are recommended to use the information from all observations in the study and obtain less biased results. An example method is multiple imputation, which is described below. Although imputation is a useful strategy, it should be understood that no single approach for dealing with missing data is fully satisfactory. Thus, the best approach is to carefully select variables that are both informative and feasible to collect when designing studies and then work diligently to collect those data as completely as possible in order to minimize the amount of missing information. In the context of LDD, a framework for the treatment and reporting of missing data was proposed [ 58 ].

For HDD data, performing a simple multivariable regression in high dimensions is typically not feasible. Therefore, most procedures for handling missing data in the HDD setting either involve a phase for selecting for imputation only those variables that are deemed important or trying to use some regularized regression [ 59 ] instead of standard multivariable regression. The handling of missing data in HDD settings is an active topic of research. Many tailor-made imputation algorithms have already been developed; for an early overview in the context of for gene expression measurements, see [ 60 ] (Table 11 ).

EDA: Exploratory data analysis

When performing statistical analyses, it is important to distinguish between exploratory data analysis (EDA) and confirmatory data analysis, as this has important consequences both for the selection of appropriate analytical methods and for the correct interpretation of the results. The starting point for confirmatory analysis is a hypothesis to be evaluated, whereas, in EDA the goal is to provide an unbiased view of the data. Insights from EDA may then lead to development of new hypotheses that can be evaluated in subsequent confirmatory analyses on independent data.

Caution is necessary when performing statistical inference (e.g., feature selection as described in section “ TEST: Identification of informative variables and multiple testing ”) or model performance assessment following EDA when decisions to remove or modify observations from the analysis might depend on the observed relationships one is trying to confirm. For example, if outlier observations are removed from a dataset, the performance of a prediction model built only on the remaining observations is most probably an overly optimistic estimate of what the model performance would be on an independent dataset, which might contain different outliers.

Two major analytical goals for EDA are (1) to identify interesting data characteristics such as variables with extreme values, associations between variables, or representative subjects with usual values of variables, and (2) to gain insight into the structure of the data. Note that many of the methods used in EDA are also applied in IDA (like PCA; see section “IDA2.4: Graphical displays”). In this section, we focus on methods that are more specific to EDA. Note that many of the methods described in this section are generally designed and suitable for continuous data; only some can also be applied for discrete data.

EDA1: Identify interesting data characteristics

EDA can assist a researcher to identify interesting data characteristics that may lead to generation of specific scientific hypotheses that can be more fully evaluated in subsequent studies. Through EDA, a researcher might identify variables exhibiting extreme values or study subjects (observations) having extreme values of one or more variables or unusual combinations of values of two or more variables. EDA might also reveal intriguing associations between variables (e.g., levels of a certain protein tend to differ between two phenotypic classes). The two main classes of exploratory methods for identifying such interesting data characteristics are graphical displays and inspection of descriptive univariate and multivariate summary statistics. Graphical displays are discussed in sections “IDA2.1: Descriptive statistics,” “IDA2.4: Graphical displays,” and “EDA1.1: Graphical displays,” whereas descriptive statistics were already described in section “IDA2.1: Descriptive statistics” as tools for the initial data analysis (IDA). It should be noted that due to the potential for identification of many false positive signals in the HDD setting, findings from large-scale comparisons of descriptive summary statistics are often tempered by application of multiple testing methods as described later in section “ TEST: Identification of informative variables and multiple testing ,” even though the original intent was exploratory analysis.

To identify interesting data characteristics in low-dimensional data via visual or graphical methods, it is usually possible to inspect simple summary statistics and graphical displays of distributions of variables one, two, or three at a time, but for HDD this approach quickly becomes infeasible. For instance, the number of scatterplots for all pairs of p variables is p ( p − 1)/2, which already exceeds 1000 when p exceeds 45. Visual identification of interesting characteristics of HDD typically requires specialized graphical displays or reduction of data dimensionality.

EDA1.1: Graphical displays

As mentioned in section “IDA2.4: Graphical displays,” one can use principal components (PCs) for exploratory analysis by first summarizing the information included in all variables through calculation of PC scores (which are linear combinations of the original variables) and then plotting in two or three dimensions the first several PC scores that capture the majority of variability in the data. This may allow identification of clusters of observations or individual observations with unusual configurations of variables warranting further inspection.

Another goal for HDD visualization is to produce a display in lower dimensions that preserves the distances (more generally degrees of “dissimilarity”) between observations such that the closest points remain the closest and the furthest remain the furthest. Alternative data reduction techniques have been developed to achieve this goal. These methods aim to translate the data in such a way that dissimilarities among points in the lower-dimensional space are as proportional as possible to those quantified in the original (high-dimensional) space. One such technique, multidimensional scaling, is described below. A variation of multidimensional scaling not discussed here is correspondence analysis, which is suitable for categorical variables and shows the relationships between variables based on data specified in a contingency table. Cox and Cox [ 62 ] provide descriptions of both multidimensional scaling and correspondence analysis (Table 12 ).

Two-dimensional visualization of a high-dimensional dataset using t-SNE. The dataset consists of 2700 single cells (peripheral blood mononuclear cells) that were sequenced on an Illumina NextSeq 500. The dataset is freely available from 10X Genomics. Points are colored by cell type. The plot shows that the cell types are locally well separated. Source: [ 71 ]

EDA2: Gain insight into the data structure

A global depiction of data to identify structure, including patterns or motifs, is another major goal of exploratory data analysis for HDD. Here, data structure is understood in a general sense, it refers to many aspects of the data that concern the arrangement or interrelation of the observations or variables of a dataset. Although a natural first step is to look at marginal distributions (e.g., univariate and bivariate) of all variables across observations, this approach is generally not feasible for HDD for reasons discussed above. Further, some structure may involve many different variables and not be discernible by examination of univariate, bivariate, or even trivariate distributions.

The data visualization techniques described in section “EDA1.1: Graphical displays” are often supplemented with additional approaches geared toward detection of certain kinds of structure, for example clusters. The goal of cluster analysis is to identify subgroups of observations or variables that are similar to each other, but different from others. Identification of prototypical observations to characterize each cluster might be of interest. The structure might also be multi-level. In this section, we focus on techniques that are useful to uncover structure that might be missed by examining only marginal distributions or low-dimensional representations of HDD.

EDA2.1: Cluster analysis

The goal of a cluster analysis is to assemble objects (observations or variables) into subgroups, termed clusters, such that similarities between members within the clusters are high (or, equivalently, distances are small), compared to similarities between members from different clusters. Sometimes, the goal is only to find dense, i.e., heavily populated, regions in the data space that correspond to modes of the data distribution. Alternatively, there may be interest in fully characterizing the structure. Cluster analyses typically require choice of a similarity metric (or, alternatively, distance metric) for pairs of objects (sometimes also for pairs of clusters), a clustering algorithm, and a criterion to determine the number of clusters. Some clustering approaches that have been successfully used for low-dimensional data, e.g., mixtures of low-dimensional parametric probability distributions such as multivariate normal mixtures, either cannot be applied at all or perform very poorly in the HDD setting. Approaches not suitable for HDD are not further discussed here.

For comparing similarity of objects (either variables or observations), the Pearson correlation coefficient or Euclidean distance are the most popular metrics. The Pearson correlation does not depend on the scale of the variables, but the Euclidean distance does. If each of the variables characterizing an object is first standardized across the set of objects (subtract mean and divide by standard deviation), then use of Pearson correlation and Euclidean distance metrics will produce equivalent results. The measure should be chosen deliberately. If only relative levels of the values are important, then Pearson correlation is suitable, but if absolute values matter, then Euclidean distance is appropriate. It is important to note that both metrics tend to be more heavily influenced by a few large differences or deviations than by a series of small ones because the values are squared. An important modification of the Pearson correlation is the Spearman (rank) correlation, where values of observations are first replaced by their corresponding ranks before calculating the Pearson correlation. With this adjustment, the results are less heavily influenced by extreme data values.

In high-dimensional spaces, data are typically quite sparse. This means that distances between objects become large, a phenomenon often referred to as the curse of dimensionality. Therefore, the distance metrics may be prone to exaggeration by a few distant objects. Strategies to help avoid this problem include use of data reduction or variable selection before clustering (see section “IDA2.4: Graphical displays” for graphical displays for dimension reduction and section “PRED1.2: Variable selection.” for variable selection and dimension reduction in the context of improving prediction models).

Clustering algorithms can be divided into hierarchical and partitioning methods. In hierarchical clustering, observations are iteratively grouped together into larger clusters (agglomerative hierarchical clustering) or clusters are subdivided into smaller clusters (divisive hierarchical clustering). Centroid-based so-called partitioning algorithms aggregate the observations around specific points (the centroids) such that observations related to the same centroid are as similar as possible, and observations related to different centroids as different as possible. Hierarchical clustering algorithms provide a clustering for any number of clusters, whereas partitioning methods require an initial choice about the number of clusters present in the data. The most popular clustering algorithms are described in Table 13 .

Hierarchical clustering result displayed in a dendrogram, where heights in the tree at which the clusters are merged correspond to the between-cluster distances. Source: [ 73 ]

Visualization of the k-means algorithm with an example. Iteratively, observations are assigned to the cluster for which the squared Euclidean distance from the observation to the cluster centroid is minimized, and cluster centroids are computed based on the current cluster memberships. The iterative process continues until no observations are reassigned (as in the case of the last iteration in the figure). Source: [ 75 ]

Other methods for cluster analysis applied to biomedical data include fuzzy clustering and SOMs (self-organizing maps). In fuzzy clustering, objects can belong to multiple clusters. In SOMs (a type of neural networks first introduced by Kohonen [ 77 ]), a meaningful topology (special relationships) between the cluster prototypes is assumed. This means that the clusters can be visualized as a two-dimensional “map,” so that observations in proximate clusters have more similar values than observations in clusters that are more distant. Since the assumptions for SOMs are not guaranteed to hold, the interpretation can easily be misleading, such that SOMs should only be used by experts in this field. In addition, SOMs can be very sensitive to starting node configurations.

For HDD, the computer runtime of such partitioning algorithms can present a challenge. For example, PAM cannot be applied if the number of objects to be clustered is very large, i.e., for clustering variables in omics data or for clustering observations in large health records data. This challenge motivated development of the algorithm CLARA (Clustering Large Applications) [ 78 ], which works on subsamples of the data. Distribution-based clustering methods provide another alternative where probabilistic distributions for the observations within the clusters are assumed (e.g., multivariate Gaussian in each cluster, but with different means and potentially different variances). Parameters of the mixture distribution are typically estimated with EM-type (expectation–maximization) iterative algorithms [ 79 ]. However, not only, but particularly for HDD, the distributional assumptions are often difficult to verify and the algorithms may not converge to a suitable solution. Therefore, clusters might not be identified at all, or the results could be misleading due to incorrect assumptions about the data distributions.

Results produced by clustering algorithms are difficult to evaluate and often require subjective judgement. The validity of the results depends on the notion of a cluster, which varies between clustering algorithms, and this ambiguity carries through to estimation of the number of clusters (Table 14 ).

Example of a scree plot, which involves plotting some measure of within-cluster variation (here the total within sum of squares) on the y -axis and the number of clusters assumed in applying the algorithm on the x -axis. Source: [ 80 ]

Silhouette values for observations that are grouped into four clusters. Observations are sorted along the x -axis by decreasing silhouette value, grouped by the four clusters. The silhouette values for the observations of the first two clusters have very low values, indicating two not well-separated clusters. Source: [ 82 ]

Some clustering methods have been specifically developed to handle the typical large storage requirements and long run times for HDD settings. For example, CAST (Cluster Affinity Search Technique) [ 83 ] is especially useful for large numbers of observations or variables. Iteratively, clusters are constructed as follows. Choose a randomly selected observation not already assigned to a cluster and assign it to a newly defined cluster. Then repeat the following two steps until the set of observations assigned to this new cluster no longer changes. Add unassigned observations with average similarity to the current cluster members above a predefined threshold, and remove observations with average similarity below this threshold.

Another method is subspace clustering [ 84 ], where first subsets of variables are identified (called subspaces) and clusters are determined by defining regions of values based only on these variables. Then, iteratively, lower-dimensional subspaces are combined to higher-dimensional ones. In biclustering (or two-way clustering), first introduced by Hartigan [ 85 ], simultaneously variables and observations are selected to generate clusters that do not depend on all variables at the same time. Again, heuristic and stable algorithms are required to find approximate solutions in acceptable time (see, e.g., [ 86 ]).

Many traditional clustering methods are best suited for continuous variables, but there are several examples of HDD that are not continuous. One example is count data such as generated by RNA-Seq. Some examples of clustering methods that have been specifically developed for count data include those of Witten [ 87 ] and Si et al. [ 88 ], which are based on Poisson or negative binomial distributions. Cluster analysis based on deep learning has also been proposed [ 89 ]. That approach trains a deep neural network, extracts the resulting hidden variables, and uses them as the basis for clustering using standard methods like k-means.

EDA2.2: Prototypical samples

Often it is useful to construct prototypical observations that represent subgroups of observations. Prototypical observations are, for example, identified by some clustering algorithms. The motivation is to allow visualization or provide a summary of relevant characteristics of subgroups of observations. These summaries can be interpreted in the biomedical context, for example as a description of the characteristics of a typical patient who responds well to a particular therapy. Prototypical samples can be selected as central observations in their respective subgroups, or they can be newly constructed. When applying a k-means algorithm to separate observations into K clusters, centroids of each cluster are natural choices for prototypes. Similar to the principles of many cluster analysis approaches (see section “EDA2.1: Cluster analysis”), the construction of prototypical observations is done such that they are simultaneously as similar as possible to the observations of the same subgroup (cluster) and as different as possible from the observations of the other subgroups. Bien and Tibshirani [ 90 ] provide a nice overview of available methods, although their review is limited to classification problems. Prototypical observations can also be used to represent classes and then to predict the class of a new observation based on the similarities with these prototypical samples (see also section “ PRED: Prediction ”).

TEST: Identification of informative variables and multiple testing

In HDD analysis, one is often interested in identifying, among a large number of candidate variables, “informative variables.” These are associated with an outcome or with a set of other phenotype variables that characterize the study subjects. For example, one might wish to characterize which single-nucleotide polymorphisms are more often present in patients who experience severe side effects from a particular drug compared to patients without severe side effects. In drug sensitivity screens performed on bacterial cultures, one might aim to identify bacterial genes with expression significantly associated with degree of sensitivity to a new antibiotic. When comparing individuals with a particular disease to healthy volunteers, one might wish to identify circulating proteins that are present in different abundance. In all these cases, evaluation of the associations might be accomplished by conducting many statistical hypothesis tests, one per candidate variable. This represents a multiple testing situation.

Multiple testing scenarios commonly encountered in biomedical studies with HDD are divided here into three categories. Scenarios that consider each candidate variable individually and perform a similar evaluation or statistical test for each include the following three cases: (i) Identification of variables among a set of candidates that are associated with a single outcome or phenotype variable, i.e., related to outcome or phenotype classes (categorical) or correlated with a continuous phenotype variable or time-to-event outcome. (ii) Identification of candidate variables with a trajectory over time affected by experimental factors or exhibiting a prescribed pattern. (iii) Identification of candidate variables that are associated with a prespecified set of other variables, i.e., where the candidate variables are considered as dependent variables and the set of prespecified variables as independent “predictor” variables. To illustrate the concepts, much of the discussion here will focus on a simple example of scenario (i) in which two classes are being compared with respect to a very large number of variables. Methods discussed for scenario (i) that can be extended straightforwardly to scenarios (ii) and (iii) are noted.

Scientific goals may go beyond simply providing a list of individual variables exhibiting associations with an outcome, a phenotype, a collection of prespecified variables, or patterns over time. Frequently, there is interest in more globally characterizing the variables that were included in the identified list. For example, genes are organized into interconnected biological pathways. Expression of two different genes might exhibit similar associations because they are both regulated by certain other genes, because one lies downstream of the other in the same biological pathway, or because their products serve similar biological functions. Established organizational structures might be described by gene taxonomies such as Gene Ontology [ 91 ], KEGG [ 92 ], or BioCarta [ 93 ]. Gene set enrichment analysis (see section “ TEST3: Identify informative groups of variables ”) refers to approaches that exploit these expected associations. They were first proposed in the omics field for use with HDD gene expression data. Although these enrichment analysis strategies could be applied in a variety of HDD settings, subsequent discussion of these methods will be based on examples with high-dimensional gene expression data for which the concept of enrichment is intuitively clear.

TEST1: Identify variables informative for an outcome

Test1.1: test statistics: hypothesis testing for a single variable.

Before discussing multiple testing procedures, it is helpful to briefly review basic concepts in statistical hypothesis testing involving a single variable. A hypothesis test aims to decide whether the data support or refute a stated “null hypothesis.” Typical examples of simple null hypotheses are that the distribution of a variable is not different between two or more groups or that a variable is not associated with another variable. A hypothesis test is based on some statistic that will reflect strength of evidence for or against the null hypothesis. Knowing the distribution of the test statistic (e.g., normal distribution or binomial distribution) allows one to construct a hypothesis test based on that statistic for which the probability of drawing an incorrect conclusion is controlled. Type I error refers to erroneously rejecting the null hypothesis when it is actually true. Type II error refers to failing to reject the null hypothesis when it is actually false. Statistical power is defined as one minus the type II error. In general, one wants to control the probability of a type I error, denoted α , at a small value, while maintaining acceptably high power (or low type II error). A conventional choice of \(\alpha\) for the single variable setting is 0.05, which means that the probability of a false positive decision, i.e., falsely rejecting the null hypothesis when it is true, is 0.05.

Hypothesis testing is often operationalized by calculation of a p -value from the observed data, which estimates the probability of observing a value of the test statistic that is at least as extreme as that observed, assuming that the null hypothesis was true. (Note the correct definition of a p -value stated here, in contrast to the common misinterpretation of a p -value as the probability that H 0 is true). A significance test is performed by comparing the computed p -value to the prespecified α level. When the p -value is less than or equal to α (e.g., 0.05 in the conventional setting), the null hypothesis is rejected; otherwise, it cannot be rejected.

It should be mentioned that sometimes the goal of a scientific study is to estimate certain parameters of interest, for example means or correlations, rather than to test hypotheses. In estimation settings, it is generally desired to provide intervals of uncertainty, such as confidence intervals, to accompany parameter estimates. Although errors in hypothesis testing have some relation to confidence interval coverage probabilities, most of the multiple testing procedures discussed in this section are not readily applicable to multiple estimation. Multiple estimation procedures are beyond the scope of the present discussion.

The t -test is an example of a widely used statistical test for a single variable. It is the basis for the modelling approaches described below that are extensions of hypothesis testing to multiple variables. Extensions particularly developed for HDD include limma, edgeR, and Deseq2, as discussed in section “TEST1.2: Modelling approaches: Hypothesis testing for multiple variables.”

Calculation of a p -value usually requires assumptions about the distribution of the test statistic. Sometimes that distribution can be derived from assumptions about the distributions of the variables. For example, the statistic of the t -test can be shown to have a t -distribution when the variables are normally distributed, and the within-group variances are the same for the classes being compared. Similar requirements hold for F-tests in analysis of variance and statistics associated with standard linear regression analysis. Although one can never be certain if these assumptions hold for real data, many test statistics can be shown by theoretical arguments to have an approximate normal distribution when sample size is sufficiently large (referred to as “asymptotic” approximation). An example asymptotic property is that a t -statistic has an approximate normal distribution for large samples size, even if the data are not normally distributed. Nonetheless, extra caution is necessary in the setting of HDD where the requirements for sample size to qualify as “large” are far greater. Extremes of a test statistic’s distribution are particularly prone to departures from data distributional assumptions, and this is exactly where accuracy is needed most when calculating the very small p -values upon which many multiple testing procedures for HDD rely.

When validity of assumptions required for familiar statistical tests is uncertain, for example that the data follow a normal distribution for the t -test or F-test, alternative tests broadly referred to as nonparametric tests may be preferable. Wilcoxon rank sum (equivalent to Mann–Whitney U) and signed rank tests are nonparametric alternatives to the two-sample t -test and paired t -test, respectively; the Kruskal–Wallis test is an alternative to the F-test in one-way ANOVA. These nonparametric tests are robust to outliers and do not require data to be normally distributed; nor do they require that their distribution is fully characterized by two parameters in the way that a mean and variance characterize a normal distribution. Many nonparametric tests are based on ranks of observed data rather than their actual values. Permutation tests, as described in Table 15 and below, comprise another class of nonparametric tests and are more generally applicable than rank-based tests.

A word of caution is in order to emphasize that correct permutation of the data is critical to validity of a permutation test. The permutations must preserve any structure in the data that is unrelated to the null hypothesis. For instance, if the goal is to test whether the mean of a variable is different between groups, but it is thought that the variances are different, then the simple permutation test described for the two-group comparison will not be appropriate because the permutations will change the variances as well as the means. If the groups are paired, e.g., variables are measured both before and after each subject receives an experimental drug, then the permutations would have to preserve that pairing by randomly “flipping” the before and after measurements within patients. Correct permutation might not be easy, or even feasible, for regression models with multiple predictors. For example, naively permuting the outcomes in a logistic or Cox regression model with many predictors to provide test statistics for individual predictor variables (adjusted for the other variables) would not provide valid permutation p -values because the correlation structure of the data, e.g., correlations of the outcome with other variables that are not the focus of the test, would not be preserved. Anderson and Legendre [ 94 ] discuss appropriateness and performance of various permutation testing strategies in the context of testing partial regression coefficients in multivariable regression models.

Nonparametric methods have advantages and disadvantages. In the context of statistical tests, their main advantages include their applicability in situations where little is understood about the likely distribution of the data, and their robustness to oddities in the data such as outliers. The main disadvantage of nonparametric methods is their reduced statistical power, particularly for small samples sizes, compared to a parametric test when distributional assumptions of that test are actually met. For HDD settings, parametric tests have additional appeal, when reasonably justified, due to the possibility to “borrow information” across variables by modelling relationships of parameters (e.g., means or variances) across variable-specific distributions; modelling approaches such as those discussed in section “TEST1.2: Modelling approaches: Hypothesis testing for multiple variables” can greatly increase statistical power for testing multiple hypotheses.

TEST1.2: Modelling approaches: Hypothesis testing for multiple variables

In the scenarios (i)-(iii) described in the introduction of section “ TEST: Identification of informative variables and multiple testing ”, the number of statistical analyses performed is equal to the number of variables. For omics data, the number of variables is often in the range of tens of thousands or even millions. Direct application of standard hypothesis testing approaches to each variable in the setting of HDD is problematic. As an illustration, consider conducting several thousand statistical tests (one per candidate variable), each using the classical α level of 0.05 to test for significance of an association between a single variable and an outcome or phenotype of interest. If the truth were that none of the candidate variables had an association with the outcome or phenotype of interest, then, on average, testing 20,000 variables would lead to 1000 false positive test results (0.05 times the 20,000 variables tested), clearly an unacceptably large number that would limit interpretability of the results. Control of the number of false positives, often termed “false discoveries” in the setting of HDD, is critical.

Several challenges are encountered in multiple testing for HDD omics data. One is that in order to control false positives when a very large number of statistical tests are performed, small α levels must be used, which limits statistical power. Another challenge is the mathematical difficulty of dealing with joint distributions of certain variable types such as counts, which are commonly generated by newer omics technologies such as RNA-Seq. Furthermore, sample sizes are often insufficient to rely on classical statistical asymptotic (large sample size) theory to provide tractable approximate distributions of test statistics required to appropriately control type I and II errors. Finally, the classical approach of limiting false positives by controlling the overall probability of any false positive findings is overly stringent when extremely large numbers of tests are performed. These challenges have spawned a wealth of innovative statistical approaches for multiple testing with HDD, which are described in the sections that follow.

The earliest technologies for high-dimensional gene expression analysis based on microarray platforms quantified gene expression by fluorescence intensities. After logarithmic transformation, these continuous intensity values are typically well approximated by a normal distribution. Many of the early methods developed for statistical analysis of microarray data relied on normally distributed data, the simplest example being use of t -tests to identify lists of differentially expressed genes with varying degrees of type I error control. Sample size in these early studies was usually relatively small, making it difficult to adequately control false discoveries and still maintain sufficient statistical power. Some of these methods were ad hoc or limited to simple experimental settings such as two-group comparisons, but advances in statistical methodology led to improved approaches for the analysis of HDD gene expression data (Table 16 ).

Sometimes a researcher is interested in identifying genes for which expression is not different between conditions, opposite the more typical goal to identify differentially expressed genes. This requires reversing the usual role of the null and alternative hypotheses. However, since it is impossible to statistically rule out very tiny effects, the null hypothesis that is tested for each gene is that its effect is larger than some user-specified minimum size. When implementing this procedure to identify genes with negligible effect, mean parameter shrinkage functions must be turned off.

TEST2: Multiple testing

Methods described in the previous section provide useful approaches to improve statistical power for testing individual variables (genes) and to appropriately model commonly encountered omics data. However, a final step is required to control false positives in HDD settings. Several multiple testing correction methods and their utility for HDD are discussed in this section.

TEST2.1: Control for false discoveries: Classical multiple testing corrections

A simple table illustrates the types of errors that can be encountered in multiple testing [ 100 ]. When testing m hypotheses, these are either true or false, and either rejected or not rejected, yielding four possibilities, which are displayed in Table 17 along with the numbers of hypotheses falling in each category.

In Table 17 , m represents the number of tests conducted; R represents the number rejected hypotheses; V represents the number of tests for which type I errors were committed, or the number of false positives; and U represents the number of tests that correctly rejected the null hypothesis, or the number of true positives. Further, m 0 represents the total number of true null hypotheses; m 1 the total number of false null hypotheses; and m 1 − U represents the number of tests for which type II errors were committed. The goal of a multiple testing procedure is to control V while not too severely limiting U . If R = 0 , then no type I error can be committed. If m 0 = m , then rejection of any test constitutes a type I error and represents a false positive result.

Classical multiple testing corrections that aim to control false discoveries by using more stringent (smaller) “critical” levels for significance testing may work well in situations with a few dozen tests or less. However, they can be problematic for HDD because they may be too stringent and severely limit statistical power for detecting associations that truly exist, particularly when sample sizes are not large.

The simplest approach to controlling false discoveries is the classical Bonferroni correction, where the critical level is adjusted by dividing it by the number of tests performed (see Table 18 ). Bonferroni correction is very stringent for several reasons. First, it is designed to control what is known as familywise error rate (FWER) , which refers to globally controlling the probability that any of the tests results in a false discovery. In terms of the notation in Table 17 , controlling the FWER at level α means requiring P ( V > 0) ≤ α. Despite its conservativeness, Bonferroni adjustment has become the standard approach for genome-wide association studies to control the genome-wide significance level. This enforces stringent control on the probability that any of the hundreds of thousands of genomic variants typically studied is falsely identified as associated with the phenotype of interest. Second, a simple Bonferroni correction is conservative in that it does not leverage information about potential correlations between the test statistics; nor does it account for the ordering of the p -values when applying the significance-testing threshold. When evaluating p -values in order from smallest to largest, it is natural to require smaller critical levels for declaring significance earlier in the list. These limitations of the Bonferroni correction have motivated development of modified approaches that are less stringent, as discussed next.

Some adjusted versions of Bonferroni correction that take p -value ordering into account have been proposed. Some, such as those proposed by Hochberg [ 101 ] and Hommel [ 102 ], require assumptions about the joint distribution of the p -values such as the nature of correlations, and those are not discussed here. However, the approach proposed by Holm [ 103 ] provides a simple improvement on the Bonferroni method that allows critical values for significance testing to depend on the ordering of the p -values while, like Bonferroni, requiring no assumptions about the joint distribution of the p -values. Holm’s approach is described in Table 18 .

Several other methods of controlling the FWER have been proposed that require additional assumptions about the nature of correlations between test statistics or might only control false positives under a global null in which all hypotheses are null. Such tests are not guaranteed to always control the FWER when these assumptions do not hold and will not be discussed further here.

An appealing aspect of multiple testing procedures that control FWER is that one can make statements about the probability that an individual test falsely rejects the null hypothesis. Because the probability that any test among a collection of tests falsely rejects must be at least as large as the probability that a single randomly chosen test falsely rejects, control of FWER at level α automatically guarantees control of the type I error at level α for each individual test.

An important caveat about any multiple testing correction method that is based on p -values is that it relies on the validity of the p -values or the validity of the corresponding test procedures. As noted in the discussion of test statistics above in section “ TEST1: Identify variables informative for an outcome ,” ensuring sufficient accuracy of p -values based on specific (parametric) distributions can be challenging in HDD settings. Permutation tests can provide distribution-free options for multiple testing in some situations. They also offer the flexibility to handle HDD with variables of different types, e.g., variables could be a mix of categorical, count, or continuous data. However, permutation tests can be problematic for multiple testing in HDD settings as well, as it can be very computationally intensive to accurately compute p -values that might be very small.

Multivariate permutation tests are permutation tests that are applied for testing multiple hypotheses simultaneously. For each hypothesis, a test statistic is calculated, for example for simultaneously comparing the distribution of many omics variables between two phenotype classes. As in the univariate case, class labels are randomly reassigned to the observations (keeping the full profile of measurements intact for each observation), and then a p -value for each variable is computed as the number of permutations on which the corresponding calculated test statistic is as extreme or more than the test statistic calculated of the original data. The popular Westfall-Young permutation procedure, as an example, is described in Table 18 . Multiple testing procedures can be applied to the collection of permutation p -values to control false discoveries just as if the p -values had been computed assuming parametric distributions for the variable.

TEST2.2: Control for false discoveries: Methods motivated by HDD

Various multiple testing correction methods have been developed that are more appropriate for HDD than the classical Bonferroni-type methods. Usually, these approaches aim for a false discovery control that is less stringent than familywise error control, such as limiting the percentage of false discoveries (rather than aiming to avoid any false discoveries) in exchange for greater power to detect true discoveries. Many multiple testing methods for HDD are combined with methods such as those just discussed in section “TEST1.2: Modelling approaches: Hypothesis testing for multiple variables” that borrow information across variables (or tests) or that exploit correlations between candidate variables to increase statistical power. The growing amount of HDD stimulated development of a variety of innovative multiple testing procedures more appropriate for these data than traditional approaches.

To describe the various multiple testing approaches for HDD and the false discovery criteria that they control, it is helpful to focus again on one of the most frequent goals in omics data analysis, which is the identification of differentially expressed genes between two or more classes or conditions. The notation used in this section follows that defined in Table 17 .

Aiming to control type I error in terms of the FWER through application of classical Bonferroni-type methods becomes extremely challenging with increasing dimension of HDD due to low statistical power, as already discussed. These challenges motivated consideration of alternatives to classical control of type I error, most commonly control of the false discovery rate (FDR). The popular FDR is in principle the expected proportion of false discoveries among the rejected tests and described in more detail below. The methods differ by the type of error they aim to control but share some operational aspects. Once the acceptable magnitude of error (e.g., FDR) has been specified, the (raw, uncorrected) p -values are calculated and next the variables are usually ranked based on their associated p -values. Those with p -values below a certain threshold are included in the list of the positive findings (rejecting their associated null hypotheses). This threshold can be fixed for all p -values, or it may depend on the ranking of p -value. Equivalently, the p -values can be adjusted and then compared to the desired level of error control. There are several methods for FDR control, which define in a different way the adjustment applied to the p -values and the threshold to which those p -values are compared.

As is common in statistics, some methods require additional assumptions and the claimed properties are only valid when those assumptions are met. In multiple testing, an important distinction is between methods that achieve weak control and those that achieve strong control. Weak control means that the method achieves the stated error control only when there are no true positives (i.e., all null hypotheses are true). In contrast, strong control means that the method achieves the stated control no matter how many of the null hypotheses are true or false. Only methods that provide a strong (general) control are discussed here. In multiple testing, it is also common to encounter assumptions about the dependence among variables or p -values; the assumption of independence among variables is unrealistic for omics data, where variables are often positively correlated.

In the following, we first define metrics to quantify false positives and then briefly present some of the methods that have been proposed to control them, focusing only on the essential concepts. We point the more technical reader to comprehensive reviews of multiple testing methods by Dudoit et al. [ 105 ] and more recently by Goeman and Solari [ 106 ]. A practical introduction providing illustrative examples with implementation in the R language is available in the book of Bretz et al. [ 107 ].

The FDR is a popular extension of the concept of type I error for HDD. Using the notation described in Table 17 , FDR is the expected (average) value of Q , i.e., FDR = E(Q) , where Q = V/R if R > 0 and Q = 0 if R = 0 [ 107 ]. Q is sometimes also called FDP (false discovery proportion). Since the case R = 0 is very uncommon in practical HDD applications, the FDR can be roughly thought of as the proportion of false positives among declared positives (i.e., among rejected tests). Controlling FDR is less stringent than controlling FWER, as FDR control inherently allows for some false positives. The goal of FDR control is to identify as many positive test results as possible, while accepting a relatively low proportion of false discoveries. In practice, common choices for FDR control are 5 or 10%.

The Benjamini–Hochberg procedure [ 108 ] is the most widely used method for controlling the FDR. It is described in Table 19 . Notably, the adjusted threshold value used by the Benjamini–Hochberg method is identical to that used by the Bonferroni and Holm methods for the variable with the smallest p -value, but it is much larger for the others. It is generally true that lists of discoveries generated by procedures that control the FDR are much longer than those generated by methods that control the FWER at the same level. Yet, like the Bonferroni method, the original FDR method is conservative, effectively controlling the FDR at level α · m 0 / m ≤ α if the variables are independent. Many methods were proposed to improve the power of FDR by estimating this unknown proportion of true null hypotheses ( m 0 / m ) from data and using it to adapt the threshold value (see [ 100 ]). The original FDR [ 108 ], which was proposed for independent variables but proven to be valid under the assumption of a positive correlation of the p -values, was extended by Benjamini and Yekutieli [ 109 ] to handle more general dependencies. This more general procedure has lower thresholds and is more conservative. Several other methods were proposed to control FDR and some error rates closely related to FDR were defined [ 110 ]. Figure 15 illustrates how the Bonferroni and the Benjamini–Hochberg correction work.

Graphical illustration how the Bonferroni and the Benjamini–Hochberg correction work, for an example with 7129 tests and 0.05 as desired significance level in each case. Applying Bonferroni, only the results of the tests with p -values smaller than 0.05 / 7129 (represented by a dotted line) provide evidence against the null hypothesis. For Benjamini-Hochberg, the significant genes are those whose tests yield p -values smaller than the largest p -value under the threshold, circled in green in the figure. The threshold is represented by the dashed line. The line has intercept 0 and slope 0.05 / 7129, where now 0.05 is the desired level of FDR control

Many extensions and modifications of the FDR have been proposed. The most common criticism of FDR is that it controls only the average proportion of false positives, which might be very variable: in practice, the actual proportion Q of false positives derived from an analysis might differ substantially from the targeted FDR threshold, but the FDR methods do not provide an estimate of this variability. Readers are referred to Goeman and Solari [ 106 ] for discussion of methods that aim to control the false discovery proportion with a specified confidence. Other methods have been proposed for control of local FDR, a concept that allows a more powerful interpretation of q -values at the level of single hypothesis and not as a property of a list of variables [ 111 ]. In practice, the FDR-controlling approaches are some of the most widely used methods for multiple testing for omics data, despite some recognized limitations.

TEST2.3: Sample size considerations

Determination of an appropriate sample size for a study that will involve conducting an extremely large number of statistical tests is very challenging. Sample size methods must be tailored to the desired approach and criteria for error control. Both false positive (type I error) and false negatives (type II) errors need to be considered. Early in the emergence of omics data, sample size methods focused on FDR control [ 112 ], particularly for microarray technology [ 113 ]. A recent review mainly focusing on sequencing experiments also provides useful guidance [ 114 ].

TEST3: Identify informative groups of variables

The multiple testing problem is less severe when the interest is shifted to groups of variables instead of single variables, as described in the introduction of section “ TEST: Identification of informative variables and multiple testing ” as example (iii) of the main scenarios. In most cases, the groups are prespecified (e.g., genes belonging to the same biological pathway, genes with the same molecular function, or mutations on the same arm of a chromosome). A variable can belong to more than one group, and often the variables belonging to the same group are positively correlated. This type of analysis has the potential of having greater statistical power and greater between study reproducibility than a variable-by-variable analysis.

The methods tailored for the analysis of groups of variables can be divided into two broad classes [ 115 , 116 ]: The first class are competitive methods, which attempt to identify which variable groups have a stronger association with the outcome (or phenotype) than the other groups. The second class are self-contained methods, which try to identify which of the variable groups contain at least one variable that is associated to the outcome. Example approaches are described below. The popular gene set enrichment analysis (GSEA) and over-representation analysis (ORA) are mixed approaches, while topGO is a competitive method and the global test a self-contained method. In all cases, FWER or FDR can be controlled using any of the methods already described in section “ TEST2: Multiple testing ” on multiple testing. When applying multiple tests for groups of variables, multiplicity refers to the multiplicity of these groups, not of individual variables. In order to also examine data from a single patient or a small number of samples in experiments, methods have been developed that score individual samples based on gene sets. Singscore [ 117 ] is one such approach. It is a rank-based single sample method that generates scores that are stable across a range of sample sizes (Table 20 ).

Subgraph of the Gene Ontology (GO) induced by the top 5 GO terms identified by topGO (elim algorithm) for scoring GO terms for enrichment. Rectangles indicate the 5 most significant terms. Rectangle color represents the relative significance, ranging from dark red (most significant) to bright yellow (least significant). The top GO terms are spread across different areas of the GO graph, representing rather different biological processes. Source: [ 123 ]

PRED: Prediction

It is often of interest to build a prediction model that takes so-called “predictor variables” (sometimes also referred to as “independent variables”) as input and returns a prediction for a target variable of interest (sometimes also referred to as “dependent variable”) as output. This target variable, which refers either to the present state of the patient or to the future, may be a (binary or multi-categorical) class membership (e.g., treatment responder versus non-responder), a continuous variable (e.g., blood pressure or tumor size after therapy), an ordinal variable (e.g., WHO tumor grade), or a time-to-event (e.g., the overall survival time). Statistically more challenging cases of target variables that are not discussed in this paper are zero-inflated variables (typically continuous with additional frequent 0 values), continuous bounded variables (e.g., with values in [0,1]), or time-to-event variables in the presence of competing risks.

In the HDD setting, the number of candidate variables available to build the prediction model may be very large. This property has implications for construction of prediction models (section “ PRED1: Construct prediction models ”) and assessment and validation of their performance (section “ PRED2: Assess performance and validate prediction models ”). Detailed guidance for training, testing, and validation of HDD prediction models is provided by the IOM (Institute of Medicine of the National Academy of Sciences, USA) within a report that identifies best practices for the development, evaluation, and translation of omics-based tests into clinical practice [ 124 ]. However, that report does not contain detailed guidance on statistical approaches for construction of prediction models and assessment of their performance. In addition, statistical methodology has seen substantial developments during the last decade.

Many methods to assess model performance and validate prediction models have been developed for low-dimensional data and then adapted to HDD, so a good starting reference is the explanation and elaboration paper of the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) reporting guideline [ 125 ]. This section explains, expands, and elaborates on existing guidance to more comprehensively cover issues in prediction modelling with HDD.

PRED1: Construct prediction models

Researchers developing a prediction model primarily focus on how well the model predicts the outcome of interest, especially for new observations, e.g., for patients whose data were not used to build the prediction model. While this is the main concern, often the researchers are also interested in the interpretation of the model, for example identifying which variables contribute most to the prediction and in what way. From this perspective, models involving only a limited number of predictor variables (denoted as “sparse models”), which clearly distinguish informative variables from non-informative variables, may be preferred to models making use of all variables measured for all observations. This is a particularly big challenge in the HDD setting, where many candidate variables are available. Beyond the issue of interpretability, sparse models may be easier to apply in clinical practice, because fewer variables have to be measured or determined to use them than for non-sparse models. In the case of gene expression, for example, the measurement of, say, 10 genes can be easily performed in any lab using PCR techniques, while the measurement of genome-wide expression requires the use of high-throughput methods (see, e.g., [ 126 ]).

A model is said to be “complex” if it reflects many patterns present in the available data, for example, by considering many predictor variables or capturing non-linear effects. Overly complex models risk overfitting the data, i.e., adhere too specifically to the data at hand and identify spurious patterns randomly present in the data used for model development that will not be present in independent data (see, e.g., [ 127 ]). An overfitted model usually exhibits suboptimal prediction performance when subjected to appropriate unbiased evaluation methods, and interpreting such models can be misleading. In contrast, a model that is not complex enough underfits the data. It misses important patterns that might have been useful for the purpose of prediction. When fitting prediction models, in particular (but not only) in the HDD setting, the challenge thus is to identify the optimal level of model complexity that will yield interpretable models with good prediction performance on independent data (see, e.g., [ 128 , 129 ]).

The most straightforward statistical approach to construct a prediction model using several predictor variables simultaneously while taking into account their correlation is fitting a multivariable (generalized) regression model, for example a simple linear regression model in the case of an approximately normally distributed target variable. In linear regression, the regression coefficients are fitted such that the sum (for the n observations) of squared errors (i.e., of squared differences between the true value of the target variable and the predicted value) is minimal. Mathematically, this basic linear regression amounts to solving a system of n equations with p + 1 unknowns, where p stands for the number of predictor variables. Such a regression model, however, cannot be fitted if the number p + 1 of coefficients to fit (the intercept and one coefficient for each variable) exceeds the dataset size n . This dimension problem is complicated by the frequently occurring situation in which some of the p variables are highly correlated, i.e., they provide similar information. These correlations can cause instability with regard to which variables are deemed important contributors to the model and, thus, can influence model interpretability and performance.

Because the number of predictor variables p is usually larger than the number of patients n in HDD settings, basic regression models cannot be fitted directly. In this section, we briefly review some key strategies to deal with the dimension problem: variable selection, dimension reduction, statistical modelling (mainly through regularization methods), and algorithmic approaches (at the interface between statistics and machine learning). First, however, we discuss a preliminary step, variable transformation, that can be particularly helpful in the context of HDD analyses.

PRED1.1: Variable transformations

As mentioned in section “ IDA3: Preprocessing the data ,” data may be transformed to obtain certain distributional properties required for the methods that might be used in preprocessing or in downstream analyses of the preprocessed data. For example, (approximate) normal distributions for errors are a prerequisite for the application of tests such as the t -test or methods based on linear models such as ANOVA and linear regression [ 130 ]. Transformations may also be helpful to dampen the influence of peculiar or extreme observations and may put variables on scales that are more amenable to analysis. For example, one could transform a bounded variable to an unbounded range or convert multiplicative effects to additive effects. It is often preferable to apply suitable transformations first and then work with transformed variables (Table 21 ).

Note that centering and scaling were discussed in section “IDA3.1: Background subtraction and normalization” (referring to normalization), but there the transformation was applied to all values of an observation (a subject, e.g., a patient) to adjust for potential systematic effects and make different observations more comparable, whereas here the transformation is related to all values of a variable.

PRED1.2: Variable selection

Variable selection refers to identification of a subset of predictor variables from all available variables, for purposes of building a prediction model. Note that terms as variable selection, selection strategy, or stepwise procedures are often used in the statistical literature, whereas use of the terms feature selection, wrapper, and filter is more common in the machine learning community. Multiple strategies have been proposed in the statistical and machine learning areas; for recent reviews, see, e.g., Heinze et al. [ 132 ] and Singh et al. [ 133 ]. If available, subject matter knowledge should be included in the variable selection process. In many cases, however, variable selection is performed in a data-driven way, either with filter methods or with wrapper methods. In filter methods, the candidate predictor variables are considered successively independently of each other. Those satisfying a given criterion (for example, those associated with the target variable or those showing sufficient variability across all patients) are selected, while the others are ignored in the remaining analyses (for a comparison of filter methods in classification tasks with HDD data, see [ 134 ]). In contrast, wrapper methods select a subset of variables that, taken in combination, yield good performance accuracy (when used for prediction modelling with a considered method). The performance is assessed, e.g., through cross-validation (see section “PRED2.2: Internal and external validation”). Note that an embedded variable selection is also performed intrinsically with model building methods such as lasso and boosting (see section “PRED1.4: Statistical modelling”).

When the variable selection process uses outcome data, care must be taken to avoid optimistic bias in apparent model performance estimates due to multiple testing issues such as those described in section “ TEST: Identification of informative variables and multiple testing .” It is critical that any data-driven variable selection steps are included as part of the model building process when model performance is assessed using any internal validation method, see Sachs and McShane [ 135 ] for a discussion of the use of “incomplete” cross-validation approaches and the bias inherent in such flawed approaches. Section “PRED2.2: Internal and external validation” provides a further discussion. With an emphasis on LDD, the topic group TG2 “Selection of variables and functional forms in multivariable analysis” of the STRATOS initiative raised several issues needing more research about the properties of variable selection procedures. Authors stressed that it is not straightforward which variable selection approach to use under which circumstances [ 136 ]. Obviously, problems mentioned are strengthened in HDD.

PRED1.3: Dimension reduction

Data reduction has many purposes, including easier data handling (see also sections “IDA2.4: Graphical displays” and “EDA2.1: Cluster analysis” for aspects regarding data reduction). Concerning prediction, data reduction can help to reduce redundant information that may lead to instability of prediction models, as noted at the beginning of section “ PRED1: Construct prediction models .” Data reduction may also facilitate explanation and interpretation by reducing the number of variables to consider. Note, however, that it may yield variables without a meaningful interpretation from a medical point of view [ 137 ].

In contrast to variable selection, the idea of dimension reduction is not to select variables but to build (a small number of) new variables, often called components , that summarize the information contained in the original variables. They can then be used as predictor variables for model building—possibly with a low-dimensional method. However, portability and feasibility of models generated using dimension reduction versus variable selection can be substantially different. To predict outcome using a model containing only a few selected variables, it is sufficient to measure these selected variables, while a model including derived components may require the measurement of all original variables. Consider, for example, deriving a prediction model from gene expression data generated using a microarray that measures 20,000 genes. There is a huge practical difference between using a model requiring input of expression levels of only 10 selected individual genes compared to using a model requiring input of 10 combination scores (components), each of which potentially requires knowledge of expression levels for all 20,000 genes.

The most well-known and widely used dimension reduction approaches are principal component analysis (PCA, see also section “ IDA3: Preprocessing the data ” for a description) and partial least squares (PLS), where the components are defined as linear combinations of the original variables [ 138 ]. While PCA constructs components that have maximal variance and thus capture the signals of all types contained in the data, PLS constructs new variables that have maximal covariance with the target variable of interest. PLS is said to be a supervised method, where the term “supervised” refers to the fact that the target variable determines the construction of the components. Note that dimension reduction can be combined with variable selection.

For HDD analysis, a knowledge-based data reduction may also be useful. There is often external knowledge available about the entities to be investigated, such as knowledge of signaling pathways when analyzing gene expression data, or knowledge on conserved regions when analyzing DNA sequencing data (see also section “ TEST3: Identify informative groups of variables ” for incorporating information about functional relationships between genes in multiple testing). Attempts to re-discover such knowledge from the data at hand when performing data reduction then will typically be less reliable compared to using a data reduction strategy that explicitly incorporates external information, even if the latter itself also is to some extent unreliable (Table 22 ).

PRED1.4: Statistical modelling

Several modifications of traditional regression methods are available to address common challenges encountered in HDD settings with p > n . There is no unique mathematical solution for the standard regression parameter estimates. Traditional regression aims to find the parameters that minimize a sum of squared errors, which can be viewed as minimizing a type of “loss function.” Various modifications to this loss function can be made to permit a unique solution for the regression parameters in the HDD setting. The modifications described in this section impose mathematical constraints on regression coefficients. These constraints effectively limit the number of predictor variables included in the model or the magnitudes of their effects or both. Estimates obtained with such constraints are often referred to as “shrunken.” Some of these constraints can be shown equivalent to adjusting the covariance matrix (e.g., ridge regression; see [ 139 ]), but a variety of other constraints can be applied through specification of different loss functions; lasso [ 140 ] and elastic net [ 141 ] are two examples. Other methods, such as boosting [ 142 ], iteratively fit regression models that minimize a specified loss function at each stage. These various approaches usually lead to different models, each of which is optimal according to its corresponding criteria.

Numerous modifications of these basic approaches have been developed in the literature (especially for lasso, due to its variable selection property). Goals can be to recover desirable mathematical properties (e.g., the adaptive lasso [ 143 ] uses adaptive weights for penalizing different coefficients and estimates the correct model under some constraints) or to adapt the lasso to specific problems (e.g., the group lasso [ 144 ] allows predefined groups of variables to jointly be selected or not) (Table 23 ).

PRED1.5: Algorithms

Boosting can be seen both as a statistical method, when a statistical model is fitted, and as an algorithmic approach, when it is implemented as a black box. In the latter case, the prediction updates are unrelated to an underlying statistical model, and only aim at minimizing a loss function [ 147 ]. Several machine learning algorithms have been developed to provide prediction rules [ 148 ]. The prediction model is constructed without variable selection or dimension reduction as a preliminary step, in a fully data-driven way, i.e., (in contrast to statistical methods) without assuming a particular model for the dependence between target and predictor variables. These algorithmic approaches may allow more flexibility to handle aspects such as non-linear or interaction effects, but often they are also less interpretable.

Machine learning algorithms comprise a diverse collection of methods. They include, among others, methods based on consideration of nearest neighbors in the predictor space (such as kNN), decision trees for classification and for regression (tree-based methods based on recursive partitioning of the predictor space), random forests (ensembles of decision trees, i.e., sets of decision trees whose predictions are averaged), and more complex approaches such as deep learning (neural networks with different structures and typically a huge number of parameters). In the HDD setting, many of these machine learning methods have been successfully used, but one must be particularly careful if the methods require the estimation of a large number of parameters, which applies especially to deep learning. Here, the overfitting problem discussed above becomes even more severe. Unbeknownst to users, some software developed to implement complex algorithms could have faulty designs that result in incorrect or overfitting results; hence, algorithms must be carefully tested [ 149 ] (Table 24 ).

PRED1.6: Integrating multiple sources of information

A major challenge for HDD, both for omics data and for electronic health records, is the integrative analysis of different data types. For instance, multiple types of omics data including proteomic, transcriptomic, and genomic, may be measured on the same subject. For health records data, various variable types are combined, such as blood values, urine values, cardiography measurements (ECG or EKG), categorical diagnostic measurements, or a variety of demographic variables. This has implications for visualization and use of clustering methods, which are often designed for a single data type. Conducting and interpreting joint analyses of disparate variable types can be challenging. Richardson and coauthors [ 160 ] distinguish between “horizontal integration” applied to the same type of data across multiple studies and “vertical integration” applied to different types of data on the same sample of subjects. The distinction between horizontal and vertical refers to the fact that, usually, data from high-throughput experiments are organized with samples represented by columns and variables by rows.

Regarding horizontal integration, the meta-analytic approach of pooling summary measures of association is the most used approach. For other applications, such as clustering, in order to deal with different normalizations and platforms for the different datasets, centering, and standardization [ 161 ] or specific methods should be considered; for clustering, see for example Huo et al. [ 162 ]. Vertical data integration is typically model-based and the model used considers the specific characteristics of the data to be integrated and of the research question (whether exploratory or predictive).

In biomedicine, integration of multiple omics data types can provide deeper biological insights compared to individual omics in terms of disease subtyping, biomarker identification, and understanding of molecular mechanisms in diseases. For example, two different tissues from the same or different organism may carry an identical DNA sequence for a particular gene, but the gene may be inactivated by methylation in one of the tissues and not in the other; or the aberrant expression of one gene regulating the function of another downstream in the same biological pathway might be evident by observing the altered expression of the downstream gene at the RNA or protein level.

Richardson and coauthors [ 160 ] reviewed some vertical integrative analysis approaches, including integrative clustering and regression. The integrative clustering approach of Shen and coauthors [ 163 ], called iCluster, involves projection, via regression modelling, of the data onto scores representing a set of latent biological subtypes assumed common across data types. Resulting predicted biological subtype scores are clustered to identify latent subtype membership, and estimated coefficients from the fitted regression models can provide insights into data features that associate with certain subtypes. Mo and coauthors subsequently developed iCluster + to allow for other non-continuous, non-Gaussian data types [ 164 ]. More complex Bayesian mixture modelling approaches have also been developed to offer greater flexibility to accommodate mixed data types (e.g., discrete mutation indicators in combination with continuous RNA or protein expression measures), provide metrics reflecting uncertainty about estimated underlying structure, and allow for elucidation of potentially different structure from different data types [ 165 , 166 , 167 , 168 ]. Integrative regression techniques are useful for supervised analyses of integrated data types, such as building a regression model for prediction of an outcome or phenotype. These methods allow to utilize structure inherent in different data types (e.g., DNA sequence location, functional categories of proteins, metabolic or signaling pathways) to effectively reduce the high dimensionality of the predictor variable space to facilitate development of more parsimonious and interpretable models relating the multi-omics data to outcomes or phenotypes of interest. Multi-omics integration methods using autoencodeurs in a deep learning setting are reviewed by Benkirane and coauthors [ 169 ]. For more details, readers are referred to Richardson and coauthors [ 160 ] and references therein.

Although many prediction models for clinical outcomes have been developed based either on clinical data or (more recently) on high-throughput molecular data (e.g., omics), far fewer models have been developed to incorporate both data types through vertical integration. The paucity of such models in the literature and in clinical use persists despite suggestions that a suitable combination of clinical and molecular information might lead to models with better predictive abilities (e.g., [ 170 , 171 ]).

In many medical specialties, there are some widely available and accepted clinical predictors with predictive value already validated in several independent populations. Strategies to combine such established clinical predictors with different data types, including high-dimensional omics data, have been proposed [ 172 ]; some examples have been published [ 173 , 174 ], but applications are still rare. Volkmann and coauthors [ 174 ] investigated whether better use of the predictive value of clinical data has an influence on the added predictive value of molecular data. This concept can also be extended to multi-omics data [ 175 ].

Conceptually, it is obvious that incorporation of important clinical variables can potentially lead to better prediction models; thus, those variables should be considered in combination with molecular data. De Bin et al. [ 176 ] present strategies to combine low- and high-dimensional data in a regression prediction model, analyzing the influence of the complex correlation structure within and between the two data sources. In some situations, predictive value of molecular data might be fully captured through the clinical variables, thereby eliminating the need for the molecular data in the prediction model [ 172 ].

PRED2: Assess performance and validate prediction models

Perhaps even more than constructing predictive models and algorithms, evaluating their performance and validating them are key challenges. For HDD, not only the choice of suitable measures to assess and compare model performance (see section below), but also the way of computing these measures is generally not straightforward.

PRED2.1: Choice of performance measures

Prediction performance is typically assessed by comparing the true and the predicted values of the target variable. The comparison is based on specific metrics, mainly depending on the nature of the target variable. Typical metrics include mean squared error or mean absolute error for continuous target variables, area under the curve (AUC) or Brier score for binary target variables, and calibration plot and time-dependent Brier score for time-to-event variables. Such measures can be used to quantify the performance of a model (or algorithm) or to compare different models constructed using the same dataset. In most biomedical applications, the goal of a comparative assessment is to select a final model [ 177 , 178 ]. Models of absolute risk that depend on covariates have been used to design intervention studies, to counsel patients regarding their risks of disease or future disease-related events, and to inform clinical decisions. Several criteria related to “calibration” and “discriminatory power” have been proposed [ 179 , 180 ]. Often the main interest will be in the added value of biomarkers or gene signatures relative to an existing clinical prediction model. Several performance measures are available to quantify the added value [ 181 ].

For a clinical task, several very different models with equivalent prediction performance may be available. Not only, but especially in this situation, other aspects of the models can play an important role. Particularly noteworthy aspects of a model are sparsity, stability, interpretability, and practical usefulness [ 7 , 182 ]. Regarding sparsity, when selecting a final model from among several with comparable prediction performance, selection of the most parsimonious (e.g., the model with smallest number of predictor variables) is preferred. Stability refers to the degree to which small changes in the data may produce large changes in the predictor output. A majority of predictors derived from HDD suffer from poor stability, irrespective of the method used to fit them, although some methods are more affected than others (see [ 183 ] for an overview of stability measures, and Sauerbrei et al. [ 184 ] for stability investigations of regression models for LDD and HDD). For HDD, the stability problem is due to the myriad ways to combine a set of predictor variables to derive similar performing predictors. If the stability is found to be low, then interpretation of specific model components (the list of selected predictor variables, relationships between predictor variables, etc.) should be avoided. In terms of interpretability of the model, strong prior biological knowledge may also be taken into account, similar as for the aim of data reduction described above (Table 25 ).

Receiver operating characteristic (ROC) curve that illustrates the predictive performance of a gene signature including 227 genes for the prediction of chemotherapy response in serous ovarian cancer, obtained using the TCGA (The Cancer Genome Atlas) data set. The arrow indicates the sensitivity and specificity values obtained for a selected cutoff value that can serve as a threshold for patient stratification. In this example, the AUC is evaluated on the same data used to train the classifier, so it is likely to be overoptimistic. Source: [ 185 ]

Illustrations of different types of miscalibration, visualized by calibration plots. Illustrations are based on an outcome with a 25% event rate and a model with an area under the ROC curve (AUC or c-statistic) of 0.71. Calibration intercept and slope are indicated for each illustrative curve. a General over- or underestimation of predicted risks. b Predicted risks that are too extreme or not extreme enough. Source: [ 187 ]

PRED2.2: Internal and external validation

Whatever measure of model performance has been chosen, computing it on the same dataset that was used for constructing the model may lead to a dramatic over-estimation of the performance. Instead, one should assess prediction performance using independent data, i.e., data not used to construct the model [ 189 , 190 , 191 ]. One classical procedure is to split the given dataset into a training set and a test set, and then to construct the model using only the training set and evaluate the model using only the test set. This is one type of “internal validation,” in contrast to “external validation,” where data from independent patient cohorts are used [ 192 ].

Due to the typical instability of predictors developed using HDD, this sample splitting procedure is very risky, as in most cases the specific split heavily influences the result. Resampling techniques, such as cross-validation, subsampling and bootstrapping, can be less risky, although even those methods cannot avoid impact of biases in the data introduced by faulty designs such as those that would confound batch effects with outcome variables. The common idea behind these procedures is to repeatedly use a part of the dataset as training dataset, i.e., to construct a prediction model, and the other (non-overlapping) part as test dataset to evaluate the constructed model. This process is repeated several times for different splits into training and test data to produce a more stable and representative estimate of model performance.

For such approaches, a bias in the performance estimates must be considered (see also [ 135 ]). This bias occurs because the training data sample size is smaller than for the full dataset, and therefore prediction models built on the training dataset tend to have somewhat worse performance than a final model built on the full data. The latter is typically used for further evaluation. This bias becomes larger the smaller the training dataset is compared to the full dataset. This aspect is less relevant if the sample size of the full dataset and thus of the training dataset is very large.

One misleading practice is use of resampling procedures for multiple different prediction modelling methods or for different parameter values, and then reporting results for only the model with best performance. This practice leads to over-optimism in model performance because it neglects to acknowledge and account for the fact that the reported model was the result of another optimization process [ 201 ]. Such studies aiming to find “best” models occur quite frequently in the context of HDD. While it would be naive to expect that investigators will not try multiple approaches to develop a prediction model, the key is transparency in reporting how many models were actually produced and evaluated, and appropriately accounting for the additional selection step. One should either validate the final selected model using an independent dataset (see Table 26 ), or when such a dataset is not available, embed the selection process in the cross-validation procedure, i.e., perform a so-called nested cross-validation procedure [ 190 , 202 ]. Figure 19 [ 203 ] shows a schematic representation of a suitable process for developing a predictor, here specified for omics data, in which the discussed aspects are adequately taken into account.

Schematic representation of an appropriate omics predictor development process, with internal validation for improving prediction performance (left box) and external validation for assessing prediction performance on external data. Source: [ 203 ]

PRED2.3: Identification of influential points

Identification of possible influential observations, defined as those for which inclusion or exclusion in model development might substantially alter characteristics of the final model [ 204 ], is an important aspect of prediction modelling that is often neglected in HDD settings, and even frequently in low-dimensional settings. Model alterations can be related to variable selection (see, e.g., [ 205 ]), functional forms (e.g., [ 206 ]) or parameter estimation (e.g., [ 207 ]). Influential points can be outliers in some of the variables (observations suspiciously different from the rest, such that they are probably generated by a different mechanism [ 208 ]), but they do not need to be.

For HDD, identification of influential points is particularly difficult, due to data sparsity (the so-called “curse of dimensionality”) and, more generally, the increased difficulty in identifying data patterns, especially by graphical methods. Many available methods for influential point detection are extensions of traditional low-dimensional tools such as Cook’s distance [ 204 ] and the CFBETA / DFFITS measures [ 209 ]. Examples of adaptations to the high-dimensional framework of the former are Zhao et al. [ 210 ] and Wang and Li [ 211 ], of the latter Walker and Birch [ 212 ] and Rajaratnam et al. [ 213 ]. Focusing more on statistical models (see section “PRED1.4: Statistical modelling”), methods like those of Shi and Wang [ 214 ] and Hellton et al. [ 215 ] investigate the effect of influential points on the choice of the tuning parameters, again adapting existing low-dimensional approaches (the aforementioned DFFITS measure and the resampling approaches in De Bin et al. [ 205 ], respectively). The latter is an example of how cross-validation and subsampling can be used to detect influential points by tracking large changes in the estimates when one (or a few) observation is omitted. Although influential points and outliers can strongly affect the results of analyses in HDD settings, systematic checks for them seem to be often ignored in the literature, despite of the availability of various techniques [ 216 , 217 ].

In a study on the classification of breast cancer subtypes, Segaert et al. [ 218 ] stress that classical statistical methods may fail to identify outliers and argue for robust classification methods that flag outliers. They propose the DetectDeviatingCells outlier detection technique. Specifically for HDD, Boulesteix et al. [ 216 ] propose a rank discrepancy measure that considers the difference between gene rankings for the original data and for a pretransformation that tries to eliminate the effect of extreme values. For survival data, Carrasquiha et al. [ 219 ] propose a rank product test to identify influential observations, and more techniques have been proposed recently. Fan [ 220 ] released the R package HighDimOut, which contains three high-dimensional outlier detection algorithms. However, none of the approaches seems to have gained popularity in practice. More guidance and comparisons of available approaches are needed.

PRED2.4: Sample size considerations

Recent guidelines for calculating sample size when developing a risk prediction model [ 16 ] are not specifically tailored for applications involving variable selection or shrinkage methods (such as LASSO or Ridge Regression). This is the situation in high-dimensional settings, where variable selection or dimension reduction is needed to identify prognostic variables or components. The available methods for sample size planning [ 17 , 18 ] and references therein are based either on simulations that require assumptions of feature independence or on the availability of a pilot dataset or preliminary data, but these methods are hardly used in practice. Moreover, penalized estimation has often been proposed for situations with potentially large overfitting problems, while recent evidence suggests that it yields unstable results, especially for small sample sizes [ 221 ], when overfitting is a major concern.

A practical sample size method for planning a preliminary study of a prognostic biomarker is suggested for microarray technology [ 113 ], which can be used in more general settings. When a gene signature is already available from previous exploratory studies, a formal calculation for a predictive model, including the gene signature and standard prognostic covariates, can be performed according to available guidelines, taking also into account the need for external validation [ 16 , 200 ].

Good reporting to improve transparency and reproducible research

Reporting of studies involving HDD can be particularly challenging and at the same time especially important due to the many potential pitfalls in the collection and analysis of complex HDD as described herein. Complete and transparent reporting of these studies is critical to allow independent researchers to evaluate how a study was designed, conducted, and analyzed so that quality and relevance of the findings can be judged and interpreted in appropriate context. Provision of data and computer code may be required to achieve full transparency.

Guidelines for reporting of many types of health research have already been developed and are largely applicable in HDD settings. Simera et al. [ 222 ] introduced the EQUATOR (Enhancing the QUAlity and Transparency Of health Research) network as an umbrella organization for the reporting of studies in the health sciences. Most relevant for HDD data are the REporting recommendations for tumor MARKer prognostic studies (REMARK) [ 11 ] and TRIPOD for the reporting of multivariable prediction models for individual prognosis or diagnosis [ 12 ]. For both reporting guidelines, more detailed “explanation and elaboration” papers have been published [ 125 , 223 ], which also include several sections on statistical analyses. Furthermore, the two-part REMARK profile, a structured display that summarizes key aspects of a study, has been proposed to improve completeness and transparency of reporting, specifically of statistical analyses. The TRIPOD checklist distinguishes between model development and validation. Both guidelines were developed for markers and for models based on clinical data, with no more than a few dozen potential predictors in mind.

In an article stressing the importance of registering diagnostic and prognostic research, Altman [ 224 ] clearly expresses that non-reporting and misleading reporting do not just mislead researchers in the field, they also diminish the evidence base underpinning clinical practice and harm patients. To improve on such an unacceptable situation of non-transparency in studies, several initiatives including data pooling, registers, and journal requirements for protocols were started, see Peat et al. [ 225 ] for a detailed discussion with an emphasis on prognosis research.

Obviously, reporting of artificial intelligence and machine learning methods come with a large number of additional challenges. Concerns have been raised that they are overhyped in clinical medicine (see, e.g., [ 226 ]) and, if not used with proper expertise, have methodological shortcomings, poor transparency, and poor reproducibility [ 227 ]. There is a strong need for applications of machine learning techniques to adhere to established methodological standards already defined in prediction model research [ 228 ].

In this section, we first summarize the content and the key messages of this overview paper. We also briefly present the relationships of the other topic groups of the STRATOS initiative to the HDD-focused TG9 group and discuss the importance of further collaboration.

Biomedical research has always relied on a combination of observational studies, carefully controlled laboratory experiments, and clinical trials, but the types of data generated and analyzed in these studies continue to evolve and now more often include HDD. The high dimensionality may result from new technologies such as omics assays, which are capable of comprehensive interrogation of biological specimens, or from increased ability to merge data from multiple information systems such as electronic health records or registries. HDD present many new challenges for statistical design and analysis of biomedical research studies. This overview provides a gentle introduction to basic concepts and useful strategies for design and analysis of studies involving HDD. Key points are summarized in the discussion that follows.

Study design for prospectively planned investigations and vigilance to detect (and avoid when possible) confounding in observational studies remain as important for studies involving HDD as for other studies. Consequences of inattention to these aspects can be particularly damaging when HDD are involved. While HDD may provide greater opportunity for discovery of new biological and clinical concepts and associations, they might also be more susceptible to influence of ancillary confounding variables and technical artifacts. Therefore, initial examination of data for technical artifacts such as batch effects, inconsistent, extreme, or suspicious values is critically important but simultaneously more challenging as the data dimension increases. New data visualization, detection, and correction or normalization methods have been adopted for HDD, as were described in section “ IDA: Initial data analysis and preprocessing ” of this overview. Techniques for data visualization and exploration such as those described in section “ EDA: Exploratory data analysis ” of this overview are also important to provide biological insights and support development of new scientific hypotheses from HDD. The initial steps and exploratory data analyses described in sections “ IDA: Initial data analysis and preprocessing ” and “ EDA: Exploratory data analysis ” are optimally performed when equipped with a good understanding of the data sources and data generation methods, for example assay technologies that produce omics data, and interpreted in collaboration with other scientists knowledgeable in the technology, biology, and clinical aspects.

Statistical analysis methods that were developed for traditional settings where the number of independent observations or subjects is substantially larger than the number of variables acquired are widely used by classically trained statisticians and others in a variety of applications, and their widespread use is supported by ready availability of software. Emergence of many new types of HDD has exposed the limitations of many traditional methods. Often, methods rely heavily on distributional assumptions such as normality, which may be unrealistic for data types generated by novel technologies such as omics assays. Many methods owe their robustness to such assumptions to large sample size, yet the notion of what qualifies as “large n ” is dramatically different for HDD where even the most basic requirement n > p is not satisfied. Much traditional statistical methodology for addressing multivariate data has focused heavily on mathematically tractable joint distributions such as multivariate Gaussian or assumed that sample sizes were large enough that this served as a good approximation. As these are not reasonable assumptions for many types of HDD, many researchers opt for an alternative strategy of examining each of many variables one-at-a-time. Yet, naively taking such an approach is fraught with danger of generating many false discoveries due to the extremely large number of variables examined. Traditional strategies for controlling false positive findings, such as controlling the FWER, are often impractical or overly stringent in view of the goals of many studies involving HDD, and this recognition has stimulated development of novel approaches for false discovery control. Section “ TEST: Identification of informative variables and multiple testing ” of the overview highlighted some of these many challenges and summarized some useful strategies to address them.

The last few decades have seen substantial progress in development of prediction modelling methodology, especially as applicable to HDD, and increased availability of free software to implement these methods has fueled their use. Available methods include a variety of statistically based approaches as well as a number of purely algorithmic approaches such as many machine learning methods. Prediction models developed from HDD have intrigued many researchers under the impression that with sufficiently large volumes of data one should be capable of predicting virtually anything. Numerous dramatic claims of performance have been made; unfortunately, these claims do not always withstand careful scrutiny. Section “ PRED: Prediction ” provides a review of several popular prediction modelling methods for HDD, and it stressed the importance of following proper procedures to assess and avoid model overfitting that leads to prediction models that do not perform well outside of the data from which they were developed. Poor study design and faulty prediction modelling approaches that lead to spurious and overfitted models along with wildly inaccurate claims of their performance persist in the biomedical literature. Guidance provided in section “ PRED: Prediction ” aims to reduce this problem and promote successful development of useful prediction models.

Within the STRATOS initiative, there are currently nine topic groups (TGs), mainly concerned with LDD. Table 27 presents the relationship of the other STRATOS topic groups to TG9 group and how TG9 guidance will build upon that of other TGs to adapt it for relevance.

This overview aimed to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are newly embarking on research involving HDD or who are merely wanting to better evaluate and understand results of HDD analyses. Common approaches for the statistical analysis of high-dimensional biomedical data are described in 24 method tables; see Table 28 for a list of these tables. New methods to generate HDD or combine existing data resources to yield HDD will continue to evolve, and there will be continued need to develop new and improved computational and statistical analysis strategies to address new types of data and novel questions to be answered from those data. Basic concepts and strategies presented in this overview will remain relevant, and their wider grasp by the biomedical research community will hopefully lead to continued improvement in the quality, reliability, and value of studies involving HDD. Most importantly, strong collaborations between statisticians, computational scientists, and other biomedical researchers such as clinicians, public health experts, laboratorians, technology experts, bioinformaticians, and others that are relevant to each project, are essential to produce the most reliable and meaningful data and results.

Availability of data and materials

Not applicable.

Abbreviations

Akaike information criteria

Analysis of variance

Average silhouette width

Area under the curve

Bootstrap AGGregatING

Benjamini-Hochberg

Bayesian information criteria

Body mass index

Cluster Affinity Search Technique

Clustering Large Applications

Methods for adjustment of batch effects

Differential gene expression analysis of RNA-seq data

Exploratory data analysis

Expectation-maximization

Enhancing the QUAlity and Transparency Of health Research

False discovery proportion

False discovery rate

Familywise error rate

Gene ontology

Gene set enrichment analysis

High-dimensional data
Initial data analysis

Institute of Medicine of the National Academy of Sciences, USA

Low-dimensional data

Linear Models for Microarray Data

Mass-to-charge ratio

Bland-Altman plot

Median absolute deviation to the median

Mean absolute (prediction) error

Missing at random

Missing “completely at random”

Multidimensional scaling

Missing not at random

Mean squared (prediction) error

Negative binomial

Non-Metric Multidimensional Scaling

Partitioning around medoids

Principal component

Principal component analysis

Positron emission tomography

Partial least squares

REporting recommendations for tumor MARKer prognostic studies

Relative log expression

Root mean squared error

Receiver operating characteristic

Stochastic neighbor embedding

Self-organizing maps

STRengthening Analytical Thinking for Observational Studies

Supervised principal components

Surrogate variable analysis

Support vector machine

Topic group

Topology-based Gene Ontology enrichment analysis

Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis

T-Distributed Stochastic Neighbor Embedding

Uniform Manifold Approximation and Projection

Waist-hip ratio

Sauerbrei W, Abrahamowicz M, Altman DG, le Cessie S, Carpenter J, on behalf of STRATOS initiative. STRengthening Analytical Thinking for Observational Studies: The STRATOS initiative. Stat Med. 2014;33:5413–32. https://doi.org/10.1002/sim.6265 .

Article PubMed PubMed Central Google Scholar

Johnstone IM, Titterington DM. Statistical challenges of high-dimensional data. Philos Trans A Math Phys Eng Sci. 1906;2009(367):4237–53. https://doi.org/10.1098/rsta.2009.0159 .

Article Google Scholar

McGrath S. The Influence of ‘Omics’ in Shaping Precision Medicine. EMJ Innov. 2018;2(1):50–5. https://doi.org/10.33590/emjinnov/10313528 .

Evans RS. Electronic Health Records: then, now, and in the future. Yearb Med Inform Suppl. 2016;1:48–61. https://doi.org/10.15265/IYS-2016-s006 .

Cowie MR, Blomster JI, Curtis LH, Duclaux S, Ford I, Fritz F, Goldman S, Janmohamed S, Kreuzer J, Leenay M, Michel A, Ong S, Pell JP, Southworth MR, Stough WG, Thoenes M, Zannad F, Zalewski A. Electronic health records to facilitate clinical research. Clin Res Cardiol. 2017;106(1):1–9. https://doi.org/10.1007/s00392-016-1025-6 .

Article PubMed Google Scholar

McShane LM, Cavenagh MM, Lively TG, Eberhard DA, Bigbee WL, Williams PM, Mesirov JP, Polley MY, Kim KY, Tricoli JV, Taylor JM, Shuman DJ, Simon RM, Doroshow JH, Conley BA. Criteria for the use of omics-based predictors in clinical trials. Nature. 2013;502(7471):317–20. https://doi.org/10.1038/nature12564 .

Article CAS PubMed PubMed Central Google Scholar

Wyatt JC, Altman DG. Commentary: Prognostic models: clinically useful or quickly forgotten? BMJ. 1995;311:1539. https://doi.org/10.1136/bmj.311.7019.1539 .

Article PubMed Central Google Scholar

Hand DJ. Classifier technology and the illusion of progress. Stat Sci. 2006;21(1):1–14. https://doi.org/10.1214/088342306000000060 .

Hernández B, Parnell A, Pennington SR. Why have so few proteomic biomarkers “survived” validation? (Sample size and independent validation considerations). Proteomics. 2014;14:1587–92. https://doi.org/10.1002/pmic.201300377 .

Article CAS PubMed Google Scholar

Kleinrouweler CE, Cheong-See FM, Collins GS, Kwee A, Thangaratinam S, Khan KS, Mol BW, Pajkrt E, Moons KG, Schuit E. Prognostic models in obstetrics: available, but far from applicable. Am J Obstet Gynecol. 2016;214(1):79-90.e36. https://doi.org/10.1016/j.ajog.2015.06.013 .

McShane LM, Altman DG, Sauerbrei W, Taube SE, Gion M, Clark GM. for the Statistics Subcommittee of the NCI-EORTC Working on Cancer Diagnostics. REporting recommendations for tumor MARKer prognostic studies (REMARK). J Natl Cancer Inst. 2005;97:1180–4. https://doi.org/10.1093/jnci/dji237 .

Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement. BMC Med. 2015;13:1. https://doi.org/10.1186/s12916-014-0241-z .

Zhou H, Chen J, Rissanen TH, Korrick SA, Hu H, Salonen JT, Longnecker MP. Outcome-dependent sampling: an efficient sampling and inference procedure for studies with a continuous outcome. Epidemiology. 2007;18(4):461–8. https://doi.org/10.1097/EDE.0b013e31806462d3 .

Yu J, Liu Y, Cai J, Sandler DP, Zhou H. Outcome-dependent sampling design and inference for Cox’s proportional hazards model. J Stat Plan Inference. 2016;178:24–36. https://doi.org/10.1016/j.jspi.2016.05.001 .

Cairns DA. Statistical issues in quality control of proteomic analyses: good experimental design and planning. Proteomics. 2011;11(6):1037–48. https://doi.org/10.1002/pmic.201000579 .

Riley RD, Ensor J, Snell KIE, Harrell FE, Martin GP, Reitsma JB, Moons KGM, Collins G, van Smeden M. Calculating the sample size required for developing a clinical prediction model. BMJ. 2020;368:m441. https://doi.org/10.1136/bmj.m441 .

Götte H, Zwiener I. Sample size planning for survival prediction with focus on high-dimensional data. Stat Med. 2013;32(5):787–807. https://doi.org/10.1002/sim.5550 .

Dobbin KK, Song X. Sample size requirements for training high-dimensional risk predictors. Biostatistics. 2013;14(4):639–52. https://doi.org/10.1093/biostatistics/kxt022 .

Maleki F, Ovens K, McQuillan I, Kusalik AJ. Size matters: how sample size affects the reproducibility and specificity of gene set analysis. Hum Genomics. 2019;13(Suppl 1):42. https://doi.org/10.1186/s40246-019-0226-2 .

Geschwind DH. Sharing gene expression data: an array of options. Nat Rev Neurosci. 2001;2(6):435–8. https://doi.org/10.1038/35077576 .

Kennedy RE, Cui X. Experimental Designs and ANOVA for Microarray Data. In: Handbook of Statistical Bioinformatics. Berlin: Springer, Berlin Heidelberg; 2011. p. 151–69.

Chapter Google Scholar

Lusa L, Cappelletti V, Gariboldi M, Ferrario C, De Cecco L, Reid JF, Toffanin S, Gallus G, McShane LM, Daidone MG, Pierotti MA. Questioning the utility of pooling samples in microarray experiments with cell lines. Int J Biol Markers. 2006;21(2):67–73. https://doi.org/10.1177/172460080602100201 .

Huebner M, Vach W, le Cessie S. A systematic approach to initial data analysis is good research practice. J Thorac Cardiovasc Surg. 2016;151(1):25–7. https://doi.org/10.1016/j.jtcvs.2015.09.085 .

Huebner M, le Cessie S, Schmidt CO, Vach W. A contemporary conceptual framework for initial data analysis. Observational Studies. 2018;4:171–92. https://doi.org/10.1353/obs.2018.0014 .

Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S, editors. Bioinformatics and computational biology solutions using R and Bioconductor. New York: Springer Science & Business Media; 2005.

Friendly M. Corrgrams: Exploratory displays for correlation matrices. Am Stat. 2002;56(4):316–24. https://doi.org/10.1198/000313002533 .

Chen Y, Mccarthy D, Ritchie M, Robinson M, Smyth G. edgeR: differential analysis of sequence read count data User’s Guide. Bioconductor.org. 2008. https://www.bioconductor.org/packages/devel/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf . cited 2022 Nov 29

Wilkinson L, Friendly M. The History of the Cluster Heat Map. Am Stat. 2009;63(2):179–84. https://doi.org/10.1198/tas.2009.0033 .

Leek JT, Scharpf R, Bravo H, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010;11:733–9. https://doi.org/10.1038/nrg2825 .

The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526:68–74. https://doi.org/10.1038/nature15393 .

Article CAS Google Scholar

Irizarry R, Love M. Data Analysis for the Life Sciences with R. CRC Press. 2016. https://doi.org/10.1201/9781315367002 .

Gandolfo LC, Speed TP. RLE plots: visualizing unwanted variation in high dimensional data. PLoS ONE. 2018;13(2):e0191629. https://doi.org/10.1371/journal.pone.0191629 .

Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;327(8476):307–10. https://doi.org/10.1016/S0140-6736(86)90837-8 .

Smyth GK, Speed T. Normalization of cDNA microarray data. Methods. 2003;31(4):265–73. https://doi.org/10.1016/S1046-2023(03)00155-5 .

Sauerbrei W, Buchholz A, Boulesteix AL, Binder H. On stability issues in deriving multivariable regression models. Biom J. 2015;57(4):531–55. https://doi.org/10.1002/bimj.201300222 .

Altman DG, Bland JM. Missing data. BMJ. 2007;334(7590):424. https://doi.org/10.1136/bmj.38977.682025.2C .

Findlay JWA, Dillard RF. Appropriate calibration curve fitting in ligand binding assays. AAPS J. 2007;9(2):E260–7. https://doi.org/10.1208/aapsj0902029 .

Pearson KFRS. LIII. On lines and planes of closest fit to systems of points in space. London Edinburgh Dublin Philos Mag J Sci. 1901;2(11):559–72. https://doi.org/10.1080/14786440109462720 .

Park M, Lee JW, Bok Lee J, Heun SS. Several biplot methods applied to gene expression data. J Stat Plan Inference. 2008;138(2):500–15. https://doi.org/10.1016/j.jspi.2007.06.019 .

Gabriel KR. The biplot graphic display of matrices with application to principal component analysis. Biometrika. 1971;58(3):453–67. https://doi.org/10.1093/biomet/58.3.453 .

Silver JD, Ritchie ME, Smyth GK. Microarray background correction: maximum likelihood estimation for the normal-exponential convolution. Biostatistics. 2009;10(2):352–63. https://doi.org/10.1093/biostatistics/kxn042 .

Coombes KR, Baggerly KA, Morris JS. Pre-processing mass spectrometry data. In: Dubitzky W, Granzow M, Berrar DP, editors. Fundamentals of data mining in genomics and proteomics. Boston: Springer; 2007. https://doi.org/10.1007/978-0-387-47509-7_4 .

Bolstad B, Irizarry R, Astrand M, Speed T. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19(2):185–93. https://doi.org/10.1093/bioinformatics/19.2.185 .

Monti S. Quantile normalization. Github.io. https://montilab.github.io/BS831/articles/docs/quantileNormalization.html . cited 2022 Nov 29

Oberg AL, Mahoney DW. Statistical methods for quantitative mass spectrometry proteomic experiments with labeling. BMC Bioinformatics. 2012;13(16):S7. https://doi.org/10.1186/1471-2105-13-S16-S7 .

Ejigu BA, Valkenborg D, Baggerman G, Vanaerschot M, Witters E, Dujardin JC, Burzykowski T, Berg M. Evaluation of normalization methods to pave the way towards large-scale LC-MS-based metabolomics profiling experiments. Omics J Integr Biol. 2013;17(9):473–85. https://doi.org/10.1089/omi.2013.0010 .

Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27. https://doi.org/10.1093/biostatistics/kxj037 .

Zhang Y, Parmigiani G, Johnson WE. ComBat-Seq: batch effect adjustment for RNA-Seq count data. NAR Genom Bioinformatics. 2020;2(3):lqaa078. https://doi.org/10.1093/nargab/lqaa078 .

Wang Y, LêCao K-A. Managing batch effects in microbiome data. Brief Bioinform. 2020;21(6):1954–70. https://doi.org/10.1093/bib/bbz105 .

Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by ‘Surrogate Variable Analysis.’ PLoS Genetics. 2007;3(9):e161. https://doi.org/10.1371/journal.pgen.0030161 .

Leek JT. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 2014;42(21):e161. https://doi.org/10.1093/nar/gku864 .

Bourgon R, Gentleman R, Huber W. Independent filtering increases detection power for high-throughput experiments. PNAS. 2010;107(21):9546–51. https://doi.org/10.1073/pnas.0914005107 .

Lusa L, Korn EL, McShane LM. A class comparison method with filtering-enhanced variable selection for high-dimensional data sets. Statist Med. 2008;27(28):5834–49. https://doi.org/10.1002/sim.3405 .

Hornung R, Bernau C, Truntzer C, Wilson R, Stadler T, Boulesteix AL. A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization. BMC Med Res Methodol. 2015;15:95. https://doi.org/10.1186/s12874-015-0088-9 .

Bair E, Tibshirani R. Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol. 2004;2(4):e108. https://doi.org/10.1371/journal.pbio.0020108 .

Greenland S. Avoiding power loss associated with categorization and ordinal scores in dose-response and trend analysis. Epidemiology. 1995;6(4):450–4. https://doi.org/10.1097/00001648-199507000-00025 .

Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Statist Med. 2006;25(1):127–41. https://doi.org/10.1002/sim.2331 .

Lee K, Tilling K, Cornish R, Carpenter J. Framework for the treatment and reporting of missing data in observational studies: the TARMOS framework. Int J Epidemiol. 2021;50(Supplement_1). https://doi.org/10.1093/ije/dyab168.371

Zhao Y, Long Q. Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res. 2016;25(5):2021–35. https://doi.org/10.1177/0962280213511027 .

Aittokallio T. Dealing with missing values in large-scale studies: microarray data imputation and beyond. Brief Bioinform. 2010;11(2):253–64. https://doi.org/10.1093/bib/bbp059 .

White IR, Royston P, Wood AM. Multiple imputation using chained equations: issues and guidance for practice. Stat Med. 2011;30(4):377–99. https://doi.org/10.1002/sim.4067 .

Cox TF, Cox M. Multidimensional Scaling. Boca Raton: Chapman & Hall/CRC; 2001. https://doi.org/10.1007/978-3-540-33037-0_14 .

Book Google Scholar

Torgerson WS. Multidimensional Scaling I: Theory and Method. Psychometrika. 1952;17:401–19. https://doi.org/10.1007/BF02288916 .

Gower JC. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika. 1966;53:325–38. https://doi.org/10.1093/biomet/53.3-4.325 .

Kruskal JB. Nonmetric multidimensional scaling: a numerical method. Psychometrika. 1964;29:115–29. https://doi.org/10.1007/BF02289694 .

Van der Maaten LJP, Hinton GE. Visualizing high-dimensional data using t-SNE. J Mach Learn Res. 2008;9:2579–605.

Google Scholar

Hinton GE, Roweis ST. Stochastic neighbor embedding. In: Advances in Neural Information Processing Systems. 2003. p. 857–64.

McInnes L, Healy J, Saul N, Großberger L. UMAP: Uniform Manifold Approximation and Projection. J Open Source Softw. 2018;3(29):861. https://doi.org/10.21105/joss.00861 .

Becht E, McInnes L, Healy J, Dutertre CA, Kwok IWH, Ng LG, et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol. 2018;37(1):38–44. https://doi.org/10.1038/nbt.4314 .

Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313:504–7. https://doi.org/10.1126/science.1127647 .

“Seurat - Guided Clustering Tutorial”. https://satijalab.org/seurat/archive/v1.4/pbmc3k_tutorial.html . citet 2023 Mar 25

Rokach L, Maimon O. Clustering methods. In: Data mining and knowledge discovery handbook. New York: Springer; 2005. p. 321–52. https://doi.org/10.1007/0-387-25465-X_15 .

Amezquita RA, Lun ATL, Becht E, Carey VJ, Carpp LN, Geistlinger L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pagès H, Smith ML, Huber W, Morgan M, Gottardo R, Hicks SC. Orchestrating single-cell analysis with Bioconductor. Nature Methods. 2020;17:137–45. https://doi.org/10.1038/s41592-019-0654-x ( http://bioconductor.org/books/3.12/OSCA/clustering.html ).

Lloyd S. Least squares quantization in PCM. IEEE Trans Inf Theory. 1982;28(2):129–37. https://doi.org/10.1109/tit.1982.1056489 .

“Machine learning - Clustering, Density based clustering and SOM”. Github.io. https://jhui.github.io/2017/01/15/Machine-learning-clustering/ . cited 2022 Nov 29

Kaufman L, Rousseeuw PJ. Clustering by means of Medoids, in Statistical Data Analysis Based on the L1-Norm and Related Methods, edited by Y. Dodge, North-Holland. 1987. p. 405–16.

Kohonen T. Self-organized formation of topologically correct feature maps. Biol Cybern. 1982;43(1):59–69. https://doi.org/10.1007/bf00337288 .

Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis. 99th ed. Nashville: John Wiley & Sons; 2009.

McLachlan GJ, Peel D. Finite mixture models. New York: Springer; 2000.

Aletta F, Oberman T, Mitchell A, Tong H, Kang J. Assessing the changing urban sound environment during the COVID-19 lockdown period using short-term acoustic measurements. Noise Mapp. 2020;7(1):123–34. https://doi.org/10.1515/noise-2020-0011 .

Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Comput Appl Math. 1987;20:53–65. https://doi.org/10.1016/0377-0427(87)90125-7 .

Ostrouchov G, Gerlovin H, Gagnon, D. clustra: clustering trajectories. R-Project.Org. https://cran.r-project.org/web/packages/clustra/vignettes/clustra_vignette.html . cited 2022 Jan 16

Ben-Dor A, Shamir R, Yakhini Z. Clustering gene expression patterns. J Comput Biol. 1999;6(3–4):281–97. https://doi.org/10.1089/106652799318274 .

Kailing K, Kriegel HP, Kröger P. Density-connected subspace clustering for high-dimensional data. Proceedings of the 2004 SIAM International Conference on Data Mining. 2004;246–256. https://doi.org/10.1137/1.9781611972740.23

Hartigan JA. Direct clustering of a data matrix. J Am Stat Assoc. 1972;67(337):123–9. https://doi.org/10.1080/01621459.1972.10481214 .

Sill M, Kaiser S, Benner A, Kopp-Schneider A. Robust biclustering by sparse singular value decomposition incorporating stability selection. Bioinformatics. 2011;27:2089–97. https://doi.org/10.1093/bioinformatics/btr322 .

Witten DM. Classification and clustering of sequencing data using a Poisson model. Ann Appl Stat. 2011;5(4):2493–518. https://doi.org/10.1214/11-AOAS493 .

Si Y, Liu P, Li P, Brutnell TP. Model-based clustering for RNA-seq data. Bioinformatics. 2014;30(2):197–205. https://doi.org/10.1093/bioinformatics/btt632 .

Tian K, Zhou S, Guan J. DeepCluster: A general clustering framework based on deep learning. In: Machine Learning and Knowledge Discovery in Databases. Cham: Springer International Publishing; 2017. p. 809–25.

Bien J, Tibshirani R. Prototype Selection for Interpretable Classification. Ann Appl Stat. 2011;5(4):2403–24. https://doi.org/10.1214/11-AOAS495 .

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. Gene Ontol Consortium Nat Genet. 2000;25(1):25–9. https://doi.org/10.1038/75556 .

Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000;28(1):27–30. https://doi.org/10.1093/nar/28.1.27 .

Rouillard AD, Gundersen GW, Fernandez NF, Wang Z, Monteiro CD, McDermott MG, Ma’ayan A. The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins. Database. 2016;2016:baw100. https://doi.org/10.1093/database/baw100 .

Anderson MJ, Legendre P. An empirical comparison of permutation methods for tests of partial regression coefficients in a linear model. J Stat Comput Simul. 1999;62(3):271–303. https://doi.org/10.1080/00949659908811936 .

Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3(1):Article3. https://doi.org/10.2202/1544-6115.1027 .

Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. https://doi.org/10.1093/nar/gkv007 .

Kammers K, Cole RN, Tiengwe C, Ruczinski I. Detecting significant changes in protein abundance. EuPA Open Proteom. 2015;7:11–9. https://doi.org/10.1016/j.euprot.2015.02.002 .

Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. https://doi.org/10.1093/bioinformatics/btp616 .

Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. https://doi.org/10.1186/s13059-014-0550-8 .

Goeman JJ, Solari A. Multiple testing for exploratory research. Statist Sci. 2011;26(4):584–97. https://doi.org/10.1214/11-STS356 .

Hochberg Y. A sharper Bonferroni procedure for multiple tests of significance. Biometrika. 1988;75:800–2. https://doi.org/10.1093/biomet/75.4.800 .

Hommel G. A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika. 1988;75:383–6. https://doi.org/10.1093/biomet/75.2.383 .

Holm M. A simple sequentially rejective multiple test procedure. Scand J Statist. 1979;6:65–70 ( https://www.jstor.org/stable/4615733 ).

Westfall PH, Young SS. Resampling-based multiple testing: examples and methods for p-value adjustment. New York: Wiley; 1993.

Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Statist Sci. 2003;18(1):71–103. https://doi.org/10.1214/ss/1056397487 .

Goeman JJ, Solari A. Multiple hypothesis testing in genomics. Stat Med. 2014;33(11):1946–78. https://doi.org/10.1002/sim.6082 .

Bretz F, Hothorn T, Westfall P. Multiple comparisons using R. CRC Press. 2016. https://doi.org/10.1201/9781420010909 .

Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B. 1995;57(1):289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x .

Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001;29(4):1165–88. https://doi.org/10.1214/aos/1013699998 .

Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci. 2003;100(16):9440–5. https://doi.org/10.1073/pnas.1530509100 .

Efron B. Robbins, empirical Bayes, and microarrays. Ann Stat. 2003;31(2):366–78. https://doi.org/10.1214/aos/1051027871 .

Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A. False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics. 2005;21(13):3017–24. https://doi.org/10.1093/bioinformatics/bti448 .

Dobbin K, Simon R. Sample size determination in microarray experiments for class comparison and prognostic classification. Biostatistics. 2005;6(1):27–38. https://doi.org/10.1093/biostatistics/kxh015 .

Li CI, Samuels DC, Zhao YY, Shyr Y, Guo Y. Power and sample size calculations for high-throughput sequencing-based experiments. Brief Bioinform. 2018;19(6):1247–55. https://doi.org/10.1093/bib/bbx061 .

Goeman JJ, Buehlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007;23(8):980–7. https://doi.org/10.1093/bioinformatics/btm051 .

Nam D, Kim SY. Gene-set approach for expression pattern analysis. Brief Bioinform. 2008;9(3):189–97. https://doi.org/10.1093/bib/bbn001 .

Foroutan M, Bhuva DD, Lyu R, Horan K, Cursons J, Davis MJ. Single sample scoring of molecular phenotypes. BMC Bioinformatics. 2018;19:404. https://doi.org/10.1186/s12859-018-2435-4 .

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005;102(43):15545–50. https://doi.org/10.1073/pnas.0506580102 .

Efron B, Tibshirani R. On testing the significance of sets of genes. Ann Appl Stat. 2007;1(1):107–29. https://doi.org/10.1214/07-AOAS101 .

Wieder C, Frainay C, Poupin N, Rodríguez-Mier P, Vinson F, Cooke J, Lai RPJ, Bundy JG, Jourdan F, Ebbels T. Pathway analysis in metabolomics: recommendations for the use of over-representation analysis. PLoS Comput Biol. 2021;17(9):e1009105. https://doi.org/10.1371/journal.pcbi.1009105 .

Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 2004;20(1):93–9. https://doi.org/10.1093/bioinformatics/btg382 .

Alexa A, Rahnenführer J, Lengauer T. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics. 2006;22(13):1600–7. https://doi.org/10.1093/bioinformatics/btl140 .

Alexa A, Rahnenführer J. Gene set enrichment analysis with topGO. Bioconductor.org. 2023. https://bioconductor.org/packages/release/bioc/vignettes/topGO/inst/doc/topGO.pdf . cited 2023 Mar 31

Committee on the Review of Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials, Board on Health Care Services, Institute of Medicine, Board on Health Sciences Policy. In: Micheel CM, Nass SJ, Omenn GS, editors. Evolution of translational omics: lessons learned and the path forward. Washington: National Academies Press; 2012. https://doi.org/10.17226/13297 .

Moons KG, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, Vickers AJ, Ransohoff DF, Collins GS. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration. Ann Intern Med. 2015;162:W1–73. https://doi.org/10.7326/M14-0698 .

Herold T, Jurinovic V, Metzeler KH, Boulesteix AL, Bergmann M, Seiler T, Mulaw M, Thoene S, Dufour A, Pasalic Z, Schmidberger M, Schmidt M, Schneider S, Kakadia PM, Feuring-Buske M, Braess J, Spiekermann K, Mansmann U, Hiddemann W, Buske C, Bohlander SK. An eight-gene expression signature for the prediction of survival and time to treatment in chronic lymphocytic leukemia. Leukemia. 2011;25:1639–45. https://doi.org/10.1038/leu.2011.125 .

Azzalini A, Scarpa B. Data analysis and data mining: an introduction. Cary: Oxford University Press; 2012.

Hastie T, Tibshirani R, Friedman JH. The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York: Springer; 2009. https://doi.org/10.1007/b94608 .

Boulesteix AL, Wright MN, Hoffmann S, König IR. Statistical learning approaches in the genetic epidemiology of complex diseases. Hum Genet. 2020;139(1):73–84. https://doi.org/10.1007/s00439-019-01996-9 .

Bland JM, Altman DG. Statistics notes: Transforming data. BMJ. 1996;312(7033):770. https://doi.org/10.1136/bmj.312.7033.770 .

Bland JM, Altman DG. Transformations, means, and confidence intervals. BMJ. 1996;312(7038):1079. https://doi.org/10.1136/bmj.312.7038.1079 .

Heinze G, Wallisch C, Dunkler D. Variable selection - a review and recommendations for the practicing statistician. Biom J. 2018;60(3):431–49. https://doi.org/10.1002/bimj.201700067 .

Singh AAGD, Balamurugan AAS, Leavline JEE. Literature review on feature selection methods for high-dimensional data. Int J Comput Appl. 2016;136(1):9–17. https://doi.org/10.5120/IJCA2016908317 .

Bommert AM, Sun X, Bischl B, Rahnenführer J, Lang M. Benchmark for filter methods for feature selection in high-dimensional classification data. Comput Stat Data Analysis. 2020;143:106839. https://doi.org/10.1016/j.csda.2019.106839 .

Sachs MC, McShane LM. Issues in developing multivariable molecular signatures for guiding clinical care decisions. J Biopharm Stat. 2016;26(6):1098–110. https://doi.org/10.1080/10543406.2016.1226329 .

Sauerbrei W, Perperoglou A, Schmid M, Abrahamowicz M, Becher H, Binder H, Dunkler D, Harrell FE Jr, Royston P, Heinze G, for TG2 of the STRATOS initiative. State of the art in selection of variables and functional forms in multivariable analysis - outstanding issues. Diagn Progn Res. 2020;4:3,1-18. https://doi.org/10.1186/s41512-020-00074-3 .

Van der Maaten L, Postma E, Van den Herik J. Dimensionality reduction: a comparative review. J Mach Learn Res. 2009;10:1–41.

Lee LC, Liong CY, Jemain AA. Partial least squares-discriminant analysis (PLS-DA) for classification of high-dimensional (HD) data: a review of contemporary practice strategies and knowledge gaps. Analyst. 2018;143(15):3526–39. https://doi.org/10.1039/C8AN00599K .

Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970;12(1):55–67. https://doi.org/10.1080/00401706.1970.10488634 .

Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Stat Soc Ser B (Methodol). 1996;58(1):267–88. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x .

Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol). 2005;67:301–20. https://doi.org/10.1111/j.1467-9868.2005.00503.x .

Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting. Ann Stat. 2000;28:337–74. https://doi.org/10.1214/aos/1016218223 .

Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006;101(476):1418–29. https://doi.org/10.1198/016214506000000735 .

Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B (Stat Methodol). 2007;68(1):49–67. https://doi.org/10.1111/j.1467-9868.2005.00532.x .

Freund Y, Schapire R. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput System Sci. 1997;55:119–39. https://doi.org/10.1006/jcss.1997.1504 .

Mayr A, Hofner B, Schmid M. The importance of knowing when to stop. Methods Inf Med. 2012;51:178–86. https://doi.org/10.3414/ME11-02-0030 .

Mease D, Wyner A. Evidence contrary to the statistical view of boosting. J Mach Learn Res. 2008;9:131–56.

Singh A, Thakur N, Sharma A. A review of supervised machine learning algorithms, 3rd International Conference on Computing for Sustainable Global Development (INDIACom). New Delhi; 2016. p. 1310–5. https://ieeexplore.ieee.org/abstract/document/7724478 .

Malhotra R. A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput. 2015;27:504–18. https://doi.org/10.1016/j.asoc.2014.11.023 .

Vapnik V. The nature of statistical learning theory. New York: Springer; 2014.

Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and Regression Trees. Philadelphia: Chapman & Hall/CRC; 1984.

Schumacher M, Holländer N, Schwarzer G, Binder H, Sauerbrei W. Prognostic Factor Studies. In: Crowley J, Hoering A, editors. Handbook of Statistics in Clinical Oncology. 3rd ed. Chapman and Hall/CRC; 2012. p. 415–70.

Breiman L. Bagging Predictors. Mach Learn. 1996;24:123–40. https://doi.org/10.1023/A:1018054314350 .

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. https://doi.org/10.1023/A:1010933404324 .

Goldstein BA, Polley EC, Briggs FBS. Random Forests for Genetic Association Studies. Stat Appl Genet Mol Biol. 2011;10(1):32. https://doi.org/10.2202/1544-6115.1691 .

Fawagreh K, Gaber MM, Elyan E. Random forests: from early developments to recent advancements. Syst Sci Control Eng Open Access J. 2014;2(1):602–9. https://doi.org/10.1080/21642583.2014.956265 .

LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44. https://doi.org/10.1038/nature14539 .

Zhao ZQ, Zheng P, Xu ST, Wu X. Object detection with deep learning: a review. IEEE Trans Neural Netw Learn Syst. 2019;30(11):3212–32. https://doi.org/10.1109/TNNLS.2018.2876865 .

Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: review, opportunities and challenges. Brief Bioinform. 2018;19(6):1236–46. https://doi.org/10.1093/bib/bbx044 .

Richardson S, Tseng GC, Sun W. Statistical Methods in Integrative Genomics. Annu Rev Stat Appl. 2016;3:181–209. https://doi.org/10.1146/annurev-statistics-041715-033506 .

Perou CM, Parker JS, Prat A, Ellis MJ, Bernard PS. Clinical implementation of the intrinsic subtypes of breast cancer. Lancet Oncol. 2010;8:718–9. https://doi.org/10.1016/S1470-2045(10)70176-5 .

Huo Z, Ding Y, Liu S, Oesterreich S, Tseng G. Meta-analytic framework for sparse K-means to identify disease subtypes in multiple transcriptomic studies. J Am Stat Assoc. 2016;111(513):27–42. https://doi.org/10.1080/01621459.2015.1086354 .

Shen R, Olshen AB, Ladanyi M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics. 2009;25:2906–12. https://doi.org/10.1093/bioinformatics/btp543 .

Mo Q, Wang S, Seshan VE, Olshen AB, Schultz N, Sander C, Powers RS, Ladanyi M, Shen R. Pattern discovery and cancer gene identification in integrated cancer genomic data. PNAS. 2013;2013(110):4245–50. https://doi.org/10.1073/pnas.1208949110 .

Savage RS, Ghahramani Z, Griffin JE, Bernard J, Wild DL. Discovering transcriptional modules by Bayesian data integration. Bioinformatics. 2010;26:i158–67. https://doi.org/10.1093/bioinformatics/btq210 .

Yuan Y, Savage RS, Markowetz F. Patient-specific data fusion defines prognostic cancer subtypes. PLoS Comput Biol. 2011;7:e1002227. https://doi.org/10.1093/bioinformatics/btq210 .

Kirk P, Griffin JE, Savage RS, Ghahramani Z, Wild DL. Bayesian correlated clustering to integrate multiple datasets. Bioinformatics. 2012;28:3290–7. https://doi.org/10.1093/bioinformatics/bts595 .

Lock E, Dunson D. Bayesian consensus clustering. Bioinformatics. 2013;29:2610–6. https://doi.org/10.1093/bioinformatics/btt425 .

Benkirane H, Pradat Y, Michiels S, Cournède PH. CustOmics: a versatile deep-learning based strategy for multi-omics integration. PLoS Comput Biol. 2023;19(3):e1010921. https://doi.org/10.1371/journal.pcbi.1010921 .

Binder H, Schumacher M. Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinformatics. 2008;9:14. https://doi.org/10.1186/1471-2105-9-14 .

Bøvelstad HM, Nygård S, Borgan Ø. Survival prediction from clinico-genomic models – a comparative study. BMC Bioinformatics. 2009;10:413. https://doi.org/10.1186/1471-2105-10-413 .

Boulesteix AL, Sauerbrei W. Added predictive value of high-throughput molecular data to clinical data and its validation. Brief Bioinform. 2011;12(3):215–29. https://doi.org/10.1093/bib/bbq085 .

De Bin R, Sauerbrei W, Boulesteix AL. Investigating the prediction ability of survival models based on both clinical and omics data: two case studies. Stat Med. 2014;30:5310–29. https://doi.org/10.1002/sim.6246 .

Volkmann A, De Bin R, Sauerbrei W, Boulesteix AL. A plea for taking all available clinical information into account when assessing the predictive value of omics data. BMC Med Res Methodol. 2019;19:162. https://doi.org/10.1186/s12874-019-0802-0 .

Van Karnebeek CDM, Wortmann SB, Tarailo-Graovac M, Langeveld M, Ferreira CR, van de Kamp JM, Hollak CE, Wasserman WW, Waterham HR, Wevers RA, Haack TB, Wanders RJA, Boycott KM. The role of the clinician in the multi-omics era: are you ready? J Inherit Metab Dis. 2018;41(3):571–82. https://doi.org/10.1007/s10545-017-0128-1 .

De Bin R, Boulesteix AL, Benner A, Becker N, Sauerbrei W. Combining clinical and molecular data in regression prediction models: insights from a simulation study. Brief Bioinform. 2020;21(6):1904–19. https://doi.org/10.1093/bib/bbz136 .

Schumacher M, Binder H, Gerds T. Assessment of survival prediction models based on microarray data. Bioinformatics. 2007;23:1768–74. https://doi.org/10.1093/bioinformatics/btm232 .

Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21:128–38. https://doi.org/10.1097/EDE.0b013e3181c30fb2 .

Gail MH, Pfeiffer RM. On criteria for evaluating models of absolute risk. Biostatistics. 2005;6(2):227–39. https://doi.org/10.1093/biostatistics/kxi005 .

Gerds TA, Cai T, Schumacher M. The performance of risk prediction models. Biom J. 2008;50:457–79. https://doi.org/10.1002/bimj.200810443 .

Cook NR. Quantifying the added value of new biomarkers: how and how not. Diagn Progn Res. 2018;2(1):14. https://doi.org/10.1186/s41512-018-0037-2 .

McShane LM, Altman DG, Sauerbrei W. Identification of clinically useful cancer prognostic factors: what are we missing? (Editorial). J Natl Cancer Inst. 2005;97:1023–5. https://doi.org/10.1093/jnci/dji193 .

Bommert AM, Rahnenführer J. Adjusted measures for feature selection stability for data sets with similar features. In: Machine Learning, Optimization, and Data Science. 2021. p. 203–14. https://doi.org/10.1007/978-3-030-64583-0_19 .

Sauerbrei W, Boulesteix AL, Binder H. Stability investigations of multivariable regression models derived from low-and high-dimensional data. J Biopharm Stat. 2011;21(6):1206–31. https://doi.org/10.1080/10543406.2011.629890 .

Liu Y, Sun Y, Broaddus R, Liu J, Sood AK, Shmulevich I, Zhang W. Integrated analysis of gene expression and tumor nuclear image profiles associated with chemotherapy response in serous ovarian carcinoma. PLoS One. 2012;7(5):e36383. https://doi.org/10.1371/journal.pone.0036383 .

Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950;78(1):1–3. https://doi.org/10.1175/1520-0493(1950)078%3c0001:VOFEIT%3e2.0.CO;2 .

Van Calster B, McLernon DJ, Van Smeden M, Wynants L, Steyerberg EW. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):1–7. https://doi.org/10.1186/s12916-019-1466-7 .

Dziak JJ, Coffman DL, Lanza ST, Li R, Jermiin LS. Sensitivity and specificity of information criteria. Brief Bioinform. 2020;21(2):553–65. https://doi.org/10.1093/bib/bbz016 .

Steyerberg EW, Harrell FE Jr, Borsboom GJ, Eijkemans MJC, Vergouwe Y, Habbema JDF. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol. 2001;54:774–81. https://doi.org/10.1016/S0895-4356(01)00341-9 .

Simon RM, Subramanian J, Li MC, Menezes S. Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data. Brief Bioinform. 2011;12:203–14. https://doi.org/10.1093/bib/bbr001 .

Iba K, Shinozaki T, Maruo K, Noma H. Re-evaluation of the comparative effectiveness of bootstrap-based optimism correction methods in the development of multivariable clinical prediction models. BMC Med Res Methodol. 2021;21:9. https://doi.org/10.1186/s12874-020-01201-w .

Steyerberg EW, Harrell FE. Prediction models need appropriate internal, internal-external, and external validation. J Clin Epidemiol. 2016;69:245–7. https://doi.org/10.1016/j.jclinepi.2015.04.005 .

Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Stat Surv. 2010;4:40–79. https://doi.org/10.1214/09-SS054 .

Efron B. Bootstrap Methods: Another Look at the Jackknife. Ann Statist. 1979;7(1):1–26. https://doi.org/10.1214/aos/1176344552 .

Efron B. Bootstrap methods: another look at the jackknife. In: Breakthroughs in statistics. New York: Springer; 1992. p. 569–93. https://doi.org/10.1007/978-1-4612-4380-9_41 .

Efron B, Tibshirani R. Improvements on cross-validation: the 632+ bootstrap method. J Am Stat Assoc. 1997;92(438):548–60. https://doi.org/10.1080/01621459.1997.10474007 .

Chernick MR. Bootstrap Methods. A Guide for Practitioners and Researchers. Hoboken: Wiley; 2008.

Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic information. Ann Intern Med. 1999;130(6):515–24. https://doi.org/10.7326/0003-4819-130-6-199903160-00016 .

Altman DG, Royston P. What do we mean by validating a prognostic model? Stat Med. 2000;19(4):453–73. https://doi.org/10.1002/(SICI)1097-0258(20000229)19:4%3c453::AID-SIM350%3e3.0.CO;2-5 .

Royston P, Altman DG. External validation of a Cox prognostic model: principles and methods. BMC Med Res Methodol. 2013;13(1):33. https://doi.org/10.1186/1471-2288-13-33 .

Boulesteix AL, Strobl C. Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction. BMC Med Res Methodol. 2019;9:85. https://doi.org/10.1186/1471-2288-9-85 .

Ruschhaupt M, Huber W, Poustka A, Mansmann U. A compendium to ensure computational reproducibility in high-dimensional classification tasks. Stat Appl Genet Mol Biol. 2004;3:37. https://doi.org/10.2202/1544-6115.1078 .

McShane LM, Polley M-YC. Development of omics-based clinical tests for prognosis and therapy selection: the challenge of achieving statistical robustness and clinical utility. Clin Trials. 2013;10(5):653–65. https://doi.org/10.1177/1740774513499458 .

Cook RD. Influential observations in linear regression. J Am Stat Assoc. 1979;74:169–74. https://doi.org/10.1080/01621459.1979.10481634 .

De Bin R, Boulesteix AL, Sauerbrei W. Detection of influential points as a byproduct of resampling-based variable selection procedures. Comput Stat Data Anal. 2017;116:19–31. https://doi.org/10.1016/j.csda.2017.07.001 .

Royston P, Sauerbrei W. Improving the robustness of fractional polynomial models by preliminary covariate transformation: a pragmatic approach. Comput Stat Data Anal. 2007;51:4240–53. https://doi.org/10.1016/j.csda.2006.05.006 .

Peña D. A new statistic for influence in linear regression. Technometrics. 2005;47:1–12. https://doi.org/10.1198/004017004000000662 .

Hawkins DM. Identification of Outliers, Chapman and Hall. 1980.

Belsley DA, Kuh E, Welsch RE. Regression diagnostics: identifying influential data and sources of collinearity. John Wiley & Sons; 1980.

Zhao J, Leng C, Li L, Wang H. High-dimensional influence measure. Ann Stat. 2013;41:2639–67. https://doi.org/10.1214/13-AOS1165 .

Wang T, Li Z. Outlier detection in high-dimensional regression model. Commun Stat Theory Methods. 2017;46:6947–58. https://doi.org/10.1080/03610926.2016.1140783 .

Walker E, Birch JB. Influence measures in ridge regression. Technometrics. 1988;30:221–7. https://doi.org/10.1080/00401706.1988.10488370 .

Rajaratnam B, Roberts S, Sparks D, Yu H. Influence diagnostics for high-dimensional lasso regression. J Comput Graph Stat. 2019;28(4):877–90. https://doi.org/10.1080/10618600.2019.1598869 .

Shi L, Wang X. Local influence in ridge regression. Comput Stat Data Anal. 1999;1999(31):341–53. https://doi.org/10.1016/S0167-9473(99)00019-5 .

Hellton KH, Lingjarde C, De Bin R. Influence of single observations on the choice of the penalty parameter in ridge regression. arXiv preprint. 2019. arXiv:1911.03662.

Boulesteix AL, Guillemot V, Sauerbrei W. Use of pretransformation to cope with extreme values in important candidate features. Biom J. 2011;53(4):673–88. https://doi.org/10.1002/bimj.201000189 .

Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A. Feature selection for high-dimensional data. Cham: Springer International Publishing; 2015. https://doi.org/10.1007/978-3-319-21858-8 .

Segaert P, Lopes MB, Casimiro S, Vinga S, Rousseeuw PJ. Robust identification of target genes and outliers in triple-negative breast cancer data. Stat Methods Med Res. 2019;28(10–11):3042–56. https://doi.org/10.1177/0962280218794722 .

Carrasquinha E, Veríssimo A, Lopes MB, Vinga S. Identification of influential observations in high-dimensional cancer survival data through the rank product test. BioData Mining. 2018;11(1):1. https://doi.org/10.1186/s13040-018-0162-z .

Fan C. HighDimOut: Outlier Detection Algorithms for High-Dimensional Data. R package version 1.0.0. 2015. https://CRAN.R-project.org/package=HighDimOut .

Riley RD, Snell KIE, Martin GP, Whittle R, Archer L, Sperrin M, Collins GS. Penalization and shrinkage methods produced unreliable clinical prediction models especially when sample size was small. J Clin Epidemiol. 2021;132:88–96. https://doi.org/10.1016/j.jclinepi.2020.12.005 .

Simera I, Moher D, Hirst A, Hoey J, Schulz KF, Altman DG. Transparent and accurate reporting increases reliability, utility, and impact of your research: reporting guidelines and the EQUATOR Network. BMC Med. 2010;8:24. https://doi.org/10.1186/1741-7015-8-24 .

Altman DG, McShane L, Sauerbrei W, Taube SE. Reporting recommendations for tumor marker prognostic studies (REMARK): explanation and elaboration. PLoS Med. 2012;9(5):E1001216. https://doi.org/10.1371/journal.pmed.1001216 .

Altman DG. The time has come to register diagnostic and prognostic research. Clin Chem. 2014;60:580–2. https://doi.org/10.1373/clinchem.2013.220335 .

Peat G, Riley RD, Croft P, Morley KI, Kyzas PA, Moons KG, Perel P, Steyerberg EW, Schroter S, Altman DG, Hemingway H, for the PROGRESS Group. Improving the Transparency of Prognosis Research: The Role of Reporting, Data Sharing, Registration, and Protocols. PLoS Medicine. 2014;11(7):e1001671. https://doi.org/10.1371/journal.pmed.1001671 .

Christodolou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12–22. https://doi.org/10.1016/j.jclinepi.2019.02.004 .

Chen JH, Asch SM. Machine learning and prediction in medicine—beyond the peak of inflated expectations. N Engl J Med. 2017;376(26):2507–9. https://doi.org/10.1056/NEJMp1702071 .

Collins GS, Moons KGM. Reporting of artificial intelligence prediction models. Lancet. 2019;393:1577–9. https://doi.org/10.1016/S0140-6736(19)30037-6 .

Gail MH, Altman DG, Cadarette SM, Collins G, Evans SJ, Sekula P, Williamson E, Woodward M. Design choices for observational studies of the effect of exposure on disease incidence. BMJ Open. 2019;9:e031031. https://doi.org/10.1136/bmjopen-2019-031031 .

Huebner M, Vach W, le Cessie S, Schmidt CO, Lusa L, on behalf of the Topic Group “Initial Data Analysis” of the STRATOS Initiative. Hidden analyses: a review of reporting practice and recommendations for more transparent reporting of initial data analyses. BMC Med Res Methodol. 2020;20(1):1–10. https://doi.org/10.1186/s12874-020-00942-y .

Shaw PA, Deffner V, Keogh R, Tooze JA, Dodd KW, Küchenhoff H, Kipnis V, Freedman LS, on behalf of Measurement Error and Misclassification Topic Group (TG4) of the STRATOS Initiative. Epidemiologic analyses with error-prone exposures: review of current practice and recommendations. Ann Epidemiol. 2018;28(11):821–8. https://doi.org/10.1016/j.annepidem.2018.09.001 .

Andersen PK, Perme MP, van Houwelingen HC, Cook RJ, Joly P, Martinussen T, Taylor JMG, Therneau TM. Analysis of time-to-event for observational studies: Guidance to the use of intensity models. Stat Med. 2021;40(1):185–211. https://doi.org/10.1002/sim.8757 .

Wynants L, van Smeden M, McLernon DJ, Timmerman D, Steyerberg EW, Van Calster B, on behalf of the Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative. Three myths about risk thresholds for prediction models. BMC Med. 2019;17(192):1–7. https://doi.org/10.1186/s12916-019-1425-3 .

Goetghebeur E, le Cessie S, De Stavola B, Moodie EE, Waernbaum I, “on behalf of” the topic group Causal Inference (TG7) of the STRATOS initiative. Formulating causal questions and principled statistical answers. Stat Med. 2020;39(30):4922–48. https://doi.org/10.1002/sim.8741 .

Download references

Acknowledgements

We thank Milena Schwotzer for administrative assistance.

The views expressed in the paper do not necessarily represent views or policies of the National Cancer Institute, National Institutes of Health, or the U.S. Department of Health & Human Services.

Open Access funding provided by the National Institutes of Health (NIH). WS was partially supported by grant SA580/10–1 from the German Research Foundation (DFG). FA was partially supported by the Italian Ministry of Education, University and Research project PRIN 2017, prot. 20178S4EK9_004. RDB was partially supported by the Norwegian Research Council research-based innovation center BigInsight, project no 237718. ALB’s group was partially supported by individual grants from the German Research Foundation (DFG, BO3139) and by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A.

Author information

Authors and affiliations.

Department of Statistics, TU Dortmund University, Dortmund, Germany

Jörg Rahnenführer

Department of Mathematics, University of Oslo, Oslo, Norway

Riccardo De Bin

Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany

Axel Benner

Department of Clinical Sciences and Community Health, University of Milan, Milan, Italy

Federico Ambrogi

Scientific Directorate, IRCCS Policlinico San Donato, San Donato Milanese, Italy

Department of Mathematics, Faculty of Mathematics, Natural Sciences and Information Technology, University of Primorksa, Koper, Slovenia

Institute of Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia

Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig Maximilian University of Munich, Munich, Germany

Anne-Laure Boulesteix

Nestle Research, EPFL Innovation Park, Lausanne, Switzerland

Eugenia Migliavacca

Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany

Harald Binder & Willi Sauerbrei

Service de Biostatistique et d’Épidémiologie, Gustave Roussy, Université Paris-Saclay, Villejuif, France

Stefan Michiels

Oncostat U1018, Inserm, Université Paris-Saclay, Labeled Ligue Contre le Cancer, Villejuif, France

Biometric Research Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, Bethesda, MD, USA

Lisa McShane

You can also search for this author in PubMed Google Scholar

Contributions

JR, RDB, AB, FA, LL, ALB, EM, HB, SM, WS, and LS contributed to the conceptualization of the research, discussed the structure of the manuscript, conducted literature research, drafted the manuscript, and wrote text sections of the final manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Lisa McShane .

Ethics declarations

Ethics approval and consent to participate, consent for publication.

All authors have given their consent for the publication of this manuscript.

Competing interests

The authors declare that they have no competing interests. E.M. is an employee of Société des Produits Nestlé SA.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Rahnenführer, J., De Bin, R., Benner, A. et al. Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges. BMC Med 21 , 182 (2023). https://doi.org/10.1186/s12916-023-02858-y

Download citation

Received : 28 December 2022

Accepted : 03 April 2023

Published : 15 May 2023

DOI : https://doi.org/10.1186/s12916-023-02858-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

STRATOS initiative
Analytical goals
Multiple testing

BMC Medicine

ISSN: 1741-7015

Submission enquiries: [email protected]
General enquiries: [email protected]

Login to your account

Change password, your password must have 8 characters or more and contain 3 of the following:.

a lower case character,
an upper case character,
a special character

Password Changed Successfully

Your password has been changed

Create a new account

Can't sign in? Forgot your password?

Enter your email address below and we will send you the reset instructions

If the address matches an existing account you will receive an email with instructions to reset your password

Request Username

Can't sign in? Forgot your username?

Enter your email address below and we will send you your username

If the address matches an existing account you will receive an email with instructions to retrieve your username

Institutional Access

Cookies Notification

Our site uses Javascript to enchance its usability. You can disable your ad blocker or whitelist our website www.worldscientific.com to view the full content.

Select your blocker:, adblock plus instructions.

Click the AdBlock Plus icon in the extension bar
Click the blue power button
Click refresh

Adblock Instructions

Click the AdBlock icon
Click "Don't run on pages on this site"

uBlock Origin Instructions

Click on the uBlock Origin icon in the extension bar
Click on the big, blue power button
Refresh the web page

uBlock Instructions

Click on the uBlock icon in the extension bar

Adguard Instructions

Click on the Adguard icon in the extension bar
Click on the toggle next to the "Protection on this website" text

Brave Instructions

Click on the orange lion icon to the right of the address bar
Click the toggle on the top right, shifting from "Up" to "Down

Adremover Instructions

Click on the AdRemover icon in the extension bar
Click the "Don’t run on pages on this domain" button
Click "Exclude"

Adblock Genesis Instructions

Click on the Adblock Genesis icon in the extension bar
Click on the button that says "Whitelist Website"

Super Adblocker Instructions

Click on the Super Adblocker icon in the extension bar
Click on the "Don’t run on pages on this domain" button
Click the "Exclude" button on the pop-up

Ultrablock Instructions

Click on the UltraBlock icon in the extension bar
Click on the "Disable UltraBlock for ‘domain name here’" button

Ad Aware Instructions

Click on the AdAware icon in the extension bar
Click on the large orange power button

Ghostery Instructions

Click on the Ghostery icon in the extension bar
Click on the "Trust Site" button

Firefox Tracking Protection Instructions

Click on the shield icon on the left side of the address bar
Click on the toggle that says "Enhanced Tracking protection is ON for this site"

Duck Duck Go Instructions

Click on the DuckDuckGo icon in the extension bar
Click on the toggle next to the words "Site Privacy Protection"

Privacy Badger Instructions

Click on the Privacy Badger icon in the extension bar
Click on the button that says "Disable Privacy Badger for this site"

Disconnect Instructions

Click on the Disconnect icon in the extension bar
Click the button that says "Whitelist Site"

Opera Instructions

Click on the blue shield icon on the right side of the address bar
Click the toggle next to "Ads are blocked on this site"

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Statistical Methods for Biomedical Research

By (author):
Jiqian Fang ( Sun Yat-sen University, China )
Add to favorites
Download Citations
Track Citations
Recommend to Library
Description
Supplementary

This book consists of four parts with 32 chapters adapted for four short courses, from the basic to the advanced levels of medical statistics (biostatistics), ideal for biomedical students. Part 1 is a compulsory course of Basic Statistics with descriptive statistics, parameter estimation and hypothesis test, simple correlation and regression. Part 2 is a selective course on Study Design and Implementation with sampling survey, interventional study, observational study, diagnosis study, data sorting and article writing. Part 3 is a specially curated course of Multivariate Analyses with complex analyses of variance, variety of regressions and classical multivariate analyses. Part 4 is a seminar course on Introduction to Advanced Statistical Methods with meta-analysis, time series, item response theory, structure equation model, multi-level model, bio-informatics, genetic statistics and data mining.

The main body of each chapter is followed by five practical sections: Report Writing , Case Discrimination , Computer Experiments , Frequently Asked Questions and Summary , and Practice & Think . Moreover, there are 2 attached Appendices, Appendix A includes Introductions to SPSS, Excel and R respectively, and Appendix B includes all the programs, data and printouts for Computer Experiments in addition to the Tests for Review and the reference answers for Case Discrimination as well as Practice & Think. .

This book can be used as a textbook for biomedical students at both under- and postgraduate levels. It can also serve as an important guide for researchers, professionals and officers in the biomedical field.

Supplementary material

Appendix B Online (18.8 MB)

includes all the programs, data, and printouts for Computer Experiments and the reference answers for Case Discrimination as well as Think & Practice.

Appendix C Online (146 KB)

Appendix C includes the Statistical Tables.

Sample Chapter(s) Chapter 0: INTRODUCTION

Request Inspection Copy

Introduction
Statistical Description
Probability Distribution
Parameter Estimation
Hypothesis Testing
Comparison of Means Between Two Groups
Comparison of Means Among Multiple Groups
Comparison of Distributions of Categorical Variables
Analysis for Association
Simple Linear Regression
Sampling Survey
Interventional Study
Clinical Trial
Observational Comparative Effectiveness Research
Diagnostic Test
Sample Size Estimation
Missing Data
Statistical Reporting Guidelines for Medical Papers
Analysis of Variance for Complicated Designs
Multiple Linear Regression
Logistic Regression
Survival Analysis
Discriminant Analysis and Classification Tree
Cluster Analysis
Principal Component Analysis and Factor Analysis
Meta Analysis
Time Series
Structural Equation Model
Multi-Level Model
Item Response Theory
Statistics for Genetics
Statistics for Bioinformatics
Data Mining
Introduction for Statistical Software
Supplement for the Text
Statistical Tables

FRONT MATTER

Pages: i–xii

https://doi.org/10.1142/9789811228872_fmatter

CHAPTER 0: INTRODUCTION

Pages: 1–22

https://doi.org/10.1142/9789811228872_0001

What is statistics? As defined in Webster’s Third New International Dictionary (1961), statistics is “A science dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.” According to the Dictionary of Epidemiology , 4th ed. (John M. Last, 2000), statistics is “The science and art of collecting, summarizing, and analyzing data that are subject to random variation”…

Part I: Basic Concepts and Methods

Chapter 1: statistical description.

Pages: 25–51

https://doi.org/10.1142/9789811228872_0002

Statistical analysis includes statistical descriptions and statistical inferences. For statistical description of the collected data, statistical charts, tables or statistical measures can be chosen.

CHAPTER 2: PROBABILITY DISTRIBUTION

Pages: 53–86

https://doi.org/10.1142/9789811228872_0003

In Chapter 1, the univariate description is introduced using frequency distribution table and histogram. However, it is just about the description of the sample distribution. Due to variation among individuals, even if two samples are taken from the same population, the frequency distribution of the samples will be different. A very important further objective is to describe the population distribution. In this chapter, we introduce several population distributions commonly used in biomedical research, including the normal distribution of continuous random variable, the binomial distribution, and Poisson distribution of discrete random variable.

CHAPTER 3: PARAMETER ESTIMATION

Pages: 87–117

https://doi.org/10.1142/9789811228872_0004

The study population of interest is usually very large, even hypothesized population (e.g., the population is all patients who will be treated). Studying the entire population is time consuming, laborious, and even impossible. Therefore, usually collect sample data that can represent the population to estimate the population parameter. Chapter 1 has learned to use the sample statistic such as sample means and sample frequency to estimate the population parameters such as population mean and population probability. These are point estimates on parameters. The point estimate does not make full use of the sample information and does not reflect the size of the sampling error. The interval estimate just overcomes these shortcomings. This chapter, first introduces the concept of sampling distribution and sampling error, second introduces several commonly used sampling distributions, and then introduces interval estimates of population mean, population probability, and population standard deviation.

CHAPTER 4: HYPOTHESIS TESTING

Pages: 119–138

https://doi.org/10.1142/9789811228872_0005

The confidence interval estimation introduced in Chapter 3 is to estimate the range of population parameters (such as μ or π ) based on the sampling distribution of sample statistics (such as X ¯ or P ). However, hypothesis testing and interval estimation are on different perspectives in statistical inference. This chapter will discuss the statistical inference of hypothesis testing through Example 4.1…

CHAPTER 5: COMPARISON OF MEANS BETWEEN TWO GROUPS

Pages: 139–168

https://doi.org/10.1142/9789811228872_0006

This chapter covers hypothesis testing, the second of two general areas of statistical inference. The other is parameter estimation discussed in the preceding chapter. Hypothesis testing involving the difference between two population means is frequently employed in medical research. For example, if we aim to examine whether the average weight of primary school students from Beijing city differs from that in Guangdong city, the convenient approach is to conduct a random sample from each region and then make hypothesis testing based on sample means. From the type of design, two samples include two independent samples and paired samples. From the statistical method, there is parametric test and non-parametric test. This chapter will introduce how to make statistical inference for quantitative variables between two groups.

CHAPTER 6: COMPARISON OF MEANS AMONG MULTIPLE GROUPS

Pages: 169–198

https://doi.org/10.1142/9789811228872_0007

In work, we often encounter the comparison of quantitative data among more than two groups, for which test, Wilcoxon rank-sum test shown in Chapter 5 cannot be used. Instead, we should use one-way analysis of variance (ANOVA) and Kruskal–Wallis test introduced in this chapter. One-way ANOVA is a parametric test, and Kruskal–Wallis test is a non-parametric test, which can be regarded as an extension of Wilcoxon rank-sum test.

CHAPTER 7: COMPARISON OF DISTRIBUTIONS OF CATEGORICAL VARIABLES

Pages: 199–226

https://doi.org/10.1142/9789811228872_0008

Qualitative variables refer to unordered categorical variables and ordered categorical variables. For example, blood type (A, B, O, AB), and gender (male, female) are unordered categorical variables, and efficacy (cure, improvement, inefficacy, death) and drug dose (high dose, medium dose, low dose) are ordered categorical variables. This chapter introduces the comparison methods commonly used for data of qualitative variables.

CHAPTER 8: ANALYSIS OF ASSOCIATION

Pages: 227–255

https://doi.org/10.1142/9789811228872_0009

In medical research, it is common to analyze the relationship between two random variables; for example, whether linear relationships exist between height and weight, body temperature and pulse, blood pressure and age, and the strength of these relationships. In this chapter, we discuss the linear relationships between two quantitative variables and two qualitative variables. Generally, a linear relationship between two continuous random variables is called a linear correlation or simple correlation, and a relationship between two categorical variables is called an association.

CHAPTER 9: SIMPLE LINEAR REGRESSION

Pages: 257–281

https://doi.org/10.1142/9789811228872_0010

The occurrence, development and changes in biomedical phenomena result from mutual restriction and the influence of many factors under certain conditions. For example, there are many influencing factors for high blood pressure, such as age, gender, family history, diet, smoking, alcoholism, and psychology. In adolescence, a person’s height in general increases with age, but is also influenced by elements such as parental height and nutrition. Determining the main influencing factors and the size of their effects are the questions with which we are concerned. In this chapter, the basic idea of regression analysis is introduced through a simple linear regression model with only one influencing factor. Multiple regression analysis is introduced in Chapter 19.

Part II: Design and Implementation of Bio-medical Research

Chapter 10: sampling survey.

Pages: 285–323

https://doi.org/10.1142/9789811228872_0011

With the changed of medical model and disease spectrum, it has become an important issue in the field of public health to explore the occurrence and progression of life and other health events in population. However, it is impossible to observe every individual in one study due to limited resources. Usually, we can obtain a sample that is a part of individuals sampled from the population randomly. And then the characteristics of the population can be inferred based on the sample information. Such research is called sampling survey. The basic concepts and methods in sampling survey research will be mainly introduced in this chapter.

CHAPTER 11: INTERVENTIONAL STUDY

Pages: 325–350

https://doi.org/10.1142/9789811228872_0012

In experiment studies, subjects were randomly assigned to several experimental groups. By comparing the effects of different experimental factors, one could determine whether the experimental factors affect the outcomes. The most fundamental difference between experimental and observational studies is whether the experimental factors are controlled by the investigators…

CHAPTER 12: CLINICAL TRIAL

Pages: 351–383

https://doi.org/10.1142/9789811228872_0013

A clinical trial is a study that examines the potential efficacy and defines safety profile of an intervention in the prevention, diagnosis and treatment of a specific disease under a set of strictly controlled conditions. It could be conducted in patients or in health volunteers. Clinical studies that examine drug absorption, distribution, metabolism, and excretion are also considered clinical trials…

CHAPTER 13: OBSERVATIONAL COMPARATIVE EFFECTIVENESS RESEARCH

Pages: 385–410

https://doi.org/10.1142/9789811228872_0014

Observational studies are widely applied in medical practice to observe and describe the distribution of diseases and health conditions among populations, and to explore the relationships between exposures and diseases by objectively performing field research and recording data while no interventions (treatments) are provided. Traditional observational studies include cross-sectional, case–control, and cohort studies. These studies are bolstered with the development of registration data and claims records. Methods based on such data have received increasing attention.

CHAPTER 14: DIAGNOSTIC TEST

Pages: 411–433

https://doi.org/10.1142/9789811228872_0015

In medical study, both clinical diagnosis and population screening require the application of kind of test methods so as to make a diagnosis or screening conclusion, which are commonly referred to as diagnostic test or screening test. The methods to find and create better diagnostic or screening tests are important directions for medical study. It is very necessary to make further study for them about their research, design, and analysis.

CHAPTER 15: SAMPLE SIZE ESTIMATION

Pages: 435–471

https://doi.org/10.1142/9789811228872_0016

There is an unavoidable part of study design for both survey and experiment, which is to determine the number of subjects, commonly referred as sample size estimation. This chapter introduces the basic concepts of sample size estimation, common conditions for which sample size estimation is applicable, preconditions, and application examples of sample size estimation.

CHAPTER 16: MISSING DATA

Pages: 473–492

https://doi.org/10.1142/9789811228872_0017

Data did not collect during the study, such as the dropout of subjects in the clinical trial; Omission of information in the survey or no response from the respondents.
Data were collected, but not available: such as obvious input error or logic error.

In statistical analysis, the imputation of missing data is very important. In this chapter, common missing data imputation methods are introduced.

CHAPTER 17: STATISTICAL REPORTING GUIDELINES FOR MEDICAL PAPERS

Pages: 493–519

https://doi.org/10.1142/9789811228872_0018

In this chapter, we introduce international guidelines that should be followed when reporting statistical designs, methods, results and conclusions in medical papers, including the consolidated standards of reporting trials (CONSORT) for randomized controlled clinical trials, the strengthening the reporting of observational studies in epidemiology (STROBE) for observational studies, the standards for reporting diagnostic accuracy studies (STARD) for diagnostic accuracy studies, and the preferred reporting items for systematic reviews and meta-analyses (PRISMA) for systematic reviews and meta-analyses. The main components of these guidelines are flow diagrams and checklists. The standard use of statistical symbols is also introduced in this chapter.

Part III: Frequently Used Powerful Statistical Methods

Chapter 18: analysis of variance for complicated designs.

Pages: 523–567

https://doi.org/10.1142/9789811228872_0019

Completely randomized design conceptually belongs to one-factor design, which is insufficient to consider multiple factors in real-life studies. When two or more factors are involved, a more complex design should be used based on specific aims of the research. In Chapter 12, principles of some more complex designs were introduced, including randomized block design, crossover design, factorial design, and repeated measurement design. In this chapter, we will illustrate the analysis of variance (ANOVA) using data from studies with these designs.

CHAPTER 19: MULTIPLE LINEAR REGRESSION

Pages: 569–599

https://doi.org/10.1142/9789811228872_0020

The interaction between factors is very common in the field of biomedical research. For example, height is not only affected by genetic factors, but also by nutritional status, physical exercise, living environment, etc.; blood pressure is related to not only age but also family history, diet, and labor intensity; the score of health-related Quality of Life (QoL) is influenced by physiological, psychological, social, environmental and many other factors. This chapter describes how multiple linear regression analysis can be used to quantitatively characterize the effects of multiple factors on the outcome variable such as height and blood pressure acting on health-related QoL score.

CHAPTER 20: LOGISTIC REGRESSION

Pages: 601–635

https://doi.org/10.1142/9789811228872_0021

Multiple linear regression models are usually thought of as only being appropriate for continuous response variables that have a normal distribution with respect to the independent variables. Categorical variables are a common type of response variable in medical research, and examples include recurrence vs. non-recurrence, alive vs. dead, different tumor histological types (squamous carcinoma, adenocarcinoma, and large cell carcinoma), and therapeutic effects (cure, effective, improved, and ineffective). Because the relationships between these dependent and independent variables are nonlinear, when the range of dependent variables is not consistent with a linear combination of the independent variables, multiple linear regression is no longer suitable to analyze categorical dependent variables. In this chapter, logistic regression is introduced as an effective method to analyze the relationship between categorical dependent variables and corresponding independent variables. Logistic regression includes unconditional logistic regression for group data and conditional logistic regression for paired data. According to the type of dependent variable, logistic regression can be divided into binary logistic regression, multinomial logistic regression, and ordinal logistic regression. This chapter will introduce the above methods respectively.

CHAPTER 21: SURVIVAL ANALYSIS

Pages: 637–672

https://doi.org/10.1142/9789811228872_0022

In logistic regression we are interested in how risk factors are associated with the presence or absence of an event. Sometimes, however, we are interested in how a risk factor or treatment affects the time to an event. Time-to-event data is analyzed in a number of applied fields such as laboratory studies of animals and clinical and epidemiological studies in humans. Survival analysis is a collection of statistical methods used to describe, explain, or predict the occurrence and timing of events. Often the event of interest is death, hence the name “survival analysis.” It has now become one of the most important fields in statistics.

CHAPTER 22: DISCRIMINANT ANALYSIS AND CLASSIFICATION TREE

Pages: 673–693

https://doi.org/10.1142/9789811228872_0023

In the medical field, doctors discriminate the types of diseases suffered by patients according to their symptoms, signs, and results of laboratory examinations. Disease diagnosis is a classification problem. For example, the differences between diabetic and healthy people can be found by their symptoms, signs, and laboratory examinations. If the differences are expressed as a classification rule, new patients with suspected diabetes can be diagnosed by this rule. Computer Expert diagnostic system is this way…

CHAPTER 23: CLUSTER ANALYSIS

Pages: 695–714

https://doi.org/10.1142/9789811228872_0024

Cluster analysis is a multivariate statistical method, which aims to classify objects, such as samples and variables, into different groups based on the idea of “birds of a feather flock together.” Through cluster analysis, the objects in the same cluster are highly similar, whereas objects in different clusters are dissimilar…

CHAPTER 24: PRINCIPAL COMPONENT ANALYSIS AND FACTOR ANALYSIS

Pages: 715–737

https://doi.org/10.1142/9789811228872_0025

In the medical research, the researchers usually analyzed the problems systematically from multi-dimensionality, so they may measure several observed variables to reflect different aspects of the subject. However, the importance varies among variables and the information provided by the observed variables may be correlated. Researchers try to summarize these variables using fewer uncorrelated underlying variables. The latter is done to reflect most of the information provided by the original variables and reduce the dimensionality of a data set. Cluster analysis of variable in the previous chapter plays an important role of dimensionality reduction, while no new variables generate during the procedure of cluster analysis. Both principal component analysis (PCA) and factor analysis (FA) explored in this chapter may generate new variables in the process of dimensionality reduction, which can be applied for further analysis.

Part IV: Selected Topics of Advanced Statistics

Chapter 25: meta analysis.

Pages: 741–796

https://doi.org/10.1142/9789811228872_0026

The best possible synthesis of available information is essential for medical researchers, health policy-makers, clinicians and other decision-makers. During the past 30 years, meta-analysis, a statistical procedure for systematically combining and analyzing the results of multiple scientific studies, has been applied with increasing frequency to health-related contexts, especially in the fields of clinical trials.

CHAPTER 26: TIME SERIES

Pages: 797–822

https://doi.org/10.1142/9789811228872_0027

Time series analysis provides a mathematical model for dynamic data to reveal the law of development and change of variables quantitatively or describe the internal quantitative relationship among variables from a dynamic perspective. In medical researches, events are dynamically observed at regular intervals (often equidistant). Due to the involvement of random factors, the measurements at those time points are random variables. Such random variables (or the observed values) arranged in time order are called time series. For example, both the incidence of influenza among children somewhere (Table 26.1) and the amount of antibiotics produced by a pharmaceutical company (Table 26.2) constitute the time series data respectively.

CHAPTER 27: STRUCTURAL EQUATION MODEL

Pages: 823–848

https://doi.org/10.1142/9789811228872_0028

Structural equation modeling (SEM) is a method that integrates measurement and analysis. It examines latent constructs that are not directly observable using analysis of observed variables to estimate measurement error and evaluate the validity of the relationships modeled. In SEM, structural models are constructed to explore the relationship between latent variables. SEM has been widely used in psychology, sociology, behavioral science, and medicine in recent decades. This chapter introduces the process of SEM analysis and illustrates this process using specific cases.

CHAPTER 28: MULTI-LEVEL MODEL

Pages: 849–882

https://doi.org/10.1142/9789811228872_0029

As mentioned in the previous sections, an important assumption in conventional regression model is that the observed outcomes of individuals are mutually independent. But in practice, due to subjective or objective factors, there are often some similar characteristics among observed individuals in a certain range of time or space so that outcomes of individuals are no longer independent and appear to have a hierarchical structure. Under such circumstance, analyzing through a conventional regression model may lead to a biased result. Multi-level models, such as random intercepts model, random slopes model and two-level logistic regression model, introduced in this chapter are more appropriate in dealing with this kind of data.

CHAPTER 29: ITEM RESPONSE THEORY

Pages: 883–907

https://doi.org/10.1142/9789811228872_0030

Item response theory (IRT) was originated in the field of psychological and educational measurement, belonging to the domain of measurement. Since IRT was presented in the middle of the 20th century, a set of mature models have been established, including unidimensional model, multi-dimensional model, non-parametric model, etc. These models have been widely used in the field of measurement. However, due to the complexity of the models and parameter estimation, application of IRT in other fields is limited, especially in biomedical fields. The purpose of this chapter is to introduce IRT briefly, so as to promote the application of IRT in biomedical research.

CHAPTER 30: STATISTICS FOR GENETICS

Pages: 909–937

https://doi.org/10.1142/9789811228872_0031

Recent developments in molecular genetics provide opportunities for genetic research on complex traits in humans. Many human diseases, such as insulin-dependent diabetes, hypertension and schizophrenia, are considered to have genetic effects. Gene mapping is extremely important for etiological studies and may lead to better treatments. The main goal of human disease genomics research is to reveal the genetic mechanism of complex genetic diseases. Complex diseases, such as diabetes, obesity, osteoporosis, hypertension, cardiovascular diseases, are generally determined by a variety of genetic and environmental factors and their interactions. Genetic epidemiology studies try to seek to elucidate the role of genetic and environmental factors in the occurrence of disease in population. The surge in the field of genetic epidemiology has been accompanied by the increasing developments of statistical methods. In this chapter, we focus on genetic association analysis, including some basic concepts, population-based and family-based association analysis.

CHAPTER 31: STATISTICS FOR BIOINFORMATICS

Pages: 939–968

https://doi.org/10.1142/9789811228872_0032

Large amounts of biomedical data have been generated and collected due to the rapid development of molecular biology technology in recent decades, such as gene chip technology, next-generation sequencing, mass spectrometry, and chromatography, protein–protein interaction detection technology, new drug testing technology, etc. Modern life science and medical research have entered a high-throughput, large-scale era of omics study. In pace with the initiation of the Human Genome Project, Bioinformatics becomes a newly emerging interdisciplinary subject involving biology, statistics, and computer science, which focuses on storing, retrieving, and analyzing biological information by means of computers. Bioinformatics is one of the major frontier fields of life sciences and natural sciences today, and it is also one of the core fields and most dynamic fields of natural sciences in the 21st century. The GWAS study described in the previous chapter is a typical example of the application of statistical analysis techniques to bioinformatics research. We will start this chapter with an introduction to gene expression data generated by gene chip technology and high-throughput sequencing (HTS) technology, and further introduce how to think and solve bioinformatics problems from a statistical perspective.

CHAPTER 32: DATA MINING

Pages: 969–994

https://doi.org/10.1142/9789811228872_0033

Data mining is a complex process of extracting hidden information and knowledge from a large number of incomplete, noisy, fuzzy and random data. Data mining methods are widely derived from artificial intelligence, which integrates mathematics, statistics, computer technology, machine learning and other disciplines, and has grown up to be a hot technology in the era of massive data…

BACK MATTER

Pages: 997–1146

https://doi.org/10.1142/9789811228872_bmatter

Introduction to SPSS
Introduction to Excel for Statistics
Introduction to R Language
Program, Data and Output for Computer Experiments
Referenced Answer for Case Discrimination
Referenced Answer for Think and Practice
Frequently Used Tables
Others (Available Online)

Ji-Qian Fang was honorably awarded as a National Teaching Master of China by the Central Government of China in 2009 due to his outstanding achievement in university education. Professor Fang received his BS in Mathematics from Fudan University, China in 1961, and his PhD in Biostatistics from the University of California at Berkeley, US in 1985. He served the position of Professor and Director of the Department of Biomathematics and Biostatistics at Beijing Medical University during 1985–1991; and since 1991, he has been the Professor and Director of the Department of Medical Statistics at Sun Yat-Sen Medical University (now Sun Yat-Sen University). He also was an Adjunct professor at Chinese University of Hong Kong from 1993 to 2009.

His research interests include the statistical methods for research on chronic diseases, stochastic models for life phenomena, measurement of quality of life and actuarial studies for health service. Professor Fang has completed 19 National and International research projects, and has received 14 awards for progression in research from the Central and Provincial Governments of China.

He is the Chief Editor of national recommended textbooks and monographs, including Advanced Mathematics , Mathematical Statistics for Medicine , Health Statistics , Statistical Methods for Biomedical research , Medical Statistics and Computer Experiments , Advanced Medical Statistics and Handbook of Medical Statistics . The course of Medical Statistics led by Professor Fang has been recognized as the National Recommended Course in 2008 and the National Demonstration Bi-lingual Course in 2010.

Professor Fang is the founder of Group China of the International Biometric Society, and the founder of Committee of Medical Statistics Education of the Chinese Association of Health Informatics.

includes Statistical Tables.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts
Open access
Published: 02 July 2024

Integrating sex and gender into biomedical research requires policy and culture change

Alice Witt 1 ,
Marina Politis 1 ,
Robyn Norton 1 , 2 , 3 &
Kate Womersley 1 , 2

npj Women's Health volume 2 , Article number: 23 ( 2024 ) Cite this article

10 Altmetric

Metrics details

Health care

Most biomedical, health and care research does not adequately account for sex and gender dimensions of health and illness. Overlooking and disregarding the influence of sex and gender in research reduces scientific rigour and reproducibility, which leads to less effective treatments and worse health outcomes for all, particularly women and sex and gender diverse people. Historically, there has been minimal sex and gender policy innovation in UK medical research. To address this, stakeholders from across the UK research sector have been collaborating since spring 2023 to co-design a sex and gender policy framework to be implemented by research funders, as part of the MESSAGE (Medical Science Sex and Gender Equity) project. In the first Policy Lab, held in London in May 2023, 50 participants, including representatives from funding organisations, medical journals, regulators, clinicians, academics and people with lived experience, identified two key priorities for future action: 1) A whole system approach to policy change, and 2) Technical capacity-building and wider culture change efforts. In pursuing these priorities and collaborating cross-sectorally, UK stakeholders are engaged in an internationally innovative approach aimed at realising sustainable and impactful sex and gender policy change. Drawing on MESSAGE Policy Lab discussions, we set out key actions needed for the UK research sector to embed meaningful accounting for sex and gender as a new norm for research practice.

Historically and to this day, sex and gender considerations have not been adequately accounted for in biomedical, health and care research. This persists across data collection, analysis and reporting of findings in both bench and clinical research. A 2019 study found that just 49% of pre-clinical research studies reported using both male and female research subjects, and only 42% analysed their data by sex 1 . In clinical medicine, just 22% of phase I trial participants are female 2 , with 67% of publications reporting data by sex 3 . Sex and gender data gaps in the evidence base reduce scientific accuracy and generalisability, leading to less effective treatments, increased risks of patient harm and worse outcomes. For example, of the ten drugs withdrawn from the US market between 1997 and 2000, eight posed a greater health risk to women than men due to trials having poorly accounted for sex and gender dimensions 4 , 5 . Though these gaps predominantly impact women, girls and sex and gender diverse people, addressing sex and gender gaps in the evidence base and research practice will improve health outcomes for all people.

Research funder policies are an important lever for initiating change

Medical research funders around the world, most notably the Canadian Institutes of Health Research (CIHR), the National Institutes of Health in the US and Horizon Europe, have adopted policies to address sex and gender data gaps 6 . An evaluation of the CIHR policy (2010) showed that such interventions can have a considerable impact: in the first decade of the policy’s implementation, the proportion of funding applications that considered sex rose from 22% to 83%, and of those considering gender increased from 12% to 33% 7 . Since 2015, the number of Canadian research publications that account for sex and/or gender rose by 64% 8 . By contrast, in the UK there is no unified sex and gender guidance for researchers, and only one funder has a sex and gender policy – the Medical Research Council’s Embedding diversity in research design , which was launched in 2023 9 . UK research funders, however, have now shown an appetite for change. Over 30 stakeholders have publicly stated their support for sex and gender policy action since December 2023, including major government and charitable funders 10 , 11 . Over the past 12 months, organisations across the UK biomedical, health and care research sector have engaged in co-designing a policy framework for research funders to improve how research accounts for sex and gender dimensions. Led by The George Institute for Global Health’s MESSAGE (Medical Science Sex and Gender Equity) project, the co-design process took place over three Policy Labs in 2023–2024 ( www.messageproject.co.uk ).

At the first Lab, held in May 2023, 50 representatives from the UK biomedical, health and care research sector came together to set out a vision for policy efforts to improve how researchers account for sex and gender dimensions 12 . The group comprised representatives from research funding organisations, regulators and academic publishers, people with lived experience, researchers, clinicians and government officials. The Lab identified two key priorities for enhancing the impact of policy action:

Sex and gender policies should be designed and delivered through a whole system approach.

Technical capacity-building and culture change across the research sector is needed to support policy implementation.

A new norm for research practice to improve scientific rigour and reproducibility

Sex and gender are relevant to the vast majority of biomedical, health and care research questions, and accounting for these dimensions is an essential component of conducting robust, rigorous and reproducible science. Despite this, identifying potential sex and gender differences and similarities – rather than merely controlling for sex or gender – is currently considered a niche area of expertise, rather than a standard that ought to be widely practised. Policy interventions are vital for bringing about a paradigm shift in how the research community thinks about, conducts and values research, to centre explicit accounting for sex and gender dimensions as a core part of high-quality research practice.

To enhance scientific rigour and reproducibility, researchers must embed sex and gender considerations in their thinking at every stage of the research cycle, from study design and data collection to analysis and reporting of findings 13 , 14 . Improved representation of all sexes and genders in research is a critical first step to address historic biases in research inclusion. Representation must be paired with a commitment to disaggregate data by sex and/or gender when analysing and reporting findings, which is rare in current practice 15 , 16 . Accounting for sex and gender does not mean powering every study to identify statistically significant sex and/or gender differences, but rather designing studies in a way that allows possible sex and/or gender dimensions of health and disease to emerge in study data and inform future research. Simply reporting sex- and gender-disaggregated data will widen the collective knowledge of the research community, enabling future research, including meta-analyses, to study important sex and gender differences in more depth.

Most importantly, greater transparency is needed in published research about the sex and/or gender characteristics of the study sample, whether data analysis has accounted for these variables, and whether the way sex and/or gender has been accounted for is a strength or limitation of the study. This change in convention will have considerable impact by bringing clarity to the existing evidence base and highlighting data gaps for further research.

Sex and gender policies should be designed and delivered through a whole system approach

MESSAGE Policy Lab participants identified that a whole system approach is needed to change expectations for what high-quality research includes and how it should be evaluated. A coordinated and consistent approach between the many organisations that make up the medical research sector will ensure that new requirements are clear to researchers and are as easy as possible to adopt.

The group identified that funder policies should be the first step in this process because they shape researchers’ thinking at the first stage of study design. Once funder requirements are in place, regulators and publishers will be able to reflect those requirements in their guidance over time, as researchers become more familiar with and skilled at accounting for and reporting sex and gender considerations. Policy Lab participants stated that a joint roadmap and timeframe for sector-wide policy roll-out would support cohesion between the different organisations.

Changes in how sex and gender are considered in research requires investment of resources, particularly by funders, to upskill researchers and to address gaps in the existing knowledge base. Internally, organisations must dedicate time to considering how sex and gender will be integrated in existing funding application review systems and must appoint dedicated members of staff to be responsible for policy implementation and evaluation. Lab participants emphasised that a whole system approach would help to reduce the burden of additional work by enabling actors to share and collaborate on strategies and resources.

Technical capacity-building and culture change across the research sector is needed to support policy implementation

The second priority outlined by Policy Lab participants was the need to provide clear guidance from the outset to support researchers to meet new policy expectations. Guidance should raise awareness about the relevance of sex and gender considerations to biomedical, health and care research, and provide researchers with the skills to implement these changes in their work. The fact that many published papers use the terms sex and gender interchangeably underscores the need to build researchers’ knowledge about the differences between usage of the two terms and their distinct relevance in the context of biomedical, health and care research 17 , 18 . Supporting researchers to reflect on which term is most relevant to their research question will enable them to identify which research participants or subjects they need to include to develop an appropriately representative sample. Guidance on statistical considerations and best practice for procurement, recruitment and retention of a more diverse sample will also be important.

Wider advocacy efforts will be needed to bring sex and gender considerations to the fore of research thinking and to establish a new norm of transparency around reporting of sex and gender dimensions. A crucial component of this will be addressing misconceptions which have historically been used to justify the exclusion of female subjects and participants, such as the (disproven) contention that female hormonal variability throughout the oestrus cycle might obscure results 19 , 20 . Likewise, women’s reproductive capacity has historically been used to justify their exclusion from drug trials, in large part due to the thalidomide scandal of the 1960s, which also damaged trust with the public 21 . Awareness-raising efforts must highlight that excluding women from clinical trials only to expose them to those very medicines and treatments outside of the comparatively safe laboratory environment increases the risk of patient harm. Moreover, concerns about reproductive capacity can now be addressed by modern contraception methods and do not apply to women who are post-menopausal.

Showcasing and rewarding examples of best practice in considering sex and gender will be important to establish role models, incentivise new ways of working, and attract further investment to fill knowledge gaps. Increasing the sex and gender diversity of the research workforce, and particularly of those in senior positions, will support this culture change 7 .

Accounting for sex and gender will generate new opportunities for research impact

Improved accounting for sex and gender in biomedical, health and care research is essential for advancing the quality of scientific research. Funder policies will play an important role in initiating this process of change at the start of the research pipeline and, to embed this as a new norm for research practice, should be grounded in sector-wide collaboration and accompanied by capacity-building and culture change efforts. Enhanced integration of sex and gender considerations will trigger a virtuous cycle: as more researchers reflect on and describe the sex and gender characteristics of their study sample, increase the sex and gender equity of their research participants or subjects, and publish sex- and gender-disaggregated data, the evidence base will grow and strengthen. Reducing sex and gender data gaps in evidence will open new avenues for research, improve treatments and culminate in improved health outcomes for all.

Woitowich, N. C., Beery, A. & Woodruff, T. A 10-year follow-up study of sex inclusion in the biological sciences. eLife 9 , e56344 (2020).

Article PubMed PubMed Central Google Scholar

Labots, G., Jones, A., de Visser, S. J., Rissman, R. & Burggraaf, J. Gender differences in clinical registration trials: is there a real problem? Br. J. Clin. Pharmacol. 84 , 700–707 (2018).

Sugimoto, C. R., Ahn, Y. Y., Smith, E., Macaluso, B. & Larivière, V. Factors affecting sex-related reporting in medical research: a cross-disciplinary bibliometric analysis. Lancet 393 , 550–559 (2019).

Article PubMed Google Scholar

Heinrich, J., Gahart, M. T., Rowe, E. J. & Bradley, L. A Letter to The Honorable Tom Harkin. Drug Safety: Most Drugs Withdrawn in Recent Years Had Greater Health Risks for Women . (U.S. Government Accountability Office, 2001). https://www.gao.gov/products/gao-01-286r .

Carey, J. L. et al. Drugs and medical devices: Adverse events and the impact on women’s health. Clin. Ther. 39 , 10–22 (2017).

Hunt, L., Nielsen, M. W. & Schiebinger, L. A framework for sex, gender, and diversity analysis in research. Science 377 , 1492–1495 (2022).

Article CAS PubMed Google Scholar

Haverfield, J. & Tannenbaum, C. A 10-year longitudinal evaluation of science policy interventions to promote sex and gender in health research. Health Res. Policy Syst . 19 , 94 https://doi.org/10.1186/s12961-021-00741-x (2021).

Canadian Institutes of Health Research. A New Era of Sex and Gender Science: Impact Report 2015–2022 . https://cihr-irsc.gc.ca/e/documents/igh_report_new_era_sgc-en.pdf (2023)

Medical Research Council. Embedding diversity in research design. Medical Research Council . https://www.ukri.org/publications/mrc-embedding-diversity-in-research-design-policy/embedding-diversity-in-research-design/ (2023).

Womersley, K. & Norton, R. UK medical research funders must do more to support sex and gender equity. BMJ 382 , 1809 (2023).

Witt, A., Politis, M. & Womersley, K. A whole sector approach to policy change will accelerate integration of sex and gender in research. BMJ 383 , p2913 (2024).

Google Scholar

Witt, A. et al. MESSAGE Policy Lab 1 – 22 nd May 2023. Medical Science Sex and Gender Equity . https://cdn.georgeinstitute.org/sites/default/files/documents/Policy%20Lab%201%20Report%20-%20MESSAGE.pdf (2023).

Tannenbaum, C., Ellis, R. P., Eyssel, F., Zou, J. & Schiebinger, S. Sex and gender analysis improves science and engineering. Nature 575 , 137–146 (2019).

Accounting for sex and gender makes for better science. Nature . 588 , 196 https://doi.org/10.1038/d41586-020-03459-y (2020).

Raising the bar on sex and gender reporting in research. Nat Commun . 13 , 2845 https://doi.org/10.1038/s41467-022-30398-1 (2022).

Geller, S. E. et al. The more things change, the more they stay the same: A study to evaluate compliance with inclusion and assessment of women and minorities in randomised controlled trials. Acad. Med. 93 , 630–635 (2018).

Peters, S. A. E. & Norton, R. Sex and gender reporting in global health: new editorial policies. BMJ Global Health 3 , e001038 (2018).

Johnson, J., Sharman, Z., Vissandjée, B. & Stewart, B. Does a change in health research funding policy related to the integration of sex and gender have an impact? PLOS One 9 , e99900 (2014).

Beery, A. K. Inclusion of females does not increase variability in rodent research studies. Curr. Opin. Behav. Sci. 23 , 143–149 (2018).

Prendergast, B. J., Onishi, K. G. & Zucker, I. Female mice liberated for inclusion in neuroscience and biomedical research. Neurosci. Biobehav. Rev. 40 , 1–5 (2014).

National Institutes of Health. NIH Inclusion Outreach Toolkit: How to Engage, Recruit and Retain Women in Clinical Research . https://orwh.od.nih.gov/toolkit/recruitment/history#3 .

Download references

Author information

Authors and affiliations.

The George Institute for Global Health, London, UK

Alice Witt, Marina Politis, Robyn Norton & Kate Womersley

Imperial College London, London, UK

Robyn Norton & Kate Womersley

UNSW Sydney, Sydney, NSW, Australia

Robyn Norton

You can also search for this author in PubMed Google Scholar

Contributions

A.W. and K.W. wrote the manuscript with input from M.P. and R.N. All authors reviewed the manuscript.

Corresponding author

Correspondence to Alice Witt .

Ethics declarations

Competing interests.

A.W., K.W. and M.P. receive funding from the Wellcome Trust, who have participated in MESSAGE Policy Labs. The funder had no role in the analysis of Policy Lab discussions or preparation of this manuscript.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Witt, A., Politis, M., Norton, R. et al. Integrating sex and gender into biomedical research requires policy and culture change. npj Womens Health 2 , 23 (2024). https://doi.org/10.1038/s44294-024-00027-x

Download citation

Received : 12 October 2023

Accepted : 18 June 2024

Published : 02 July 2024

DOI : https://doi.org/10.1038/s44294-024-00027-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

Explore articles by subject
Guide to authors
Editorial policies

Whiting School of Engineering
Johns Hopkins School of Medicine

Johns Hopkins Biomedical Engineering
Research Areas

Biomedical Data Science

Biomedical Data Science involves the analysis of large-scale biomedical datasets to understand how living systems function. Our academic and research programs in Biomedical Data Science center on developing new data analysis technologies in order to understand disease mechanisms and provide improved health care at lower costs.

Education in Biomedical Data Science

Our curriculum trains students to extract knowledge from biomedical datasets of all sizes in order to understand and solve health-related problems. Students collaborate with faculty throughout the schools of Medicine and Engineering to develop novel cloud-based technologies and data analysis methods that will improve our ability to diagnose and treat diseases.

Research in Biomedical Data Science

Our students and faculty are pioneering new methods to analyze large-scale biomedical datasets, shedding new light on the function of living systems. Key research areas include:

Computational Science

Machine learning and data science, biomedical data, science as a service, biomedical clouds, core faculty.

Read the Johns Hopkins University privacy statement here .

Information

Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

Active Journals
Find a Journal
Proceedings Series
For Authors
For Reviewers
For Editors
For Librarians
For Publishers
For Societies
For Conference Organizers
Open Access Policy
Institutional Open Access Program
Special Issues Guidelines
Editorial Process
Research and Publication Ethics
Article Processing Charges
Testimonials
Preprints.org
SciProfiles
Encyclopedia

Topic Information

Participating journals, topic editors.

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

Chromatography–Mass Spectrometry Analysis in Biomedical Research and Clinical Laboratory

Dear Colleagues,

It is recognized that chromatography–mass spectrometry introduced a revolution in biomedical research, offering specificity and sensitivity superior to that of other analytical techniques. It is currently an intensively developing scientific field that comprises the development and application of new methods using state-of-the-art equipment. The present Topic aims to cover the latest research trends and achievements of chromatography–mass spectrometry in biomedical, clinical, and pharmacological research by highlighting novel applications and novel approaches in sample treatment and instrumental analysis. Researchers working on all aspects of basic research and applications in biomedical and clinical sciences are cordially invited to contribute a research or review article in this Topic.

Dr. Constantinos K. Zacharis Dr. Andreas Tsakalof Topic Editors

chromatography–mass spectrometry in biomedical research
bioanalysis
biomarkers of disease
biomarkers of exposure
omics research: metabolomics, volatolomics, lipidomics, proteomics
drugs development
therapeutic drug monitoring
biosample preparation

Journal Name	Impact Factor	Launched Year	First Decision (median)	APC
analytica	-	2020	12.8 Days	CHF 1000
jcm		2012	17.3 Days	CHF 2600
separations		2014	12.4 Days	CHF 2600
biomolecules		2011	16.3 Days	CHF 2700
molecules		1996	15.1 Days	CHF 2700
ijms		2000	18.1 Days	CHF 2900

Immediately share your ideas ahead of publication and establish your research priority;
Protect your idea from being stolen with this time-stamped preprint article;
Enhance the exposure and impact of your research;
Receive feedback from your peers in advance;
Have it indexed in Web of Science (Preprint Citation Index), Google Scholar, Crossref, SHARE, PrePubMed, Scilit and Europe PMC.

Published Papers (11 papers)

Graphical abstract

Further Information

Mdpi initiatives, follow mdpi.

Subscribe to receive issue release notifications and newsletters from MDPI journals

Submit your Manuscript

Submit your abstract.

Journal Name	Impact Factor	Launched Year	First Decision (median)	APC
analytica	-	2020	12.8 Days	CHF 1000
jcm		2012	17.3 Days	CHF 2600
separations		2014	12.4 Days	CHF 2600
biomolecules		2011	16.3 Days	CHF 2700
molecules		1996	15.1 Days	CHF 2700
ijms		2000	18.1 Days	CHF 2900

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings
My Bibliography
Collections
Citation manager

Save citation to file

Email citation, add to collections.

Create a new collection
Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

Search in PubMed
Search in NLM Catalog
Add to Search

Issues in biomedical research data management and analysis: needs and barriers

Affiliation.

1 University of Washington, Department of Medical Education and Biomedical Informatics, Box 357240, Seattle, WA 98195-7420, USA. [email protected]
PMID: 17460139
PMCID: PMC2244904
DOI: 10.1197/jamia.M2114

Objectives: A. Identify the current state of data management needs of academic biomedical researchers. B. Explore their anticipated data management and analysis needs. C. Identify barriers to addressing those needs.

Design: A multimodal needs analysis was conducted using a combination of an online survey and in-depth one-on-one semi-structured interviews. Subjects were recruited via an e-mail list representing a wide range of academic biomedical researchers in the Pacific Northwest.

Measurements: The results from 286 survey respondents were used to provide triangulation of the qualitative analysis of data gathered from 15 semi-structured in-depth interviews.

Results: Three major themes were identified: 1) there continues to be widespread use of basic general-purpose applications for core data management; 2) there is broad perceived need for additional support in managing and analyzing large datasets; and 3) the barriers to acquiring currently available tools are most commonly related to financial burdens on small labs and unmet expectations of institutional support.

Conclusion: Themes identified in this study suggest that at least some common data management needs will best be served by improving access to basic level tools such that researchers can solve their own problems. Additionally, institutions and informaticians should focus on three components: 1) facilitate and encourage the use of modern data exchange models and standards, enabling researchers to leverage a common layer of interoperability and analysis; 2) improve the ability of researchers to maintain provenance of data and models as they evolve over time though tools and the leveraging of standards; and 3) develop and support information management service cores that could assist in these previous components while providing researchers with unique data analysis and information design support within a spectrum of informatics capabilities.

PubMed Disclaimer

Primary roles by percentage of…

Primary roles by percentage of respondents.

Primary research interest by percentage…

Primary research interest by percentage of respondents.

Individuals reporting experiencing computational and…

Individuals reporting experiencing computational and informatics problem by lab size and percentage of…

Image archiving by sub-discipline by…

Image archiving by sub-discipline by percentage of respondents.

Employee hours spent per-week by…

Employee hours spent per-week by sub-discipline and percentage of respondents.

Publication types

Search in MeSH

Related information

Grants and funding.

T15LM07442/LM/NLM NIH HHS/United States
DC02310/DC/NIDCD NIH HHS/United States
T15 LM007442/LM/NLM NIH HHS/United States
R01-HG02288/HG/NHGRI NIH HHS/United States
P20 LM007714/LM/NLM NIH HHS/United States
P20-LM007714/LM/NLM NIH HHS/United States
R01 DC002310/DC/NIDCD NIH HHS/United States
R01 HG002288/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full text sources.

Europe PubMed Central
PubMed Central
Silverchair Information Systems

Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

Advanced Search

Data integration through canonical correlation analysis and its application to OMICs research

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, graphical abstract, recommendations, microrna: human disease and development.

microRNAs or miRNAs are an abundant class of highly conversed, small non-coding RNAs that present an entirely new theme of post-transcriptional gene regulation. miRNAs play a key role in diverse biological systems, such as virology, embryogenesis, ...

Fast regularized canonical correlation analysis

Canonical correlation analysis is a popular statistical method for the study of the correlations between two sets of variables. Finding the canonical correlations between these datasets requires the inversion of their corresponding sample correlation ...

An automated pipeline for discovering gene expression patterns associated with increased cancer survival time

Gene expression profiles quantify the expression of thousands of genes simultaneously, providing a snapshot in time of gene expression in a specific tissue. A gene expression profile can be helpful in understanding the association of genes to the ...

Information

Published in.

Elsevier Science

San Diego, CA, United States

Publication History

Author tags.

Canonical correlation analysis
Bioinformatics
Review-article

Contributors

Other metrics, bibliometrics, article metrics.

0 Total Citations
0 Total Downloads
Downloads (Last 12 months) 0
Downloads (Last 6 weeks) 0

View options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
Download citation
Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

Business Analyst – Biomedical research with Generative AI Expertise

About the role.

Your responsibilities include but are not limited to

Domain Expertise:

Apply in-depth knowledge of biomedical research processes, including chemical, pre-clinical, and clinical stages, to provide valuable insights and recommendations.

Stay updated with the latest advancements in the biomedical field and AI technologies, ensuring the application of cutting-edge solutions.

Business Analysis:

Collaborate with customers to gather and document business requirements, ensuring alignment with organizational goals.

Conduct thorough analyses of business processes, identifying areas for improvement and optimization.

Develop business cases, feasibility studies, and cost-benefit analyses to support decision-making.

Data Analysis and Interpretation:

Leverage GenerativeAI and AI tools to analyze large datasets, extracting meaningful patterns and trends.

Present data-driven insights to collaborators, aiding in the development of critical initiatives and operational efficiencies.

Project Management:

Lead and manage projects from inception to completion, ensuring timely delivery and adherence to quality standards.

Coordinate with multi-functional teams, including IT, R&D, and operations, to implement solutions and drive project success.

Solution Development:

Work with technical teams to design and implement AI-driven solutions that address business needs and improve research outcomes.

Ensure that developed solutions are scalable, sustainable, and aligned with industry best practices

Communicate effectively with both technical and non-technical customers, translating complex technical concepts into actionable insights.

Facilitate workshops, meetings, and presentations to engage customers and drive consensus.

Minimum Requirements

Bachelor's degree in Biomedical Sciences, Life Sciences, Computer Science, or a related field; Master's degree preferred.

Experience as a Business Analyst in the biomedical research domain, with a focus on chemical, pre-clinical, and clinical research.

Solid understanding of GenerativeAI and AI technologies, with practical experience in their application within the biomedical field.

Proficiency in data analysis tools and techniques, with the ability to interpret complex datasets.

Excellent project management skills, with a track record of optimally leading and delivering projects.

Strong analytical and problem-solving abilities, with a keen attention to detail.

Exceptional communication and interpersonal skills, with the ability to engage and influence customers at all levels.

Familiarity with regulatory requirements and industry standards in the pharmaceutical and biomedical research sectors.

Why consider Novartis? Our purpose is to reimagine medicine to improve and extend people’s lives and our vision is to become the most valued and trusted medicines company in the world. How can we achieve this? With our people. It is our associates that drive us each day to reach our ambitions. Be a part of this mission and join us! Learn more here: https://www.novartis.com/about/strategy/people-and-culture Commitment to Diversity and Inclusion: Novartis is committed to building an outstanding, inclusive work environment and diverse teams' representative of the patients and communities we serve. Join our Novartis Network: If this role is not suitable to your experience or career goals but you wish to stay connected to hear more about Novartis and our career opportunities, join the Novartis Network here: https://talentnetwork.novartis.com/network

Why Novartis: Helping people with disease and their families takes more than innovative science. It takes a community of smart, passionate people like you. Collaborating, supporting and inspiring each other. Combining to achieve breakthroughs that change patients’ lives. Ready to create a brighter future together? https://www.novartis.com/about/strategy/people-and-culture

Join our Novartis Network: Not the right Novartis role for you? Sign up to our talent community to stay connected and learn about suitable career opportunities as soon as they come up: https://talentnetwork.novartis.com/network

https://www.novartis.com/about/strategy/people-and-culture
https://talentnetwork.novartis.com/network
https://novartis.talentsys.ru/gateway.html?recTitle=Business%20Analyst%20%E2%80%93%20Biomedical%20research%20with%20Generative%20AI%20Expertise&requisition=REQ-10012591&redirect=https%3A//novartis.wd3.myworkdayjobs.com/en-US/Novartis_Careers/job/Hyderabad-Office/Business-Analyst---Biomedical-research-with-Generative-AI-Expertise_REQ-10012591

Morphological simulation tests the limits on phenotype discovery in 3D image analysis

Find this author on Google Scholar
Find this author on PubMed
Search for this author on this site
ORCID record for Rachel A Roston
For correspondence: [email protected]
ORCID record for Sophie M Whikehart
ORCID record for Sara M Rolfe
ORCID record for A. Murat Maga
Info/History
Preview PDF

In the past few decades, advances in 3D imaging have created new opportunities for reverse genetic screens. Rapidly growing datasets of 3D images of genetic knockouts require high-throughput, automated computational approaches for identifying and characterizing new phenotypes. However, exploratory, discovery-oriented image analysis pipelines used to discover these phenotypes can be difficult to validate because, by their nature, the expected outcome is not known a priori. Introducing known morphological variation through simulation can help distinguish between real phenotypic differences and random variation; elucidate the effects of sample size; and test the sensitivity and reproducibility of morphometric analyses. Here we present a novel approach for 3D morphological simulation that uses open-source, open-access tools available in 3D Slicer, SlicerMorph, and Advanced Normalization Tools in R (ANTsR). While we focus on diffusible-iodine contrast-enhanced micro-CT (diceCT) images, this approach can be used on any volumetric image. We then use our simulated datasets to test whether tensor-based morphometry (TBM) can recover our introduced differences; to test how effect size and sample size affect detectability; and to determine the reproducibility of our results. In our approach to morphological simulation, we first generate a simulated deformation based on a reference image and then propagate this deformation to subjects using inverse transforms obtained from the registration of subjects to the reference. This produces a new dataset with a shifted population mean while retaining individual variability because each sample deforms more or less based on how different or similar it is from the reference. TBM is a widely-used technique that statistically compares local volume differences associated with local deformations. Our results showed that TBM recovered our introduced morphological differences, but that detectability was dependent on the effect size, the sample size, and the region of interest (ROI) included in the analysis. Detectability of subtle phenotypes can be improved both by increasing the sample size and by limiting analyses to specific body regions. However, it is not always feasible to increase sample sizes in screens of essential genes. Therefore, methodical use of ROIs is a promising way to increase the power of TBM to detect subtle phenotypes. Generating known morphological variation through simulation has broad applicability in developmental, evolutionary, and biomedical morphometrics and is a useful way to distinguish between a failure to detect morphological difference and a true lack of morphological difference. Morphological simulation can also be applied to AI-based supervised learning to augment datasets and overcome dataset limitations.

Competing Interest Statement

The authors have declared no competing interest.

https://github.com/raroston/SimMorph

View the discussion thread.

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Citation Manager Formats

EndNote (tagged)
EndNote 8 (xml)
RefWorks Tagged
Ref Manager
Tweet Widget
Facebook Like
Google Plus One

Subject Area

Animal Behavior and Cognition (5425)
Biochemistry (12229)
Bioengineering (9168)
Bioinformatics (30236)
Biophysics (15503)
Cancer Biology (12627)
Cell Biology (18122)
Clinical Trials (138)
Developmental Biology (9771)
Ecology (14649)
Epidemiology (2067)
Evolutionary Biology (18817)
Genetics (12563)
Genomics (17247)
Immunology (12349)
Microbiology (29098)
Molecular Biology (12088)
Neuroscience (63376)
Paleontology (467)
Pathology (1943)
Pharmacology and Toxicology (3378)
Physiology (5207)
Plant Biology (10833)
Scientific Communication and Education (1711)
Synthetic Biology (3011)
Systems Biology (7553)
Zoology (1694)

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

Advanced Search
Journal List
J Med Internet Res
v.24(4); 2022 Apr

Understanding the Research Landscape of Deep Learning in Biomedical Science: Scientometric Analysis

1 Department of Library and Information Science, Sungkyunkwan University, Seoul, Republic of Korea

Donghun Kim

Woojin jung, yongjun zhu.

2 Department of Library and Information Science, Yonsei University, Seoul, Republic of Korea

Advances in biomedical research using deep learning techniques have generated a large volume of related literature. However, there is a lack of scientometric studies that provide a bird’s-eye view of them. This absence has led to a partial and fragmented understanding of the field and its progress.

This study aimed to gain a quantitative and qualitative understanding of the scientific domain by analyzing diverse bibliographic entities that represent the research landscape from multiple perspectives and levels of granularity.

We searched and retrieved 978 deep learning studies in biomedicine from the PubMed database. A scientometric analysis was performed by analyzing the metadata, content of influential works, and cited references.

In the process, we identified the current leading fields, major research topics and techniques, knowledge diffusion, and research collaboration. There was a predominant focus on applying deep learning, especially convolutional neural networks, to radiology and medical imaging, whereas a few studies focused on protein or genome analysis. Radiology and medical imaging also appeared to be the most significant knowledge sources and an important field in knowledge diffusion, followed by computer science and electrical engineering. A coauthorship analysis revealed various collaborations among engineering-oriented and biomedicine-oriented clusters of disciplines.

Conclusions

This study investigated the landscape of deep learning research in biomedicine and confirmed its interdisciplinary nature. Although it has been successful, we believe that there is a need for diverse applications in certain areas to further boost the contributions of deep learning in addressing biomedical research problems. We expect the results of this study to help researchers and communities better align their present and future work.

Introduction

Deep learning is a class of machine learning techniques based on neural networks with multiple processing layers that learn representations of data [ 1 , 2 ]. Stemming from shallow neural networks, many deep learning architectures, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have been developed for various purposes [ 3 ]. The exponentially growing amount of data in many fields and recent advances in graphics processing units have further expedited research progress in the field. Deep learning has been actively applied to tasks, such as natural language processing (NLP), speech recognition, and computer vision, in various domains [ 1 ] and has shown promising results in diverse areas of biomedicine, including radiology [ 4 ], neurology [ 2 ], cardiology [ 5 ], cancer detection and diagnosis [ 6 , 7 ], radiotherapy [ 8 ], and genomics and structural biology [ 9 - 11 ]. Medical image analysis is a field that has actively used deep learning. For example, successful applications have been made in diagnosis [ 12 ], lesion classification or detection [ 13 , 14 ], organ and other substructure localization or segmentation [ 15 , 16 ], and image registration [ 17 , 18 ]. In addition, deep learning has also made an impact on predicting protein structures [ 19 , 20 ] and genomic sequencing [ 21 - 23 ] for biomarker development and drug design.

Despite the increasing number of published biomedical studies on deep learning techniques and applications, there has been a lack of scientometric studies that both qualitatively and quantitatively explore, analyze, and summarize the relevant studies to provide a bird’s-eye view of them. Previous studies have mostly provided qualitative reviews [ 2 , 9 , 10 ], and the few available bibliometric analyses were limited in their scope in that the researchers focused on a subarea such as public health [ 24 ] or a particular journal [ 25 ]. The absence of a coherent lens through which we can examine the field from multiple perspectives and levels of granularity leads to a partial and fragmented understanding of the field and its progress. To fill this gap, the aim of this study is to perform a scientometric analysis of metadata, content, and citations to investigate current leading fields, research topics, and techniques, as well as research collaboration and knowledge diffusion in deep learning research in biomedicine. Specifically, we intend to examine (1) biomedical journals that had frequently published deep learning studies and their coverage of research areas, (2) diseases and other biomedical entities that have been frequently studied with deep learning and their relationships, (3) major deep learning architectures in biomedicine and their specific applications, (4) research collaborations among disciplines and organizations, and (5) knowledge diffusion among different areas of study.

Data were collected from PubMed, a citation and abstract database that includes biomedical literature from MEDLINE and other life science journals indexed with Medical Subject Heading (MeSH) terms [ 26 ]. MeSH is a hierarchically structured biomedical terminology with descriptors organized into 16 categories, with subcategories [ 27 ]. In this study, deep learning [MeSH Major Topic] was used as the query to search and download deep learning studies from PubMed. Limiting a MeSH term as a major topic increases the precision of retrieval so that only studies that are highly relevant to the topic are found [ 28 ]. As of January 1, 2020, a total of 978 PubMed records with publication years ranging from 2016 to 2020 have been retrieved using the National Center for Biotechnology Information Entrez application programming interface. Entrez is a data retrieval system that can be programmatically accessed through its Biopython module to search and export records from the National Center for Biotechnology Information’s databases, including PubMed [ 26 , 29 ]. The metadata of the collected bibliographic records included the PubMed identifier or PubMed ID, publication year, journal title and its electronic ISSN, MeSH descriptor terms, and author affiliations. We also downloaded the citation counts and references of each bibliographic record and considered data sources other than PubMed as well. We collected citation counts of the downloaded bibliographic records from Google Scholar (last updated on February 8, 2020) and the subject categories of their publishing journals from the Web of Science (WoS) Core Collection database using the electronic ISSN.

Detailed Methods

Metadata analysis.

Journals are an important unit of analysis in scientometrics and have been used to understand specific research areas and disciplines [ 30 ]. In this study, biomedical journals that published deep learning studies were grouped using the WoS Core Collection subject categories and analyzed to identify widely studied research areas and disciplines.

Disease-related MeSH terms were analyzed to identify major diseases that have been studied using deep learning. We mapped descriptors to their corresponding numbers in MeSH Tree Structures to identify higher level concepts for descriptors that were too specific and ensured that all the descriptors had the same level of specificity. Ultimately, all descriptors were mapped to 6-digit tree numbers (C00.000), and terms with >1 tree number were separately counted for all the categories they belonged to. In addition, we visualized the co-occurrence network of major MeSH descriptors using VOSviewer (version 1.6.15) [ 31 , 32 ] and its clustering technique [ 33 ] to understand the relationships among the biomedical entities, as well as the clusters they form together.

Author Affiliations

We analyzed author affiliations to understand the major organizations and academic disciplines that were active in deep learning research. The affiliations of 4908 authors extracted from PubMed records were recorded in various formats and manually standardized. We manually reviewed the affiliations to extract organizations, universities, schools, colleges, and departments. For authors with multiple affiliations, we selected the first one listed, which is usually the primary. We also analyzed coauthorships to investigate research collaboration among organizations and disciplines. All the organizations were grouped into one of the following categories: universities, hospitals, companies, or research institutes and government agencies to understand research collaboration among different sectors. We classified medical schools under hospitals as they are normally affiliated with each other. In the category of research institutes or government agencies, we included nonprofit private organizations or foundations and research centers that do not belong to a university, hospital, or company. We extracted academic disciplines from the department section or the school or college section when department information was unavailable. As the extracted disciplines were not coherent with multiple levels and combinations, data were first cleaned with OpenRefine (originally developed by Metaweb then Google), an interactive data transformation tool for profiling and cleaning messy data [ 34 ], and then manually grouped based on WoS categories and MeSH Tree Structures according to the following rules. We treated interdisciplinary fields and fields with high occurrence as separate disciplines from their broader fields and aggregated multiple fields that frequently co-occurred under a single department name into a single discipline after reviewing their disciplinary similarities.

Content Analysis

We identified influential studies by examining their citation counts in PubMed and Google Scholar. Citation counts from Google Scholar were considered in addition to PubMed as Google Scholar’s substantial citation data encompasses WoS and Scopus citations [ 35 ]. After sorting the articles in descending order of citations, the 2 sources showed a Spearman rank correlation coefficient of 0.883. From the PubMed top 150 list (ie, citation count >7) and Google Scholar top 150 list (ie, citation count >36), we selected the top 109 articles. Among these, we selected the sources that met the criteria for applying or developing deep learning models as the subjects of analysis to understand the major deep learning architectures in biomedicine and their applications. Specifically, we analyzed the research topics of the studies, the data and architectures used for those purposes, and how the black box problem was addressed.

Cited Reference Analysis

We collected the references from downloaded articles that had PubMed IDs. Citations represent the diffusion of knowledge from cited to citing publications; therefore, analyzing the highly cited references in deep learning studies in biomedicine allows for the investigation of disciplines and studies that have greatly influenced the field. Toward this end, we visualized networks of knowledge diffusion among WoS subjects using Gephi (v0.9.2) [ 36 ] and examined metrics such as modularity, PageRank score, and weighted outdegree using modularity for community detection [ 37 ]. PageRank indicates the importance of a node by measuring the quantity and quality of its incoming edges [ 38 ], and weighted outdegree measures the number of outgoing edges of a node. We also reviewed the contents of the 10 most highly cited influential works.

On the basis of the data set, 315 biomedical journals have published deep learning studies, and Table 1 lists the top 10 journals selected based on publication size. Different WoS categories and MeSH terms are separated using semicolons.

Top 10 journals with the highest record counts.

Journal title	Web of Science category	National Library of Medicine catalog Medical Subject Heading term	Publisher	Record count, n
Bioinformatics	Biochemical Research Methods; Mathematical and Computational Biology; Biotechnology and Applied Microbiology	Computational Biology	BMC	38
	Multidisciplinary Sciences	Natural Science Disciplines	Nature Research	37
	Neurosciences; Computer Science, Artificial Intelligence	Nerve Net; Nervous System	Elsevier	35
Engineering in Medicine and Biology Society	N/A	Biomedical Engineering	IEEE	31
	Imaging Science and Photographic Technology; Engineering, Electrical and Electronic; Computer Science, Interdisciplinary Applications; Radiology, Nuclear Medicine, and Medical Imaging; Engineering, Biomedical	Electronics, Medical; Radiography	IEEE	30
	Chemistry, Analytical; Electrochemistry; Instruments and Instrumentation; Engineering, Electrical and Electronic	Biosensing Techniques	Multidisciplinary Digital Publishing Institute	26
	Biochemical Research Methods; Mathematical and Computational Biology; Biotechnology and Applied Microbiology	Computational Biology; Genome	Oxford University Press	22
	Biochemical Research Methods	Biomedical Research/methods; Research Design	Nature Research	21
	Radiology, Nuclear Medicine, and Medical Imaging	Biophysics	American Association of Physicists in Medicine	20
	Multidisciplinary Sciences	Medicine; Science	Public Library of Science	20

a BMC: BioMed Central.

b IEEE: Institute of Electrical and Electronics Engineers.

c N/A: not applicable.

From a total of 978 records, 96 (9.8%) were unindexed in the WoS Core Collection and were excluded, following which, an average of 2.02 (SD 1.19) categories were assigned per record. The top ten subject categories, which mostly pertained to (1) biomedicine, with 22.2% (196/882) articles published in Radiology, Nuclear Medicine, and Medical Imaging (along with Engineering, Biomedical : 121/882, 13.7%; Mathematical and Computational Biology : 107/882, 12.1%; Biochemical Research Methods : 103/882, 11.7%; Biotechnology and Applied Microbiology : 76/882, 8.6%; Neurosciences : 74/882, 8.4%); (2) computer science and engineering ( Computer Science, Interdisciplinary Applications : 112/882, 12.7%; Computer Science, Artificial Intelligence : 75/882, 8.5%; Engineering, Electrical and Electronic : 75/882, 8.5%); or (3) Multidisciplinary Sciences (82/882, 9.3%).

For the main MeSH term or descriptor, an average of 9 (SD 4.21) terms was assigned to each record as subjects. Among them, we present in Figure 1 the diseases that were extracted from the C category. In the figure, the area size is proportional to the record count, and the terms are categorized by color. In addition, terms under >1 category were counted multiple times. For instance, the term Digestive System Neoplasms has two parents in MeSH Tree Structures, Neoplasms and Digestive System Diseases , and as such, we counted articles in this category under Neoplasms by Site as well as under Digestive System Neoplasms . Owing to the limited space, 7 categories whose total record counts were ≤10 (eg, Congenital, Hereditary, and Neonatal Diseases and Abnormalities ; Nutritional and Metabolic Diseases ; and Stomatognathic Diseases ) were combined under the Others category, and individual diseases that had <10 record counts were summed up with each other in the same category to show only their total count (or with one of the diseases included as an example). In the process, we identified Neoplasms as the most frequently studied disease type, with a total of 199 studies.

An external file that holds a picture, illustration, etc.
Object name is jmir_v24i4e28114_fig1.jpg

Disease-related Medical Subject Heading descriptors studied with deep learning.

We further constructed a co-occurrence network of the complete set of major MeSH descriptors assigned to the records to understand the relationships among the biomedical entities. To enhance legibility, we filtered out terms with <5 occurrences. Figure 2 presents the visualized network of nodes (100/966, 10.4% of the total terms) with 612 edges and 7 clusters. In the figure, the sizes of the nodes and edges are proportional to the number of occurrences, and the node color indicates the assigned cluster (although the term deep learning was considered nonexclusive to any cluster as it appeared in all records).

An external file that holds a picture, illustration, etc.
Object name is jmir_v24i4e28114_fig2.jpg

Co-occurrence network of the major Medical Subject Heading descriptors (number of nodes=100; number of edges=612; number of clusters=7).

As depicted in Figure 2 , each cluster comprised descriptors from two groups: (1) biomedical domains that deep learning was applied to, including body regions, related diseases, diagnostic imaging methods, and theoretical models, and (2) the purposes of deep learning and techniques used for the tasks, including diagnosis, analysis, and processing of biomedical data. In the first cluster, computer neural networks and software were studied for the purposes of computational biology , specifically protein sequence analysis , drug discovery , and drug design , to achieve precision medicine . These were relevant to the biomedical domains of (1) proteins , related visualization methods ( microscopy ), and biological models , and (2) neoplasms , related drugs ( antineoplastic agents ), and diagnostic imaging ( radiology ). In the second cluster, deep learning and statistical models were used for RNA sequence analysis and computer-assisted radiotherapy planning in relation to the domains of (1) genomics , RNA , and mutation , and (2) brain neoplasms and liver neoplasms . The third cluster comprised (1) heart structures ( heart ventricles ), cardiovascular diseases , and ultrasonography and (2) eye structures ( retina ), diseases ( glaucoma ), and ophthalmological diagnostic techniques . These had been studied for computer-assisted image interpretation using machine learning and deep learning algorithms . The biomedical domain group of the fourth cluster involved specific terms related to neoplasms such as type ( adenocarcinoma ), different regions ( breast neoplasms , lung neoplasms , and colorectal neoplasms ), and respective imaging methods ( mammography and X-ray computed tomography ) to which deep learning and support vector machines have been applied for the purpose of computer-assisted radiographic image interpretation and computer-assisted diagnosis . The fifth cluster included (1) brain disorders ( Alzheimer disease ), neuroimaging , and neurological models ; (2) prostatic neoplasms ; and (3) diagnostic magnetic resonance imaging and 3D imaging . S upervised machine learning had been used for computer-assisted image processing of these data. In the sixth cluster, automated pattern recognition and computer-assisted signal processing were studied with (1) human activities (eg, movement and face ), (2) abnormal brain activities ( epilepsy and seizures ) and monitoring methods ( electroencephalography ), and (3) heart diseases and electrocardiography . In the last cluster, medical informatics , specifically data mining and NLP , including speech perception , had been applied to (1) electronic health records , related information storage and retrieval , and theoretical models and (2) skin diseases ( skin neoplasms and melanoma ) and diagnostic dermoscopy .

To investigate research collaboration within the field, we analyzed paper-based coauthorships using author affiliations with different levels of granularity, including organization and academic disciplines. We extracted organizations from 98.7% (4844/4908) of the total affiliations and visualized the collaboration of different organization types. The top 10 organizations with the largest publication records included Harvard University (37/844, 4.4%), Chinese Academy of Sciences (21/844, 2.5%; eg, Institute of Computing Technology, Institute of Automation, and Shenzhen Institutes of Advanced Technology), Seoul National University (21/844, 2.5%), Stanford University (20/844, 2.4%), Sun Yat-sen University (14/844, 1.7%; eg, Zhongshan Ophthalmic Center and Collaborative Innovation Center of Cancer Medicine), University of California San Diego (14/844, 1.7%; eg, Institute for Genomic Medicine, Shiley Eye Institute, and Institute for Brain and Mind), University of California San Francisco (14/844, 1.7%), University of Michigan (14/844, 1.7%), Yonsei University (14/844, 1.7%), and the University of Texas Health Science Center at Houston (12/844, 1.4%). The extracted organizations were assigned to one of the following four categories according to their main purpose: universities, hospitals, companies, or research institutes and government agencies. Among these, universities participated in most papers (567/844, 67.2%), followed by hospitals (429/844, 50.8%), companies (139/844, 16.5%), and research institutes or government agencies (88/844, 10.4%). We used a co-occurrence matrix to visualize the degrees of organizational collaboration, with the co-occurrence values log normalized to compare the relative differences ( Figure 3 ).

An external file that holds a picture, illustration, etc.
Object name is jmir_v24i4e28114_fig3.jpg

Collaboration of organization types.

From Figure 3 , we found that universities were the most active in collaborative research, particularly with hospitals, followed by companies and research institutes or government agencies. Hospitals also frequently collaborated with companies; however, research institutes or government agencies tended not to collaborate much as they published relatively fewer studies.

We also examined the collaborations among academic disciplines, which we could extract, as described in the Methods section, from 76.24% (3742/4908) of the total affiliations. Approximately half (ie, 386/756, 51.1%) of the papers were completed under disciplinary collaboration. Figure 4 depicts the network with 36 nodes (36/148, 24.3% of the total) and 267 edges after we filtered out disciplines with weighted degrees <10, representing the number of times one collaborated with the other disciplines. In the figure, the node and edge sizes are proportional to the weighted degree and link strength, respectively, and the node color indicates the assigned cluster.

An external file that holds a picture, illustration, etc.
Object name is jmir_v24i4e28114_fig4.jpg

Collaboration network of academic disciplines (number of nodes=36; number of edges=267; number of clusters=6).

As shown in the figure, the academic disciplines were assigned to 1 of 6 clusters, including 1 engineering-oriented cluster (cluster 1) and other clusters that encompassed biomedical fields. We specifically looked at the degree of collaboration between the biomedical and engineering disciplines. Figure 4 depicts that the most prominent collaboration was among Radiology, Medical Imaging, and Nuclear Medicine ; Computer Science ; and Electronics and Electrical Engineering . There were also strong links among Computer Science or Electronics and Electrical Engineering and Biomedical Informatics , Biomedical Engineering , and Pathology and Laboratory Medicine .

Among the top 10 disciplines in Figure 4 , the following three had published the most papers and had the highest weighted degree and degree centralities: Computer Science (number of papers=195, weighted degree=193, and degree centrality=32); Radiology, Medical Imaging, and Nuclear Medicine (number of papers=168, weighted degree=166, and degree centrality=30); and Electronics and Electrical Engineering (number of papers=161, weighted degree=160, and degree centrality=32). Meanwhile, some disciplines had high weighted degrees compared with their publication counts, indicating their activeness in collaborative research. These included Pathology and Laboratory Medicine (5th in link strength vs 8th in publications) and Public Health and Preventive Medicine (9th in link strength vs 15th in publications). A counterexample was Computational Biology , which was 12th in link strength but 7th in publications.

We analyzed the content of influential studies that had made significant contributions to the field through the application or development of deep learning architectures. We identified these studies by examining the citation counts from PubMed and Google Scholar, assigning the 109 most-cited records to one of the following categories: (1) review , (2) application of existing deep learning architectures to certain biomedical domains (denoted by A ), or (3) development of a novel deep learning model (denoted by D ). Table 2 summarizes the 92 papers assigned to the application or development category according to their research topic in descending order of citation count.

Top 92 studies with the highest citation count under the application or development category, according to the research topic.

Research topic and number			Task type		Data		Deep learning architectures

	A1 [ ]	Classification		Retinal disease OCT and chest x-ray with pneumonia		Inception
	A2 [ ]	Segmentation and classification		Retinal disease OCT		U-net and CNN
	A3 [ ]	Classification		Melanoma dermoscopic images		Inception
	A4 [ ]	Survival prediction		Brain glioblastoma MRI		CNN_S
	A6 [ ]	Classification and segmentation		WSI of 13 cancer types		CNN with CAE and DeconvNet
	D1 [ ]	Segmentation		Brain MRI		ResNet based
	A7 [ ]	Prediction		Retinal fundus images with cardiovascular disease		Inception
	D2 [ ]	Tracking		Video of freely behaving animal		ResNet-based DeeperCut subset
	A8 [ ]	Classification		Colonoscopy video of colorectal polyps		Inception
	A9 [ ]	Classification		Lung cancer CT		CNN
	A10 [ ]	Classification and segmentation		Retinal OCT with macular disease		Encoder-decoder CNN
	D3 [ ]	Segmentation		Brain glioma MRI		CNN based
	D4 [ ]	Binding affinities prediction		Protein-ligand complexes as voxel		SqueezeNet based
	A11 [ ]	Survival classification		Brain glioma MRI, functional MRI, and DTI		CNN and mCNN
	A12 [ ]	Classification		Fundus images with glaucomatous optic neuropathy		Inception
	A13 [ ]	Classification		Chest radiographs with pneumonia		ResNet and CheXNet
	A14 [ ]	Classification and segmentation		Critical head abnormality CT		ResNet, U-net, and DeepLab
	A15 [ ]	Classification		Brain glioma MRI		ResNet
	D6 [ ]	Classification		Thoracic disease radiographs		DenseNet based
	A16 [ ]	Classification and segmentation		Echocardiogram video with cardiac disease		VGGNet and U-net
	A17 [ ]	Classification		Brain positron emission tomography with Alzheimer		Inception
	D7 [ ]	Classification		Breast cancer histopathological images		CNN based
	A18 [ ]	Classification		Skin tumor images		ResNet
	A19 [ ]	Classification and prediction		Chest CT with chronic obstructive pulmonary disease and acute respiratory disease		CNN
	A20 [ ]	Segmentation		Brain MRI with autism spectrum disorder		FCNN
	D8 [ ]	Segmentation		Fetal MRI and brain tumor MRI		Proposal network (P-Net) based
	A21 [ ]	Classification, prediction, and reconstruction		Natural movies and functional MRI of watching movies		AlexNet and De-CNN
	D9 [ ]	Detection and classification		Facial images with a genetic syndrome		CNN based
	A22 [ ]	Detection and segmentation		Microscopic images of cells		U-net
	A23 [ ]	Classification and localization		Breast cancer mammograms		Faster region-based CNN with VGGNet
	A24 [ ]	Segmentation and prediction		Lung cancer CT		Mask-RCNN, CNN with GoogLeNet and RetinaNet
	A26 [ ]	Classification		Lung cancer CT		CNN; fully connected NN; SAE
	A27 [ ]	Survival classification		Lung cancer CT		CNN
	A29 [ ]	Prediction		Polar maps of myocardial perfusion imaging with CAD		CNN
	A30 [ ]	Classification		Prostate cancer MRI		CNN
	D12 [ ]	Classification		Liver SWE with chronic hepatitis B		CNN based
	D14 [ ]	Segmentation		Liver cancer CT		DenseNet with U-net based
	A31 [ ]	Classification		Fundus images with macular degeneration		AlexNet, GoogLeNet, VGGNet, inception, ResNet, and inception-ResNet
	A32 [ ]	Classification		Bladder cancer CT		cuda-convnet
	A34 [ ]	Classification		Prostate cancer tissue microarray images		MobileNet
	D19 [ ]	Classification		Holographic microscopy of species		CNN based
	A36 [ ]	Survival classification		Chest CT		CNN
	D20 [ ]	Classification and localization		Malignant lung nodule radiographs		ResNet based
	A37 [ ]	Classification		Shoulder radiographs with proximal humerus fracture		ResNet
	A39 [ ]	Classification		Facial images of hetero and homosexual		VGG-Face
	A41 [ ]	Segmentation and classification		CAD CT angiography		CNN and CAE
	A42 [ ]	Classification and localization		Radiographs with fracture		U-net
	A43 [ ]	Binding classification		Peptide major histocompatibility complex as image-like array		CNN
	A44 [ ]	Detection		Lung nodule CT		CNN
	A45 [ ]	Classification		Confocal endomicroscopy video of oral cancer		LeNet
	A46 [ ]	Classification		WSI of prostate, skin, and breast cancer		MIL with ResNet and RNN
	D24 [ ]	Tracking		Video of freely behaving animal		FCNN based
	D25 [ ]	Segmentation		Fundus images with glaucoma		U-net based
	A47 [ ]	Segmentation and classification		Cardiac disease cine MRI		U-net; M-Net; Dense U-net; SVF-Net; Grid-Net; Dilated CNN
	D27 [ ]	Classification		Knee abnormality MRI		AlexNet based
	D28 [ ]	Binding affinities prediction		Protein-ligand complexes as grid		CNN based
	A50 [ ]	Segmentation		Autosomal dominant polycystic kidney disease CT		FCNN with VGGNet
	A51 [ ]	Segmentation and classification		Knee cartilage lesion MRI		VGGNet
	A52 [ ]	Classification		Mammograms		ResNet
	A54 [ ]	Prediction		CAD CT angiography		FCNN
	D31 [ ]	Classification and localization		WSI of lymph nodes in metastatic breast cancer		Inception based
	D35 [ ]	Classification		Fluorescence microscopic images of cells		FFNN based
	A56 [ ]	Classification		Retinal fundus images with diabetic retinopathy and breast mass mammography		ResNet; GoogLeNet

	A25 [ ]	Artifact reduction		Brain and abdomen CT and radial MR data		U-net
	A28 [ ]	Resolution enhancement		Fluorescence microscopic images		GAN with U-net and CNN
	D15 [ ]	Dealiasing		Compressed sensing brain lesion and cardiac MRI		GAN with U-net and VGGNet based
	D16 [ ]	Resolution enhancement		Superresolution localization microscopic images		GAN with U-net–based pix2pix network modified
	A33 [ ]	Reconstruction		Brain and pelvic MRI and CT		GAN with FCNN and CNN
	D18 [ ]	Artifact reduction		CT		CNN based
	A38 [ ]	Reconstruction		Contrast-enhanced brain MRI		Encoder-decoder CNN
	D22 [ ]	Reconstruction		Brain MR fingerprinting data		FFNN based
	D23 [ ]	Resolution enhancement		Hi-C matrix of chromosomes		CNN based
	A48 [ ]	Resolution enhancement		Brain tumor MRI		U-net
	D26 [ ]	Reconstruction		Lung vessels CT		CNN based
	D32 [ ]	Resolution enhancement		Knee MRI		CNN based
	D33 [ ]	Reconstruction		CT		CNN based
	D34 [ ]	Registration		Cardiac cine MRI and chest CT		CNN based

	D17 [ ]	Novel structures generation and property prediction		SMILES		Stack-RNN with GRU - and LSTM based
	A40 [ ]	Novel structures generation		SMILES		variational AE ; CNN- and RNN with GRU-based AAE
	D21 [ ]	Gene expression (variant effects) prediction		Genomic sequence		CNN based
	D30 [ ]	Novel structures generation and classification		SMILES		GAN with differentiable neural computer and CNN based
	A53 [ ]	Novel structures generation		SMILES		LSTM
	A57 [ ]	Classification		Antimicrobial peptide sequence		CNN with LSTM

	D13 [ ]	Contact prediction		Protein sequence to contact matrix		ResNet based

	A5 [ ]	Subtype identification (survival classification)		Multi-omics data from liver cancer		AE
	D5 [ ]	Phenotype prediction		Genotype		GoogLeNet and deeply supervised net based
	D10 [ ]	Survival prediction		Genomic profiles from cancer		FFNN based
	D11 [ ]	Drug synergies prediction		Gene expression profiles of cancer cell line and chemical descriptors of drugs		FFNN based
	A35 [ ]	NLP (classification)		Electronic health record with pediatric disease		Attention-based BLSTM
	A49 [ ]	Binding classification		Protein sequence as matrix and drug molecular fingerprint		SAE
	D29 [ ]	Classification		Electrocardiogram signal		BLSTM based
	A55 [ ]	Classification		Polysomnogram signal		CNN

a OCT: optical coherence tomography.

b CNN: convolutional neural network.

c MRI: magnetic resonance imaging.

d WSI: whole slide image.

e CAE: convolutional autoencoder.

f ResNet: residual networks.

g CT: computed tomography.

h DTI: diffusion tensor imaging.

i mCNN: multicolumn convolutional neural network.

j FCNN: fully convolutional neural network.

k SAE: stacked autoencoder.

l CAD: coronary artery disease.

m SWE: shear wave elastography.

n MIL: multiple instance learning.

o FFNN: feedforward neural network.

p MR: magnetic resonance.

q GAN: generative adversarial network.

r SMILES: simplified molecular input line-entry system.

s RNN: recurrent neural network.

t GRU: gated recurrent unit.

u LSTM: long short-term memory.

v AE: autoencoder.

w AAE: adversarial autoencoder.

x NLP: natural language processing.

y BLSTM: bidirectional long short-term memory.

Research Topics

In these studies, researchers applied or developed deep learning architectures mainly for the following purposes: image analysis, especially for diagnostic purposes, including the classification or prediction of diseases or survival, and the detection, localization, or segmentation of certain areas or abnormalities. These 3 tasks, which aim to identify the location of an object of interest, are different in that detection involves a single reference point, whereas localization involves an area identified through a bounding box, saliency map, or heatmap, segmentation involves a precise area with clear outlines identified through pixel-wise analysis. Meanwhile, in some studies, models for image analysis unrelated to diagnosis were proposed, such as classifying or segmenting cells in microscopic images and tracking moving animals in videos through pose estimation. Another major objective involved image processing for reconstructing or registering medical images. This included enhancing low-resolution images to high resolution, reconstructing images with different modalities or synthesized targets, reducing artifacts, dealiasing, and aligning medical images.

Meanwhile, several researchers used deep learning architectures to analyze molecules, proteins, and genomes for various purposes. These included drug design or discovery, specifically for generating novel molecular structures through sequence analysis and for predicting binding affinities through image analysis of complexes; understanding protein structure through image analysis of contact matrix; and predicting phenotypes, cancer survival, drug synergies, and genomic variant effects from genes or genomes. Finally, in some studies, deep learning was applied to the diagnostic classification of sequential data, including electrocardiogram or polysomnogram signals and electronic health records. In summary, in the reviewed literature, we identified a predominant focus on applying or developing deep learning models for image analysis regarding localization or diagnosis and image processing, with a few studies focusing on protein or genome analysis.

Deep Learning Architectures

Regarding the main architectures, most of them were predominantly CNNs and based on ≥1 CNN architecture such as a fully CNN (FCNN) and its variants, including U-net; residual neural network (ResNet) and its variants; GoogLeNet (Inception v1) or Inception and VGGNet and its variants; and other architectures. Meanwhile, a few researchers based their models on feedforward neural networks that were not CNNs, including autoencoders (AEs) such as convolutional AE and stacked AE. Others adapted RNNs, including (bidirectional) long short-term memory and gated recurrent unit. Furthermore, models that combined RNNs or AEs with CNNs were also proposed.

Content analysis of the reviewed literature showed that different deep learning architectures were used for different research tasks. Models for classification or prediction tasks using images were predominantly CNN based, with most being ResNet and GoogLeNet or Inception. ResNet with shortcut connections [ 129 ] and GoogLeNet or Inception with 1×1 convolutions, factorized convolutions, and regularizations [ 130 , 131 ] allow networks of increased depth and width by solving problems such as vanishing gradients and computational costs. These mostly analyzed medical images from magnetic resonance imaging or computed tomography, with cancer-related images often used as input data for diagnostic classification, in addition to image-like representations of protein complexes. Meanwhile, when applying these tasks to data other than images, such as genomic or gene expression profiles and protein sequence matrices, researchers used feedforward neural networks, including AEs, that enabled semi- or unsupervised learning and dimensionality reduction.

Image analysis for segmentation and image processing were achieved through CNN-based architectures as well, with most of them being FCNNs, especially U-net. FCNNs produce an input-sized pixel-wise prediction by replacing the last fully connected layers to convolution layers, making them advantageous for the abovementioned tasks [ 132 ], and U-net enhances these performances through long skip connections that concatenate feature maps from the encoder path to the decoder path [ 133 ]. In particular, for medical image processing tasks, a few researchers combined FCNNs (U-net) with other CNNs by adopting the generative adversarial network structure, which generates new instances that mimic the real data through an adversarial process between the generator and discriminator [ 134 ]. We found that images of the brain were often used as input data for these studies.

On the other hand, RNNs were applied to sequence analysis of the string representation of molecules (simplified molecular input line-entry system) and pattern analysis of sequential data such as signals. A few of these models, especially those generating novel molecular structures, combined RNNs with CNNs by adopting generative adversarial networks, including adversarial AE. In summary, the findings showed that the current deep learning models were predominantly CNN based, with most of them focusing on analyzing medical image data and different architectures that are preferred for the specific tasks.

Among these studies, Table 3 shows, in detail, the objectives and the proposed methods of the 35 studies with novel model development.

Content analysis of the top 35 records in the development category.

Number	Development objectives	Methods (proposed model)
D1	Segment brain anatomical structures in 3D MRI	Voxelwise Residual Network: trained through residual learning of volumetric feature representation and integrated with contextual information of different modalities and levels
D2	Estimate poses to track body parts in various animal behaviors	DeeperCut’s subset DeepLabCut: network fine-tuned on labeled body parts, with deconvolutional layers producing spatial probability densities to predict locations
D3	Predict isocitrate dehydrogenase 1 mutation in low-grade glioma with MRI radiomics analysis	Deep learning–based radiomics: segment tumor regions and directly extract radiomics image features from the last convolutional layer, which is encoded for feature selection and prediction
D4	Predict protein-ligand binding affinities represented by 3D descriptors	KDEEP: 3D network to predict binding affinity using voxel representation of protein-ligand complex with assigned property according to its atom type
D5	Predict phenotype from genotype through the biological hierarchy of cellular subsystems	DCell: visible neural network with structure following cellular subsystem hierarchy to predict cell growth phenotype and genetic interaction from genotype
D6	Classify and localize thoracic diseases in chest radiographs	DenseNet-based CheXNeXt: networks trained for each pathology to predict its presence and ensemble and localize indicative parts using class activation mappings
D7	Multi-classification of breast cancer from histopathological images	CSDCNN : trained through end-to-end learning of hierarchical feature representation and optimized feature space distance between breast cancer classes
D8	Interactive segmentation of 2D and 3D medical images fine-tuned on a specific image	Bounding box and image-specific fine-tuning–based segmentation: trained for interactive image segmentation using bounding box and fine-tuned for specific image with or without scribble and weighted loss function
D9	Facial image analysis for identifying phenotypes of genetic syndromes	DeepGestalt: preprocessed for face detection and multiple regions and extracts phenotype to predict syndromes per region and aggregate probabilities for classification
D10	Predict cancer outcomes with genomic profiles through survival models optimization	SurvivalNet: deep survival model with high-dimensional genomic input and Bayesian hyperparameter optimization, interpreted using risk backpropagation
D11	Predict synergy effect of novel drug combinations for cancer treatment	DeepSynergy: predicts drug synergy value using cancer cell line gene expressions and chemical descriptors, which are normalized and combined through conic layers
D12	Classify liver fibrosis stages in chronic hepatitis B using radiomics of SWE	DLRE : predict the probability of liver fibrosis stages with quantitative radiomics approach through automatic feature extraction from SWE images
D13	Predict protein residue contact map at pixel level with protein features	RaptorX-Contact: combined networks to learn contact occurrence patterns from sequential and pairwise protein features to predict contacts simultaneously at pixel level
D14	Segment liver and tumor in abdominal CT scans	Hybrid Densely connected U-net: 2D and 3D networks to extract intra- and interslice features with volumetric contexts, optimized through hybrid feature fusion layer
D15	Reconstruct compressed sensing MRI to dealiased image	DAGAN : conditional GAN stabilized by refinement learning, with the content loss combined adversarial loss incorporating frequency domain data
D16	Reconstruct sparse localization microscopy to superresolution image	Artificial Neural Network Accelerated–Photoactivated Localization Microscopy: trained with superresolution PALM as the target, compares reconstructed and target with loss functions containing conditional GAN
D17	Generate novel chemical compound design with desired properties	Reinforcement Learning for Structural Evolution: generate chemically feasible molecule as strings and predict its property, which is integrated with reinforcement learning to bias the design
D18	Reduce metal artifacts in reconstructed x-ray CT images	CNN -based Metal Artifact Reduction: trained on images processed by other Metal Artifact Reduction methods and generates prior images through tissue processing and replaces metal-affected projections
D19	Predict species to identify anthrax spores in single cell holographic images	HoloConvNet: trained with raw holographic images to directly recognize interspecies difference through representation learning using error backpropagation
D20	Classify and detect malignant pulmonary nodules in chest radiographs	Deep learning–based automatic detection: predict the probability of nodules per radiograph for classification and detect nodule location per nodule from activation value
D21	Predict tissue-specific gene expression and genomic variant effects on the expression	ExPecto: predict regulatory features from sequences and transform to spatial features and use linear models to predict tissue-specific expression and variant effects
D22	Reconstruct MRF to obtain tissue parameter maps	Deep reconstruction network: trained with a sparse dictionary that maps magnitude image to quantitative tissue parameter values for MRF reconstruction
D23	Generate high-resolution Hi-C interaction matrix of chromosomes from a low-resolution matrix	HiCPlus: predict high-resolution matrix through mapping regional interaction features of low-resolution to high-resolution submatrices using neighboring regions
D24	Estimate poses to track body parts of freely moving animals	LEAP : videos preprocessed for egocentric alignment and body parts labeled using GUI and predicts each location by confidence maps with probability distributions
D25	Jointly segment optic disc and cup in fundus images for glaucoma screening	M-Net: multi-scale network for generating multi-label segmentation prediction maps of disc and cup regions using polar transformation
D26	Reconstruct limited-view PAT to high-resolution 3D images	Deep gradient descent: learned iterative image reconstruction, incorporated with gradient information of the data fit separately computed from training
D27	Predict classifications of and localize knee injuries from MRI	MRNet: networks trained for each diagnosis according to a series to predict its presence and combine probabilities for classification using logistic regression
D28	Predict binding affinities between 3D structures of protein-ligand complexes	Pafnucy: structure-based prediction using 3D grid representation of molecular complexes with different orientations as having same atom types
D29	Classify electrocardiogram signals based on wavelet transform	Deep bidirectional LSTM network–based wavelet sequences: generate decomposed frequency subbands of electrocardiogram signal as sequences by wavelet-based layer and use as input for classification
D30	Generate novel small molecule structures with possible biological activity	Reinforced Adversarial Neural Computer: combined with GAN and reinforcement learning, generates sequences matching the key feature distributions in the training molecule data
D31	Detect and localize breast cancer metastasis in digitized lymph nodes slides	LYmph Node Assistant: predict the likelihood of tumor in tissue area and generate a heat map for slides identifying likely areas
D32	Transform low-resolution thick slice knee MRI to high-resolution thin slices	DeepResolve: trained to compute residual images, which are added to low-resolution images to generate their high-resolution images
D33	Reconstruct sparse-view CT to suppress artifact and preserve feature	Learned Experts’ Assessment–Based Reconstruction Network: iterative reconstruction using previous compressive sensing methods, with fields of expert-applied regularization terms learned iteration dependently
D34	Unsupervised affine and deformable aligning of medical images	Deep Learning Image Registration: multistage registration network and unsupervised training to predict transformation parameters using image similarity and create warped moving images
D35	Classify subcellular localization patterns of proteins in microscopy images	Localization Cellular Annotation Tool: predict localization per cell for image-based classification of multi-localizing proteins, combined with gamer annotations for transfer learning

a MRI: magnetic resonance imaging.

b CSDCNN: class structure-based deep convolutional neural network.

c SWE: shear wave elastography.

d DLRE: deep learning radiomics of elastography.

e CT: computed tomography.

f DAGAN: Dealiasing Generative Adversarial Networks.

g GAN: generative adversarial network.

h PALM: photoactivated localization microscopy.

i CNN: convolutional neural network.

j MRF: magnetic resonance fingerprinting.

k LEAP: LEAP Estimates Animal Pose.

l GUI: graphical user interface.

m PAT: photoacoustic tomography.

n LSTM: long short-term memory.

Black Box Problem

In quite a few of the reviewed studies, the black box problem of deep learning was partly addressed, as researchers implemented various methods to improve model interpretability. To understand the prediction results of image analysis models, most used one of the following two techniques to visualize the important regions: (1) activation-based heatmaps [ 45 , 54 , 65 , 70 ], especially class activation maps [ 57 , 61 , 77 , 92 ], and saliency maps [ 59 ] and (2) occlusion testing [ 39 , 75 , 82 , 94 ]. For models analyzing data other than images, there were no generally accepted techniques for model interpretation, and researchers suggested some methods, including adopting an interpretable hierarchical structure such as the cellular subsystem [ 122 ] or anatomical division [ 125 ], using backpropagation [ 123 ], observing gate activations of cells in the neural network [ 114 ], or investigating how corrupted input data affect the prediction and how identical predictions are made for different inputs [ 93 ]. As such, various methods were found to be used to tackle this well-known limitation of deep learning.

On average, each examined deep learning study with at least one PubMed indexed citation (429/978, 43.9%) had 25.8 (SD 20.0) citations. These cited references comprised 9373 unique records that were cited 1.27 times on average (SD 2.16). Excluding the ones that were unindexed in the WoS Core Collection (8618/9373, 8.06% of the unique records), an average of 1.77 (SD 1.07) categories were assigned to a record. The top ten WoS categories, which were assigned to the greatest number of total cited references, pertained to the following three major groups: (1) biomedicine ( Radiology, Nuclear Medicine, and Medical Imaging : 2025/11,033, 18.35%; Biochemical Research Methods : 1118/11,033, 10.13%; Mathematical and Computational Biology : 1066/11,033, 9.66%; Biochemistry and Molecular Biology : 1043/11,033, 9.45%; Engineering, Biomedical : 981/11,033, 8.89%; Biotechnology and Applied Microbiology : 916/11,033, 8.3%; Neurosciences : 844/11,033, 7.65%), (2) computer science and engineering ( Computer Science, Interdisciplinary Applications : 1041/11,033, 9.44%; Engineering, Electrical and Electronic : 645/11,033, 5.85%), and (3) Multidisciplinary Sciences (with 1411/11,033, 12.79% records).

To understand the intellectual structure of how knowledge is transferred among different areas of study through citations, we visualized the citation network of WoS subject categories. In the directed citation network shown in Figure 5 , the edges were directed clockwise with the source nodes as the WoS categories of the deep learning studies we examined and the target nodes as the WoS categories of the cited references from which knowledge was obtained. To enhance legibility, we filtered out categories with <100 weighted degrees, excluding self-loops, to form a network of 20 nodes (20/158, 12.7% of the total) and 59 edges (59/2380, 2.48% of the total). In the figure, the node color and size are proportional to the PageRank score (probability 0.85; ε=0.001; Figure 5 A) and weighted-out degree ( Figure 5 B), and the edge size and color are proportional to the link strength. PageRank considers not only the quantity but also the quality of incoming edges, identifying important exporters for knowledge diffusion based on how often and by which fields a node is cited. On the other hand, the weighted outdegree measures outgoing edges and identifies major knowledge importers that frequently cite other fields.

An external file that holds a picture, illustration, etc.
Object name is jmir_v24i4e28114_fig5.jpg

Citation network of the Web of Science subject categories assigned to the reviewed publications and their cited references according to (A) PageRank and (B) weighted outdegree (number of nodes=20; number of edges=59).

As depicted in Figure 5 A, categories with high PageRank scores mostly coincided with the frequently cited fields identified above and were grouped into two communities through modularity (upper half and lower half). The upper half region centered on Radiology, Nuclear Medicine, and Medical Imaging , which had the highest PageRank score (0.191) and proved to be a field with a significant influence on deep learning studies in biomedicine. Meanwhile, important knowledge exporters to this field included Engineering, Biomedical (0.134); Engineering, Electrical and Electronic (0.110); and Computer Science, Interdisciplinary Applications (0.091). The lower half region mainly comprised categories with comparable PageRank scores in which knowledge was frequently exchanged between one another, including Biochemical Research Methods (0.053), Multidisciplinary Sciences (0.053), Biochemistry and Molecular Biology (0.052), Biotechnology and Applied Microbiology (0.050), and Mathematical and Computational Biology (0.048). Specifically, in Figure 5 B, Mathematical and Computational Biology (1992), Biotechnology and Applied Microbiology (1836), and Biochemical Research Methods (1807) were identified as major knowledge importers with the highest weighted outdegrees, whereas Biochemistry and Molecular Biology (344) had a relatively low weighted outdegree, indicating their role as a source of knowledge for these fields.

We analyzed the 10 most frequently cited studies to gain an in-depth understanding of the most influential works and assigned these papers to one of the three categories: review, application, or development. Review articles provided comprehensive overviews of the development and applications of deep learning [ 1 , 3 ], with 1 focusing on applications to medical image analysis [ 4 ]. We summarize the 7 application (denoted by A ) or development (denoted by D ) studies in Table 4 .

Content analysis matrix of the highly cited references in the application or development category.

Category	Citation count, n	Research topic: task type	Objectives	Methods (deep learning architectures)
A1 [ ]	53	Diagnostic image analysis: classification	Apply CNN to classifying skin lesions from clinical images	Inception version 3 fine-tuned end to end with images; tested against dermatologists on 2 binary classifications
A2 [ ]	51	Diagnostic image analysis: classification	Apply CNN to detecting referrable diabetic retinopathy on retinal fundus images	Inception version 3 trained and validated using 2 data sets of images graded by ophthalmologists
D1 [ ]	34	Computer science	Develop a new gradient-based RNN to solve error backflow problems	LSTM achieved constant error flow through memory cells regulated by gate units; tested numerous times against other methods
D2 [ ]	33	Sequence analysis: binding (variant effects) prediction	Propose a predictive model for sequence specificities of DNA- and RNA-binding proteins	CNN-based DeepBind trained fully automatically through parallel implementation to predict and visualize binding specificities and variation effects
A3 [ ]	27	Diagnostic image analysis: classification	Evaluate factors of using CNNs for thoracoabdominal lymph node detection and interstitial lung disease classification	Compare performances of AlexNet, CifarNet, and GoogLeNet trained with transfer learning and different data set characteristics
D3 [ ]	23	Sequence analysis: chromatin profiles (variant effects) prediction	Propose a model for predicting noncoding variant effects from genomic sequence	CNN-based DeepSEA trained for chromatin profile prediction to estimate variant effects with single nucleotide sensitivity and prioritize functional variants
A4 [ ]	23	Diagnostic image analysis: classification	Evaluate CNNs for tuberculosis detection on chest radiographs	Compare performances of AlexNet and GoogLeNet and ensemble of 2 trained with transfer learning, augmented data set, and radiologist-augmented approach

a CNN: convolutional neural network.

b RNN: recurrent neural network.

c LSTM: long short-term memory.

In these studies, excluding the study by Hochreiter and Schmidhuber [ 135 ], whose research topic pertained to computer science, deep learning was used for diagnostic image analysis of various areas [ 12 - 14 , 136 ] and for sequence analysis of proteins [ 21 ] or genomes [ 22 ]. The main architectures implemented to achieve the different research objectives mostly comprised CNNs [ 12 - 14 , 136 ] or CNN-based novel models [ 21 , 22 ] and RNNs [ 135 ]. The findings indicated that these deep neural networks either outperformed previous methods or achieved a performance comparable with that of human experts.

Principal Findings

With the increase in biomedical research using deep learning techniques, we aimed to gain a quantitative and qualitative understanding of the scientific domain, as reflected in the published literature. For this purpose, we conducted a scientometric analysis of deep learning studies in biomedicine.

Through the metadata and content analyses of bibliographic records, we identified the current leading fields and research topics, the most prominent being radiology and medical imaging. Other biomedical fields that have led this domain included biomedical engineering, mathematical and computational biology, and biochemical research methods. As part of interdisciplinary research, computer science and electrical engineering were important fields as well. The major research topics that were studied included computer-assisted image interpretation and diagnosis (which involved localizing or segmenting certain areas for classifying or predicting diseases), image processing such as medical image reconstruction or registration, and sequence analysis of proteins or RNA to understand protein structure and discover or design drugs. These topics were particularly prevalent in their application to neoplasms.

Furthermore, although deep learning techniques that had been proposed for these themes were predominantly CNN based, different architectures are preferred for different research tasks. The findings showed that CNN-based models mostly focused on analyzing medical image data, with RNN architectures for sequential data analysis and AEs for unsupervised dimensionality reduction yet to be actively explored. Other deep learning methods, such as deep belief networks [ 137 , 138 ], deep Q network [ 139 ], and dictionary learning [ 140 ], have also been applied to biomedical research but were excluded from the content analysis because of low citation count. As deep learning is a rapidly evolving field, future biomedical researchers should pay attention to the emerging trends and keep aware of state-of-the-art models for enhanced performance, such as transformer-based models, including bidirectional encoder representations from transformers for NLP [ 141 ]; wav2vec for speech recognition [ 142 ]; and the Swin transformer for computer vision tasks of image classification, segmentation, and object detection [ 143 ].

The findings from the analysis of the cited references revealed patterns of knowledge diffusion. In the analysis, radiology and medical imaging appeared to be the most significant knowledge source and an important field in the knowledge diffusion network. Relatedly, we identified knowledge exporters to this field, including biomedical engineering, electrical engineering, and computer science, as important, despite their relatively low citation counts. Furthermore, citation patterns revealed clique-like relationships among the four fields—biochemical research methods, biochemistry and molecular biology, biotechnology and applied microbiology, and mathematical and computational biology—with each being a source of knowledge and diffusion for the others.

Beyond knowledge diffusion, knowledge integration was also encouraged through collaboration among authors from different organizations and academic disciplines. Coauthorship analysis revealed active research collaboration between universities and hospitals and between hospitals and companies. Separately, we identified an engineering-oriented cluster and biomedicine-oriented clusters of disciplines, among which we observed a range of disciplinary collaborations, with the most prominent 2 between radiology and medical imaging and computer science and electrical engineering, which were the 3 disciplines that were most involved in publishing and collaboration. Meanwhile, pathology and public health showed a high collaborative research to publications ratio, whereas computational biology showed a low collaborative ratio.

Limitations

This study has the following limitations that may have affected data analysis and interpretation. First, focusing only on published studies may have underrepresented the field. Second, publication data were only retrieved from PubMed; although PubMed is one of the largest databases for biomedical literature, other databases such as DataBase systems and Logic Programming may also include relevant studies. Third, the use of PubMed limited our data to biomedical journals and proceedings. Given that deep learning is an active research area in computer science, computer science conference articles are valuable sources of data that were not considered in this study. Finally, our current data retrieval strategy involved searching deep learning as the major MeSH term, which increased precision but may have omitted relevant studies that were not explicitly tagged as deep learning . We plan to expand our scope in future work to consider other bibliographic databases and search terms as well.

In this study, we investigated the landscape of deep learning research in biomedicine and identified major research topics, influential works, knowledge diffusion, and research collaboration through scientometric analyses. The results showed a predominant focus on research applying deep learning techniques, especially CNNs, to radiology and medical imaging and confirmed the interdisciplinary nature of this domain, especially between engineering and biomedical fields. However, diverse biomedical applications of deep learning in the fields of genetics and genomics, medical informatics focusing on text or speech data, and signal processing of various activities (eg, brain, heart, and human) will further boost the contribution of deep learning in addressing biomedical research problems. As such, although deep learning research in biomedicine has been successful, we believe that there is a need for further exploration, and we expect the results of this study to help researchers and communities better align their present and future work.

Abbreviations

AE	autoencoder
CNN	convolutional neural network
FCNN	fully convolutional neural network
MeSH	Medical Subject Heading
NLP	natural language processing
ResNet	residual neural network
RNN	recurrent neural network
WoS	Web of Science

Authors' Contributions: SN and YZ designed the study. SN, DK, and WJ analyzed the data. SN took the lead in the writing of the manuscript. YZ supervised and implemented the study. All authors contributed to critical edits and approved the final manuscript.

Conflicts of Interest: None declared.

UB Directory

Campaign >
Find Your Cause >
Jacobs School of Medicine and Biomedical Sciences >

Study explores pandemic-related shifts in alcohol sales across 16 U.S. states

Trends examined with machine-learning analysis, which could help inform public health policies, by plos one and the university at buffalo.

Release Date: December 20, 2021

Yingjie Hu, assistant professor of geography in the UB College of Arts and Sciences.

Brian Quigley, research assistant professor of medicine in the Jacobs School of Medicine and Biomedical Sciences at UB and the UB Clinical and Research Institute on Addictions.

Dane Taylor, assistant professor of mathematics in the UB College of Arts and Sciences.

BUFFALO, N.Y. — An analysis of data from 16 U.S. states suggests that the first few months of the COVID-19 pandemic saw increases in wine and spirit sales, accompanied by notable changes in the relationship between alcohol sales and people’s visits to businesses that sell alcohol.

University at Buffalo researchers Yingjie Hu , Brian M. Quigley and Dane Taylor present these findings in the open-access journal PLOS ONE on Dec. 17 The team notes that trends varied by state.

After U.S. states implemented stay-at-home orders and other restrictions to reduce the spread of COVID-19 in March 2020, anecdotes suggested an increase in alcohol sales. However, data-driven investigations into whether alcohol sales and use did indeed increase have produced mixed results.

To help clarify the potential impact of COVID-19 lockdowns and other social distancing measures on the dynamics of alcohol sales, Hu and colleagues conducted an analysis of relevant data from 16 U.S. states, comparing the period from March to June 2020 to the same period in 2018 and 2019.

“Anonymized human mobility data and geospatial analysis help us understand how people’s visiting behavior to alcohol outlets changed during the stay-at-home period of COVID-19, and how such behavior change varied across different geographic regions,” says Hu, PhD, an assistant professor of geography in the UB College of Arts and Sciences.

“Understanding how alcohol purchase behavior is changed by events such as COVID is important because heavy alcohol use is known to be associated with numerous social problems, especially within the home,” says Quigley, PhD, research assistant professor of medicine in the Jacobs School of Medicine and Biomedical Sciences at UB and the UB Clinical and Research Institute on Addictions.

Using a variety of analytical techniques, including machine-learning methods, they evaluated monthly alcohol sales data reported by the U.S. National Institute on Alcohol Abuse and Alcoholism (NIAAA), as well as anonymized mobility data from over 45 million smart mobile devices (mostly smartphones) indicating people’s visits to businesses where alcohol is sold. (The NIAAA data used in the study focuses on monthly sales of alcohol for 14 U.S. states. It includes sales of spirits, wine and beer, but not all states report data in all of those categories. The anonymized mobility data included information for these 14 states, plus two others.)

The analysis found that overall, sales of spirits and wine increased in the early months of the pandemic — by as much as 20-40% in some states in certain months — while beer sales declined overall compared to the same period during recent years. Meanwhile, people’s visits to bars and pubs declined, but visits to liquor stores increased.

Dynamics varied significantly across states. For example, while beer sales decreased in most states between March and June 2020 compared with the same months in recent years, they increased in Kansas, Arkansas and Texas. Meanwhile, Texas, Kentucky and Virginia showed sustained increases in their sales of both spirits and wine, which the authors suggest “can be alarming signals for problematic alcohol use.”

“If data can provide information about geographic areas in which alcohol use increases during certain types of events such as during severe weather, high unemployment, or events such as the COVID pandemic, this information can be useful to help prepare law enforcement, medical professionals and substance use disorder treatment providers to address alcohol-related issues associated with such times,” Quigley says.

Machine-learning assessments in the study point to a significant shift in the relationship between alcohol sales data and visits to various alcohol outlets. More research will be necessary to understand how people’s behaviors changed, but these findings suggest the possibility that some states may have seen an increase in online alcohol purchases or panic buying of spirits and wine.

The research team notes that the study has some limitations: For example, many states were not included in the NIAAA dataset, and the human mobility data was not able to capture alcohol sales at places such as grocery stores, where sales of alcohol are mixed with sales of other items. Nevertheless, these results provide insights into the potential effects of lockdown policies on alcohol use and could inform future public health policies to address alcohol-related social issues, the researchers say.

From a research methodological perspective, Taylor, PhD, UB assistant professor of mathematics, notes, “Interfacing new data sources such as anonymized human mobility data with public health challenges that are difficult or expensive to directly measure reveals new methodological challenges for applied machine learning research.”

Media Contact Information

Charlotte Hsu is a former staff writer in University Communications. To contact UB's media relations staff, email [email protected] or visit our list of current university media contacts .

VIDEO

Part 3 Movement Analysis & Biomedical Aspects
Design of Continuous Bioreactors
Design of Fed Batch bioreactors
Intelligent Analysis of Biomedical Images
Biomedical Analysis in Abaqus-kidney CFD simulation
Studying biomed

COMMENTS

Evidence‐based statistical analysis and methods in biomedical research (SAMBR) checklists according to design features
Statistical analysis according to design features and objectives is essential to ensure the validity and reliability of the study findings and conclusions in biomedical research. Heterogeneity in reporting study design elements and conducting statistical analyses is often observed for the same study design and study objective in medical ...
Biostatistics Series Module 1: Basics of Biostatistics
Basics of Biostatistics. Application of statistical methods in biomedical research began more than 150 years ago. One of the early pioneers, Florence Nightingale, the icon of nursing, worked during the Crimean war of the 1850s to improve the methods of constructing mortality tables. The conclusions from her tables helped to change the practices ...
Statistical Analysis of Biomedical Data
The Correlation Coefficient of Regression Analysis. Another common research objective is to measure how much changes in one variable explain changes in another. Generally, scatter plots are used to illustrate this relation. ... This was a very non-mathematical overview of the everyday statistics used in biomedical research. The maths behind ...
Methods, techniques, assays and protocols
In applied biomedical research, methods and protocols are indispensable for unravelling the workings of biomedically relevant biological systems (molecular, cellular, and at the organ and whole ...
Basic Biomedical Sciences Research
Basic biomedical research, which addresses mechanisms that underlie the formation and function of living organisms, ranging from the study of single molecules to complex integrated functions of humans, contributes profoundly to our knowledge of how disease, trauma, or genetic defects alter normal physiological and behavioral processes. Recent advances in molecular biology techniques and ...
Biostatistics for Biomedical Research
The book is aimed at exposing biomedical researchers to modern biostatistical methods and statistical graphics, highlighting those methods that make fewer assumptions, including nonparametric statistics and robust statistical measures. In addition to covering traditional estimation and inferential techniques, the course contrasts those with the ...
Statistical analysis of high-dimensional biomedical data: a gentle
The goal of the topic group TG9 "High-dimensional data" (HDD) of the STRATOS (STRengthening Analytical Thinking for Observational Studies) [] initiative is to provide guidance for planning, conducting, analyzing, and reporting studies involving high-dimensional biomedical data.The increasing availability and use of "big" data in biomedical research, characterized by "large n ...
Statistical Methods for Biomedical Research
This book consists of four parts with 32 chapters adapted for four short courses, from the basic to the advanced levels of medical statistics (biostatistics), ideal for biomedical students. Part 1 is a compulsory course of Basic Statistics with descriptive statistics, parameter estimation and hypothesis test, simple correlation and regression.
Statistical Analysis in Preclinical Biomedical Research
Abstract. Statistics is the science of quantitative methods that guide experimental data collection, interpretation, and presentation. Statistics has a central role in the biomedical sciences, with appropriate statistical practices leading to an enhanced probability of reproducibility and the avoidance of false positives. Statistics is often ...
5 Key trends shaping statistical analysis in biomedical research
Statistical analysis is a crucial component of biomedical research, providing insights and guiding decisions. In this blog post, we'll explore five key trends that are currently shaping statistical analysis in the field. So grab a cup of coffee, sit back, and let's dive in! 1. Increased Interest in Bayesian Statistics.
Biomedical Analysis
Biomedical Analysis is an international, interdisciplinary, scientific, peer-reviewed Open Access journal. It aims to publish high-quality articles in the field of biomedical engineering, bioanalytical chemistry, biochemistry, genetics, biology, biomaterials, and medicine. Its objects involve the interaction of chemical, biological, and medical ...
The landscape of biomedical research: Patterns
We used this atlas as an exploration tool to study the biomedical research landscape, generating hypotheses that we later confirmed using the original high-dimensional data. Using five distinct examples—the emergence of the COVID-19 literature, the evolution of the neuroscience discipline, the uptake of machine learning, the gender imbalance ...
Evidence-based statistical analysis and methods in biomedical research
Background: Statistical analysis according to design features and objectives is essential to ensure the validity and reliability of the study findings and conclusions in biomedical research. Heterogeneity in reporting study design elements and conducting statistical analyses is often observed for the same study design and study objective in medical literatures.
Building on current progress to shape the future of biomedical science
The past 200 years have seen rapid advances in western biomedicine. A model arising from western Europe and North America, current biomedical science is largely driven by efforts to prevent or cure diseases. It uses hierarchies of evidence generated from observational and experimental research,1 and is arguably driven by the interests of scientists who hold this underlying philosophy, with ...
Raman spectroscopy and multivariate regression analysis in biomedical
Raman spectroscopy and multivariate regression analysis in biomedical research, medical diagnosis, and clinical analysis. ... Recent innovations in the use of Raman spectroscopy for chemical analysis in human specimens are discussed. Applications of Raman spectroscopy in cancer immunotherapy, cancer imaging, and detecting disease biomarkers in ...
Issues in Biomedical Research Data Management and Analysis: Needs and
Introduction. Rapid advances in analytical technology coupled with widespread access to large amounts of highly detailed, heterogeneous and often public biomedical research data have dramatically increased the difficulties faced by biomedical investigators in acquiring, archiving, annotating, and analyzing data. 1 Recognition of this fact is reflected in a number of large scale initiatives by ...
Integrating sex and gender into biomedical research requires ...
Most biomedical, health and care research does not adequately account for sex and gender dimensions of health and illness. ... A framework for sex, gender, and diversity analysis in research ...
Biomedical Data Science
Biomedical Data Science involves the analysis of large-scale biomedical datasets to understand how living systems function. Our academic and research programs in Biomedical Data Science center on developing new data analysis technologies in order to understand disease mechanisms and provide improved health care at lower costs.
Chromatography-Mass Spectrometry Analysis in Biomedical Research and
The present Topic aims to cover the latest research trends and achievements of chromatography-mass spectrometry in biomedical, clinical, and pharmacological research by highlighting novel applications and novel approaches in sample treatment and instrumental analysis. Researchers working on all aspects of basic research and applications in ...
Issues in Biomedical Research Data Management and Analysis: Needs and
Needs Analysis. Three broad data management and analysis themes emerged from the analysis of the interview data within the context of the survey responses: 1) current state of data management and analysis at the laboratory level; 2) anticipated data management and analysis needs; 3) barriers to addressing those needs.
Evidence‐based statistical analysis and methods in biomedical research
Statistical analysis according to design features and objectives is essential to ensure the validity and reliability of the study findings and conclusions in biomedical research. Heterogeneity in reporting study design elements and conducting statistical analyses is often observed for the same study design and study objective in medical ...
Issues in biomedical research data management and analysis ...
Objectives: A. Identify the current state of data management needs of academic biomedical researchers. B. Explore their anticipated data management and analysis needs. C. Identify barriers to addressing those needs. Design: A multimodal needs analysis was conducted using a combination of an online survey and in-depth one-on-one semi-structured interviews.
Data integration through canonical correlation analysis and its
Graphical abstractDisplay Omitted. The dynamic development of high-throughput methods, and with them the availability of large and constantly growing data resources, forces the development of new analytical approaches that allow the review of the analyzed processes, taking into account data from various levels of the organization of living organisms.
Statistical analysis of high-dimensional biomedical data: a gentle
The term "data preprocessing" is often used in biomedical research involving analysis of HDD, especially in the omics field, to denote certain initial data cleaning and screening steps falling within the more general category of "initial data analysis." Data preprocessing refers to the process of transforming "raw" data, obtained ...
Biomedical research with Generative AI Expertise
Apply in-depth knowledge of biomedical research processes, including chemical, pre-clinical, and clinical stages, to provide valuable insights and recommendations. Stay updated with the latest advancements in the biomedical field and AI technologies, ensuring the application of cutting-edge solutions. Business Analysis:
Morphological simulation tests the limits on phenotype ...
In the past few decades, advances in 3D imaging have created new opportunities for reverse genetic screens. Rapidly growing datasets of 3D images of genetic knockouts require high-throughput, automated computational approaches for identifying and characterizing new phenotypes. However, exploratory, discovery-oriented image analysis pipelines used to discover these phenotypes can be difficult ...
A novel temporal-frequency combination pattern optimization approach
Computer Methods in Biomechanics and Biomedical Engineering ... filter banks. Subsequently, we introduce the Pearson-Fisher combinational method along with Discriminant Correlation Analysis (DCA) for joint feature selection and fusion. ... This work was supported by National Key Research and Development Program of China (No.2021ZD0113204 ...
Understanding the Research Landscape of Deep Learning in Biomedical
Advances in biomedical research using deep learning techniques have generated a large volume of related literature. However, there is a lack of scientometric studies that provide a bird's-eye view of them. ... Coauthorship analysis revealed active research collaboration between universities and hospitals and between hospitals and companies ...
Study explores pandemic-related shifts in alcohol sales across 16 U.S
"Understanding how alcohol purchase behavior is changed by events such as COVID is important because heavy alcohol use is known to be associated with numerous social problems, especially within the home," says Quigley, PhD, research assistant professor of medicine in the Jacobs School of Medicine and Biomedical Sciences at UB and the UB ...