Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • For authors
  • Browse by collection
  • BMJ Journals

You are here

  • Volume 13, Issue 7
  • Artificial intelligence in systematic reviews: promising when appropriately used
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • http://orcid.org/0000-0003-1727-0608 Sanne H B van Dijk 1 , 2 ,
  • Marjolein G J Brusse-Keizer 1 , 3 ,
  • Charlotte C Bucsán 2 , 4 ,
  • http://orcid.org/0000-0003-1071-6769 Job van der Palen 3 , 4 ,
  • Carine J M Doggen 1 , 5 ,
  • http://orcid.org/0000-0002-2276-5691 Anke Lenferink 1 , 2 , 5
  • 1 Health Technology & Services Research, Technical Medical Centre , University of Twente , Enschede , The Netherlands
  • 2 Pulmonary Medicine , Medisch Spectrum Twente , Enschede , The Netherlands
  • 3 Medical School Twente , Medisch Spectrum Twente , Enschede , The Netherlands
  • 4 Cognition, Data & Education, Faculty of Behavioural, Management & Social Sciences , University of Twente , Enschede , The Netherlands
  • 5 Clinical Research Centre , Rijnstate Hospital , Arnhem , The Netherlands
  • Correspondence to Dr Anke Lenferink; a.lenferink{at}utwente.nl

Background Systematic reviews provide a structured overview of the available evidence in medical-scientific research. However, due to the increasing medical-scientific research output, it is a time-consuming task to conduct systematic reviews. To accelerate this process, artificial intelligence (AI) can be used in the review process. In this communication paper, we suggest how to conduct a transparent and reliable systematic review using the AI tool ‘ASReview’ in the title and abstract screening.

Methods Use of the AI tool consisted of several steps. First, the tool required training of its algorithm with several prelabelled articles prior to screening. Next, using a researcher-in-the-loop algorithm, the AI tool proposed the article with the highest probability of being relevant. The reviewer then decided on relevancy of each article proposed. This process was continued until the stopping criterion was reached. All articles labelled relevant by the reviewer were screened on full text.

Results Considerations to ensure methodological quality when using AI in systematic reviews included: the choice of whether to use AI, the need of both deduplication and checking for inter-reviewer agreement, how to choose a stopping criterion and the quality of reporting. Using the tool in our review resulted in much time saved: only 23% of the articles were assessed by the reviewer.

Conclusion The AI tool is a promising innovation for the current systematic reviewing practice, as long as it is appropriately used and methodological quality can be assured.

PROSPERO registration number CRD42022283952.

  • systematic review
  • statistics & research methods
  • information technology

This is an open access article distributed in accordance with the Creative Commons Attribution 4.0 Unported (CC BY 4.0) license, which permits others to copy, redistribute, remix, transform and build upon this work for any purpose, provided the original work is properly cited, a link to the licence is given, and indication of whether changes were made. See:  https://creativecommons.org/licenses/by/4.0/ .

https://doi.org/10.1136/bmjopen-2023-072254

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

Potential pitfalls regarding the use of artificial intelligence in systematic reviewing were identified.

Remedies for each pitfall were provided to ensure methodological quality. A time-efficient approach is suggested on how to conduct a transparent and reliable systematic review using an artificial intelligence tool.

The artificial intelligence tool described in the paper was not evaluated for its accuracy.

Medical-scientific research output has grown exponentially since the very first medical papers were published. 1–3 The output in the field of clinical medicine increased and keeps doing so. 4 To illustrate, a quick PubMed search for ‘cardiology’ shows a fivefold increase in annual publications from 10 420 (2007) to 52 537 (2021). Although the medical-scientific output growth rate is not higher when compared with other scientific fields, 1–3 this field creates the largest output. 3 Staying updated by reading all published articles is therefore not feasible. However, systematic reviews facilitate up-to-date and accessible summaries of evidence, as they synthesise previously published results in a transparent and reproducible manner. 5 6 Hence, conclusions can be drawn that provide the highest considered level of evidence in medical research. 5 7 Therefore, systematic reviews are not only crucial in science, but they have a large impact on clinical practice and policy-making as well. 6 They are, however, highly labour-intensive to conduct due to the necessity of screening a large amount of articles, which results in a high consumption of research resources. Thus, efficient and innovative reviewing methods are desired. 8

An open-source artificial intelligence (AI) tool ‘ASReview’ 9 was published in 2021 to facilitate the title and abstract screening process in systematic reviews. Applying this tool facilitates researchers to conduct more efficient systematic reviews: simulations already showed its time-saving potential. 9–11 We used the tool in the study selection of our own systematic review and came across scenarios that needed consideration to prevent loss of methodological quality. In this communication paper, we provide a reliable and transparent AI-supported systematic reviewing approach.

We first describe how the AI tool was used in a systematic review conducted by our research group. For more detailed information regarding searches and eligibility criteria of the review, we refer to the protocol (PROSPERO registry: CRD42022283952). Subsequently, when deciding on the AI screening-related methodology, we applied appropriate remedies against foreseen scenarios and their pitfalls to maintain a reliable and transparent approach. These potential scenarios, pitfalls and remedies will be discussed in the Results section.

In our systematic review, the AI tool ‘ASReview’ (V.0.17.1) 9 was used for the screening of titles and abstracts by the first reviewer (SHBvD). The tool uses an active researcher-in-the-loop machine learning algorithm to rank the articles from high to low probability of eligibility for inclusion by text mining. The AI tool offers several classifier models by which the relevancy of the included articles can be determined. 9 In a simulation study using six large systematic review datasets on various topics, a Naïve Bayes (NB) and a term frequency-inverse document frequency (TF-IDF) outperformed other model settings. 10 The NB classifier estimates the probability of an article being relevant, based on TF-IDF measurements. TF-IDF measures the originality of a certain word within the article relative to the total number of articles the word appears in. 12 This combination of NB and TF-IDF was chosen for our systematic review.

Before the AI tool can be used for the screening of relevant articles, its algorithm needs training with at least one relevant and one irrelevant article (ie, prior knowledge). It is assumed that the more prior knowledge, the better the algorithm is trained at the start of the screening process, and the faster it will identify relevant articles. 9 In our review, the prior knowledge consisted of three relevant articles 13–15 selected from a systematic review on the topic 16 and three randomly picked irrelevant articles.

After training with the prior knowledge, the AI tool made a first ranking of all unlabelled articles (ie, articles not yet decided on eligibility) from highest to lowest probability of being relevant. The first reviewer read the title and abstract of the number one ranked article and made a decision (‘relevant’ or ‘irrelevant’) following the eligibility criteria. Next, the AI tool took into account this additional knowledge and made a new ranking. Again, the next top ranked article was proposed to the reviewer, who made a decision regarding eligibility. This process of AI making rankings and the reviewer making decisions, which is also called ‘researcher-in-the-loop’, was repeated until the predefined data-driven stopping criterion of – in our case – 100 subsequent irrelevant articles was reached. After the reviewer rejected what the AI tool puts forward as ‘most probably relevant’ a hundred times, it was assumed that there were no relevant articles left in the unseen part of the dataset.

The articles that were labelled relevant during the title and abstract screening were each screened on full text independently by two reviewers (SHBvD and MGJB-K, AL, JvdP, CJMD, CCB) to minimise the influence of subjectivity on inclusion. Disagreements regarding inclusion were solved by a third independent reviewer.

How to maintain reliability and transparency when using AI in title and abstract screening

A summary of the potential scenarios, and their pitfalls and remedies, when using the AI tool in a systematic review is given in table 1 . These potential scenarios should not be ignored, but acted on to maintain reliability and transparency. Figure 1 shows when and where to act on during the screening process reflected by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flowchart, 17 from literature search results to publishing the review.

  • Download figure
  • Open in new tab
  • Download powerpoint

Flowchart showing when and where to act on when using ASReview in systematic reviewing. Adapted the PRISMA flowchart from Haddaway et al . 17

  • View inline

Per-scenario overview of potential pitfalls and how to prevent these when using ASReview in a systematic review

In our systematic review, by means of broad literature searches in several scientific databases, a first set of potentially relevant articles was identified, yielding 8456 articles, enough to expect the AI tool to be efficient in the title and abstract screening (scenario ① was avoided, see table 1 ). Subsequently, this complete set of articles was uploaded in reference manager EndNote X9 18 and review manager Covidence, 19 where 3761 duplicate articles were removed. Given that EndNote has quite low sensitivity in identifying duplicates, additional deduplication in Covidence was considered beneficial. 20 Deduplication is usually applied in systematic reviewing, 20 but is increasingly important prior to the use of AI. Since multiple decisions regarding a duplicate article weigh more than one, this will disproportionately influence classification and possibly the results ( table 1 , scenario ② ). In our review, a deduplicated set of articles was uploaded in the AI tool. Prior to the actual AI-supported title and abstract screening, the reviewers (SHBvD and AL, MGJB-K) trained themselves with a small selection of 74 articles. The first reviewer became familiar with the ASReview software, and all three reviewers learnt how to apply the eligibility criteria, to minimise personal influence on the article selection ( table 1 , scenario ③ ).

Defining the stopping criterion used in the screening process is left to the reviewer. 9 An optimal stopping criterion in active learning is considered a perfectly balanced trade-off between a certain cost (in terms of time spent) of screening one more article versus the predictive performance (in terms of identifying a new relevant article) that could be increased by adding one more decision. 21 The optimal stopping criterion in systematic reviewing would be the moment that screening additional articles will not result in more relevant articles being identified. 22 Therefore, in our review, we predetermined a data-driven stopping criterion for the title and abstract screening as ‘100 consecutive irrelevant articles’ in order to prevent the screening from being stopped before or a long time after all relevant articles were identified ( table 1 , scenario ④ ).

Due to the fact that the stopping criterion was reached after 1063 of the 4695 articles, only a part of the total number of articles was seen. Therefore, this approach might be sensitive to possible mistakes when articles are screened by only one reviewer, influencing the algorithm, possibly resulting in an incomplete selection of articles ( table 1 , scenario ③ ). 23 As a remedy, second reviewers (AL, MGJB-K) checked 20% of the titles and abstracts seen by the first reviewer. This 20% had a comparable ratio regarding relevant versus irrelevant articles over all articles seen. The percentual agreement and Cohen’s Kappa (κ), a measure for the inter-reviewer agreement above chance, were calculated to express the reliability of the decisions taken. 24 The decisions were agreed in 96% and κ was 0.83. A κ equal of at least 0.6 is generally considered high, 24 and thus it was assumed that the algorithm was reliably trained by the first reviewer.

The reporting of the use of the AI tool should be transparent. If the choices made regarding the use of the AI tool are not entirely reported ( table 1 , scenario ⑤ ), the reader will not be able to properly assess the methodology of the review, and review results may even be graded as low-quality due to the lack of transparent reporting. The ASReview tool offers the possibility to extract a data file providing insight into all decisions made during the screening process, in contrast to various other ‘black box’ AI-reviewing tools. 9 This file will be published alongside our systematic review to provide full transparency of our AI-supported screening. This way, the screening with AI is reproducible (remedy to scenario ⑥ , table 1 ).

Results of AI-supported study selection in a systematic review

We experienced an efficient process of title and abstract screening in our systematic review. Whereas the screening was performed with a database of 4695 articles, the stopping criterion was reached after 1063 articles, so 23% were seen. Figure 2A shows the proportion of articles identified as being relevant at any point during the AI-supported screening process. It can be observed that the articles are indeed prioritised by the active learning algorithm: in the beginning, relatively many relevant articles were found, but this decreased as the stopping criterion (vertical red line) was approached. Figure 2B compares the screening progress when using the AI tool versus manual screening. The moment the stopping criterion was reached, approximately 32 records would have been found when the titles and abstract would have been screened manually, compared with 142 articles labelled relevant using the AI tool. After the inter-reviewer agreement check, 142 articles proceeded to the full text reviewing phase, of which 65 were excluded because these were no articles with an original research format, and three because the full text could not be retrieved. After full text reviewing of the remaining 74 articles, 18 articles from 13 individual studies were included in our review. After snowballing, one additional article from a study already included was added.

Relevant articles identified after a certain number of titles and abstracts were screened using the AI tool compared with manual screening.

In our systematic review, the AI tool considerably reduced the number of articles in the screening process. Since the AI tool is offered open source, many researchers may benefit from its time-saving potential in selecting articles. Choices in several scenarios regarding the use of AI, however, are still left open to the researcher, and need consideration to prevent pitfalls. These include the choice whether or not to use AI by weighing the costs versus the benefits, the importance of deduplication, double screening to check inter-reviewer agreement, a data-driven stopping criterion to optimally use the algorithm’s predictive performance and quality of reporting of the AI-related methodology chosen. This communication paper is, to our knowledge, the first elaborately explaining and discussing these choices regarding the application of this AI tool in an example systematic review.

The main advantage of using the AI tool is the amount of time saved. Indeed, in our study, only 23% of the total number of articles were screened before the predefined stopping criterion was met. Assuming that all relevant articles were found, the AI tool saved 77% of the time for title and abstract screening. However, time should be invested to become acquainted with the tool. Whether the expected screening time saved outweighs this time investment is context-dependent (eg, researcher’s digital skills, systematic reviewing skills, topic knowledge). An additional advantage is that research questions previously unanswerable due to the insurmountable number of articles to screen in a ‘classic’ (ie, manual) review, now actually are possible to answer. An example of the latter is a review screening over 60 000 articles, 25 which would probably never have been performed without AI supporting the article selection.

Since the introduction of the ASReview tool in 2021, it was applied in seven published reviews. 25–31 An important note to make is that only one 25 clearly reported AI-related choices in the methods and a complete and transparent flowchart reflecting the study selection process in the Results section. Two reviews reported a relatively small number (<400) of articles to screen, 26 27 of which more than 75% of the articles were screened before the stopping criterion was met, so the amount of time saved was limited. Also, three reviews reported many initial articles (>6000) 25 28 29 and one reported 892 articles, 31 of which only 5%–10% needed to be screened. So in these reviews, the AI tool saved an impressive amount of screening time. In our systematic review, 3% of the articles were labelled relevant during the title and abstract screening and eventually, <1% of all initial articles were included. These percentages are low, and are in line with the three above-mentioned reviews (1%–2% and 0%–1%, respectively). 25 28 29 Still, relevancy and inclusion rates are much lower when compared with ‘classic’ systematic reviews. A study evaluating the screening process in 25 ‘classic’ systematic reviews showed that approximately 18% was labelled relevant and 5% was actually included in the reviews. 32 This difference is probably due to more narrow literature searches in ‘classic’ reviews for feasibility purposes compared with AI-supported reviews, resulting in a higher proportion of included articles.

In this paper, we show how we applied the AI tool, but we did not evaluate it in terms of accuracy. This means that we have to deal with a certain degree of uncertainty. Despite the data-driven stopping criterion there is a chance that relevant articles were missed, as 77% was automatically excluded. Considering this might have been the case, first, this could be due to wrong decisions of the reviewer that would have undesirably influenced the training of the algorithm by which the articles were labelled as (ir)relevant and the order in which they were presented to the reviewer. Relevant articles could have therefore remained unseen if the stopping criterion was reached before they were presented to the reviewer. As a remedy, in our own systematic review, of the 20% of the articles screened by the first reviewer, relevancy was also assessed by another reviewer to assess inter-reviewer reliability, which was high. It should be noted, though, that ‘classic’ title and abstract screening is not necessarily better than using AI, as medical-scientific researchers tend to assess one out of nine abstracts wrongly. 32 Second, the AI tool may not have properly ranked highly relevant to irrelevant articles. However, given that simulations proved this AI tool’s accuracy before 9–11 this was not considered plausible. Since our study applied, but did not evaluate, the AI tool, we encourage future studies evaluating the performance of the tool across different scientific disciplines and contexts, since research suggests that the tool’s performance depends on the context, for example, the complexity of the research question. 33 This could not only enrich the knowledge about the AI tool, but also increases certainty about using it. Also, future studies should investigate the effects of choices made regarding the amount of prior knowledge that is provided to the tool, the number of articles defining the stopping criterion, and how duplicate screening is best performed, to guide future users of the tool.

Although various researcher-in-the-loop AI tools for title and abstract screening have been developed over the years, 9 23 34 they often do not develop into usable mature software, 34 which impedes AI to be permanently implemented in research practice. For medical-scientific research practice, it would therefore be helpful if large systematic review institutions, like Cochrane and PRISMA, would consider to ‘officially’ make AI part of systematic reviewing practice. When guidelines on the use of AI in systematic reviews are made available and widely recognised, AI-supported systematic reviews can be uniformly conducted and transparently reported. Only then we can really benefit from AI’s time-saving potential and reduce our research time waste.

Our experience with the AI tool during the title and abstract screening was positive as it has highly accelerated the literature selection process. However, users should consider applying appropriate remedies to scenarios that may form a threat to the methodological quality of the review. We provided an overview of these scenarios, their pitfalls and remedies. These encourage reliable use and transparent reporting of AI in systematic reviewing. To ensure the continuation of conducting systematic reviews in the future, and given their importance for medical guidelines and practice, we consider this tool as an important addition to the review process.

Ethics approval

Not applicable.

  • Bornmann L ,
  • Haunschild R ,
  • Michels C ,
  • Haghani M ,
  • Zwack CC , et al
  • McKenzie JE ,
  • Bossuyt PM , et al
  • Gurevitch J ,
  • Koricheva J ,
  • Nakagawa S , et al
  • Rohrich RJ ,
  • Bastian H ,
  • Glasziou P ,
  • van de Schoot R ,
  • de Bruin J ,
  • Schram R , et al
  • Ferdinands G ,
  • de Bruin J , et al
  • Ferdinands G
  • Havrlant L ,
  • Kreinovich V
  • Li Y , et al
  • Jalloul F ,
  • Ayed S , et al
  • Andrijevic I ,
  • Milutinov S ,
  • Lozanov Crvenkovic Z , et al
  • Hawkins NM ,
  • Virani SA , et al
  • Haddaway NR ,
  • Pritchard CC , et al
  • Clarivate Analytics
  • Veritas Health Innovation
  • McKeown S ,
  • Ishibashi H ,
  • Blaizot A ,
  • Veettil SK ,
  • Saidoung P , et al
  • Bernardes RC ,
  • Botina LL ,
  • Araújo R dos S , et al
  • Silva GFS ,
  • Fagundes TP ,
  • Teixeira BC , et al
  • Miranda L ,
  • Pütz B , et al
  • Schouw HM ,
  • Huisman LA ,
  • Janssen YF , et al
  • Schuengel C ,
  • Sterkenburg PS , et al
  • Procházková M ,
  • Lu J , et al
  • Lam L , et al
  • Tetzlaff J , et al
  • Marshall IJ ,

Contributors SHBvD proposed the methodology and conducted the study selection. MGJB-K, CJMD and AL critically reflected on the methodology. MGJB-K and AL contributed substantially to the study selection. CCB, JvdP and CJMD contributed to the study selection. The manuscript was primarily prepared by SHBvD and critically revised by all authors. All authors read and approved the final manuscript.

Funding The systematic review is conducted as part of the RE-SAMPLE project. RE-SAMPLE has received funding from the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 965315).

Competing interests None declared.

Provenance and peer review Not commissioned; externally peer reviewed.

Read the full text or download the PDF:

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 01 February 2021

An open source machine learning framework for efficient and transparent systematic reviews

  • Rens van de Schoot   ORCID: orcid.org/0000-0001-7736-2091 1 ,
  • Jonathan de Bruin   ORCID: orcid.org/0000-0002-4297-0502 2 ,
  • Raoul Schram 2 ,
  • Parisa Zahedi   ORCID: orcid.org/0000-0002-1610-3149 2 ,
  • Jan de Boer   ORCID: orcid.org/0000-0002-0531-3888 3 ,
  • Felix Weijdema   ORCID: orcid.org/0000-0001-5150-1102 3 ,
  • Bianca Kramer   ORCID: orcid.org/0000-0002-5965-6560 3 ,
  • Martijn Huijts   ORCID: orcid.org/0000-0002-8353-0853 4 ,
  • Maarten Hoogerwerf   ORCID: orcid.org/0000-0003-1498-2052 2 ,
  • Gerbrich Ferdinands   ORCID: orcid.org/0000-0002-4998-3293 1 ,
  • Albert Harkema   ORCID: orcid.org/0000-0002-7091-1147 1 ,
  • Joukje Willemsen   ORCID: orcid.org/0000-0002-7260-0828 1 ,
  • Yongchao Ma   ORCID: orcid.org/0000-0003-4100-5468 1 ,
  • Qixiang Fang   ORCID: orcid.org/0000-0003-2689-6653 1 ,
  • Sybren Hindriks 1 ,
  • Lars Tummers   ORCID: orcid.org/0000-0001-9940-9874 5 &
  • Daniel L. Oberski   ORCID: orcid.org/0000-0001-7467-2297 1 , 6  

Nature Machine Intelligence volume  3 ,  pages 125–133 ( 2021 ) Cite this article

78k Accesses

276 Citations

163 Altmetric

Metrics details

  • Computational biology and bioinformatics
  • Computer science
  • Medical research

A preprint version of the article is available at arXiv.

To help researchers conduct a systematic review or meta-analysis as efficiently and transparently as possible, we designed a tool to accelerate the step of screening titles and abstracts. For many tasks—including but not limited to systematic reviews and meta-analyses—the scientific literature needs to be checked systematically. Scholars and practitioners currently screen thousands of studies by hand to determine which studies to include in their review or meta-analysis. This is error prone and inefficient because of extremely imbalanced data: only a fraction of the screened studies is relevant. The future of systematic reviewing will be an interaction with machine learning algorithms to deal with the enormous increase of available text. We therefore developed an open source machine learning-aided pipeline applying active learning: ASReview. We demonstrate by means of simulation studies that active learning can yield far more efficient reviewing than manual reviewing while providing high quality. Furthermore, we describe the options of the free and open source research software and present the results from user experience tests. We invite the community to contribute to open source projects such as our own that provide measurable and reproducible improvements over current practice.

Similar content being viewed by others

systematic literature review using ai

AI-assisted peer review

systematic literature review using ai

A typology for exploring the mitigation of shortcut behaviour

systematic literature review using ai

Distributed peer review enhanced with natural language processing and machine learning

With the emergence of online publishing, the number of scientific manuscripts on many topics is skyrocketing 1 . All of these textual data present opportunities to scholars and practitioners while simultaneously confronting them with new challenges. Scholars often develop systematic reviews and meta-analyses to develop comprehensive overviews of the relevant topics 2 . The process entails several explicit and, ideally, reproducible steps, including identifying all likely relevant publications in a standardized way, extracting data from eligible studies and synthesizing the results. Systematic reviews differ from traditional literature reviews in that they are more replicable and transparent 3 , 4 . Such systematic overviews of literature on a specific topic are pivotal not only for scholars, but also for clinicians, policy-makers, journalists and, ultimately, the general public 5 , 6 , 7 .

Given that screening the entire research literature on a given topic is too labour intensive, scholars often develop quite narrow searches. Developing a search strategy for a systematic review is an iterative process aimed at balancing recall and precision 8 , 9 ; that is, including as many potentially relevant studies as possible while simultaneously limiting the total number of studies retrieved. The vast number of publications in the field of study often leads to a relatively precise search, with the risk of missing relevant studies. The process of systematic reviewing is error prone and extremely time intensive 10 . In fact, if the literature of a field is growing faster than the amount of time available for systematic reviews, adequate manual review of this field then becomes impossible 11 .

The rapidly evolving field of machine learning has aided researchers by allowing the development of software tools that assist in developing systematic reviews 11 , 12 , 13 , 14 . Machine learning offers approaches to overcome the manual and time-consuming screening of large numbers of studies by prioritizing relevant studies via active learning 15 . Active learning is a type of machine learning in which a model can choose the data points (for example, records obtained from a systematic search) it would like to learn from and thereby drastically reduce the total number of records that require manual screening 16 , 17 , 18 . In most so-called human-in-the-loop 19 machine-learning applications, the interaction between the machine-learning algorithm and the human is used to train a model with a minimum number of labelling tasks. Unique to systematic reviewing is that not only do all relevant records (that is, titles and abstracts) need to seen by a researcher, but an extremely diverse range of concepts also need to be learned, thereby requiring flexibility in the modelling approach as well as careful error evaluation 11 . In the case of systematic reviewing, the algorithm(s) are interactively optimized for finding the most relevant records, instead of finding the most accurate model. The term researcher-in-the-loop was introduced 20 as a special case of human-in-the-loop with three unique components: (1) the primary output of the process is a selection of the records, not a trained machine learning model; (2) all records in the relevant selection are seen by a human at the end of the process 21 ; (3) the use-case requires a reproducible workflow and complete transparency is required 22 .

Existing tools that implement such an active learning cycle for systematic reviewing are described in Table 1 ; see the Supplementary Information for an overview of all of the software that we considered (note that this list was based on a review of software tools 12 ). However, existing tools have two main drawbacks. First, many are closed source applications with black box algorithms, which is problematic as transparency and data ownership are essential in the era of open science 22 . Second, to our knowledge, existing tools lack the necessary flexibility to deal with the large range of possible concepts to be learned by a screening machine. For example, in systematic reviews, the optimal type of classifier will depend on variable parameters, such as the proportion of relevant publications in the initial search and the complexity of the inclusion criteria used by the researcher 23 . For this reason, any successful system must allow for a wide range of classifier types. Benchmark testing is crucial to understand the real-world performance of any machine learning-aided system, but such benchmark options are currently mostly lacking.

In this paper we present an open source machine learning-aided pipeline with active learning for systematic reviews called ASReview. The goal of ASReview is to help scholars and practitioners to get an overview of the most relevant records for their work as efficiently as possible while being transparent in the process. The open, free and ready-to-use software ASReview addresses all concerns mentioned above: it is open source, uses active learning, allows multiple machine learning models. It also has a benchmark mode, which is especially useful for comparing and designing algorithms. Furthermore, it is intended to be easily extensible, allowing third parties to add modules that enhance the pipeline. Although we focus this paper on systematic reviews, ASReview can handle any text source.

In what follows, we first present the pipeline for manual versus machine learning-aided systematic reviews. We then show how ASReview has been set up and how ASReview can be used in different workflows by presenting several real-world use cases. We subsequently demonstrate the results of simulations that benchmark performance and present the results of a series of user-experience tests. Finally, we discuss future directions.

Pipeline for manual and machine learning-aided systematic reviews

The pipeline of a systematic review without active learning traditionally starts with researchers doing a comprehensive search in multiple databases 24 , using free text words as well as controlled vocabulary to retrieve potentially relevant references. The researcher then typically verifies that the key papers they expect to find are indeed included in the search results. The researcher downloads a file with records containing the text to be screened. In the case of systematic reviewing it contains the titles and abstracts (and potentially other metadata such as the authors’s names, journal name, DOI) of potentially relevant references into a reference manager. Ideally, two or more researchers then screen the records’s titles and abstracts on the basis of the eligibility criteria established beforehand 4 . After all records have been screened, the full texts of the potentially relevant records are read to determine which of them will be ultimately included in the review. Most records are excluded in the title and abstract phase. Typically, only a small fraction of the records belong to the relevant class, making title and abstract screening an important bottleneck in systematic reviewing process 25 . For instance, a recent study analysed 10,115 records and excluded 9,847 after title and abstract screening, a drop of more than 95% 26 . ASReview therefore focuses on this labour-intensive step.

The research pipeline of ASReview is depicted in Fig. 1 . The researcher starts with a search exactly as described above and subsequently uploads a file containing the records (that is, metadata containing the text of the titles and abstracts) into the software. Prior knowledge is then selected, which is used for training of the first model and presenting the first record to the researcher. As screening is a binary classification problem, the reviewer must select at least one key record to include and exclude on the basis of background knowledge. More prior knowledge may result in improved efficiency of the active learning process.

figure 1

The symbols indicate whether the action is taken by a human, a computer, or whether both options are available.

A machine learning classifier is trained to predict study relevance (labels) from a representation of the record-containing text (feature space) on the basis of prior knowledge. We have purposefully chosen not to include an author name or citation network representation in the feature space to prevent authority bias in the inclusions. In the active learning cycle, the software presents one new record to be screened and labelled by the user. The user’s binary label (1 for relevant versus 0 for irrelevant) is subsequently used to train a new model, after which a new record is presented to the user. This cycle continues up to a certain user-specified stopping criterion has been reached. The user now has a file with (1) records labelled as either relevant or irrelevant and (2) unlabelled records ordered from most to least probable to be relevant as predicted by the current model. This set-up helps to move through a large database much quicker than in the manual process, while the decision process simultaneously remains transparent.

Software implementation for ASReview

The source code 27 of ASReview is available open source under an Apache 2.0 license, including documentation 28 . Compiled and packaged versions of the software are available on the Python Package Index 29 or Docker Hub 30 . The free and ready-to-use software ASReview implements oracle, simulation and exploration modes. The oracle mode is used to perform a systematic review with interaction by the user, the simulation mode is used for simulation of the ASReview performance on existing datasets, and the exploration mode can be used for teaching purposes and includes several preloaded labelled datasets.

The oracle mode presents records to the researcher and the researcher classifies these. Multiple file formats are supported: (1) RIS files are used by digital libraries such as IEEE Xplore, Scopus and ScienceDirect; the citation managers Mendeley, RefWorks, Zotero and EndNote support the RIS format too. (2) Tabular datasets with the .csv, .xlsx and .xls file extensions. CSV files should be comma separated and UTF-8 encoded; the software for CSV files accepts a set of predetermined labels in line with the ones used in RIS files. Each record in the dataset should hold the metadata on, for example, a scientific publication. Mandatory metadata is text and can, for example, be titles or abstracts from scientific papers. If available, both are used to train the model, but at least one is needed. An advanced option is available that splits the title and abstracts in the feature-extraction step and weights the two feature matrices independently (for TF–IDF only). Other metadata such as author, date, DOI and keywords are optional but not used for training the models. When using ASReview in the simulation or exploration mode, an additional binary variable is required to indicate historical labelling decisions. This column, which is automatically detected, can also be used in the oracle mode as background knowledge for previous selection of relevant papers before entering the active learning cycle. If unavailable, the user has to select at least one relevant record that can be identified by searching the pool of records. At least one irrelevant record should also be identified; the software allows to search for specific records or presents random records that are most likely to be irrelevant due to the extremely imbalanced data.

The software has a simple yet extensible default model: a naive Bayes classifier, TF–IDF feature extraction, a dynamic resampling balance strategy 31 and certainty-based sampling 17 , 32 for the query strategy. These defaults were chosen on the basis of their consistently high performance in benchmark experiments across several datasets 31 . Moreover, the low computation time of these default settings makes them attractive in applications, given that the software should be able to run locally. Users can change the settings, shown in Table 2 , and technical details are described in our documentation 28 . Users can also add their own classifiers, feature extraction techniques, query strategies and balance strategies.

ASReview has a number of implemented features (see Table 2 ). First, there are several classifiers available: (1) naive Bayes; (2) support vector machines; (3) logistic regression; (4) neural networks; (5) random forests; (6) LSTM-base, which consists of an embedding layer, an LSTM layer with one output, a dense layer and a single sigmoid output node; and (7) LSTM-pool, which consists of an embedding layer, an LSTM layer with many outputs, a max pooling layer and a single sigmoid output node. The feature extraction techniques available are Doc2Vec 33 , embedding LSTM, embedding with IDF or TF–IDF 34 (the default is unigram, with the option to run n -grams while other parameters are set to the defaults of Scikit-learn 35 ) and sBERT 36 . The available query strategies for the active learning part are (1) random selection, ignoring model-assigned probabilities; (2) uncertainty-based sampling, which chooses the most uncertain record according to the model (that is, closest to 0.5 probability); (3) certainty-based sampling (max in ASReview), which chooses the record most likely to be included according to the model; and (4) mixed sampling, which uses a combination of random and certainty-based sampling.

There are several balance strategies that rebalance and reorder the training data. This is necessary, because the data is typically extremely imbalanced and therefore we have implemented the following balance strategies: (1) full sampling, which uses all of the labelled records; (2) undersampling the irrelevant records so that the included and excluded records are in some particular ratio (closer to one); and (3) dynamic resampling, a novel method similar to undersampling in that it decreases the imbalance of the training data 31 . However, in dynamic resampling, the number of irrelevant records is decreased, whereas the number of relevant records is increased by duplication such that the total number of records in the training data remains the same. The ratio between relevant and irrelevant records is not fixed over interactions, but dynamically updated depending on the number of labelled records, the total number of records and the ratio between relevant and irrelevant records. Details on all of the described algorithms can be found in the code and documentation referred to above.

By default, ASReview converts the records’s texts into a document-term matrix, terms are converted to lowercase and no stop words are removed by default (but this can be changed). As the document-term matrix is identical in each iteration of the active learning cycle, it is generated in advance of model training and stored in the (active learning) state file. Each row of the document-term matrix can easily be requested from the state-file. Records are internally identified by their row number in the input dataset. In oracle mode, the record that is selected to be classified is retrieved from the state file and the record text and other metadata (such as title and abstract) are retrieved from the original dataset (from the file or the computer’s memory). ASReview can run on your local computer, or on a (self-hosted) local or remote server. Data (all records and their labels) remain on the users’s computer. Data ownership and confidentiality are crucial and no data are processed or used in any way by third parties. This is unique by comparison with some of the existing systems, as shown in the last column of Table 1 .

Real-world use cases and high-level function descriptions

Below we highlight a number of real-world use cases and high-level function descriptions for using the pipeline of ASReview.

ASReview can be integrated in classic systematic reviews or meta-analyses. Such reviews or meta-analyses entail several explicit and reproducible steps, as outlined in the PRISMA guidelines 4 . Scholars identify all likely relevant publications in a standardized way, screen retrieved publications to select eligible studies on the basis of defined eligibility criteria, extract data from eligible studies and synthesize the results. ASReview fits into this process, particularly in the abstract screening phase. ASReview does not replace the initial step of collecting all potentially relevant studies. As such, results from ASReview depend on the quality of the initial search process, including selection of databases 24 and construction of comprehensive searches using keywords and controlled vocabulary. However, ASReview can be used to broaden the scope of the search (by keyword expansion or omitting limitation in the search query), resulting in a higher number of initial papers to limit the risk of missing relevant papers during the search part (that is, more focus on recall instead of precision).

Furthermore, many reviewers nowadays move towards meta-reviews when analysing very large literature streams, that is, systematic reviews of systematic reviews 37 . This can be problematic as the various reviews included could use different eligibility criteria and are therefore not always directly comparable. Due to the efficiency of ASReview, scholars using the tool could conduct the study by analysing the papers directly instead of using the systematic reviews. Furthermore, ASReview supports the rapid update of a systematic review. The included papers from the initial review are used to train the machine learning model before screening of the updated set of papers starts. This allows the researcher to quickly screen the updated set of papers on the basis of decisions made in the initial run.

As an example case, let us look at the current literature on COVID-19 and the coronavirus. An enormous number of papers are being published on COVID-19. It is very time consuming to manually find relevant papers (for example, to develop treatment guidelines). This is especially problematic as urgent overviews are required. Medical guidelines rely on comprehensive systematic reviews, but the medical literature is growing at breakneck pace and the quality of the research is not universally adequate for summarization into policy 38 . Such reviews must entail adequate protocols with explicit and reproducible steps, including identifying all potentially relevant papers, extracting data from eligible studies, assessing potential for bias and synthesizing the results into medical guidelines. Researchers need to screen (tens of) thousands of COVID-19-related studies by hand to find relevant papers to include in their overview. Using ASReview, this can be done far more efficiently by selecting key papers that match their (COVID-19) research question in the first step; this should start the active learning cycle and lead to the most relevant COVID-19 papers for their research question being presented next. A plug-in was therefore developed for ASReview 39 , which contained three databases that are updated automatically whenever a new version is released by the owners of the data: (1) the Cord19 database, developed by the Allen Institute for AI, with over all publications on COVID-19 and other coronavirus research (for example SARS, MERS and so on) from PubMed Central, the WHO COVID-19 database of publications, the preprint servers bioRxiv and medRxiv and papers contributed by specific publishers 40 . The CORD-19 dataset is updated daily by the Allen Institute for AI and updated also daily in the plugin. (2) In addition to the full dataset, we automatically construct a daily subset of the database with studies published after December 1st, 2019 to search for relevant papers published during the COVID-19 crisis. (3) A separate dataset of COVID-19 related preprints, containing metadata of preprints from over 15 preprints servers across disciplines, published since January 1st, 2020 41 . The preprint dataset is updated weekly by the maintainers and then automatically updated in ASReview as well. As this dataset is not readily available to researchers through regular search engines (for example, PubMed), its inclusion in ASReview provided added value to researchers interested in COVID-19 research, especially if they want a quick way to screen preprints specifically.

Simulation study

To evaluate the performance of ASReview on a labelled dataset, users can employ the simulation mode. As an example, we ran simulations based on four labelled datasets with version 0.7.2 of ASReview. All scripts to reproduce the results in this paper can be found on Zenodo ( https://doi.org/10.5281/zenodo.4024122 ) 42 , whereas the results are available at OSF ( https://doi.org/10.17605/OSF.IO/2JKD6 ) 43 .

First, we analysed the performance for a study systematically describing studies that performed viral metagenomic next-generation sequencing in common livestock such as cattle, small ruminants, poultry and pigs 44 . Studies were retrieved from Embase ( n  = 1,806), Medline ( n  = 1,384), Cochrane Central ( n  = 1), Web of Science ( n  = 977) and Google Scholar ( n  = 200, the top relevant references). After deduplication this led to 2,481 studies obtained in the initial search, of which 120 were inclusions (4.84%).

A second simulation study was performed on the results for a systematic review of studies on fault prediction in software engineering 45 . Studies were obtained from ACM Digital Library, IEEExplore and the ISI Web of Science. Furthermore, a snowballing strategy and a manual search were conducted, accumulating to 8,911 publications of which 104 were included in the systematic review (1.2%).

A third simulation study was performed on a review of longitudinal studies that applied unsupervised machine learning techniques to longitudinal data of self-reported symptoms of the post-traumatic stress assessed after trauma exposure 46 , 47 ; 5,782 studies were obtained by searching Pubmed, Embase, PsychInfo and Scopus and through a snowballing strategy in which both the references and the citation of the included papers were screened. Thirty-eight studies were included in the review (0.66%).

A fourth simulation study was performed on the results for a systematic review on the efficacy of angiotensin-converting enzyme inhibitors, from a study collecting various systematic review datasets from the medical sciences 15 . The collection is a subset of 2,544 publications from the TREC 2004 Genomics Track document corpus 48 . This is a static subset from all MEDLINE records from 1994 through 2003, which allows for replicability of results. Forty-one publications were included in the review (1.6%).

Performance metrics

We evaluated the four datasets using three performance metrics. We first assess the work saved over sampling (WSS), which is the percentage reduction in the number of records needed to screen achieved by using active learning instead of screening records at random; WSS is measured at a given level of recall of relevant records, for example 95%, indicating the work reduction in screening effort at the cost of failing to detect 5% of the relevant records. For some researchers it is essential that all relevant literature on the topic is retrieved; this entails that the recall should be 100% (that is, WSS@100%). We also propose the amount of relevant references found after having screened the first 10% of the records (RRF10%). This is a useful metric for getting a quick overview of the relevant literature.

For every dataset, 15 runs were performed with one random inclusion and one random exclusion (see Fig. 2 ). The classical review performance with randomly found inclusions is shown by the dashed line. The average work saved over sampling at 95% recall for ASReview is 83% and ranges from 67% to 92%. Hence, 95% of the eligible studies will be found after screening between only 8% to 33% of the studies. Furthermore, the number of relevant abstracts found after reading 10% of the abstracts ranges from 70% to 100%. In short, our software would have saved many hours of work.

figure 2

a – d , Results of the simulation study for the results for a study systematically review studies that performed viral metagenomic next-generation sequencing in common livestock ( a ), results for a systematic review of studies on fault prediction in software engineering ( b ), results for longitudinal studies that applied unsupervised machine learning techniques on longitudinal data of self-reported symptoms of posttraumatic stress assessed after trauma exposure ( c ), and results for a systematic review on the efficacy of angiotensin-converting enzyme inhibitors ( d ). Fiteen runs (shown with separate lines) were performed for every dataset, with only one random inclusion and one random exclusion. The classical review performances with randomly found inclusions are shown by the dashed lines.

Usability testing (user experience testing)

We conducted a series of user experience tests to learn from end users how they experience the software and implement it in their workflow. The study was approved by the Ethics Committee of the Faculty of Social and Behavioral Sciences of Utrecht University (ID 20-104).

Unstructured interviews

The first user experience (UX) test—carried out in December 2019—was conducted with an academic research team in a substantive research field (public administration and organizational science) that has conducted various systematic reviews and meta-analyses. It was composed of three university professors (ranging from assistant to full) and three PhD candidates. In one 3.5 h session, the participants used the software and provided feedback via unstructured interviews and group discussions. The goal was to provide feedback on installing the software and testing the performance on their own data. After these sessions we prioritized the feedback in a meeting with the ASReview team, which resulted in the release of v.0.4 and v.0.6. An overview of all releases can be found on GitHub 27 .

A second UX test was conducted with four experienced researchers developing medical guidelines based on classical systematic reviews, and two experienced reviewers working at a pharmaceutical non-profit organization who work on updating reviews with new data. In four sessions, held in February to March 2020, these users tested the software following our testing protocol. After each session we implemented the feedback provided by the experts and asked them to review the software again. The main feedback was about how to upload datasets and select prior papers. Their feedback resulted in the release of v.0.7 and v.0.9.

Systematic UX test

In May 2020 we conducted a systematic UX test. Two groups of users were distinguished: an unexperienced group and an experienced user who already used ASReview. Due to the COVID-19 lockdown the usability tests were conducted via video calling where one person gave instructions to the participant and one person observed, called human-moderated remote testing 49 . During the tests, one person (SH) asked the questions and helped the participant with the tasks, the other person observed and made notes, a user experience professional at the IT department of Utrecht University (MH).

To analyse the notes, thematic analysis was used, which is a method to analyse data by dividing the information in subjects that all have a different meaning 50 using the Nvivo 12 software 51 . When something went wrong the text was coded as showstopper, when something did not go smoothly the text was coded as doubtful, and when something went well the subject was coded as superb. The features the participants requested for future versions of the ASReview tool were discussed with the lead engineer of the ASReview team and were submitted to GitHub as issues or feature requests.

The answers to the quantitative questions can be found at the Open Science Framework 52 . The participants ( N  = 11) rated the tool with a grade of 7.9 (s.d. = 0.9) on a scale from one to ten (Table 2 ). The unexperienced users on average rated the tool with an 8.0 (s.d. = 1.1, N  = 6). The experienced user on average rated the tool with a 7.8 (s.d. = 0.9, N  = 5). The participants described the usability test with words such as helpful, accessible, fun, clear and obvious.

The UX tests resulted in the new release v0.10, v0.10.1 and the major release v0.11, which is a major revision of the graphical user interface. The documentation has been upgraded to make installing and launching ASReview more straightforward. We made setting up the project, selecting a dataset and finding past knowledge is more intuitive and flexible. We also added a project dashboard with information on your progress and advanced settings.

Continuous input via the open source community

Finally, the ASReview development team receives continuous feedback from the open science community about, among other things, the user experience. In every new release we implement features listed by our users. Recurring UX tests are performed to keep up with the needs of users and improve the value of the tool.

We designed a system to accelerate the step of screening titles and abstracts to help researchers conduct a systematic review or meta-analysis as efficiently and transparently as possible. Our system uses active learning to train a machine learning model that predicts relevance from texts using a limited number of labelled examples. The classifier, feature extraction technique, balance strategy and active learning query strategy are flexible. We provide an open source software implementation, ASReview with state-of-the-art systems across a wide range of real-world systematic reviewing applications. Based on our experiments, ASReview provides defaults on its parameters, which exhibited good performance on average across the applications we examined. However, we stress that in practical applications, these defaults should be carefully examined; for this purpose, the software provides a simulation mode to users. We encourage users and developers to perform further evaluation of the proposed approach in their application, and to take advantage of the open source nature of the project by contributing further developments.

Drawbacks of machine learning-based screening systems, including our own, remain. First, although the active learning step greatly reduces the number of manuscripts that must be screened, it also prevents a straightforward evaluation of the system’s error rates without further onerous labelling. Providing users with an accurate estimate of the system’s error rate in the application at hand is therefore a pressing open problem. Second, although, as argued above, the use of such systems is not limited in principle to reviewing, no empirical benchmarks of actual performance in these other situations yet exist to our knowledge. Third, machine learning-based screening systems automate the screening step only; although the screening step is time-consuming and a good target for automation, it is just one part of a much larger process, including the initial search, data extraction, coding for risk of bias, summarizing results and so on. Although some other works, similar to our own, have looked at (semi-)automating some of these steps in isolation 53 , 54 , to our knowledge the field is still far removed from an integrated system that would truly automate the review process while guaranteeing the quality of the produced evidence synthesis. Integrating the various tools that are currently under development to aid the systematic reviewing pipeline is therefore a worthwhile topic for future development.

Possible future research could also focus on the performance of identifying full text articles with different document length and domain-specific terminologies or even other types of text, such as newspaper articles and court cases. When the selection of past knowledge is not possible based on expert knowledge, alternative methods could be explored. For example, unsupervised learning or pseudolabelling algorithms could be used to improve training 55 , 56 . In addition, as the NLP community pushes forward the state of the art in feature extraction methods, these are easily added to our system as well. In all cases, performance benefits should be carefully evaluated using benchmarks for the task at hand. To this end, common benchmark challenges should be constructed that allow for an even comparison of the various tools now available. To facilitate such a benchmark, we have constructed a repository of publicly available systematic reviewing datasets 57 .

The future of systematic reviewing will be an interaction with machine learning algorithms to deal with the enormous increase of available text. We invite the community to contribute to open source projects such as our own, as well as to common benchmark challenges, so that we can provide measurable and reproducible improvement over current practice.

Data availability

The results described in this paper are available at the Open Science Framework ( https://doi.org/10.17605/OSF.IO/2JKD6 ) 43 . The answers to the quantitative questions of the UX test can be found at the Open Science Framework (OSF.IO/7PQNM) 52 .

Code availability

All code to reproduce the results described in this paper can be found on Zenodo ( https://doi.org/10.5281/zenodo.4024122 ) 42 . All code for the software ASReview is available under an Apache 2.0 license ( https://doi.org/10.5281/zenodo.3345592 ) 27 , is maintained on GitHub 63 and includes documentation ( https://doi.org/10.5281/zenodo.4287120 ) 28 .

Bornmann, L. & Mutz, R. Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references. J. Assoc. Inf. Sci. Technol. 66 , 2215–2222 (2015).

Article   Google Scholar  

Gough, D., Oliver, S. & Thomas, J. An Introduction to Systematic Reviews (Sage, 2017).

Cooper, H. Research Synthesis and Meta-analysis: A Step-by-Step Approach (SAGE Publications, 2015).

Liberati, A. et al. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. J. Clin. Epidemiol. 62 , e1–e34 (2009).

Boaz, A. et al. Systematic Reviews: What have They Got to Offer Evidence Based Policy and Practice? (ESRC UK Centre for Evidence Based Policy and Practice London, 2002).

Oliver, S., Dickson, K. & Bangpan, M. Systematic Reviews: Making Them Policy Relevant. A Briefing for Policy Makers and Systematic Reviewers (UCL Institute of Education, 2015).

Petticrew, M. Systematic reviews from astronomy to zoology: myths and misconceptions. Brit. Med. J. 322 , 98–101 (2001).

Lefebvre, C., Manheimer, E. & Glanville, J. in Cochrane Handbook for Systematic Reviews of Interventions (eds. Higgins, J. P. & Green, S.) 95–150 (John Wiley & Sons, 2008); https://doi.org/10.1002/9780470712184.ch6 .

Sampson, M., Tetzlaff, J. & Urquhart, C. Precision of healthcare systematic review searches in a cross-sectional sample. Res. Synth. Methods 2 , 119–125 (2011).

Wang, Z., Nayfeh, T., Tetzlaff, J., O’Blenis, P. & Murad, M. H. Error rates of human reviewers during abstract screening in systematic reviews. PLoS ONE 15 , e0227742 (2020).

Marshall, I. J. & Wallace, B. C. Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Syst. Rev. 8 , 163 (2019).

Harrison, H., Griffin, S. J., Kuhn, I. & Usher-Smith, J. A. Software tools to support title and abstract screening for systematic reviews in healthcare: an evaluation. BMC Med. Res. Methodol. 20 , 7 (2020).

O’Mara-Eves, A., Thomas, J., McNaught, J., Miwa, M. & Ananiadou, S. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst. Rev. 4 , 5 (2015).

Wallace, B. C., Trikalinos, T. A., Lau, J., Brodley, C. & Schmid, C. H. Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinf. 11 , 55 (2010).

Cohen, A. M., Hersh, W. R., Peterson, K. & Yen, P.-Y. Reducing workload in systematic review preparation using automated citation classification. J. Am. Med. Inform. Assoc. 13 , 206–219 (2006).

Kremer, J., Steenstrup Pedersen, K. & Igel, C. Active learning with support vector machines. WIREs Data Min. Knowl. Discov. 4 , 313–326 (2014).

Miwa, M., Thomas, J., O’Mara-Eves, A. & Ananiadou, S. Reducing systematic review workload through certainty-based screening. J. Biomed. Inform. 51 , 242–253 (2014).

Settles, B. Active Learning Literature Survey (Minds@UW, 2009); https://minds.wisconsin.edu/handle/1793/60660

Holzinger, A. Interactive machine learning for health informatics: when do we need the human-in-the-loop? Brain Inform. 3 , 119–131 (2016).

Van de Schoot, R. & De Bruin, J. Researcher-in-the-loop for Systematic Reviewing of Text Databases (Zenodo, 2020); https://doi.org/10.5281/zenodo.4013207

Kim, D., Seo, D., Cho, S. & Kang, P. Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Inf. Sci. 477 , 15–29 (2019).

Nosek, B. A. et al. Promoting an open research culture. Science 348 , 1422–1425 (2015).

Kilicoglu, H., Demner-Fushman, D., Rindflesch, T. C., Wilczynski, N. L. & Haynes, R. B. Towards automatic recognition of scientifically rigorous clinical research evidence. J. Am. Med. Inform. Assoc. 16 , 25–31 (2009).

Gusenbauer, M. & Haddaway, N. R. Which academic search systems are suitable for systematic reviews or meta‐analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Res. Synth. Methods 11 , 181–217 (2020).

Borah, R., Brown, A. W., Capers, P. L. & Kaiser, K. A. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open 7 , e012545 (2017).

de Vries, H., Bekkers, V. & Tummers, L. Innovation in the Public Sector: a systematic review and future research agenda. Public Adm. 94 , 146–166 (2016).

Van de Schoot, R. et al. ASReview: Active Learning for Systematic Reviews (Zenodo, 2020); https://doi.org/10.5281/zenodo.3345592

De Bruin, J. et al. ASReview Software Documentation 0.14 (Zenodo, 2020); https://doi.org/10.5281/zenodo.4287120

ASReview PyPI Package (ASReview Core Development Team, 2020); https://pypi.org/project/asreview/

Docker container for ASReview (ASReview Core Development Team, 2020); https://hub.docker.com/r/asreview/asreview

Ferdinands, G. et al. Active Learning for Screening Prioritization in Systematic Reviews—A Simulation Study (OSF Preprints, 2020); https://doi.org/10.31219/osf.io/w6qbg

Fu, J. H. & Lee, S. L. Certainty-enhanced active learning for improving imbalanced data classification. In 2011 IEEE 11th International Conference on Data Mining Workshops 405–412 (IEEE, 2011).

Le, Q. V. & Mikolov, T. Distributed representations of sentences and documents. Preprint at https://arxiv.org/abs/1405.4053 (2014).

Ramos, J. Using TF–IDF to determine word relevance in document queries. In Proc. 1st Instructional Conference on Machine Learning Vol. 242, 133–142 (ICML, 2003).

Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).

MathSciNet   MATH   Google Scholar  

Reimers, N. & Gurevych, I. Sentence-BERT: sentence embeddings using siamese BERT-networks Preprint at https://arxiv.org/abs/1908.10084 (2019).

Smith, V., Devane, D., Begley, C. M. & Clarke, M. Methodology in conducting a systematic review of systematic reviews of healthcare interventions. BMC Med. Res. Methodol. 11 , 15 (2011).

Wynants, L. et al. Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal. Brit. Med. J . 369 , 1328 (2020).

Van de Schoot, R. et al. Extension for COVID-19 Related Datasets in ASReview (Zenodo, 2020). https://doi.org/10.5281/zenodo.3891420 .

Lu Wang, L. et al. CORD-19: The COVID-19 open research dataset. Preprint at https://arxiv.org/abs/2004.10706 (2020).

Fraser, N. & Kramer, B. Covid19_preprints (FigShare, 2020); https://doi.org/10.6084/m9.figshare.12033672.v18

Ferdinands, G., Schram, R., Van de Schoot, R. & De Bruin, J. Scripts for ‘ASReview: Open Source Software for Efficient and Transparent Active Learning for Systematic Reviews’ (Zenodo, 2020); https://doi.org/10.5281/zenodo.4024122

Ferdinands, G., Schram, R., van de Schoot, R. & de Bruin, J. Results for ‘ASReview: Open Source Software for Efficient and Transparent Active Learning for Systematic Reviews’ (OSF, 2020); https://doi.org/10.17605/OSF.IO/2JKD6

Kwok, K. T. T., Nieuwenhuijse, D. F., Phan, M. V. T. & Koopmans, M. P. G. Virus metagenomics in farm animals: a systematic review. Viruses 12 , 107 (2020).

Hall, T., Beecham, S., Bowes, D., Gray, D. & Counsell, S. A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 38 , 1276–1304 (2012).

van de Schoot, R., Sijbrandij, M., Winter, S. D., Depaoli, S. & Vermunt, J. K. The GRoLTS-Checklist: guidelines for reporting on latent trajectory studies. Struct. Equ. Model. Multidiscip. J. 24 , 451–467 (2017).

Article   MathSciNet   Google Scholar  

van de Schoot, R. et al. Bayesian PTSD-trajectory analysis with informed priors based on a systematic literature search and expert elicitation. Multivar. Behav. Res. 53 , 267–291 (2018).

Cohen, A. M., Bhupatiraju, R. T. & Hersh, W. R. Feature generation, feature selection, classifiers, and conceptual drift for biomedical document triage. In Proc. 13th Text Retrieval Conference (TREC, 2004).

Vasalou, A., Ng, B. D., Wiemer-Hastings, P. & Oshlyansky, L. Human-moderated remote user testing: orotocols and applications. In 8th ERCIM Workshop, User Interfaces for All Vol. 19 (ERCIM, 2004).

Joffe, H. in Qualitative Research Methods in Mental Health and Psychotherapy: A Guide for Students and Practitioners (eds Harper, D. & Thompson, A. R.) Ch. 15 (Wiley, 2012).

NVivo v. 12 (QSR International Pty, 2019).

Hindriks, S., Huijts, M. & van de Schoot, R. Data for UX-test ASReview - June 2020. OSF https://doi.org/10.17605/OSF.IO/7PQNM (2020).

Marshall, I. J., Kuiper, J. & Wallace, B. C. RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials. J. Am. Med. Inform. Assoc. 23 , 193–201 (2016).

Nallapati, R., Zhou, B., dos Santos, C. N., Gulcehre, Ç. & Xiang, B. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proc. 20th SIGNLL Conference on Computational Natural Language Learning 280–290 (Association for Computational Linguistics, 2016).

Xie, Q., Dai, Z., Hovy, E., Luong, M.-T. & Le, Q. V. Unsupervised data augmentation for consistency training. Preprint at https://arxiv.org/abs/1904.12848 (2019).

Ratner, A. et al. Snorkel: rapid training data creation with weak supervision. VLDB J. 29 , 709–730 (2020).

Systematic Review Datasets (ASReview Core Development Team, 2020); https://github.com/asreview/systematic-review-datasets

Wallace, B. C., Small, K., Brodley, C. E., Lau, J. & Trikalinos, T. A. Deploying an interactive machine learning system in an evidence-based practice center: Abstrackr. In Proc. 2nd ACM SIGHIT International Health Informatics Symposium 819–824 (Association for Computing Machinery, 2012).

Cheng, S. H. et al. Using machine learning to advance synthesis and use of conservation and environmental evidence. Conserv. Biol. 32 , 762–764 (2018).

Yu, Z., Kraft, N. & Menzies, T. Finding better active learners for faster literature reviews. Empir. Softw. Eng . 23 , 3161–3186 (2018).

Ouzzani, M., Hammady, H., Fedorowicz, Z. & Elmagarmid, A. Rayyan—a web and mobile app for systematic reviews. Syst. Rev. 5 , 210 (2016).

Przybyła, P. et al. Prioritising references for systematic reviews with RobotAnalyst: a user study. Res. Synth. Methods 9 , 470–488 (2018).

ASReview: Active learning for Systematic Reviews (ASReview Core Development Team, 2020); https://github.com/asreview/asreview

Download references

Acknowledgements

We would like to thank the Utrecht University Library, focus area Applied Data Science, and departments of Information and Technology Services, Test and Quality Services, and Methodology and Statistics, for their support. We also want to thank all researchers who shared data, participated in our user experience tests or who gave us feedback on ASReview in other ways. Furthermore, we would like to thank the editors and reviewers for providing constructive feedback. This project was funded by the Innovation Fund for IT in Research Projects, Utrecht University, the Netherlands.

Author information

Authors and affiliations.

Department of Methodology and Statistics, Faculty of Social and Behavioral Sciences, Utrecht University, Utrecht, the Netherlands

Rens van de Schoot, Gerbrich Ferdinands, Albert Harkema, Joukje Willemsen, Yongchao Ma, Qixiang Fang, Sybren Hindriks & Daniel L. Oberski

Department of Research and Data Management Services, Information Technology Services, Utrecht University, Utrecht, the Netherlands

Jonathan de Bruin, Raoul Schram, Parisa Zahedi & Maarten Hoogerwerf

Utrecht University Library, Utrecht University, Utrecht, the Netherlands

Jan de Boer, Felix Weijdema & Bianca Kramer

Department of Test and Quality Services, Information Technology Services, Utrecht University, Utrecht, the Netherlands

Martijn Huijts

School of Governance, Faculty of Law, Economics and Governance, Utrecht University, Utrecht, the Netherlands

Lars Tummers

Department of Biostatistics, Data management and Data Science, Julius Center, University Medical Center Utrecht, Utrecht, the Netherlands

Daniel L. Oberski

You can also search for this author in PubMed   Google Scholar

Contributions

R.v.d.S. and D.O. originally designed the project, with later input from L.T. J.d.Br. is the lead engineer, software architect and supervises the code base on GitHub. R.S. coded the algorithms and simulation studies. P.Z. coded the very first version of the software. J.d.Bo., F.W. and B.K. developed the systematic review pipeline. M.Huijts is leading the UX tests and was supported by S.H. M.Hoogerwerf developed the architecture of the produced (meta)data. G.F. conducted the simulation study together with R.S. A.H. performed the literature search comparing the different tools together with G.F. J.W. designed all the artwork and helped with formatting the manuscript. Y.M. and Q.F. are responsible for the preprocessing of the metadata under the supervision of J.d.Br. R.v.d.S, D.O. and L.T. wrote the paper with input from all authors. Each co-author has written parts of the manuscript.

Corresponding author

Correspondence to Rens van de Schoot .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks Jian Wu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information.

Overview of software tools supporting systematic reviews.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

van de Schoot, R., de Bruin, J., Schram, R. et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell 3 , 125–133 (2021). https://doi.org/10.1038/s42256-020-00287-7

Download citation

Received : 04 June 2020

Accepted : 17 December 2020

Published : 01 February 2021

Issue Date : February 2021

DOI : https://doi.org/10.1038/s42256-020-00287-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

A systematic review, meta-analysis, and meta-regression of the prevalence of self-reported disordered eating and associated factors among athletes worldwide.

  • Hadeel A. Ghazzawi
  • Lana S. Nimer
  • Haitham Jahrami

Journal of Eating Disorders (2024)

Systematic review using a spiral approach with machine learning

  • Amirhossein Saeidmehr
  • Piers David Gareth Steel
  • Faramarz F. Samavati

Systematic Reviews (2024)

Determinants of and interventions for Proton Pump Inhibitor prescription behavior: A systematic scoping review

  • L. C. van Gestel
  • M. A. Adriaanse
  • G. van den Brink

BMC Primary Care (2024)

The spatial patterning of emergency demand for police services: a scoping review

  • Samuel Langton
  • Stijn Ruiter
  • Linda Schoonmade

Crime Science (2024)

The SAFE procedure: a practical stopping heuristic for active learning-based screening in systematic reviews and meta-analyses

  • Josien Boetje
  • Rens van de Schoot

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

systematic literature review using ai

This manuscript presents a comprehensive review of the use of Artificial Intelligence (AI) in Systematic Literature Reviews (SLRs). A SLR is a rigorous and organised methodology that assesses and integrates prior research on a given topic. Numerous tools have been developed to assist and partially automate the SLR process. The increasing role of AI in this field shows great potential in providing more effective support for researchers, moving towards the semi-automatic creation of literature reviews. Our study focuses on how AI techniques are applied in the semi-automation of SLRs, specifically in the screening and extraction phases. We examine 21 leading SLR tools using a framework that combines 23 traditional features with 11 AI features. We also analyse 11 recent tools that leverage large language models for searching the literature and assisting academic writing. Finally, the paper discusses current trends in the field, outlines key research challenges, and suggests directions for future research. We highlight three primary research challenges: integrating advanced AI solutions, such as large language models and knowledge graphs, improving usability, and developing a standardised evaluation framework. We also propose best practices to ensure more robust evaluations in terms of performance, usability, and transparency. Overall, this review offers a detailed overview of AI-enhanced SLR tools for researchers and practitioners, providing a foundation for the development of next-generation AI solutions in this field.

Introduction

In this page, we report the tables we have developed when conducting our analysis on the Systematic Literature Review Tools.

Systematic Literature Review Tools analysed through AI and Generic Features

Screening phase of systematic literature review tools analysed through ai features.

  • Extraction Phase of Systematic Literature Review Tools analysed through AI Features
  • Systematic Literature Review Tools analysed based on AI Features

Figures of the paper

Prisma checklist.

  • Codebase Snowballing

How to cite

In this section, we report three tables that describe the 21 systematic literature review tools examined according to both generic and AI-based features. In Section 1.1 and Section 1.2 , we present the analysis of the AI features for the screening and the extraction phases, respectively. In Section 1.3 , we report the analysis of the tools according to the generic features. A copy of these tables is persistently available on ORKG.

Tool Research Field SLR Task Human Interaction Approach Text Representation Input Output Minimum Requirement Model Execution Pre-screening Support Post-Screening Support
Abstrackr Any Classification of relevant papers. Ranking papers as relevant and irrelevant. ML classifier (Active Learning) based on SMV applying constrained weightspace. Bag of words. Title & Abstract Pre classification of papers based on inclusion probability. - Synchronous: the model updates in the background as soon as a new paper is added. Keywords search: It searches by keywords that could be color by level of relevance. NO
ASReview Any Classification of relevant papers. Ranking papers as relevant and irrelevant. ML classifier (Active Learning) based on Naive Bayes(default), SMV , logistic regression, RandoM Forest or Simple NN. Bag of words.
Embeddings: SentenceBERT, doc2vec.
Title & Abstract Pre classification of papers based on inclusion probability (likelihood of relevance from 0 to 1). For values greater than 0.5 the papers are marked as relevant. Relevant papers: 1.
Irrelevant papers: 1.
Synchronous Keywords search: Basic search trough keywords NO
Colandr Any Task 1: Classification of relevant papers.
Task 2: Identification of the category attributed to the paper by the user.
Task 1: Ranking papers as relevant and irrelevant.
Task 2: The user has to assigned categories (tags) to the papers.
Task 1: Similarity-based method: Identifies co-occurrences of words around the keywords selected by the user (Active Learning).
Task 2: NER for Locations (Active Learning). ML classifier for other tags based on logistic regression (Active Learning).
Task 1: Embeddings: Word2vec.
Task 2: Embeddings: Glove
Task 1: Title & Abstract
Task 2: Full content
Task 1: Pre classification of papers based on inclusion probability (relevance).
Task 2: Listing of sentences that best predicts the label (High, Medium, Low).
Task 1: 10 relevant papers and 10 irrelevant papers.
Task 2: Minimum 50 papers.
Synchronous for both tasks Keywords search: It searches by keywords that could be color by level of relevance. NO
Covidence Any Task 1: Classification of relevant papers.
Task 2: Identification of biomedical studies (RCTs).
Task 1: Ranking papers as relevant and irrelevant.
Task 2: No user interaction.
Task 1: ML classifier (Active Learning) based on two ensembles SVM.
Task 2: ML classifier (Superivised Learning) bassed on SVM (RCT classifier).
Bag of words for both tasks: ngrams. Task 1: Title & Abstract
Task 2: Title & Abstract
Task 1: Pre classification of papers based on inclusion probability (relevancy prediction).
Task 2: Pre classification of papers based on inclusion probability (from 0 to 1). For values greater than 0.24 the papers are marked as Possible RCT.
Task 1: 2 relevant papers and 2 irrelevant papers.
Task 2: Not Applicable.
- Keywords search: Basic search trough keywords NO
DistillerSR Any Classification of relevant papers. Ranking papers as relevant and irrelevant. ML classifiers (Active Learning) based on SVM or Naive Bayes. Bag of words. Title & Abstract Naive Bayes: Pre classification of papers based on inclusion probability (prediction score from 0 to 1). For values greater or equal than 0.5 the papers are likely to marked as relevant.
SVM: Pre classification of papers based on binary classification (from 0 to 1 and then define Include, Exclude, or Cannot Decide).
Relevant papers: 10.
Irrelevant papers: 40.
- Keywords search: Basic search trough keywords None
EPPI-Reviewer Any Task 1: Classification of relevant papers.
Task 2: Identification of biomedical studies (RCTs, Systematic Reviews, Economic Evaluations, COVID-19 categories, long COVID).
Task 1: Ranking papers as relevant and irrelevant.
Task 2: No user interaction.
Task 1: ML classifier (Active Learning) based on SVM.
Task 2: ML classifier (Superivised Learning) based on SVM for Cochrane RCT. For Origina RCT, Systematic Review, Economic Evaluations, COVID-19 categories, and Identify Long COVID the information is not available.
Task 1: Bag of words (ngrams).
Task 2: The Cochrane RCT classifer uses bag of words. For the other approaches the information is not available.
Task 1: Title & Abstract
Task 2: Title & Abstract
Task 1: Pre classification of papers based on inclusion probability (probability score from 0 to 100).
Task 2: Pre classification of papers based on inclusion probability (from 0 to 1). For values greater than 0.24 the papers are marked as Possible RCT.
Task 1: 5 relevant papers. Number of irrelevant papers not available.
Task 2: Not Applicable
Synchronous for both tasks Keywords search: It searches by keywords that could be highlighted;
Tags: It searches by terms or tags assigned after reading the paper.
NO
FAST2 Any Classification of relevant papers. Ranking papers as relevant and irrelevant. ML classifier (Active Learning) based on SVM. Bag of words. Title & Abstract Pre classification of papers based on inclusion probability (decision score from 0 to 1). - Synchronous NO NO
Iris.ai Any Clustering of Abstracts Task 1: Exploratory search: When the researcher is novice or exploring a new field.
1.1 Provide a seed idea (300-500 words) or a title or abstract of a paper.
1.2 Select the relevant papers from the visual map manually or using the search filter. In the latter they can narrow down the results based on topics or concepts using the analyze tool.

Task 2: Advanced search: When the researcher has expertise on the topic.
2.1 Dataset selection (online database or their own records).
2.2 Clustering of results with the search filter option which will allow him to apply the analyze tool and/or the context filter.
2.3 Selection of the relavant papers.
2.4 Repeat steps 2.2 and 2.3 until they considers appropriate to stop.
Similarity-based method: Matches the fingerprint of the text or abstract with the fingerprints of the papers of the databases CORE, PubMed, US Patent or CORDIS (Unsupervised Learning).
The fingerprint is a vector representation of the most meaningful words and their synonyms of the text or abstract.
Embeddings. Title & Abstract Pre classification of papers based on inclusion probability (relevance score from 0 to 1 with threshold being 0.4 and 0.9). Not Applicable Synchronous Keywords search: Basic search trough keywords;
Topic Groups: from a global topic (what topics do these articles fall within from an overall scientific level) as well as a specific topic (within this reading list, what topics do the articles fall within) based on visual maps.
Abstractive summarisation based on the selected papers.
LitSuggest Biomedicine Classification of relevant papers. Ranking papers as relevant and irrelevant. ML classifier (Active Learning) based on logistic regression. Bag of words. Title & Abstract Pre classification of papers based on inclusion probability (relevance prediction from 0 to 1). - Synchronous NO NO
Nested Knowledge Any Classification of relevant papers. Ranking papers as relevant and irrelevant. ML classifier (Active Learning) which is unkown. - Title & Abstract Pre classification of papers based on inclusion probability (inclusion prediction). - - Keywords search: Basic search trough keywords;
PICO identification: Highlights the parts of a PICO question in the abstract;
Ontology based on the user classification.
PICOPortal Any Task 1: Classification of relevant papers.
Task 2: Identification of biomedical studies (RCTs).
Task 1: Ranking papers as relevant and irrelevant.
Task 2: No user interaction.
Task 1: ML classifier (Active Learning) based on SVM.
Task 2: ML classifier (Superivised Learning) based voting system implementing decision tree or deep learning.
Embeddings for task 2: BioBERT.
No information regardin Task 1.
Task 1: Title & Abstract
Task 2: Title & Abstract
Task 1: Pre classification of papers based on inclusion probability.
Task 2: Pre classification of papers based on inclusion probability (from 0 to 1).
- Synchronous for both tasks PICO identification: Highlights the parts of a PICO question in the abstract. NO
pitts.ai Biomedicine Identification of biomedical studies (RCTs). No user interaction. ML classifier (Superivised Learning) based on SVM. Embeddings: SciBERT Title & Abstract Pre classification of papers based on inclusion probability (prediction probability from 0 to 1). Not Applicable Synchronous NO NO
Rayyan Any Classification of relevant papers. Ranking papers as relevant and irrelevant. ML classifier (Active Learning) based on SVM. Bag of words: ngrams Title & Abstract Pre classification of papers based on inclusion probability (score probability).
The score is based on a 5 star scheme, higher score identifies the relevant papers.
Relevant papers: 5.
Irrelevant papers: 5.
Synchronous Keywords search: It searches by keywords that could be highlighted;
Other searches: It searches by author or publication year;
Boolean Operator: It searches by the combination of boolean operators(AND, OR, NOT) with keywords.
PICO identification: Highlights the parts of a PICO question in the abstract.
Location facet: Extracts the study locations (highly applicable in biomedical studies).
Topics: Extracts topics enriching them with MeSH terms.
Biomedical keywords: Prepopulates a set of keywords and phrases (highly applicable in RCT).
NO
Research Screener Any Classification of relevant papers. Ranking papers as relevant and irrelevant. ML classifier (Active Learning) based on unkown algorithm. Embeddings: paragraph embedding Title & Abstract Pre classification of papers based on inclusion probability (inclusion probability). Relevant papers: 1.
Irrelevant papers: Information not available.
Synchronous NO NO
RobotAnalyst Any Classification of relevant papers. Ranking papers as relevant and irrelevant. ML classifier (Active Learning) based on SVM. Bag of words. Title & Abstract Pre classification of papers based on inclusion probability (inclusion confidence from 0 to 1). - Synchronous Topic modelling: It assigned a topic to a paper based on the most recurrent terms which could be shared by other papers;
Clustering: groups paper on the most common terms;
Keywords/Term search: searches by keywords or terms(noun phrases);
Other criterias: Document id, Publication Year, Author, Type of Publication, Journal, Notes, Time of screening decision, retrieval method;
NO
RobotReviewer/RobotSearch Biomedicine Identification of biomedical studies (RCTs). No user interaction. ML classifier (Active Learning) based on SVM. Embeddings: SciBERT Title & Abstract Pre classification of papers based on inclusion probability (prediction probability from 0 to 1). Relevant papers: NA.
Irrelevant papers: NA.
- PICO model: It colours the different PICO elements. NO
SWIFT-Active Screener Any Classification of relevant papers. Ranking papers as relevant and irrelevant. ML classifier (Active Learning) based on log-linear. Bag of words. Title & Abstract Pre classification of papers based on inclusion probability (inclusion rate from 0 to 1). Relevant papers: 1.
Irrelevant papers:1.
Asynchronous: The model updates every 30 papers. There must be a gap of 2 minutes between the last model built and the current model built. Keywords search: Basic search trough keywords NO
SWIFT-Review Biomedicine Classification of relevant papers. Ranking papers as relevant and irrelevant. ML classifier (Active Learning) based on log-linear. Bag of words. Title & Abstract Pre classification of papers based on inclusion probability (priority ranking from 0 to 1). Relevant papers: 1.
Irrelevant papers:1.
Synchronous Keywords search: Basic search trough keywords.
Topic modelling: It assigned a topic to a paper based on the most recurrent terms which could be shared by other papers; Keywor search: searches by keywords or tags;
NO
SysRev.com Any Classification of relevant papers. Ranking papers as relevant and irrelevant. ML classifier (Active Learning) based on a customized architecture similar to cross attention. - Title & Abstract Pre classification of papers based on inclusion probability (prediction probability). Relevant papers: 30.
Irrelevant papers: 30.
Asynchronous: The model updates nightly. NO NO

-: No information available

Extraction phase of Systematic Literature Review Tools analysed through AI Features

Tool Research Field SLR Task Approach Text Representation Input Output
RobotReviewer/RobotSearch Biomedical Identifies risks of bias: how reliable are the results? ML classifier, combining a lineal model and a Convolutional Neural Network (CNN) model.
These models are trained on a dataset containing manually annotated sentences stating the level of bias.
Bag of word: ngrams.
Embeddings: embedding layer from CNN Model.
Full-text paper. Risk of bias classification (as Low, High, Unclear)
ExaCT Biomedical NER of Randomised Controlled Trials Task 1: ML classifier based on SVM to identify sentences regarding a control trial.
Task 2: Rule base detection to identify the 21 CONSORT categories.
Bag of words: ngrams. Full-text paper. Possible RCT entities
Dextr Environmental Health Science Task 1: NER of animal studies.
Task 2: Entity linking of animal studies.
Task 1: ML Classifier implementing a neural network model based on bidirectional LSTM with a Conditional Random Field (BI-LSTM-CRF) architecture.
Task 2: Linking according to a customised ontology
Task 1: Embeddings: GloVe, ELMo.
Task 2: Not Applicable.
Title and Abstracts Task 1: Possible animal entities.
Task 2: Relationships of animal models and exposures vs experimentas ot endpoints vs experiments.
Iris.ai Any Task 1: NER of entities selected by the user.
Task 2: Entity linking of the identified entities.
Task 1: ML classifier. Algorithim is unknown.
Task 2: Uses a knowledge graph to represent the relations of within the entities on the paper or between the entities of the table. The technical implementation is unknown.
Task 1: Embeddings: word embedding.
Task 2: Not Applicable.
Full-text paper. Task 1: Possible entities based on a confidence interval.
Task 2: Additional semantics on the extracted entities.

Systematic Literature Review Tools analysed based on General Features

Tool Multiplatform Multiple user roles Multiple user support Project auditing Project progress Authentication Status of software Automated full-text retrieval Automated search Snowballing Manual reference importing Manually attaching full-text Reference importing Deduplication Discrepancy resolving In-/excluding references Reference labelling & comments Screening phases Exporting results Flow diagram creation Protocol Living/updatable Free to use SLR stage
Abstrackr Yes Single 2 Yes Limited Basic Stable release No None No Yes No PMID; csv; xml No Yes No Yes Title & Abstract csv; xml No No No Yes Screening
Colandr Yes Single 2 No Limited Basic Stable release No None No No No txt; ris; bib No Yes Yes Yes Title & Abstract csv No Yes No Yes Screening
DistillerSR Yes Multiple >1 Yes Limited Basic Stable release Yes PubMed No Yes Yes csv; enlx; ris; zip; zip(japic) Yes Yes Yes Yes Title & Abstract + Full Content - Yes No No No Screening
EPPI-Reviewer Yes Multiple >1 Yes Detailed Basic Stable release No PubMed No Yes Yes ris; PMID; ciw; Yes Yes Yes Yes Title & Abstract + Full Content The screened papers go to the next stage which is information   extraction No No No No Screening
LitSuggest Yes Single No No Limited Basic Stable release No PubMed No No No PMID No No No No Title & Abstract tsv No No Yes Yes Screening
Nested Knowledge Yes Multiple >1 Yes Detailed Basic Stable release Yes PubMed; Europe PMC; DOAJ; ClinicalTrials.gov No No Yes nBIB; ris Yes Yes Yes Yes Title & Abstract csv; ris Yes Yes No No Screening
Rayyan Yes Multiple >1 Yes Detailed Basic Stable release No None No Yes Yes enw; xml; nbib; ciw; ris; bib; cvs Yes Yes Yes Yes Title & Abstract + Full Content ris; bib; enw; csv Yes No No Yes Screening
RobotAnalyst Yes Single No No Limited Basic Stable release No PubMed No Yes No txt; nbib; ris No No Yes No Title & Abstract ris No No No Yes Screening
SWIFT-Active Screener Yes Multiple >1 Yes Detailed Basic Stable release No None No No Yes PMID; xml; ris Yes Yes Yes Yes Title & Abstract + Full Content csv; json No Yes No No Screening
SWIFT-Review Yes Single No No No Basic Stable release No None No No No PMID; xml No No No No Title & Abstract txt No No No Yes Screening
FAST2 Yes Single No No No None Stable release No None No No No - No No No No Title & Abstract No export available No No No Yes Screening
ASReview Yes Single >1 No Detailed None Stable release No None No No No ris; csv, xlsx; No No Yes No Title & Abstract csv; tsv; ris No No No Yes Screening
Research Screener Yes Multiple >1 No Limited Basic Stable release No None No No No xml Yes Yes Yes No Title & Abstract xml No No No Yes Screening
pitts.ai Yes Multiple >1 No Limited Basic Stable release No PubMed No No No ris No Yes Yes No Title & Abstract No export available No No No No Screening
SysRev.com Yes Multiple >1 Yes Limited Basic Stable release No PubMed No No Yes pmid; ris; enlx; json No Yes Yes Yes Title & Abstract xml; csv No No No No Screening
Covidence Yes Multiple >1 No Limited Basic Stable release No None No No Yes xml; crs; ris Yes Yes Yes Yes Title & Abstract + Full Content csv; ris Yes No No No Screening
RobotReviewer /RobotSearch Yes Single No No No None Stable release No None No No No pdf No No No No Title & Abstract No export available No No No Yes Screening + Extraction
Iris.ai Yes Single No Yes No Basic Stable release No CORE; PubMed; US Patent Office; CORDIS No No No bibtex No No No No Title & Abstract - No No No No Screening + Extraction
PICO Portal Yes Multiple >1 Yes Detailed Basic Stable release Yes None No No Yes csv; bibtex; ris; enw; xml; xls; txt; ciw Yes Yes Yes Yes Title & Abstract - No Yes No Yes Screening
Dextr Yes Single No No No Basic Stable release NA None NA NA NA ris, pdf NA No NA NA Not applicable csv; zip NA NA NA Yes Extraction
ExaCT Yes Single No No No Basic Stable release NA None NA NA NA xml NA No NA NA Not applicable No export available NA NA NA Yes Extraction

-: No information available NA: Not applicable because the tools are specifically for extraction

In this section we attach all the figures of the mauscript in high defininition (300DPI).

alternative text

In the following table we report our PRISMA checklist, using the model from "Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 2021;372:n71. doi: 10.1136/bmj.n71".

    
Section and Topic     
    
Item #    
    
Checklist item     
    
Location where item is reported     
   
TITLE      
   
   
   
Title      
   
1   
   
Identify   the report as a systematic review.   
   
1   
   
ABSTRACT      
   
   
   
Abstract      
   
2   
   
See   the PRISMA 2020 for Abstracts checklist.   
   
1   
   
INTRODUCTION      
   
   
   
Rationale      
   
3   
   
Describe   the rationale for the review in the context of existing knowledge.   
   
2   
   
Objectives      
   
4   
   
Provide   an explicit statement of the objective(s) or question(s) the review   addresses.   
   
2   
   
METHODS      
   
   
   
Eligibility   criteria    
   
5   
   
Specify   the inclusion and exclusion criteria for the review and how studies were   grouped for the syntheses.   
   
5   
   
Information   sources    
   
6   
   
Specify   all databases, registers, websites, organisations, reference lists and other   sources searched or consulted to identify studies. Specify the date when each   source was last searched or consulted.   
   
5,6   
   
Search   strategy   
   
7   
   
Present   the full search strategies for all databases, registers and websites,   including any filters and limits used.   
   
5,6   
   
Selection   process   
   
8   
   
Specify   the methods used to decide whether a study met the inclusion criteria of the   review, including how many reviewers screened each record and each report   retrieved, whether they worked independently, and if applicable, details of   automation tools used in the process.   
   
5,6   
   
Data   collection process    
   
9   
   
Specify   the methods used to collect data from reports, including how many reviewers   collected data from each report, whether they worked independently, any   processes for obtaining or confirming data from study investigators, and if   applicable, details of automation tools used in the process.   
   
5,6,23   
   
Data   items    
   
10a   
   
List   and define all outcomes for which data were sought. Specify whether all   results that were compatible with each outcome domain in each study were sought   (e.g. for all measures, time points, analyses), and if not, the methods used   to decide which results to collect.   
   
N/A   
   
10b   
   
List   and define all other variables for which data were sought (e.g. participant   and intervention characteristics, funding sources). Describe any assumptions   made about any missing or unclear information.   
   
8,10,11   
   
Study   risk of bias assessment   
   
11   
   
Specify   the methods used to assess risk of bias in the included studies, including   details of the tool(s) used, how many reviewers assessed each study and   whether they worked independently, and if applicable, details of automation   tools used in the process.   
   
5,6   
   
Effect   measures    
   
12   
   
Specify   for each outcome the effect measure(s) (e.g. risk ratio, mean difference)   used in the synthesis or presentation of results.   
   
N/A   
   
Synthesis   methods   
   
13a   
   
Describe   the processes used to decide which studies were eligible for each synthesis   (e.g. tabulating the study intervention characteristics and comparing against   the planned groups for each synthesis (item #5)).   
   
23   
   
13b   
   
Describe   any methods required to prepare the data for presentation or synthesis, such   as handling of missing summary statistics, or data conversions.   
   
N/A   
   
13c   
   
Describe   any methods used to tabulate or visually display results of individual   studies and syntheses.   
   
N/A   
   
13d   
   
Describe   any methods used to synthesize results and provide a rationale for the   choice(s). If meta-analysis was performed, describe the model(s), method(s)   to identify the presence and extent of statistical heterogeneity, and   software package(s) used.   
   
N/A   
   
13e   
   
Describe   any methods used to explore possible causes of heterogeneity among study   results (e.g. subgroup analysis, meta-regression).   
   
N/A   
   
13f   
   
Describe   any sensitivity analyses conducted to assess robustness of the synthesized   results.   
   
N/A   
   
Reporting   bias assessment   
   
14   
   
Describe   any methods used to assess risk of bias due to missing results in a synthesis   (arising from reporting biases).   
   
21, 22,23   
   
Certainty   assessment   
   
15   
   
Describe   any methods used to assess certainty (or confidence) in the body of evidence   for an outcome.   
   
NA   
   
RESULTS      
   
   
   
Study   selection    
   
16a   
   
Describe   the results of the search and selection process, from the number of records identified   in the search to the number of studies included in the review, ideally using   a flow diagram.   
   
Fig 1   
   
16b   
   
Cite   studies that might appear to meet the inclusion criteria, but which were   excluded, and explain why they were excluded.   
   
7   
   
Study   characteristics    
   
17   
   
Cite   each included study and present its characteristics.   
   
Appendix A   
   
Risk   of bias in studies    
   
18   
   
Present   assessments of risk of bias for each included study.   
   
N/A   
   
Results   of individual studies    
   
19   
   
For   all outcomes, present, for each study: (a) summary statistics for each group   (where appropriate) and (b) an effect estimates and its precision (e.g.   confidence/credible interval), ideally using structured tables or plots.   
   
Appendix A   
   
Results   of syntheses   
   
20a   
   
For   each synthesis, briefly summarise the characteristics and risk of bias among   contributing studies.   
   
12-21   
   
20b   
   
Present   results of all statistical syntheses conducted. If meta-analysis was done,   present for each the summary estimate and its precision (e.g.   confidence/credible interval) and measures of statistical heterogeneity. If   comparing groups, describe the direction of the effect.   
   
Table 4   
   
20c   
   
Present   results of all investigations of possible causes of heterogeneity among study   results.   
   
N/A   
   
20d   
   
Present   results of all sensitivity analyses conducted to assess the robustness of the   synthesized results.   
   
N/A   
   
Reporting   biases   
   
21   
   
Present   assessments of risk of bias due to missing results (arising from reporting   biases) for each synthesis assessed.   
   
N/A   
   
Certainty   of evidence    
   
22   
   
Present   assessments of certainty (or confidence) in the body of evidence for each   outcome assessed.   
   
N/A   
   
DISCUSSION      
   
   
   
Discussion      
   
23a   
   
Provide   a general interpretation of the results in the context of other evidence.   
   
23-31   
   
23b   
   
Discuss   any limitations of the evidence included in the review.   
   
21-23   
   
23c   
   
Discuss   any limitations of the review processes used.   
   
21-23   
   
23d   
   
Discuss   implications of the results for practice, policy, and future research.   
   
23-31   
   
OTHER   INFORMATION   
   
   
   
Registration   and protocol   
   
24a   
   
Provide   registration information for the review, including register name and   registration number, or state that the review was not registered.   
   
Not registered   
   
24b   
   
Indicate   where the review protocol can be accessed, or state that a protocol was not   prepared.   
   
Not registered   
   
24c   
   
Describe   and explain any amendments to information provided at registration or in the   protocol.   
   
N/A   
   
Support   
   
25   
   
Describe   sources of financial or non-financial support for the review, and the role of   the funders or sponsors in the review.   
   
N/A   
   
Competing   interests   
   
26   
   
Declare   any competing interests of review authors.   
   
None   
   
Availability   of data, code and other materials   
   
27   
   
Report   which of the following are publicly available and where they can be found:   template data collection forms; data extracted from included studies; data   used for all analyses; analytic code; any other materials used in the review.   
   
Supplementary Material, Appendix A   

Here is the codebase we developed for the snowballing search on Semantic Scholar.

F. Bolaños Burgos, A. Salatino, F. Osborne, and E. Motta. Artificial intelligence for systematic literature reviews: Opportunities and challenges. Submitted to Artificial Intelligence Review, 2024.

This work is licensed under CC BY 4.0

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • v.13(7); 2023
  • PMC10335470

Logo of bmjo

Artificial intelligence in systematic reviews: promising when appropriately used

Sanne h b van dijk.

1 Health Technology & Services Research, Technical Medical Centre, University of Twente, Enschede, The Netherlands

2 Pulmonary Medicine, Medisch Spectrum Twente, Enschede, The Netherlands

Marjolein G J Brusse-Keizer

3 Medical School Twente, Medisch Spectrum Twente, Enschede, The Netherlands

Charlotte C Bucsán

4 Cognition, Data & Education, Faculty of Behavioural, Management & Social Sciences, University of Twente, Enschede, The Netherlands

Job van der Palen

Carine j m doggen.

5 Clinical Research Centre, Rijnstate Hospital, Arnhem, The Netherlands

Anke Lenferink

Associated data.

Systematic reviews provide a structured overview of the available evidence in medical-scientific research. However, due to the increasing medical-scientific research output, it is a time-consuming task to conduct systematic reviews. To accelerate this process, artificial intelligence (AI) can be used in the review process. In this communication paper, we suggest how to conduct a transparent and reliable systematic review using the AI tool ‘ASReview’ in the title and abstract screening.

Use of the AI tool consisted of several steps. First, the tool required training of its algorithm with several prelabelled articles prior to screening. Next, using a researcher-in-the-loop algorithm, the AI tool proposed the article with the highest probability of being relevant. The reviewer then decided on relevancy of each article proposed. This process was continued until the stopping criterion was reached. All articles labelled relevant by the reviewer were screened on full text.

Considerations to ensure methodological quality when using AI in systematic reviews included: the choice of whether to use AI, the need of both deduplication and checking for inter-reviewer agreement, how to choose a stopping criterion and the quality of reporting. Using the tool in our review resulted in much time saved: only 23% of the articles were assessed by the reviewer.

The AI tool is a promising innovation for the current systematic reviewing practice, as long as it is appropriately used and methodological quality can be assured.

PROSPERO registration number

CRD42022283952.

Strengths and limitations of this study

  • Potential pitfalls regarding the use of artificial intelligence in systematic reviewing were identified.
  • Remedies for each pitfall were provided to ensure methodological quality. A time-efficient approach is suggested on how to conduct a transparent and reliable systematic review using an artificial intelligence tool.
  • The artificial intelligence tool described in the paper was not evaluated for its accuracy.

Medical-scientific research output has grown exponentially since the very first medical papers were published. 1–3 The output in the field of clinical medicine increased and keeps doing so. 4 To illustrate, a quick PubMed search for ‘cardiology’ shows a fivefold increase in annual publications from 10 420 (2007) to 52 537 (2021). Although the medical-scientific output growth rate is not higher when compared with other scientific fields, 1–3 this field creates the largest output. 3 Staying updated by reading all published articles is therefore not feasible. However, systematic reviews facilitate up-to-date and accessible summaries of evidence, as they synthesise previously published results in a transparent and reproducible manner. 5 6 Hence, conclusions can be drawn that provide the highest considered level of evidence in medical research. 5 7 Therefore, systematic reviews are not only crucial in science, but they have a large impact on clinical practice and policy-making as well. 6 They are, however, highly labour-intensive to conduct due to the necessity of screening a large amount of articles, which results in a high consumption of research resources. Thus, efficient and innovative reviewing methods are desired. 8

An open-source artificial intelligence (AI) tool ‘ASReview’ 9 was published in 2021 to facilitate the title and abstract screening process in systematic reviews. Applying this tool facilitates researchers to conduct more efficient systematic reviews: simulations already showed its time-saving potential. 9–11 We used the tool in the study selection of our own systematic review and came across scenarios that needed consideration to prevent loss of methodological quality. In this communication paper, we provide a reliable and transparent AI-supported systematic reviewing approach.

We first describe how the AI tool was used in a systematic review conducted by our research group. For more detailed information regarding searches and eligibility criteria of the review, we refer to the protocol (PROSPERO registry: CRD42022283952). Subsequently, when deciding on the AI screening-related methodology, we applied appropriate remedies against foreseen scenarios and their pitfalls to maintain a reliable and transparent approach. These potential scenarios, pitfalls and remedies will be discussed in the Results section.

In our systematic review, the AI tool ‘ASReview’ (V.0.17.1) 9 was used for the screening of titles and abstracts by the first reviewer (SHBvD). The tool uses an active researcher-in-the-loop machine learning algorithm to rank the articles from high to low probability of eligibility for inclusion by text mining. The AI tool offers several classifier models by which the relevancy of the included articles can be determined. 9 In a simulation study using six large systematic review datasets on various topics, a Naïve Bayes (NB) and a term frequency-inverse document frequency (TF-IDF) outperformed other model settings. 10 The NB classifier estimates the probability of an article being relevant, based on TF-IDF measurements. TF-IDF measures the originality of a certain word within the article relative to the total number of articles the word appears in. 12 This combination of NB and TF-IDF was chosen for our systematic review.

Before the AI tool can be used for the screening of relevant articles, its algorithm needs training with at least one relevant and one irrelevant article (ie, prior knowledge). It is assumed that the more prior knowledge, the better the algorithm is trained at the start of the screening process, and the faster it will identify relevant articles. 9 In our review, the prior knowledge consisted of three relevant articles 13–15 selected from a systematic review on the topic 16 and three randomly picked irrelevant articles.

After training with the prior knowledge, the AI tool made a first ranking of all unlabelled articles (ie, articles not yet decided on eligibility) from highest to lowest probability of being relevant. The first reviewer read the title and abstract of the number one ranked article and made a decision (‘relevant’ or ‘irrelevant’) following the eligibility criteria. Next, the AI tool took into account this additional knowledge and made a new ranking. Again, the next top ranked article was proposed to the reviewer, who made a decision regarding eligibility. This process of AI making rankings and the reviewer making decisions, which is also called ‘researcher-in-the-loop’, was repeated until the predefined data-driven stopping criterion of – in our case – 100 subsequent irrelevant articles was reached. After the reviewer rejected what the AI tool puts forward as ‘most probably relevant’ a hundred times, it was assumed that there were no relevant articles left in the unseen part of the dataset.

The articles that were labelled relevant during the title and abstract screening were each screened on full text independently by two reviewers (SHBvD and MGJB-K, AL, JvdP, CJMD, CCB) to minimise the influence of subjectivity on inclusion. Disagreements regarding inclusion were solved by a third independent reviewer.

How to maintain reliability and transparency when using AI in title and abstract screening

A summary of the potential scenarios, and their pitfalls and remedies, when using the AI tool in a systematic review is given in table 1 . These potential scenarios should not be ignored, but acted on to maintain reliability and transparency. Figure 1 shows when and where to act on during the screening process reflected by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flowchart, 17 from literature search results to publishing the review.

An external file that holds a picture, illustration, etc.
Object name is bmjopen-2023-072254f01.jpg

Flowchart showing when and where to act on when using ASReview in systematic reviewing. Adapted the PRISMA flowchart from Haddaway et al . 17

Per-scenario overview of potential pitfalls and how to prevent these when using ASReview in a systematic review

Potential scenarioPitfallRemedy
Only a small (ie, manually feasible*) number of articles (with possibly a high proportion relevant) available for screeningTime wasted by considering AI-related choices, software training and no time saved by using AIDo not use AI: conduct manual screening
Presence of duplicate articles in ASReviewUnequal weighing of labelled articles in AI-supported screeningApply deduplication methods before using AI
Reviewer’s own opinion, expertise or mistakes influence(s) AI algorithm on article selectionNot all relevant articles are included, potentially introducing selection biasReviewer training in title and abstract screening
Perform (partial) double screening and check inter-reviewer agreement
AI-supported screening is stopped before or a long time after all relevant articles are foundNot all relevant articles are included, potentially introducing selection bias, or time is wastedFormulate a data-driven stopping criterion (ie, number of consecutive irrelevant articles)
AI-related choices not (completely) describedIrreproducible results, leading to a low-quality systematic reviewDescribe and substantiate the choices that are made
Study selection is not transparentIrreproducible results (black box algorithm), leading to a low-quality systematic reviewPublish open data (ie, extracted file with all decisions)

*What is considered manually feasible is highly context-dependent (ie, the intended workload and/or reviewers available).

In our systematic review, by means of broad literature searches in several scientific databases, a first set of potentially relevant articles was identified, yielding 8456 articles, enough to expect the AI tool to be efficient in the title and abstract screening (scenario ① was avoided, see table 1 ). Subsequently, this complete set of articles was uploaded in reference manager EndNote X9 18 and review manager Covidence, 19 where 3761 duplicate articles were removed. Given that EndNote has quite low sensitivity in identifying duplicates, additional deduplication in Covidence was considered beneficial. 20 Deduplication is usually applied in systematic reviewing, 20 but is increasingly important prior to the use of AI. Since multiple decisions regarding a duplicate article weigh more than one, this will disproportionately influence classification and possibly the results ( table 1 , scenario ② ). In our review, a deduplicated set of articles was uploaded in the AI tool. Prior to the actual AI-supported title and abstract screening, the reviewers (SHBvD and AL, MGJB-K) trained themselves with a small selection of 74 articles. The first reviewer became familiar with the ASReview software, and all three reviewers learnt how to apply the eligibility criteria, to minimise personal influence on the article selection ( table 1 , scenario ③ ).

Defining the stopping criterion used in the screening process is left to the reviewer. 9 An optimal stopping criterion in active learning is considered a perfectly balanced trade-off between a certain cost (in terms of time spent) of screening one more article versus the predictive performance (in terms of identifying a new relevant article) that could be increased by adding one more decision. 21 The optimal stopping criterion in systematic reviewing would be the moment that screening additional articles will not result in more relevant articles being identified. 22 Therefore, in our review, we predetermined a data-driven stopping criterion for the title and abstract screening as ‘100 consecutive irrelevant articles’ in order to prevent the screening from being stopped before or a long time after all relevant articles were identified ( table 1 , scenario ④ ).

Due to the fact that the stopping criterion was reached after 1063 of the 4695 articles, only a part of the total number of articles was seen. Therefore, this approach might be sensitive to possible mistakes when articles are screened by only one reviewer, influencing the algorithm, possibly resulting in an incomplete selection of articles ( table 1 , scenario ③ ). 23 As a remedy, second reviewers (AL, MGJB-K) checked 20% of the titles and abstracts seen by the first reviewer. This 20% had a comparable ratio regarding relevant versus irrelevant articles over all articles seen. The percentual agreement and Cohen’s Kappa (κ), a measure for the inter-reviewer agreement above chance, were calculated to express the reliability of the decisions taken. 24 The decisions were agreed in 96% and κ was 0.83. A κ equal of at least 0.6 is generally considered high, 24 and thus it was assumed that the algorithm was reliably trained by the first reviewer.

The reporting of the use of the AI tool should be transparent. If the choices made regarding the use of the AI tool are not entirely reported ( table 1 , scenario ⑤ ), the reader will not be able to properly assess the methodology of the review, and review results may even be graded as low-quality due to the lack of transparent reporting. The ASReview tool offers the possibility to extract a data file providing insight into all decisions made during the screening process, in contrast to various other ‘black box’ AI-reviewing tools. 9 This file will be published alongside our systematic review to provide full transparency of our AI-supported screening. This way, the screening with AI is reproducible (remedy to scenario ⑥ , table 1 ).

Results of AI-supported study selection in a systematic review

We experienced an efficient process of title and abstract screening in our systematic review. Whereas the screening was performed with a database of 4695 articles, the stopping criterion was reached after 1063 articles, so 23% were seen. Figure 2A shows the proportion of articles identified as being relevant at any point during the AI-supported screening process. It can be observed that the articles are indeed prioritised by the active learning algorithm: in the beginning, relatively many relevant articles were found, but this decreased as the stopping criterion (vertical red line) was approached. Figure 2B compares the screening progress when using the AI tool versus manual screening. The moment the stopping criterion was reached, approximately 32 records would have been found when the titles and abstract would have been screened manually, compared with 142 articles labelled relevant using the AI tool. After the inter-reviewer agreement check, 142 articles proceeded to the full text reviewing phase, of which 65 were excluded because these were no articles with an original research format, and three because the full text could not be retrieved. After full text reviewing of the remaining 74 articles, 18 articles from 13 individual studies were included in our review. After snowballing, one additional article from a study already included was added.

An external file that holds a picture, illustration, etc.
Object name is bmjopen-2023-072254f02.jpg

Relevant articles identified after a certain number of titles and abstracts were screened using the AI tool compared with manual screening.

In our systematic review, the AI tool considerably reduced the number of articles in the screening process. Since the AI tool is offered open source, many researchers may benefit from its time-saving potential in selecting articles. Choices in several scenarios regarding the use of AI, however, are still left open to the researcher, and need consideration to prevent pitfalls. These include the choice whether or not to use AI by weighing the costs versus the benefits, the importance of deduplication, double screening to check inter-reviewer agreement, a data-driven stopping criterion to optimally use the algorithm’s predictive performance and quality of reporting of the AI-related methodology chosen. This communication paper is, to our knowledge, the first elaborately explaining and discussing these choices regarding the application of this AI tool in an example systematic review.

The main advantage of using the AI tool is the amount of time saved. Indeed, in our study, only 23% of the total number of articles were screened before the predefined stopping criterion was met. Assuming that all relevant articles were found, the AI tool saved 77% of the time for title and abstract screening. However, time should be invested to become acquainted with the tool. Whether the expected screening time saved outweighs this time investment is context-dependent (eg, researcher’s digital skills, systematic reviewing skills, topic knowledge). An additional advantage is that research questions previously unanswerable due to the insurmountable number of articles to screen in a ‘classic’ (ie, manual) review, now actually are possible to answer. An example of the latter is a review screening over 60 000 articles, 25 which would probably never have been performed without AI supporting the article selection.

Since the introduction of the ASReview tool in 2021, it was applied in seven published reviews. 25–31 An important note to make is that only one 25 clearly reported AI-related choices in the methods and a complete and transparent flowchart reflecting the study selection process in the Results section. Two reviews reported a relatively small number (<400) of articles to screen, 26 27 of which more than 75% of the articles were screened before the stopping criterion was met, so the amount of time saved was limited. Also, three reviews reported many initial articles (>6000) 25 28 29 and one reported 892 articles, 31 of which only 5%–10% needed to be screened. So in these reviews, the AI tool saved an impressive amount of screening time. In our systematic review, 3% of the articles were labelled relevant during the title and abstract screening and eventually, <1% of all initial articles were included. These percentages are low, and are in line with the three above-mentioned reviews (1%–2% and 0%–1%, respectively). 25 28 29 Still, relevancy and inclusion rates are much lower when compared with ‘classic’ systematic reviews. A study evaluating the screening process in 25 ‘classic’ systematic reviews showed that approximately 18% was labelled relevant and 5% was actually included in the reviews. 32 This difference is probably due to more narrow literature searches in ‘classic’ reviews for feasibility purposes compared with AI-supported reviews, resulting in a higher proportion of included articles.

In this paper, we show how we applied the AI tool, but we did not evaluate it in terms of accuracy. This means that we have to deal with a certain degree of uncertainty. Despite the data-driven stopping criterion there is a chance that relevant articles were missed, as 77% was automatically excluded. Considering this might have been the case, first, this could be due to wrong decisions of the reviewer that would have undesirably influenced the training of the algorithm by which the articles were labelled as (ir)relevant and the order in which they were presented to the reviewer. Relevant articles could have therefore remained unseen if the stopping criterion was reached before they were presented to the reviewer. As a remedy, in our own systematic review, of the 20% of the articles screened by the first reviewer, relevancy was also assessed by another reviewer to assess inter-reviewer reliability, which was high. It should be noted, though, that ‘classic’ title and abstract screening is not necessarily better than using AI, as medical-scientific researchers tend to assess one out of nine abstracts wrongly. 32 Second, the AI tool may not have properly ranked highly relevant to irrelevant articles. However, given that simulations proved this AI tool’s accuracy before 9–11 this was not considered plausible. Since our study applied, but did not evaluate, the AI tool, we encourage future studies evaluating the performance of the tool across different scientific disciplines and contexts, since research suggests that the tool’s performance depends on the context, for example, the complexity of the research question. 33 This could not only enrich the knowledge about the AI tool, but also increases certainty about using it. Also, future studies should investigate the effects of choices made regarding the amount of prior knowledge that is provided to the tool, the number of articles defining the stopping criterion, and how duplicate screening is best performed, to guide future users of the tool.

Although various researcher-in-the-loop AI tools for title and abstract screening have been developed over the years, 9 23 34 they often do not develop into usable mature software, 34 which impedes AI to be permanently implemented in research practice. For medical-scientific research practice, it would therefore be helpful if large systematic review institutions, like Cochrane and PRISMA, would consider to ‘officially’ make AI part of systematic reviewing practice. When guidelines on the use of AI in systematic reviews are made available and widely recognised, AI-supported systematic reviews can be uniformly conducted and transparently reported. Only then we can really benefit from AI’s time-saving potential and reduce our research time waste.

Our experience with the AI tool during the title and abstract screening was positive as it has highly accelerated the literature selection process. However, users should consider applying appropriate remedies to scenarios that may form a threat to the methodological quality of the review. We provided an overview of these scenarios, their pitfalls and remedies. These encourage reliable use and transparent reporting of AI in systematic reviewing. To ensure the continuation of conducting systematic reviews in the future, and given their importance for medical guidelines and practice, we consider this tool as an important addition to the review process.

Supplementary Material

Contributors: SHBvD proposed the methodology and conducted the study selection. MGJB-K, CJMD and AL critically reflected on the methodology. MGJB-K and AL contributed substantially to the study selection. CCB, JvdP and CJMD contributed to the study selection. The manuscript was primarily prepared by SHBvD and critically revised by all authors. All authors read and approved the final manuscript.

Funding: The systematic review is conducted as part of the RE-SAMPLE project. RE-SAMPLE has received funding from the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 965315).

Competing interests: None declared.

Provenance and peer review: Not commissioned; externally peer reviewed.

Ethics approval

Not applicable.

Purdue University

  • Ask a Librarian

Artificial Intelligence (AI)

Ai for systematic review.

  • How to Cite AI Generated Content
  • Prompt Design
  • Resources for Educators
  • Purdue AI Resources
  • AI and Ethics
  • Publisher Policies
  • Selected Journals in AI

Various AI tools are invaluable throughout the systematic review or evidence synthesis process. While the consensus acknowledges the significant utility of AI tools across different review stages, it's imperative to grasp their inherent biases and weaknesses. Moreover, ethical considerations such as copyright and intellectual property must be at the forefront.

  • Application ChatGPT in conducting systematic reviews and meta-analyses
  • Are ChatGPT and large language models “the answer” to bringing us closer to systematic review automation?
  • Artificial intelligence in systematic reviews: promising when appropriately used
  • Harnessing the power of ChatGPT for automating systematic review process: methodology, case study, limitations, and future directions
  • In-depth evaluation of machine learning methods for semi-automating article screening in a systematic review of mechanistic
  • Tools to support the automation of systematic reviews: a scoping review
  • The use of a large language model to create plain language summaries of evidence reviews in healthcare: A feasibility study
  • Using artificial intelligence methods for systematic review in health sciences: A systematic review

AI Tools for Systematic Review

  • DistillerSR Securely automate every stage of your literature review to produce evidence-based research faster, more accurately, and more transparently at scale.
  • Rayyan A web-tool designed to help researchers working on systematic reviews, scoping reviews and other knowledge synthesis projects, by dramatically speeding up the process of screening and selecting studies.
  • RobotReviewer A machine learning system aiming which aims to automate evidence synthesis.
  • << Previous: AI Tools
  • Next: How to Cite AI Generated Content >>
  • Last Edited: May 7, 2024 10:25 AM
  • URL: https://guides.lib.purdue.edu/ai

Advertisement

Advertisement

A systematic review of artificial intelligence technologies used for story writing

  • Published: 05 April 2023
  • Volume 28 , pages 14361–14397, ( 2023 )

Cite this article

systematic literature review using ai

  • Xiaoxuan Fang   ORCID: orcid.org/0000-0002-3953-3993 1 ,
  • Davy Tsz Kit Ng   ORCID: orcid.org/0000-0002-2380-7814 2 ,
  • Jac Ka Lok Leung   ORCID: orcid.org/0000-0001-6490-7005 3 &
  • Samuel Kai Wah Chu   ORCID: orcid.org/0000-0003-1557-2776 2  

3377 Accesses

11 Citations

Explore all metrics

With the digital revolution of artificial intelligence (AI) in language education, the way how people write and create stories has been transformed in recent years. Although recent studies have started to examine the roles of AI in literacy, there is a lack of systematic review to inform how it has been applied and what has been achieved in story-writing. This paper reviews the literature on the use of AI in story-writing during the last 5 years. The discussion explores the year of publication, countries of implementation, educational levels, participants and research methodology. In terms of research context, most studies were carried out in universities in the United States, and children and adult learners were the two most common participants. Most studies involved the collection and analysis of quantitative data. After that, the mechanisms of using AI for story-writing are investigated in terms of the types, approaches, and roles of AI. The pedagogies used in the learning context of AI-supported story-writing are discussed. Finally, the benefits of using AI in story-writing are pointed out. The findings show that the literature has paid most attention to learners’ creativity, writing skills, presentation skills, motivation, and satisfaction. The review also suggested that human-AI collaboration could effectively improve story creation. Some studies had trained high-level AI to help students write better stories. As findings from the current body of research are not conclusive, more work is needed in exploring challenges of using AI in story-writing. Lastly, a set of limitations and recommendations for future research are summarized in this study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

systematic literature review using ai

Similar content being viewed by others

systematic literature review using ai

A Systematic Literature Review of Digital Storytelling for English Language Speaking and Writing Skills

systematic literature review using ai

Exploring students’ perspectives on Generative AI-assisted academic writing

systematic literature review using ai

Improving STEM Competences by Using Artificial Intelligence to Generate Video Games Based on Student Written Stories

Explore related subjects.

  • Artificial Intelligence
  • Digital Education and Educational Technology

Data availability

The authors declare that the data supporting the findings of this study are available within the article and its supplementary information files.

Alhussain, A. I., & Azmi, A. M. (2021). Automatic story generation: a survey of approaches. ACM Computing Surveys (CSUR) , 54 (5), 1–38.

Bai, B., Wang, J., & Zhou, H. (2021). An intervention study to improve primary school students’ self-regulated strategy use in English writing through e-learning in Hong Kong. Computer Assisted Language Learning , 1–23.

Biermann, O. C., Ma, N. F., & Yoon, D. (2022, June). From Tool to Companion: Storywriters Want AI Writers to Respect Their Personal Values and Writing Strategies. In Designing Interactive Systems Conference (pp. 1209–1227).

Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3 (2), 77–101.

Article   Google Scholar  

Braun, V., & Clarke, V. (2012) Thematic analysis. In H. Cooper (Ed.), Handbook of research methods in psychology . (Vol. 2: Research Designs, pp. 57–71). Washington, DC: APA Books.

Cambre, J., Reig, S., Kravitz, Q., & Kulkarni, C. (2020, July). " All Rise for the AI Director" Eliciting Possible Futures of Voice Technology through Story Completion. In Proceedings of the 2020 ACM Designing Interactive Systems Conference (pp. 2051–2064).

Candello, H., Pichiliani, M., Wessel, M., Pinhanez, C., & Muller, M. (2019, November). Teaching robots to act and converse in physical spaces: participatory design fictions with museum guides. In Proceedings of the Halfway to the Future Symposium 2019 (pp. 1–4).

Chang, T. S., Li, Y., Huang, H. W., & Whitfield, B. (2021, March). Exploring EFL students' writing performance and their acceptance of AI-based automated writing feedback. In 2021 2nd International Conference on Education Development and Studies (pp. 31–35). Association for Computing Machinery.

Chen, L., Chen, P., & Lin, Z. (2020). Artificial intelligence in education: A review. IEEE Access, 8 , 75,264–75,278.

Chen, Z. H., & Liu, W. Y. (2021). A six-stage story structure approach for elementary students’ story production: Quality, interest, and attitude. Computer Assisted Language Learning, 34 (1–2), 184–207.

Cheong, Y. G., Park, W. H., & Yu, H. Y. (2018, October). A Demonstration of an Intelligent Storytelling System. In Proceedings of the 26th ACM international conference on Multimedia (pp. 1258–1259).

Chung, J. J. Y., Kim, W., Yoo, K. M., Lee, H., Adar, E., & Chang, M. (2022, April). TaleBrush: Sketching Stories with Generative Pretrained Language Models. In CHI Conference on Human Factors in Computing Systems (pp. 1–19).

Chow, P. S. (2020). Ghost in the (Hollywood) machine: Emergent applications of artificial intelligence in the film industry. NECSUS_European Journal of Media Studies , 9 (1), 193–214.

Clark, E., Ross, A. S., Tan, C., Ji, Y., & Smith, N. A. (2018, March). Creative writing with a machine in the loop: Case studies on slogans and stories. In 23rd International Conference on Intelligent User Interfaces (pp. 329–340).

Coenen, A., Davis, L., Ippolito, D., Reif, E., & Yuan, A. (2021). Wordcraft: a Human-AI Collaborative Editor for story writing. arXiv preprint arXiv:2107.07430 .

Cohen, L., Manion, L., & Morrison, K. (2002). Research methods in education . Routledge.

Book   Google Scholar  

Creswell, J. W. (2012). Educational research: Planning. Conducting, and Evaluating, 260 , 375–382.

Google Scholar  

Crompton, H., Jones, M. V., & Burke, D. (2022). Affordances and challenges of artificial intelligence in K-12 education: a systematic review. Journal of Research on Technology in Education , 1–21.

Dahlström, H. (2019). Digital writing tools from the student perspective. Education and Information Technologies, 24 (2), 1563–1581.

Del-Moral-Pérez, M. E., Villalustre-Martínez, L., & Neira-Piñeiro, M. D. R. (2019). Teachers’ perception about the contribution of collaborative creation of digital storytelling to the communicative and digital competence in primary education schoolchildren. Computer Assisted Language Learning, 32 (4), 342–365.

Frich, J., MacDonald Vermeulen, L., Remy, C., Biskjaer, M. M., & Dalsgaard, P. (2019, May). Mapping the landscape of creativity support tools in HCI. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (pp. 1–18).

Gagliano, P., Blustein, C., & Oppenheim, D. (2021). Agence, a dynamic film about (and with) artificial intelligence. In ACM SIGGRAPH 2021 Immersive Pavilion (pp. 1–2).

Gala, K., Somaiya, M., Gopani, M., & Joshi, A. (2021, September). Picture Tales: An Approach for Story Generation Using a Series of Images. In 2021 IEEE 4th International Conference on Computing, Power and Communication Technologies (GUCON) (pp. 1–5). IEEE.

Gero, K. I., Liu, V., & Chilton, L. (2022, June). Sparks: Inspiration for science writing using language models. In Designing Interactive Systems Conference (pp. 1002–1019).

Goldfarb-Tarrant, S., Feng, H., & Peng, N. (2019). Plan, write, and revise: an interactive system for open-domain story generation. arXiv preprint arXiv:1904.02357 .

Guan, J., Huang, F., Zhao, Z., Zhu, X., & Huang, M. (2020). A knowledge-enhanced pretraining model for commonsense story generation. Transactions of the Association for Computational Linguistics, 8 , 93–108.

Herrera-González, B. D., Gelbukh, A., & Calvo, H. (2020, October). Automatic Story Generation: State of the Art and Recent Trends. In Mexican International Conference on Artificial Intelligence (pp. 81–91). Springer, Cham.

Hsu, T. Y., Hsu, Y. C., & Huang, T. H. (2019, May). On how users edit computer-generated visual stories. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (pp. 1–6).

Ibáñez, M. B., & Delgado-Kloos, C. (2018). Augmented reality for STEM learning: A systematic review. Computers & Education, 123 , 109–123.

Issa, L., & Jusoh, S. (2019, October). Applying ontology in computational creativity approach for generating a story. In 2019 2nd International Conference on new Trends in Computing Sciences (ICTCS) (pp. 1–6). IEEE.

Karlimah, K., Hamdu, G., Pratiwi, V., Herdiansah, H., & Kurniawan, D. (2021, July). The development of motion comic storyboard based on digital literacy and elementary school mathematics ability in the new normal era during covid-19 pandemic. In Journal of Physics: Conference Series (Vol. 1987, No. 1, p. 012026). IOP Publishing.

Keller, J. M. (1984). The use of the ARCS model of motivation in teacher training. In K.S.A.J. Trott (Ed.), Aspects of educational technology volume XVII: Staff development and career updating . Kogan Page.

Keller, J. M. (1987). Development and use of the ARCS model of instructional design. Journal of Instructional Development, 10 (3), 2–10.

Article   MathSciNet   Google Scholar  

Keller, J. M. (2009). Motivational design for learning and performance: The ARCS model approach . Springer Science & Business Media.

Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., & Socher, R. (2019). Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858 .

Kılıçkaya, F. (2020). Learners’ perceptions of collaborative digital graphic writing based on semantic mapping. Computer Assisted Language Learning, 33 (1–2), 58–84.

Klimashevskaia, A., Gadgil, R., Gerrity, T., Khosmood, F., Gütl, C., & Howe, P. (2021, November). Automatic News Article Generation from Legislative Proceedings: A Phenom-Based Approach. In International Conference on Statistical Language and Speech Processing (pp. 15–26). Springer, Cham.

Lee, M., Liang, P., & Yang, Q. (2022, April). Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. In CHI Conference on Human Factors in Computing Systems (pp. 1–19).

Li, X., & Zhang, B. (2020, October). AI poem case analysis: Take ancient Chinese poems as an example. In Proceedings of the 2020 Conference on Artificial Intelligence and Healthcare (pp. 132–136).

Lin, P. Y., Chai, C. S., Jong, M. S. Y., Dai, Y., Guo, Y., & Qin, J. (2021). Modeling the structural relationship among primary students’ motivation to learn artificial intelligence. Computers and Education: Artificial Intelligence, 2 , 100006.

Lin, J. W., & Chang, R. G. (2022). Chinese story generation of sentence format control based on multi-channel word embedding and novel data format. Soft Computing, 26 (5), 2179–2196.

Liu, C., Hou, J., Tu, Y. F., Wang, Y., & Hwang, G. J. (2021). Incorporating a reflective thinking promoting mechanism into artificial intelligence-supported English writing environments. Interactive Learning Environments , 1–19.

Min, K., Dang, M., & Moon, H. (2021). Deep learning-based short story generation for an image using the encoder-decoder structure. IEEE Access, 9 , 113,550–113,557.

Ng, D. T. K., & Chu, S. K. W. (2021). Motivating students to learn STEM via engaging flight simulation activities. Journal of Science Education and Technology, 30 (5), 608–629.

Ng, D. T. K., Leung, J. K. L., Chu, S. K. W., & Qiao, M. S. (2021). Conceptualizing AI literacy: An exploratory review. Computers and Education: Artificial Intelligence, 2 , 100041.

Ng, D. T. K., Luo, W., Chan, H. M. Y., & Chu, S. K. W. (2022). Using digital story writing as a pedagogy to develop AI literacy among primary students. Computers and Education: Artificial Intelligence, 3 , 100054.

Nichols, E., Gao, L., Vasylkiv, Y., & Gomez, R. (2021). Design and Analysis of a Collaborative Story Generation Game for Social Robots. Frontiers in Computer Science , 74.

Noceti, N., Odone, F., Marsella, A., Moro, M., & Nicora, E. (2020, July). Tangible Coding for kids with AI inside. In Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization (pp. 163–166).

Ochieng, P. A. (2009). An analysis of the strengths and limitation of qualitative and quantitative research paradigms. Problems of Education in the 21st Century , 13 , 13.

Osone, H., Lu, J. L., & Ochiai, Y. (2021, May). BunCho: AI Supported Story Co-Creation via Unsupervised Multitask Learning to Increase Writers’ Creativity in Japanese. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1–10).

Ouyang, F., Zheng, L., & Jiao, P. (2022). Artificial intelligence in online higher education: A systematic review of empirical research from 2011 to 2020. Education and Information Technologies , 1–33.

Park, W., & Park, K. (2018, February). Story creation and design algorithm in unity. In 2018 20th International Conference on Advanced Communication Technology (ICACT) (pp. 444–447). IEEE

Peng, N., Ghazvininejad, M., May, J., & Knight, K. (2018, June). Towards controllable story generation. In Proceedings of the First Workshop on Storytelling (pp. 43–49).

Petticrew, M., & Roberts, H. (2008). Systematic reviews in the social sciences: A practical guide . John Wiley & Sons.

Refat, N., Rahman, M. A., Asyhari, A. T., Kurniawan, I. F., Bhuiyan, M. Z. A., & Kassim, H. (2019). Interactive learning experience-driven smart communications networks for cognitive load management in grammar learning context. IEEE Access , 7 , 64,545–64,557.

Roemmele, M., & Gordon, A. S. (2018, March). Automated assistance for creative writing with an rnn language model. In Proceedings of the 23rd International Conference on Intelligent User Interfaces Companion (pp. 1–2).

Shakeri, H., Neustaedter, C., & DiPaola, S. (2021, October). SAGA: Collaborative Storytelling with GPT-3. In Companion Publication of the 2021 Conference on Computer Supported Cooperative Work and Social Computing (pp. 163–166).

Su, J., & Yang, W. (2022). Artificial intelligence in early childhood education: A scoping review. Computers and Education: Artificial Intelligence , 100049.

Suh, S., & An, P. (2022, March). Leveraging Generative Conversational AI to Develop a Creative Learning Environment for Computational Thinking. In 27th International Conference on Intelligent User Interfaces (pp. 73–76).

Takacs, Z. K., Swart, E. K., & Bus, A. G. (2015). Benefits and pitfalls of multimedia and interactive features in technology-enhanced storybooks: A meta-analysis. Review of Educational Research, 85 (4), 698–739.

Tanrıkulu, F. (2022). Students’ perceptions about the effects of collaborative digital storytelling on writing skills. Computer Assisted Language Learning, 35 (5–6), 1090–1105.

Transformer Jr, G. P., Note, E. X., Spellchecker, M. S., & Yampolskiy, R. (2020). When Should Co-Authorship Be Given to AI?  PhilArchive . https://philarchive.org/archive/GPTWSCv1

Tricco, A. C., Lillie, E., Zarin, W., O'Brien, K. K., Colquhoun, H., Levac, D., ... & Straus, S. E. (2018). PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Annals of internal medicine , 169 (7), 467–473.

Tsou, W., & Tsai, S. C. (2022). Interactive learning for professional development of bilingual education by a blended instructional approach. Interactive Learning Environments , 1–13.

Valls-Vargas, J., Zhu, J., & Ontanón, S. (2014, September). Toward automatic role identification in unannotated folk tales. In Tenth Artificial Intelligence and Interactive Digital Entertainment Conference .

Wang, Y. (2021, May). The Application of Artificial Intelligence in Chinese News Media. In 2021 2nd International Conference on Artificial Intelligence and Information Systems (pp. 1–4).

Watcharapunyawong, S., & Usaha, S. (2013). Thai EFL Students’ Writing Errors in Different Text Types: The Interference of the First Language. English Language Teaching, 6 (1), 67–78.

Wicke, P., & Veale, T. (2021, March). Are You Not Entertained? Computational Storytelling With Non-verbal Interaction. In Companion of the 2021 ACM/IEEE International Conference on Human-Robot Interaction (pp. 200–204).

Woo, D. J., Wang, Y., & Susanto, H. (2022). Student-AI Creative Writing: Pedagogical Strategies for Applying Natural Language Generation in Schools. EdArXiv. June , 3 .

Wu, J., & Chen, D. T. V. (2020). A systematic review of educational digital storytelling. Computers & Education, 147 , 103786.

Xu, P., Patwary, M., Shoeybi, M., Puri, R., Fung, P., Anandkumar, A., & Catanzaro, B. (2020). MEGATRON-CNTRL: Controllable story generation with external knowledge using large-scale language models. arXiv preprint arXiv:2010.00840 .

Xu, Z., Banerjee, M., Ramirez, G., Zhu, G., & Wijekumar, K. (2019). The effectiveness of educational technology applications on adult English language learners’ writing quality: A meta-analysis. Computer Assisted Language Learning, 32 (1–2), 132–162.

Young, R. M., Ware, S. G., Cassell, B. A., & Robertson, J. (2013). Plans and planning in narrative generation: A review of plan-based approaches to the generation of story, discourse and interactivity in narratives. Sprache Und Datenverarbeitung, Special Issue on Formal and Computational Models of Narrative, 37 (1–2), 41–64.

Yu, M. (2021). The Dilemmas and Reform of Translation Education in the Age of Artificial Intelligence. In 2021 2nd International Conference on Artificial Intelligence and Education (ICAIE) (pp. 40–44). IEEE.

Yuan, A., Coenen, A., Reif, E., & Ippolito, D. (2022, March). Wordcraft: story writing With Large Language Models. In 27th International Conference on Intelligent User Interfaces (pp. 841–852).

Zainuddin, Z., Chu, S. K. W., Shujahat, M., & Perera, C. J. (2020). The impact of gamification on learning and instruction: A systematic review of empirical evidence. Educational Research Review, 30 , 100326.

Zhang, M. (2020, October). Application of Artificial Intelligence Interactive storytelling in Animated. In 2020 International Conference on Control, Robotics and Intelligent System (pp. 37–41).

Zhang, C., Yao, C., Liu, J., Zhou, Z., Zhang, W., Liu, L., ... & Wang, G. (2021, May). StoryDrawer: A Co-Creative Agent Supporting Children's Storytelling through Collaborative Drawing. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1–6).

Zhang, C., Yao, C., Wu, J., Lin, W., Liu, L., Yan, G., & Ying, F. (2022, April). StoryDrawer: A Child–AI Collaborative Drawing System to Support Children's Creative Visual Storytelling. In CHI Conference on Human Factors in Computing Systems (pp. 1–15).

Download references

Author information

Authors and affiliations.

Department of Curriculum and Instruction, The Education University of Hong Kong, Hong Kong, China

Xiaoxuan Fang

Faculty of Education, The University of Hong Kong, Hong Kong, China

Davy Tsz Kit Ng & Samuel Kai Wah Chu

Division of Integrative Systems and Design, The Hong Kong University of Science and Technology, Hong Kong, China

Jac Ka Lok Leung

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Xiaoxuan Fang .

Ethics declarations

Conflict of interest.

No potential conflict of interest was reported by the author.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix. Reviewed studies and information

SN

Author(s) and year of publication

Type of publication

Country

Educational level

Participant/Research object

Methodology

Type of AI technologies

Approach of AI technologies

Role of AI technologies

Pedagogy

Benefit

1

Biermann et al. ( )

Empirical study

Canada

Adult education

20 adults (7 hobbyists and 13 professional writers)

Qualitative

AI-writers

Modern large-scale language models

Story collaborator

Human-AI collaborative writing

Help writers have good productivity and complete challenging writing tasks

2

Cambre et al. ( )

Empirical study

USA

Adult education

149 adults

Mixed methods

Voice assistant

No mention

No mention

No mention

No mention

3

Cheong et al. ( )

Descriptive study

Korea

Not applicable

Not applicable

Not applicable

AI planner

The unity game engine

Story animator

Not applicable

Generate the story with humans and then visualize it as a 3D animation

Encourage the user to interact with the story by manipulating props or characters

4

Chung et al. ( )

Empirical study

USA

Adult education

14 adults (7 female and 7 male)

Qualitative

TaleBrush

GPT-Neo

Story generator

Human-AI collaborative writing

Collaborate with writers to generate stories based on their intentions about characters’ fortune

Maintain the novelty of generated sentences

Inspire writers’ ideas when the AI application generates stories

5

Clark et al. ( )

Empirical study

USA

Adult education

36 adults

Mixed methods

Machine in the loop systems

A neural language model

Story co-creator

Human-AI collaborative writing

Generate suggestions based on writers’ story context

6

Coenen et al. ( )

Descriptive study

USA

Not applicable

Not applicable

Not applicable

Wordcraft (an AI-assisted editor)

Neural language generation

Story collaborator

Not applicable

Collaborate with writers to complete a story

7

Gala et al. ( )

Empirical study

India

Adult education

No mention

Quantitative

An encoder-decoder architecture

Recurrent neural networks

Story generator

No mention

Generate grammatically correct and sensible stories automatically based on images

8

Goldfarb-Tarrant et al. ( )

Empirical study

USA

Adult education

30 workers

Quantitative

A neural narrative generation system

A conditional language models implemented with LSTMs

Story co-creator

Human-AI collaborative writing

Interact with humans to generate stories

9

Hsu et al. ( )

Empirical study

USA

Adult education

197 workers

Quantitative

State-of-the-art visual storytelling models

Human-centered computer (supported storytelling system)

Story collaborator

Human-AI collaborative writing

Assist writers in generating machine-generated visual story based on the visual storytelling model using the VIST dataset (with photos)

Collaborate with users to generate good quality and understandable stories by reducing word redundancy and increasing lexical diversity

10

Issa and Jusoh ( )

Descriptive study

Jordan

Not applicable

Not applicable

Not applicable

Markov chain model (a model that hires statistics in determining a sequence of elements according to certain rules or history)

Natural language generation

Story generator

Not applicable

Generate educational stories automatically based on users setting characters

11

Karlimah et al. ( )

Empirical study

Indonesia

Primary education

25 primary students

Quantitative

Motion comic storyboard

Drawing applications, animation applications, and project export applications

Storyteller

Technology-mediated story creation

Improve primary students’ math skills by developing a motion comic prototype about fractions

12

Lee et al. ( )

Empirical study

USA

Adult education

63 writers

Quantitative

CoAuthor

GPT-3

Story collaborator

Human-AI collaborative writing

Collaborate with writers to generate stories

Enhance writers’ story language, ideation, and collaboration capabilities

13

Min et al. ( )

Descriptive study

Korea

Not applicable

Stories from two datasets

Not applicable

Visual story writer model

Recurrent neural network structure and encoder-decoder model

Story generator

Not applicable

Generate several captions describing story contents based on the input images. These captions are then used to create a sequence of sentences to make a short story as the output

14

Nichols et al. ( )

Empirical study

Canada

Higher education

122 adults (workers and university students)

Mixed methods

AI agent

A large-scale neural language model

Story collaborator

Human-AI collaborative writing

Assist writers in generating sufficiently human-like utterances and propose a sample-and-rank approach to improve utterance quality

15

Noceti et al. ( )

Empirical study

Italy

Preschool and primary education

Over 1000 participants (parents and children aged 5–8)

Qualitative

Triangle coding

Triangle language

Story generator

Human-AI collaborative writing

Help an interpretation of the shapes sequence and the generation of a fantasy sentence or a small story

16

Osone et al. ( )

Empirical study

Japan

Adult education

48 adults (16 writers and 32 readers)

Quantitative

BunCho (an AI supported story co-creation system)

GPT-2

Story co-creator

Human-AI collaborative writing

Assist Japanese novelists in creating high-level and creative writing, enhance affective-enjoyed writing synopses (creativity, interestingness, comprehensibility, grammatical correctness, consistency of sentences), improve common metrics-creativity, and broadened their stories

17

Park and Park ( )

Descriptive study

Korea

Not applicable

Not applicable

Not applicable

Intelligent narrative story creation systems

Unity game engine

Story co-creator

Not applicable

Enhance the satisfaction of users through VR in storytelling activities

18

Peng et al. ( )

Empirical study

USA

No mention

98,162 stories

Quantitative

An analyze-to-generate story framework

A conditional language model

Story generator

No mention

Generate stories based on control factors extracted from existing stories to reflect a user’s intent

Provide a good interaction environment for users

19

Roemmele et al. ( )

Empirical study

USA

Adult education

139 adults

Quantitative

Creative Help

Recurrent neural network language model (RNN LM)

Story collaborator

Human-AI collaborative writing

Assist writers in generating more grammatical and coherent sentences, writing the story easier, influenced its content more, and were more helpful overall. The authors made significantly fewer changes to the sentence suggestions

20

Shakeri et al. ( )

Empirical study

Canada

Adult education

2 adults

Qualitative

SAGA (an asynchronous collaborative storytelling system)

GPT-3

Story collaborator

Human-AI collaborative writing

Collaborate with users to generate stories based on their prompt in terms of the basis of the story, and can include things like the setting, genre of the story, and even descriptions of the characters

21

Suh and An ( )

Descriptive study

Canada

Not applicable

Not applicable

Not applicable

CodeToon (generative conversational AI)

GPT-3

Story co-creator

Not applicable

Assist students’ learning, creative, and sensemaking process in a visual programming environment where users can create comics from code

Encourage out-of-the-box ideas and motivate users to participate actively in this co-creative process

22

Wicke and Veale ( )

Descriptive study

Ireland

Adult education

2 robot and 1 human

Not applicable

A multi-modal storytelling system

Scéalability storytelling framework

Story generator

Not applicable

Collaborate with robots to develop stories based on users’ emotions and gestures

23

Xu et al. ( )

Empirical study

China

No mention

98,161 stories

Mixed methods

MEGATRON-CNTRL (a novel framework)

Large-scale language models

Story generator

No mention

Assist writers in generating more fluent, consistent, and coherent stories with less repetition and higher diversity

24

Yuan et al. ( )

Empirical study

USA

Adult education

25 hobbyist writers

Mixed methods

Wordcraft (an AI-assisted editor)

A generative language model

Story co-creator

Human-AI collaborative writing

Respond to writers’ custom requests via open-ended conversation and express them in natural language

Generate suggestions for writers in the creative process

25

Zhang ( )

Descriptive study

China

Not applicable

Not applicable

Not applicable

An application of AI Interactive storytelling in Animation

UnrealTM game engine

Story animator

Not applicable

Use 3D animation system to present storyline setting by characteristics and scenario development

26

Zhang et al. ( )

Empirical study

China

Preschool and primary education

10 children aged 5–10

Mixed methods

StoryDrawer

A co-creative agent

Story co-creator

Human-AI collaborative writing

Assist children’s oral and drawing skills

27

Zhang et al. ( )

Empirical study

China

Preschool and primary education

24 participants (12 parents and 12 Children aged 6–10)

Qualitative

StoryDrawer

A context-based voice agent and two AI-driven collaborative strategies

Story collaborator

Human-AI collaborative writing

Inspire participants’ creative, elaborate ideas, and contribute to their creative outcomes during an engaging visual storytelling experience

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Fang, X., Ng, D.T.K., Leung, J.K.L. et al. A systematic review of artificial intelligence technologies used for story writing. Educ Inf Technol 28 , 14361–14397 (2023). https://doi.org/10.1007/s10639-023-11741-5

Download citation

Received : 27 December 2022

Accepted : 20 March 2023

Published : 05 April 2023

Issue Date : November 2023

DOI : https://doi.org/10.1007/s10639-023-11741-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Digital story writing
  • AI story generators
  • Human-AI story collaboration
  • AI-supported story-writing
  • Literature review
  • Find a journal
  • Publish with us
  • Track your research
  • Resources Home 🏠
  • Try SciSpace Copilot
  • Search research papers
  • Add Copilot Extension
  • Try AI Detector
  • Try Paraphraser
  • Try Citation Generator
  • April Papers
  • June Papers
  • July Papers

SciSpace Resources

Role of AI in Systematic Literature Review

Sumalatha G

Table of Contents

Systematic literature reviews have been significantly helping researchers to stay up to date by synthesizing the highest level of evidence for the research questions. But, the end-to-end process is highly resource-intensive and time-consuming.

Ideally, it requires a year or more as it is a long-term study and includes multiple stages, right from protocol registration to submission for publication.

With the ever-expanding growth of research paper publications, conducting a manual systematic literature review would be difficult and impractical.

However, with the help of Artificial intelligence, i.e., AI software designed to automate systematic review, researchers can expedite the whole systematic search process for rapid reviews.

Using AI in systematic reviews is revolutionizing the process, making it more practical and sustainable.

Incorporating AI into the process not only expedites the systematic literature reviews but also reduces human errors, and comes as a cost-effective systematic approach.

In this article, we’ll learn more about the role of AI in systematic reviews and the different types of AI software that assist in automating the  systematic review findings process .

What is a Systematic Review?

A systematic literature review or systematic review is an in-depth analysis   of observational studies conducted to provide an answer to the specific research question. It generates evidence-based synthesis by employing a thorough research strategy that identifies, screens, and analyzes all the relevant publications (particularly in healthcare research front), including both published and unpublished scientific literature.

Earlier, systematic reviews were mostly done manually. Researchers used to sift through a vast number of existing research articles, choose the ones that were pertinent based on predetermined criteria, extract data, and then synthesize the results. This laborious method required a lot of time and resources and was also subject to prejudices.

However, clinical practice of doing manual systematic reviews has become more impractical due to the upsurge in the number of research articles. And that’s when AI comes into the picture, providing a scalable method to accelerate the entire process and manage enormous amounts of data more effectively.

Let’s understand how AI automates the process of Systematic Literature Review tasks!

Role of Artificial Intelligence in Systematic Literature Reviews

1. automates the systematic review process.

Natural Language Processing (NLP) algorithms used in AI are adept at scouring through large databases to find relevant studies for systematic reviews. These algorithms can recognize keyword trends, comprehend context, and exclude irrelevant papers to ensure that only significant scientific literature is picked for evaluation.

2. Data synthesis and extraction

Machine learning models are trained to extract key findings from a set of scientific articles automatically. With the help of AI, data extraction becomes more reliable, complete, and consistent by identifying patterns in the presentation of data across different articles.

3. Detection of research bias

Research bias can potentially hamper the research findings and lead to distorted conclusions. So, AI algorithms are now being used to identify biases in particular research, such as publication bias or selection bias in qualitative research. By detecting these primary research biases, reviewers can account for them in their synthesis, which involves combining the results of multiple studies to draw more robust and reliable conclusions.

4. Stay informed with the new literature

Systematic reviews can easily become obsolete because of the dynamic and ever-growing nature of research. However, with the help of AI tools , it can constantly search databases for the latest research and support the evidence synthesis and meta analyses of existing literature. When relevant papers are published, the AI may easily incorporate them into the present review to keep it up-to-date.

List of AI Tools for Systematic Literature Review

Here’s a thorough overview of the best AI software for systematic literature reviews.

1. SciSpace Literature Review

SciSpace-Literature-Review

With SciSpace Literature Review software, you can save time by automating the literature review process. It helps you land on pertinent studies and dig out the most relevant information required for the evidence-based synthesis. You just need to enter the keyword or the topic and the tool pulls out all the papers that are vital for your research. It reduces your intensive task of sifting through papers manually by automating the entire process using the AI algorithm.

Key Features

  • It comes with an AI research assistant, SciSpace ChatPDF , that explains the text and math used in the paper and also answers any follow-up questions you may have.
  • You can interact with any paper in 75+ different languages.
  • You can create custom columns and draw specific insights based on the different research stages.

Refer to this detailed blog that helps you understand the SciSpace Literature review workspace thoroughly.

2. DistillerSR Literature review software

DistillerSR-Literature-Review-Software

DistillerSR is a prominent systematic literature review software. Its AI-powered capabilities streamline the identification, extraction, and management of research. Its collaboration, and built-in project management features make it easier for researchers to speed up the literature review process.

Key features:

  • AI-driven de-duplication: Removes duplicate research to ensure that reviewers are only evaluating original content.
  • Automatic data extraction: Pulls relevant data automatically from studies, dramatically reducing the need for manual input.
  • Collaborative workspace: Its real-time team collaboration is made possible through collaborative workspaces, which ensure the team's synchrony.

3. EPPI-Reviewer 6

EPPI-Reviewer-version-6

The key characteristic of EPPI-Reviewer is its sophisticated text and data mining capabilities. The platform effectively handles both qualitative and quantitative data by incorporating machine learning algorithms, ensuring a thorough review and quality assessment throughout.

  • Aids in Statistical Synthesis: Combines data from numerous studies to achieve uniform findings, facilitating a comprehensive review.
  • Cloud-based collaboration: Enables remote collaboration between reviewers thereby ensuring a smooth literature review process.

4. Rayyan — AI software for systematic review

Rayyan-AI-Systematic-Review-Software

Rayyan is a renowned literature review software for systematic review. It’s known for its user-friendly design and powerful AI-driven capabilities. Its design and layout ensure that even those unfamiliar with  systematic review methods can easily follow the procedure.

  • Intelligent screening: Prioritizes studies for review using machine learning.
  • Blinding capability: Prevents bias in reviews by concealing author and publication information.
  • Integrated with major databases to source the data from PubMed and Scopus seamlessly

5. Covidence

Covidence-systematic-review-software

Covidence stands out as it claims to reduce the time required to perform systematic reviews by 30% . It also emphasizes more on the user experience. Beyond its AI capabilities, its design ensures smoother navigation through the systematic review process quickly and easily.

  • Data visualization support: Offers graphs and charts for simple data understanding.
  • Feedback mechanism: Based on the user feedback loop, the tool allows for continuous improvement.

Benefits of Using AI Tools in Systematic Literature Reviews

The advent of AI-powered tools in systematic literature reviews has myriad benefits, and below are a few of the prominent ones:

Efficiency: These AI-powered tools' automated features significantly cut down on the time required to complete a review.

Accuracy: AI algorithms ensure human error-free data extraction and statistical analysis throughout.

Comprehensiveness: By scanning enormous amounts of data, These techniques ensure that no vital study is overlooked or ignored.

Collaboration: The majority of AI tools come with built-in project management functions that support team collaboration, providing uniform and consistent outcomes.

Despite the massive benefits of AI, careful consideration must be given to its usage. The data used to train AI algorithms determines their quality. It is essential to make sure that these algorithms are trained on several high-quality datasets in order to prevent biases or mistakes.

In addition, even though AI can speed up the systematic review process, human oversight is inevitable. To ensure AI outputs are accurate and relevant, researchers must validate them.

Parting thoughts!

Artificial Intelligence is redefining the way we conduct systematic reviews. It has the potential to completely transform the way we synthesize research findings by automating labor-intensive activities, identifying biases, and constantly updating reviews and analyses.

While hindrances in traditional narrative reviews could still persist, the fusion of human expertise and AI abilities can steer a new era of swift, precise, and significant systematic reviews.

systematic literature review using ai

Suggested Reads — Curated for you!

AI for thesis writing — Unveiling 7 best AI tools

How to Write an Effective Background of the Study: A Comprehensive Guide

Elicit vs. SciSpace: AI research assistant for effortless literature review

How to Use AI Tools for Conducting a Literature Review

You might also like

Smallpdf vs SciSpace: Which ChatPDF is Right for You?

Smallpdf vs SciSpace: Which ChatPDF is Right for You?

Sumalatha G

ChatPDF Showdown: SciSpace Chat PDF vs. Adobe PDF Reader

Boosting Citations: A Comparative Analysis of Graphical Abstract vs. Video Abstract

Boosting Citations: A Comparative Analysis of Graphical Abstract vs. Video Abstract

Searching for Systematic Reviews & Evidence Synthesis: AI tools in evidence synthesis

  • Define your search question
  • Searching Databases
  • Drawing up your search strategy
  • Advanced search techniques
  • Using Filters
  • Grey Literature
  • Recording your search strategy and results
  • Managing References & Software Tools
  • Further information
  • Library Workshops, Drop ins and 1-2-1s
  • AI tools in evidence synthesis

Introduction

A variety of AI tools can be used during the systematic review or evidence synthesis process. These may be used to assist with developing a search strategy; locating relevant articles or resources; or during the data screening, data extraction or synthesis stage. They can also be used to draft plain language summaries.

The overall consensus is that the AI tools can be very useful in different stages of the systematic or other evidence review but that it is important to fully understand any bias and weakness they may bring to the process. In many cases using new AI tools, which previous research has not assessed rigorously, should happen in conjunction with existing validated methods. It is also essential to consider ethical, copyright and intellectual property issues for example if the process involves you uploading data or full text of articles to an AI tool.

 Below are some recent published articles on the topic:

  • Alshami, A.; Elsayed, M.; Ali, E.; Eltoukhy, A.E.E.; Zayed, T. Harnessing the Power of ChatGPT for Automating Systematic Review Process: Methodology, Case Study, Limitations, and Future Directions . Systems 2023, 11, 351. https://doi.org/10.3390/systems11070351 Explores the use of ChatGPT in (1) Preparation of Boolean research terms and article collection, (2) Abstract screening and articles categorization, (3) Full-text filtering and information extraction, and (4) Content analysis to identify trends, challenges, gaps, and proposed solutions.
  • Blaizot, A, Veettil, SK, Saidoung, P, et al.  Using artificial intelligence methods for systematic review in health sciences: A systematic review.   Res Syn Meth . 2022; 13(3): 353-362. doi: 10.1002/jrsm.1553 The review below delineated automated tools and platforms that employ artificial intelligence (AI) approaches and evaluated the reported benefits and challenges in using such methods.They report the usage of Rayyan Robot Reviewer EPPI-reviewer; K-means; SWIFT-review; SWIFT-Active Screener; Abstrackr; Wordstat; Qualitative Data Analysis (QDA);  Miner and NLP and assess the quality of the reviews which used these.
  • Janka H, Metzendorf M-I. High precision but variable recall – comparing the performance of five deduplication tools . JEAHIL [Internet]. 17Mar.2024 [cited 28Mar.2024];20(1):12-7. Available from: http://ojs.eahil.eu/ojs/index.php/JEAHIL/article/view/607  
  • Kebede, MM, Le Cornet, C, Fortner, RT.  In-depth evaluation of machine learning methods for semi-automating article screening in a systematic review of mechanistic literature.   Res Syn Meth . 2023; 14(2): 156-172. doi: 10.1002/jrsm.1589 "We aimed to evaluate the performance of supervised machine learning algorithms in predicting articles relevant for full-text review in a systematic review." "Implementing machine learning approaches in title/abstract screening should be investigated further toward refining these tools and automating their implementation"  

Khalil H, Ameen D, Zarnegar A. Tools to support the automation of systematic reviews: a scoping review .  J Clin Epidemiol  2022;  144:  22-42  https://www.jclinepi.com/article/S0895-4356(21)00402-9/fulltext  "The current scoping review identified that LitSuggest, Rayyan, Abstractr, BIBOT, R software, RobotAnalyst, DistillerSR, ExaCT and NetMetaXL have potential to be used for the automation of systematic reviews. However, they are not without limitations. The review also identified other studies that employed algorithms that have not yet been developed into user friendly tools. Some of these algorithms showed high validity and reliability but their use is conditional on user knowledge of computer science and algorithms."

Khraisha Q, Put S, Kappenberg J, Warraitch A, Hadfield K.  Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages .  Res Syn Meth . 2024; 1-11. doi: 10.1002/jrsm.1715 "Although our findings indicate that, currently, substantial caution should be exercised if LLMs are being used to conduct systematic reviews, they also offer preliminary evidence that, for certain review tasks delivered under specific conditions, LLMs can rival human performance."

Mahuli, S., Rai, A., Mahuli, A. et al. Application ChatGPT in conducting systematic reviews and meta-analyses . Br Dent J 235, 90–92 (2023). https://doi.org/10.1038/s41415-023-6132-y Explores using ChatGPT for conducting Risk of Bias analysis and data extraction from a randomised controlled trial.

Ovelman, C., Kugley, S., Gartlehner, G., & Viswanathan, M. (2024). The use of a large language model to create plain language summaries of evidence reviews in healthcare: A feasibility study . Cochrane Evidence Synthesis and Methods, 2(2), e12041.  https://onlinelibrary.wiley.com/doi/abs/10.1002/cesm.12041 

Qureshi, R., Shaughnessy, D., Gill, K.A.R.  et al.   Are ChatGPT and large language models “the answer” to bringing us closer to systematic review automation? .  Syst Rev   12 , 72 (2023). https://doi.org/10.1186/s13643-023-02243-z "Our experience from exploring the responses of ChatGPT suggest that while ChatGPT and LLMs show some promise for aiding in SR-related tasks, the technology is in its infancy and needs much development for such applications. Furthermore, we advise that great caution should be taken by non-content experts in using these tools due to much of the output appearing, at a high level, to be valid, while much is erroneous and in need of active vetting."

van Dijk SHB, Brusse-Keizer MGJ, Bucsán CC, et al. Artificial intelligence in systematic reviews: promising when appropriately used . BMJ Open 2023;13:e072254. doi: 10.1136/bmjopen-2023-072254  Suggests how to conduct a transparent and reliable systematic review using the AI tool ‘ASReview’ in the title and abstract screening.

An update on machine learning AI in systematic reviews

June 2023 webinar including a panel discussion exploring the use of machine learning AI in Covidence (screening & data extraction tool).

CLEAR Framework for Prompt Engineering

  • The CLEAR path: A framework for enhancing information literacy through prompt engineering. This article introduces the CLEAR Framework for Prompt Engineering, designed to optimize interactions with AI language models like ChatGPT. The framework encompasses five core principles—Concise, Logical, Explicit, Adaptive, and Reflective—that facilitate more effective AI-generated content evaluation and creation. more... less... Lo, L. S. (2023). The CLEAR path: A framework for enhancing information literacy through prompt engineering. The Journal of Academic Librarianship, 49(4), 102720.

Selection of AI tools used in Evidence Synthesis

  • Systematic Review Toolbox The Systematic Review Toolbox is an online catalogue of tools that support various tasks within the systematic review and wider evidence synthesis process.
  • Rayyan Free web-tool designed to speed up the process of screening and selecting studies
  • Abstrackr Aids in citation screening. Please note you will need to create a free account before accessing the tool.
  • DistillerSR An online application designed to automate all stages of the systematic literature reviews. Priced packages available (please note we cannot offer support on using this system).
  • ExaCT Information Extraction system. Please note you will need to request a free account. The system is trained to find key information from scientific clinical trial publications, namely the descriptions of the trial's interventions, population, outcome measures, funding sources, and other critical characteristics. Please note you will need to request a free account.
  • RobotReviewer RobotReviewer is a machine learning system which aims to support evidence synthesis. The demonstration website allows users to upload RCT articles and see automatically determined information concerning the trial conduct (the 'PICO', study design, and whether there is a risk of bias).

Selection of tools to support the automation of systematic reviews (2022)

Khalil H, Ameen D, Zarnegar A. Tools to support the automation of systematic reviews: a scoping review. J Clin Epidemiol. 2022 Apr;144:22-42. doi: 10.1016/j.jclinepi.2021.12.005. Epub 2021 Dec 8. PMID: 34896236.  https://www.sciencedirect.com/science/article/pii/S0895435621004029?ref=pdf_download&fr=RR-2&rr=821cfdcf2d377762#tbl0004 [accessed 06-11-23].

Summary of validated tools available for each stage of the review

Screenshot of Table 4. Summary of validated tools available for each stage of the review

King’s guidance on generative AI for teaching, assessment and feedback

  • King’s guidance on generative AI for teaching, assessment and feedback comprehensive guidance aims to support the adoption and integration of generative AI at different institutional levels - macro (university), meso (department, programme, module), and micro (individual lecturers, especially those with assessment roles).

Leveraging GPT-4 for Systematic Reviews

Recording of 1 hour webinar exploring Artificial Intelligence (AI) and its potential impact on the process of systematic reviews (August 15th, 2023). Note PICO Portal is a systematic review platform that leverages artificial intelligence to accelerate research and innovation.

Moderator Dr Greg Martin. Presenters: Eitan Agai - PICO Portal Founder & AI Expert; Riaz Qureshi - U. of Colorado Anschutz Medical Campus; Kevin Kallmes - Chief Executive Officer, Cofounder; Jeff Johnson - Chef Design Officer.

PAIR (problem, AI, interaction, reflection) framework guidance

  • PAIR (problem, AI, interaction, reflection) framework guidance The framework is designed to be (i) simple, providing a straightforward structure to harness the potential of generative AI; (ii) customisable, allowing adaptation to align with specific learning objectives and student characteristics; and (iii) compatible, building on established pedagogical approaches such as problem/inquiry-based learning and active learning, making it suitable for different disciplines.

Artificial intelligence (AI) technologies in Cochrane

  • Web Clinic: Artificial intelligence (AI) technologies in Cochrane The session was delivered in May 2024 and you will find the videos from the webinar, together with the accompanying slides to download [PDF]. Recordings from other Methods Support Unit web clinics are available here. Part 1: How Cochrane currently uses machine learning: implementing innovative technology Part 2: What generative AI is, the opportunities it brings and the challenges regarding its safe use Part 3: Cochrane's focus on the responsible use of AI in systematic reviews Part 4: Questions and answers
  • << Previous: Library Workshops, Drop ins and 1-2-1s
  • Last Updated: Aug 16, 2024 4:30 PM
  • URL: https://libguides.kcl.ac.uk/systematicreview

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Systematic Literature Review (SLR) using AI involves leveraging artificial intelligence techniques to automate and expedite the process of reviewing and synthesizing large volumes of scholarly literature.

drshahizan/SLR-FC

Folders and files.

NameName
498 Commits

Repository files navigation

Stars Badge

Systematic Literature Review (SLR) Using AI

systematic literature review using ai

Session Topic Date
1a (basic) 23 and 24 Jan 2024
1b (advance) 4 and 5 Feb 2024
2 18 Feb 2024 and 3 Mac 2024
3 5 Mac 2024
4a 19 Mac 2024
4b 26 Mac 2024
4c 2 April 2024

Contribution 🛠️

Please create an Issue for any improvements, suggestions or errors in the content.

You can also contact me using Linkedin for any other queries or feedback.

Visitors

Contributors 34

Krystelis-Black-Logo-62px

We Make Clinical Research Crystal Clear

  • Connect with us

Use of AI in Systematic Literature Reviews

systematic literature review using ai

Systematic literature reviews (SLRs) generate comprehensive, evidence-based outputs in response to specific research questions using a structured, yet transparent approach. In clinical research, SLRs can support the holistic evaluation of the efficacy, safety, and cost-effectiveness of a medicinal product by combining findings from multiple clinical studies. These reviews provide clinicians, regulators, policy makers, economists,  epidemiologists, and health technology assessment (HTA) bodies with more meaningful information than typical review articles.

An SLR starts with creating a search strategy which is free of selection bias. This strategy is used to find relevant research articles for review and analysis. However, in this digital era, where enormous amounts of data are just a click away, conducting an SLR can be a daunting and laborious task. Literature searches can retrieve thousands of publications and therefore manually reviewing and analysing these can be time-consuming. By the time an SLR is published, the results may be outdated and out of context.

Artificial intelligence (AI) can significantly improve the efficiency and accuracy of the SLRs. Typically, an SLR is conducted using the following steps:

systematic literature review using ai

AI can support some of these steps very efficiently, as detailed below:

Automated Search and Screening

The predominant use of AI in SLRs is to help streamline the initial steps of the process by automating the search for relevant literature and screening articles based on predefined eligibility criteria. Search engines now utilise AI, especially by strengthening Retrieval-Augmented Generation (RAG) frameworks through large language models. These can support the preparation of a search with complex queries that surpasses the capabilities of a conventional keywords-based literature search.

systematic literature review using ai

Typically, machine learning classifiers are used to find more pertinent articles. These classifiers are trained on a first set of user-selected papers, and then reiteratively use the automatic classifications and retrain the classifier to better identify further relevant literature.

AI techniques employed by automated tools use an article’s title, abstract, or the full text of papers as inputs, and then Natural Language Processing (NLP) algorithms analyse abstracts, titles, and keywords to determine their relevance to the research topic. The AI techniques can also incorporate a statistical selection process to detect the key terms characterising each cluster by scoring each citation based on the number of keywords present, enabling the screener to make more efficient decisions on their relevance. The resulting clusters emphasise the most representative terms and thereby better support the inclusion or omission of a publication from the analysis.

Data Extraction and Evidence Generation

In health research, researchers apply various protocols for literature review depending on the type of report to be generated. These include PICO (population, intervention, comparison, outcome), PCC (population, context, concept), PICODR (elements of PICO plus duration and results), PIBOSO (population, intervention, background, outcome, study design, and other). 

systematic literature review using ai

Using AI, information is extracted from a predefined set of fields for interventional, diagnostic, or prognostic SLRs. For instance, NLP algorithms can extract key information such as study methodologies, findings, and statistical data, which can then be synthesised and analysed to draw meaningful conclusions. Knowledge graphs are produced that improve research paper classification and characterisation. AI technologies use a domain ontology to organise the data, which offers a formal description of the variable types and their association with each other.

Quality Assessment

It is important to minimise selection bias and maximise the external and internal validity of the publications selected for an SLR. To do this, systematic errors should be reduced to the largest possible extent.

A quality assessment provides evidence of the overall strength and reliability of an SLR. AI can help assess the quality of included studies by analysing factors such as study design, sample size, and methodology.

systematic literature review using ai

Machine learning algorithms can be trained on existing datasets to recognise patterns indicative of high-quality research, helping researchers to evaluate the reliability of the evidence more efficiently. Most validated checklists, such as PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) recommend evaluation of the bias in literature search and selection by an independent reviewer. Therefore, combining a manual quality check with automated screening is important to understand any gaps and discrepancies, which can be resolved by evaluating the conflicts between the screener’s and the reviewer's decisions.

Analysis, Data Visualization, and Preparation of the Report

AI-powered tools can support the meta-analysis of data extracted from multiple studies, enabling researchers to quantitatively synthesize findings and assess the overall effect sizes. AI can facilitate the organisation and categorisation of large volumes of literature through semantic analysis and clustering techniques. By identifying common themes and relationships among studies, AI algorithms can help researchers gain deeper insights into the existing literature, evaluate the current status of the topic of research, and identify gaps or areas for further investigation.

Additionally, AI-driven visualization techniques can be used to present complex information in a more accessible and comprehensible manner, facilitating understanding and decision-making.

systematic literature review using ai

AI technology uses algorithms and models to find patterns and trends, outliers in data, and correlations amongst diverse variables. Insights and recommendations generated from SLR data may help researchers understand the implications of gaps in current knowledge, processes, research, and policies.  By incorporating feedback from researchers and updating algorithms based on new data, AI can continuously enhance the accuracy and efficiency of SLRs and improve the quality of research outputs.

Allied uses of Artificial Intelligence

An SLR report usually includes many abbreviated terms and acronyms and these must all be listed. AI can be used automate the creation of the abbreviations list. In addition, AI can also assist automated quality check of the data presented in an SLR.

Overall, the integration of AI into SLR looks promising, especially for enhancing the efficiency, accuracy, and comprehensiveness of the review process, enabling researchers to leverage the vast amount of available literature more effectively.

To know more…

Write to us, if you want to know more about our customised and cost-effective AI solutions for SLRs, targeted literature reviews, and meta-analyses at:  [email protected]

  • Open access
  • Published: 21 August 2024

Evaluating the effectiveness of large language models in abstract screening: a comparative analysis

  • Michael Li 1 ,
  • Jianping Sun 2 &
  • Xianming Tan   ORCID: orcid.org/0000-0002-5478-2269 1 , 3  

Systematic Reviews volume  13 , Article number:  219 ( 2024 ) Cite this article

249 Accesses

Metrics details

This study aimed to evaluate the performance of large language models (LLMs) in the task of abstract screening in systematic review and meta-analysis studies, exploring their effectiveness, efficiency, and potential integration into existing human expert-based workflows.

We developed automation scripts in Python to interact with the APIs of several LLM tools, including ChatGPT v4.0, ChatGPT v3.5, Google PaLM 2, and Meta Llama 2, and latest tools including ChatGPT v4.0 turbo, ChatGPT v3.5 turbo, Google Gemini 1.0 pro, Meta Llama 3, and Claude 3. This study focused on three databases of abstracts and used them as benchmarks to evaluate the performance of these LLM tools in terms of sensitivity, specificity, and overall accuracy. The results of the LLM tools were compared to human-curated inclusion decisions, gold standard for systematic review and meta-analysis studies.

Different LLM tools had varying abilities in abstract screening. Chat GPT v4.0 demonstrated remarkable performance, with balanced sensitivity and specificity, and overall accuracy consistently reaching or exceeding 90%, indicating a high potential for LLMs in abstract screening tasks. The study found that LLMs could provide reliable results with minimal human effort and thus serve as a cost-effective and efficient alternative to traditional abstract screening methods.

While LLM tools are not yet ready to completely replace human experts in abstract screening, they show great promise in revolutionizing the process. They can serve as autonomous AI reviewers, contribute to collaborative workflows with human experts, and integrate with hybrid approaches to develop custom tools for increased efficiency. As technology continues to advance, LLMs are poised to play an increasingly important role in abstract screening, reshaping the workflow of systematic review and meta-analysis studies.

Peer Review reports

Introduction

A systematic review is a rigorous form of research that collates and synthesizes all existing evidence on a specific research question [ 1 ]. It stands as a cornerstone not just in medical research but across diverse academic disciplines. Unlike traditional literature reviews, systematic reviews follow a comprehensive and standardized process, such as the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guideline [ 2 ], designed to minimize bias and ensure reproducibility. Therefore, these reviews are recognized as one of the zenith levels of evidence in evidence-based research [ 3 ] and play a pivotal role in shaping clinical guidelines, healthcare policies, and informing medical decisions [ 4 ].

Commencing with a well-articulated research question, a typical systematic review launches an exhaustive search strategy that subsequently sweeps through databases, such as PubMed and Embase, supplemented by additional sources such as clinical trial registries and pertinent article reference lists, with aims to capture all relevant studies and mitigate bias. Predetermined inclusion and exclusion criteria guide the subsequent screening and selection of studies, encompassing facts like study design, patient demographics, and intervention types. Reviewers, working independently, appraise each study’s eligibility, reconciling disagreements through discussions or third-party reviews. Data extraction and synthesis follow, either through meta-analysis or narrative synthesis, depending on the heterogeneity of the selected studies.

The practice of conducting systematic reviews has gained substantial popularity with considerable demand within the academic community. A notable reference [ 5 ] to this trend found that approximately 75 trials and potentially 11 systematic reviews are disseminated daily, based on data from around 2010. Moreover, a query for “systematic review” in the Google Scholar database yields approximately 17,000 entries that have been published within the year 2023 alone, which translates into about 51 systematic reviews per day, as observed on November 26, 2023. This expanding volume of literature underscores the critical role that systematic reviews play in the consolidation of research findings across various fields of study.

Despite their pivotal role, executing systematic reviews remains a formidable task due to the abstract screening, a key phase that can be overwhelmingly time-consuming due to its volume. For example, in Polanin et al. [ 6 ], the authors reported that the research staff screened 29,846 abstracts independently (14,923 unique citations were double screened) over the course of 189 days. In addition, the variability in reporting standards, use of jargon, and varied study designs can further complicate the abstract screening process [ 7 ]. Nevertheless, the repetitive nature of the task, combined with cognitive fatigue, can lead to human errors [ 8 , 9 ]. Recent advancements in machine learning (ML) and deep learning propose possible solutions to these challenges. However, traditional ML models, while promising, require domain-specific training, a time-consuming process that often demands manual labeling of datasets [ 10 ].

Advancements in natural language processing (NLP) and artificial intelligence (AI) are opening doors to address challenges in systematic reviews. Large language models (LLMs) like ChatGPT [ 11 ], PaLM [ 12 ], Llama [ 13 ], and Claude [ 14 ] are capturing the research community’s attention. Their collective potential, especially their capability to operate without exhaustive domain-specific training, makes them prime candidates for revolutionizing the systematic review process.

While each of the aforementioned AI tools brings its unique capabilities to the table, the fundamental question remains: How do they stack up, both individually and collectively, in the realm of abstract screening, against the human expert-based process? In pursuit of answers, this research seeks to thoroughly investigate the potential of ChatGPT, Google PaLM, Llama, and Claude in automating the crucial abstract screening phase integral to systematic reviews. Our goal is to rigorously compare the performance of these advanced AI-driven methods with existing machine learning (ML)-based approaches. In doing so, we aim to develop AI strategies that masterfully balance efficiency and accuracy with minimal human intervention, ultimately transforming systematic review practice across disciplines.

The use of NLP for abstract screening is established [ 10 , 15 ]. However, the application of LLMs specifically for this task is a nascent field [ 16 , 17 ]. This emerging area offers significant potential to improve efficiency and accuracy. Our study aims to fill this gap by providing a comprehensive analysis of LLM capabilities in abstract screening, laying the groundwork for future research and application. This is particularly relevant considering the rapid evolution of this technology, highlighting its potential to streamline systematic reviews now and in the future.

The remainder of this paper is structured to provide a comprehensive exploration of our topic. We begin with an in-depth examination of existing methods for abstract screening, including both manual and NLP-based approaches, laying the groundwork for understanding the current state of the field. We then introduce the use of large language model (LLM) tools for abstract screening, detailing our experimental design to meticulously evaluate their performance in this context. Subsequent sections present our empirical findings and results, shedding light on the capabilities and limitations of the AI tools in question. Finally, we engage in a thoughtful discussion, reflecting on the implications of our findings and considering the future trajectory of abstract screening in systematic reviews.

Existing approaches to abstract screening in systematic reviews

In the vast realm of systematic reviews, the critical task of abstract screening serves as a foundational step in curating the highest quality of evidence [ 2 ]. However, this process often presents significant challenges due to the involvement of sifting through large volumes of literature to identify those that align with predefined criteria. Over time, various methodologies, ranging from manual evaluations to sophisticated AI-driven techniques, have been proposed to address the complexities of this task. In this section, we first describe the existing approaches on their operational mechanisms and associated advantages and disadvantages.

Manual approach

Historically, the process of abstract screening was firmly rooted in manual evaluations. In this conventional approach, individual reviewers would scrutinize each abstract against predefined criteria [ 1 ]. The meticulous nature of this method required that multiple experts independently evaluate the literature to ensure both reliability and reduced biases [ 8 ]. While the depth of human expertise brought about nuanced understanding, the manual nature of this method made it both time-consuming and, at times, prone to human error [ 6 , 9 ].

NLP-based approach

As technology evolved, the field witnessed the incorporation of natural language processing (NLP) to automate abstract screening [ 10 ]. In this framework, abstract text undergoes preprocessing and vectorization. Supervised machine learning models, notably the support vector machine (SVM) and the random forest (RF), are then trained on this vectorized data to classify literature based on specific criteria [ 15 ]. The strength of this approach lies in its potential for efficiency. However, its efficacy and accuracy hinges heavily on the availability of a well-curated, labeled training set.

Zero-shot classification

A more recent and innovative approach is zero-shot classification, which was notably highlighted by Xian et al. [ 18 ]. Eschewing the need for an extensive labeled training dataset, zero-shot classification offers the allure of categorizing abstracts without prior domain-specific training. By calculating a probability score for each abstract, researchers obtain a dynamic measure of its alignment with predefined categories. While it does not require training for the machine learning model, the tradeoff is a loss of sensitivity and potential omissions of pertinent studies [ 19 ]. In this study, for a given set of abstracts, we first obtain embeddings (using OpenAI’s text embedding API, the babbage model) for abstracts and for a pre-specified description of an ideal study to include. We indeed use the inclusion/exclusion criteria (see “Examples of abstract screening by using LLMs,” “Automated workflow for streamlining abstract screening via ChatGPT and other tools,” and “ Study design ” sections) as the description of an ideal study to include. We then compute the cosine similarity score between each abstract (its embedding) and the pre-specified description (its embedding). We claim the top 10% (in terms of similarity) of abstracts as positive (i.e., a study should be included).

Hybrid approach

To reconcile the strengths and limitations of these models, a hybrid methodology has emerged [ 18 ]. It cleverly marries the rapid categorization capabilities of zero-shot classification with the precision of traditional machine learning models such as SVM and RF. Here, zero-shot classification provides an initial curation of all abstracts in the training set. Then, based on zero-shot classification’s curation, the abstracts with classification score higher than a pre-specified threshold are selected to undergo manual review to rectify zero-shot classification. The rectified classification then serves as a foundation for training traditional models, which are then employed for screening the broader dataset (e.g., in the testing set). The hybrid approach balances the speed of zero-shot and the precision of traditional ML, and potentially offers enhanced accuracy at reduced human efforts. However, this approach involves the use of multiple methodologies and still relies on well-curated, labeled training data (in this case, a subset of the whole training set). In this study, for a given set of abstracts, we rectify the “positive” abstracts (i.e., the top 10% of abstracts) identified by zero-shot, and then fit a classification model for rectified labels to the text embeddings of the corresponding abstracts using SVM. This learned classification model is then used to predict the class of other abstracts. We added the R code for the zero-shot and hybrid approaches to the GitHub repository https://github.com/mikeli380/LLMAbstractScreening .

Active learning approach

Active learning [ 20 , 21 , 22 ] is an innovative approach to machine learning that optimizes the training process by allowing the model to selectively query a human annotator for labels on the most informative data points. This method is particularly advantageous in scenarios where labeled data is scarce or expensive to obtain. Active learning models can start with minimal datasets, often requiring only one relevant and one irrelevant abstract, making them particularly suitable for tasks such as abstract screening, where the acquisition of large labeled datasets can be prohibitive.

The main advantage of active learning is its efficiency. By focusing on the most informative samples, it reduces the amount of data that needs to be labeled while still effectively training the model. This can significantly reduce the time and resources required for the annotation process. However, the effectiveness of active learning depends heavily on the initial selection of samples and the criteria used to determine the informativeness of subsequent data points. If not well calibrated, the model may request labels for data that do not significantly improve its performance, resulting in an inefficient use of resources. In addition, the iterative nature of querying and updating the model based on new labels can introduce complexity into the training process, potentially requiring more sophisticated infrastructure and oversight than traditional supervised learning methods.

While this study did not test active learning approaches for abstract screening, readers interested in exploring this methodology further are referred to [ 23 ] for detailed information on the application of active learning in abstract screening contexts.

  • Large language models

Amidst the evolving methodologies described in the “ Existing approaches to abstract screening in systematic reviews ” section, modern AI tools based on large language models (LLMs), such as ChatGPT, PaLM, Llama, and Claude, are emerging as potential game-changers. Grounded in advanced language processing capabilities, these tools can be tailored to evaluate abstracts against nuanced criteria and offer detailed assessments and classifications. Their prowess signals transformative potential for abstract screening. In this section, we first present two examples to illustrate the potential of LLMs in the context of abstract screening, and then precede with our in-depth discussion on the study’s workflow and design that aims to critically investigate the performance of LLMs in this domain.

Examples of abstract screening by using LLMs

In this subsection, we demonstrate the process of LLM-based abstract screening using ChatGPT with two specified abstracts: (1) Millard et al. [ 24 ] and (2) Zawia et al. [ 25 ]. Using other LLM tools such as Llama, Google PaLM, or Claude for abstract screening is similar and will be skipped for brevity.

In constructing the prompts, we adopted a standardized approach to mimic a typical interaction between a senior researcher and a research assistant. Each prompt was designed to include three critical elements: a brief statement of the topic under review, the exact eligibility criteria as specified in the study guidelines, and the text of the abstract to be reviewed. This approach was designed to simulate a realistic and practical scenario that reflects common academic practices. In addition, to ensure consistency and objectivity in the responses, we required that the results of the LLM be in a binary “yes” or “no” format, directly indicating whether each abstract met the inclusion criteria. This format was chosen to eliminate any ambiguity in the model’s decisions and to facilitate straightforward analysis of its accuracy.

Abstract (1)

ChatGPT prompt: “I would like you to help me with conducting a systematic review on animal depression. I will provide the title and abstract for one journal article and would like you to screen the paper for inclusion. Here are the inclusion criteria: Does the study provide primary data of an animal model of depression or depressive-like phenotype with an appropriate control group? Here is the abstract:

Wistar-Kyoto (WKY) rats show endogenous depressive behavior that can be reversed by antidepressants. Given that WKYs exhibit decreased sensitivity to some antidepressants and treatment-resistant depressed patients often show hypothalamic-pituitary-thyroid (HPT) dysregulation, we examined the behavioral and HPT hormonal responses of WKYs to altered thyroid status. “Euthyroid” WKYs had elevated basal plasma TSH and T(3) levels as compared to Wistars. Hypothyroidism increased TSH levels more in WKYs than in Wistars and increased response latency in the open field test (OFT) of WKYs only. Administration of T(4) and T(3) suppressed plasma TSH equally in both strains. Wistars responded to increased T(3) levels with decreased response latency and increased activity in the OFT, but increased immobility in the forced swim test. In contrast, WKYs responded only to the high T(3) levels with decreased response latency in the OFT. These results suggest the existence of a decreased central nervous system sensitivity to thyroid hormones in WKYs that could be related to their depressive behavior.”

ChatGPT v4 response (accessed on 10/15/2023):

Given the abstract and the inclusion criteria, this study appears to meet the inclusion criteria for your systematic review on animal models of depression.

Consideration for inclusion:

 1. Animal model and depressive-like phenotype:

  • The study utilizes Wistar-Kyoto (WKY) rats which are mentioned to show endogenous depressive behavior.

  • This study investigates behaviors and hormonal responses associated with depression in this particular animal model, with a focus on the role of thyroid hormones.

 2. Primary data and appropriate control group:

  • The study provides primary data on behavioral and hormonal responses of WKY rats, comparing them to Wistars (likely serving as the control group), under altered thyroid status, allowing for comparisons between the depressive-like phenotype and controls.

  • The utilization of both “Euthyroid” WKYs and Wistars as controls helps in effectively investigating the behavioral changes and responses in WKY rats due to alterations in thyroid status.

 3. Behavioral test indicating depression:

  • Depressive behavior and responses are assessed using standardized tests like the open field test (OFT) and the forced swim test, which are conventional methods to investigate depressive-like behaviors in animal models.

  • The mention of antidepressant’s impact on WKY rats indicates a focus on depressive-like states.

 4. Relevance to human condition:

  • Insights gained from this study on decreased central nervous system sensitivity to thyroid hormones and its possible relation to depressive behavior in WKYs can have implications in understanding treatment-resistant depression in humans, making it valuable for the review.

Conclusion:

Given the provided abstract, this study does seem to fit the inclusion criteria for a systematic review on animal depression, as it provides primary data on an animal model demonstrating depressive-like behavior, includes an appropriate control group, and employs standardized tests to study the depressive-like states and their relation to thyroid hormones. The detailed exploration of depressive behaviors and hormonal imbalances in relation to altered thyroid status can offer valuable insights for understanding and treating depression.

Abstract (2)

The induction of the activity of cerebral ornithine decarboxylase (EC 4.1.1.17) and mRNA by electrical stimulation exhibits regional differences. The effects of the enzyme inhibitor difluoromethylornithine on these regional variations was examined. Administration of this inhibitor resulted in pronounced depression of both basal and induced activity of ornithine decarboxylase in the hippocampus. Basal activity of the enzyme in the neocortex and the cerebellum appeared to be resistant to difluoromethylornithine but the induced enzyme activity was sensitive to the effects of this inhibitor. Susceptibility to difluoromethylornithine may be directly correlated with a slower turnover rate for ornithine decarboxylase. These results suggest that ornithine decarboxylase in the hippocampus may possess a longer half-life than its counterparts in other regions of the brain. Pretreatment with difluoromethylornithine had no effect on the induced ornithine decarboxylase mRNA in the neocortex. Thus, elevated activity of ornithine decarboxylase enzyme, due to electrical stimulation, appears to not have any effect on either the transcription or the decay rate of the induced ornithine decarboxylase mRNA. These findings support the concept of region-specific regulation of cerebral ornithine decarboxylase. ”

Given the provided abstract and the inclusion criteria, this study does not appear to meet the criteria for inclusion in a systematic review on animal models of depression or depressive-like phenotypes with an appropriate control group.

Consideration for inclusion:

 1. Depression or depressive-like phenotype:

  • The abstract does not provide any information or data related to depression or depressive-like behavior or phenotypes in animals. It is mainly focused on the activity of cerebral ornithine decarboxylase and its regional differences in response to electrical stimulation and inhibitor administration.

 2. Animal model:

  • There is no mention of the use of any specific animal model or any behavioral assays to assess depressive-like symptoms or states in the provided abstract.

 3. Behavioral assessment and primary data:

  • The abstract does not contain any details about behavioral assessments related to depression or depressive-like states. It primarily provides data on the effects of enzyme inhibitor difluoromethylornithine on cerebral ornithine decarboxylase activity and mRNA, but does not relate these findings to depression or depressive-like behaviors.

 4. Relevance to depression:

  • While the study explores biochemical aspects in brain regions like the hippocampus, which is implicated in mood disorders like depression, the connection between the findings and depressive symptoms or states is not established in the provided abstract.

Conclusion:

This article does not seem to fit the inclusion criteria for a systematic review on animal models of depression or depressive-like phenotypes as it lacks primary data on depressive-like behaviors or phenotypes in animals, and there is no mention of the use of animal models representative of depression or an appropriate control group for such a comparison.

From the above two examples presented, it is evident that ChatGPT adeptly comprehends the posed queries and efficiently extracted the inclusion and exclusion criteria. It not only makes clear decisions regarding study inclusion but also articulates its rationale by cross-referencing each criterion.

It is worth noting that the same ChatGPT prompt might result in slightly different responses between different users due to the variability in its response generation, but the conclusion for the eligibility of an abstract is typically consistent. In addition, enhancing the specificity of the criteria within the prompts can further refine the screening process. To illustrate, consider the following enhanced prompts:

Enhanced ChatGPT prompt for abstract (1): “I seek assistance with a systematic review focused on animal depression. I’ll provide the abstract of a particular journal article and would appreciate an assessment for its inclusion based on the following criteria:

The study employs an animal model reflecting depression or a depressive-like phenotype.

It incorporates an appropriate control group.

Primary data is presented within the study.

and here are the exclusion criteria:

The study bypasses the use of an animal model.

The utilized animal model doesn’t characterize depression or a depressive-like phenotype.

Absence of a control group within the study.

The study is primarily a review, devoid of primary data.

Here is the abstract: …”.

Let’s assess the provided abstract based on the inclusion and exclusion criteria:

Inclusion criteria:

 1. Animal model reflecting depression: The study uses Wistar-Kyoto (WKY) rats, which are described as showing “endogenous depressive behavior.” Thus, this criterion is met.

 2. Appropriate control group: The study compares WKY rats to Wistars (another strain), which serves as a comparison or control group. This criterion is met.

 3. Primary data: The abstract mentions specific results such as the elevated basal plasma TSH and T(3) levels in “Euthyroid” WKYs compared to Wistars and the different responses in the open field test (OFT). This suggests the presence of primary data. Thus, this criterion is met.

Exclusion criteria:

 1. Bypasses the use of an animal model: The study uses WKY rats as an animal model. Thus, this criterion is not met, which is good for inclusion.

 2. Animal model doesn’t characterize depression: The WKY rats show “endogenous depressive behavior.” Thus, this criterion is not met, which again supports inclusion.

 3. Absence of a control group: The study does use a control group (Wistars). Thus, this criterion is not met, supporting inclusion.

 4. Study is primarily a review: The abstract presents experimental data and does not indicate that the study is a review. Thus, this criterion is not met, which is favorable for inclusion.

Conclusion:

Given that the abstract meets all the inclusion criteria and none of the exclusion criteria, it should be considered in your systematic review on animal depression.

With these revised prompts applied to the previously discussed abstracts, the conclusions remain consistent. Nonetheless, we contend that this enhanced structure not only provides a more rigorous framework but also aligns more closely with widely accepted practices in systematic review criteria definition.

Automated workflow for streamlining abstract screening via ChatGPT and other tools

For systematic reviews, particularly when dealing with voluminous data, efficient workflows are paramount. The ChatGPT API (application programming interface) offers a dynamic solution, enabling the automation of abstract screening on a large scale, which circumvents the labor-intensive process of manually inputting abstracts into a chatbot interface. In this subsection, we present an automated workflow for streamlining abstract screening via ChatGPT. Note, though this automated workflow uses ChatGPT as the platform, analogous workflows work for other AI platforms like PaLM, Llama, and Claude.

Automated workflow:

Data collection: The preliminary step entails accumulating a list of titles and abstracts. By utilizing carefully crafted keywords, we retrieve these from PubMed and other pertinent databases. This comprehensive approach ensures the potential inclusion of all relevant studies for a subsequent detailed screening. It is worth noting that while this list is expansive, most of these studies may not find their way into the final meta-analysis post-screening.

Automation through Python: We have devised a python script aimed at harnessing the capabilities of ChatGPT for evaluating the amassed abstracts.

This script interacts with the ChatGPT API (specifically, the GPT-4 version) and, when furnished with tailored prompts, extracts structured responses from ChatGPT.

Typically, the AI’s response commences with a succinct summary, delves into explanations aligned with each criterion, and concludes with a decisive judgment, as exemplified by the examples in the “ Examples of abstract screening by using LLMs ” section.

This automated process efficiently saves ChatGPT’s verdicts on each abstract for ensuing analyses. For instance, it extracts the final decisions regarding the inclusion or exclusion of studies and determines the stance on each pre-specified criterion for every abstract, as exemplified by the last example in the “ Examples of abstract screening by using LLMs ” section.

Additionally, to ascertain the efficiency and cost-effectiveness of this methodology, the script also monitors the time, token usage, and the financial implications of querying the OpenAI API.

In essence, we envision this procedure as delegating the meticulous task of poring over scientific summaries to an AI assistant. This virtual entity meticulously sifts through each summary, determining its alignment with stipulated criteria.

Tuning parameters in the ChatGPT API: The effectiveness of the ChatGPT API is not only dependent on the input data; it is also significantly influenced by adjustable parameters that can refine the responses. Parameters such as temperature, top k, and top p critically affect model performance by modulating the randomness and focus of the output. While fine-tuning these parameters can improve results, it requires significant technical expertise and resources. Defaults, which are rigorously tested by developers, strike a balance between output quality and ease of use, making LLMs more accessible to a wider range of users without the need for complex parameter optimization. While customization holds promise for custom applications, the default settings provide an efficient and practical solution that facilitates wider adoption of LLM technologies.

Given the complexity of fine-tuning these parameters to optimize performance specifically for abstract screening, our study primarily used the recommended default settings for these parameters as provided by the respective platforms (detailed in Table 1 ). This approach was chosen to maintain the feasibility of our experiments and to ensure that our findings are applicable to typical deployment scenarios.

The automated workflow described applies to other LLM tools or different versions of the same tools. Throughout our study, we have tested several popular LLM tools available to us. The specific LLM models and their versions used at the time of our first submission are detailed in Table  1 , along with the tuning parameters.

The field of large language models (LLMs) has evolved rapidly since we started this study in 2023. New models are frequently released, and existing versions are constantly updated. To account for these advances, we have expanded our analysis to include results from the latest versions of previously studied models, as well as a few entirely new ones. As a best practice, we set the temperature parameter to 0 for all latest models. Table 1 now includes these latest models and their versions.

A brief explanation of these parameters is as follows:

Temperature: The temperature controls the randomness of the outputs, with a range from 0 to 2. A temperature value greater than 1 is random and 0 is deterministic. The maximum temperature of 2 gives the most creative and variable outputs.

Max length: The max length is the maximum number of tokens for the model to generate as a response. A single word is generally 2–3 tokens.

Stop sequences: This parameter controls which tokens or phrases will stop the LLM from generating more text.

Top p: When generating text, the model samples from the top p percentage of most likely tokens to generate. The top p is the cumulative probability cutoff for the model’s selection of tokens to generate. Lower top p values mean sampling from a smaller, more top-weighted nucleus.

Top k: When generating text, the model samples from the top k most likely tokens. When deciding the next word to output, the model will select the most likely word when top k is lower.

Frequency penalty: This frequency penalty parameter controls how the LLM penalizes tokens that have already appeared in the input and output text. A frequency penalty of 0 implies that the frequency of the tokens does not impact the LLM and will generate them based on their probability.

Presence penalty: This parameter controls how the LLM penalizes tokens that have not been used recently. A presence penalty of 0 means that the LLM does not care about the recency of the tokens and will generate them based on their probability.

Study design

In our pursuit to assess ChatGPT’s proficiency in abstract screening, we selected certain benchmark databases that have existing performance data from other methodologies. This selection aids in a comparative analysis of performance.

In selecting the systematic reviews for our study, we used a systematic approach guided by specific criteria to ensure relevance and reliability. These studies were selected from the publicly available SYNERGY [ 23 ] dataset, which contains 26 systematic reviews from different disciplines. Key selection criteria included:

Clarity and conciseness of eligibility criteria: The selected studies had well-defined and explicit eligibility criteria. This clarity is essential for accurate replication of the study selection process, which is critical for assessing the performance of LLM tools in an analogous real-world application.

Completeness and cleanliness of data: We ensured that the selected reviews had complete datasets, with all necessary information on included and excluded studies clearly documented, minimizing the risk of ambiguities affecting our analysis.

In addition, to comply with the AMSTAR-2 [ 26 ] guidelines, in particular point 5, we reviewed the methodologies of these reviews to confirm the selection of studies was performed in duplicate and disagreements were resolved by consensus. While our analysis assumes that these systematic reviews adhere to high standards, we recognize the inherent limitations of using pre-existing datasets as a proxy for gold standards in the discussion section.

We picked the following 3 databases from the publicly available SYNERGY dataset [ 23 ]:

Bannach-Brown 2016 [ 27 ]—topic: use of animal models to study depressive behavior

Human-curated (gold standard) results: 1258 excluded abstracts and 230 included abstracts.

We randomly selected 100 excluded abstracts and 100 included abstracts for screening by LLM tools.

Meijboom 2021 [ 28 ]—topic: retransitioning of etanercept in patients with a rheumatic disease

Human-curated (gold standard) results: 599 excluded abstracts and all 32 included abstracts.

We randomly selected 100 excluded abstracts and 32 included abstracts for screening by LLM tools.

Menon 2022 [ 29 ]—topic: methodological rigor of systematic reviews in environmental health

Human-curated (gold standard) results: 896 excluded abstracts and 73 included abstracts.

We randomly selected 100 excluded abstracts and all 73 included abstracts for screening by LLM tools.

For each chosen database, abstracts were categorized as either “cases” (those included based on a gold standard) or “controls” (those excluded per the gold standard). From each category, we randomly selected 100 abstracts (we use all abstracts if there are less than 100 abstracts in that category in a database). These abstracts underwent evaluation by ChatGPT (v4.0) as per our established workflow. Subsequently, ChatGPT’s decisions were juxtaposed against the gold standard to determine sensitivity, specificity, and overall accuracy. The same abstracts were also processed using other LLM tools as listed in Table  1 to record their respective verdicts.

Statistical analysis

To quantify the efficacy of ChatGPT and other AI tools for each database, we calculated the following metrics: (1) sensitivity, (2) specificity, and (3) overall accuracy, where sensitivity is defined as the number of true positives divided by the sum of true positives and false negatives, specificity as the number of true negatives divided by the sum of true negatives and false positives, and accuracy as sum of true positives and true negatives divided by the total number of abstracts. For each metric, associated 95% confidence intervals were also determined. Although it is very common in the field to report F1 score, recall rate, and precision, we believe it is more appropriate to report sensitivity and specificity given this study design. In addition, F1 score, recall rate, and precision can be derived from sensitivity and specificity.

Furthermore, to explore the potential of a unified decision-making process, we combined the decisions from all AI tools using a voting mechanism. The majority decision across the tools was taken as the final verdict for each abstract. For this consolidated approach, we again computed sensitivity, specificity, overall accuracy, and the associated 95% CIs for each database. We also explore the use of latent class analysis (LCA), a model-based clustering approach, to derive consolidated decisions. More details on this LCA approach are provided in the “ Beyond majority voting ” section.

For a given database, 100 cases and 100 controls yield a two-sided 95% confidence interval with a half-width equal to 0.048 when the underline sensitivity (specificity) is approximately 95%.

All statistical analyses were conducted using the R statistical software (version 4.3.1). All tests were two-sided with an alpha level set at 0.05 unless otherwise mentioned.

To improve the transparency and reproducibility of studies using AI tools, we have included the TRIPOD + AI checklist [ 30 ] in our report. This checklist has been adapted to reflect the specifics of our research, which focuses on the evaluation of large language models for abstract screening rather than diagnostic or prognostic modeling. The completed checklist is presented in Table S 1 , to provide readers with a comprehensive overview of our study’s adherence to established reporting standards.

We present the results for each of the 3 databases. For each, we first present the prompts we used when we called LLM tools to screen an abstract, then present the performance data (accuracy, sensitivity, and specificity for each method or LLM tool), followed by a summary of the performance, and a comparison of the performance of the different methods to that based on ChatGPT v4.0.

Results on the Bannach-Brown 2016 database (see Table  2 )

The prompts we used for screening abstracts in this database are as follows:

Conduct a systematic review on animal depression. I provide the title and abstract for one journal article. Provide an overall assessment based on eligibility criteria with only one word answer yes or no with no explanation. Then, for each inclusion or exclusion criterion, answer with only one word, yes if it is included by the inclusion criterion or excluded by the exclusion criterion, and answer no if it does not meet the inclusion criterion or not excluded by the exclusion criterion. After answering all the criteria with yes or no, then provide an overall explanation.

Here is the eligibility criteria: Inclusion Criteria: 1. Any article providing primary data of an animal model of depression or depressive-like phenotype with an appropriate control group (specified above). 2. Animals of all ages, sexes and species, where depression-like phenotype intended to mimic the human condition have been induced. Including animal models where depressive-like phenotypes are induced in the presence of a comorbidity (e.g. obesity or cancer). 3. All studies that claim to model depression or depressive-like phenotypes in animals. Studies that induce depressive behavior or model depression and that also test a treatment or intervention (prior or subsequent to model induction), with no exclusion criteria based on dosage, timing or frequency. 4. Studies measuring behavioral, anatomical and structural, electrophysiological, histological and/or neurochemical outcomes and where genomic, proteomic or metabolomic outcomes are measured in addition to behavioral, anatomical, electrophysiological, histological or neurochemical outcomes. Exclusion Criteria: 1. Review article, editorials, case reports, letters or comments, conference or seminar abstracts, studies providing primary data but not appropriate control group. 2. Human studies and ex vivo, in vitro or in silico studies. Studies will be excluded if authors state an intention to induce or investigate only anxiety or anxious behavior. Studies will be excluded if there is no experimental intervention on the animals (e.g. purely observational studies). 3. Studies that investigate treatments or interventions, but no depressive behavior or model of depression is induced (e.g. toxicity and side-effect studies). 4. Where metabolic outcome measures are the primary outcome measure of a study. Where genomic, proteomic, metabolic or metabolomic outcomes are the sole outcome measures in a study, they will be excluded.

Here is the abstract:

Among all the LLM tools we tested, ChatGPT v4.0 stood out with the highest accuracy (0.945) and specificity (0.960), and satisfactory sensitivity (0.930). Combined decision using major voting seemed to improve sensitivity (0.970) considerably but did not improve specificity (0.870) much. Comparatively, the zero-shot method was less effective across these metrics, while the hybrid method attained superior accuracy (0.961) and specificity (0.982), albeit with reduced sensitivity (0.843). We acknowledge that our findings regarding the zero-shot and hybrid methods differ from those reported in [ 18 ] for the same set of abstracts. This discrepancy could be attributed to the use of distinct sets of embeddings in our analysis compared to the previous study.

For this database, the newer versions of ChatGPT (3.5-Turbo and 4-Turbo) did not improve performance over ChatGPT (v4.0). Gemini-1.0-pro (vs. PaLM 2) and Llama 3 (vs. Llama 2) improved over their older versions, but did not surpass the performance of ChatGPT (v4.0). Claude 3 performed well, but still did not surpass the performance of ChatGPT (v4.0).

Comparison between LLM tools. We compared the performance (sensitivity and specificity) between ChatGPTv4.0 and other LLM tools using the McNemar test and found that ChatGPTv4.0 performed significantly better ( p value = 0.002) than Google PaLM 2 in terms of sensitivity; ChatGPTv4.0 performed significantly better than ChatGPTv3.5 ( p value = 0.008) and better than Llama-2 ( p value < 0.001) in terms of specificity. Combining the decisions of different LLM tools using majority voting did not improve the overall accuracy compared to ChatGPTv4.0. Specifically, there was no statistically significant difference ( p value = 0.134) in sensitivity between the combined decision (majority voting) and ChatGPTv4.0, and the combined decision was significantly worse ( p value = 0.008) than ChatGPTv4.0 in terms of specificity.

Comparison between ChatGPT v4.0 and zero-shot and hybrid methods: We assessed the performance (sensitivity and specificity) of ChatGPT v4.0 against both the zero-shot and hybrid approaches using the McNemar test. Specifically, we aligned the screening results from 100 cases and 100 controls as per the ChatGPT v4.0 method and similarly for the zero-shot and hybrid methods, testing for inconsistencies between these approaches as previously done. Our analysis revealed that ChatGPT v4.0 significantly outperformed the zero-shot method in sensitivity ( p value < 0.001) but showed comparable effectiveness in specificity ( p value = 0.37). Additionally, ChatGPT v4.0 demonstrated marginally superior sensitivity compared to the hybrid method ( p value = 0.07), while its performance in specificity was similar ( p value = 1.00).

Results on the Meijboom 2021 database (see Table  3 )

Conduct a systematic review on transitioning patients from an originator to a corresponding biosimilar.

I provide the title and abstract for one journal article. Provide an overall assessment based on eligibility criteria with only one word answer yes or no, with no explanation. Then, for each inclusion or exclusion criterion, answer with only one word, yes if it is included by the inclusion criterion or excluded by the exclusion criterion, and answer no if it does not meet the inclusion criterion or not excluded by the exclusion criterion.

After answering all the criteria with yes or no, then provide an overall explanation.

Here is the eligibility criteria:

Articles were included if they met the following criteria:

Study involved transitioning from a TNFα inhibitor (including etanercept, infliximab, and adalimumab) originator to a biosimilar

The number of patients who retransitioned was reported or could be calculated

The article was an original research article published in a peer-reviewed journal

The article included baseline characteristics of the patients who transitioned

The article was written in English

The full-text version of the article could be obtained.

Transitioning was defined as patients in whom the biosimilar was introduced after the originator, without treatment with other drugs in between. Retransitioning was defined as restarting the originator directly after discontinuing a biosimilar, without treatment with other drugs in between. In summary, transitioning was defined as switching from the originator to a biosimilar; retransitioning was defined as switching from the originator to a biosimilar and back to the originator. Both transitioning and retransitioning involved changes with the same active biological substance.

Among all the LLM tools we tested, ChatGPT v4.0 stood out with the highest accuracy (0.840), but not with specificity (0.860) or satisfactory sensitivity (0.812). Compared to ChatGPTv4.0, combined decision using major voting did not improve overall accuracy (0.720), but improved sensitivity (1.000) at the sacrifice of specificity (0.630).

Comparison between LLM tools. We compared the performance (sensitivity and specificity) between ChatGPTv4.0 and other LLM tools using the McNemar test and found that ChatGPTv4.0 performed significantly better ( p value < 0.001) than Google PaLM 2, but significantly worse than ChatGPT3.5 ( p value = 0.001) and Llama 2 in terms of sensitivity; ChatGPTv4.0 performed significantly better than ChatGPTv3.5 ( p value < 0.001) and better than Llama 2 ( p value < 0.001), but worse than Google PaLM 2 ( p value = 0.002), in terms of specificity. Combining the decisions of different LLM tools using majority voting did not improve the overall accuracy compared to ChatGPTv4.0. Specifically, there was statistically significant difference ( p value = 0.008) in sensitivity between the combined decision (majority voting) and ChatGPTv4.0, and the combined decision was not significantly worse ( p value > 0.50) than ChatGPTv4.0 in terms of specificity.

For this database, the newer versions of ChatGPT (3.5-Turbo and 4-Turbo) did not improve performance over ChatGPT (v4.0), and Gemini-1.0-pro (vs. PaLM 2) did not improve performance either. However, Llama 3 (vs. Llama 2) improved over its older version and surpassed the performance of ChatGPT (v4.0). Claude 3 also slightly surpassed the performance of ChatGPT (v4.0).

Comparison between ChatGPT v4.0 and zero-shot and hybrid methods: We evaluated the performance of ChatGPT v4.0, focusing on sensitivity and specificity, in comparison with the zero-shot and hybrid approaches, employing the McNemar test as described above. In this analysis, we aligned the screening results from 32 cases and 100 controls for the tests. Our findings indicated that ChatGPT v4.0 significantly surpassed the zero-shot method in sensitivity ( p value = 0.0002) and exhibited marginally improved specificity ( p value = 0.099). Furthermore, ChatGPT v4.0 showed notably higher sensitivity than the hybrid method ( p value < 0.001), although its specificity was comparatively lower.

Results on the Menon 2022 database (see Table  4 )

“Conduct a systematic review on the methodological rigour of systematic reviews in environmental health.

I provide the title and abstract for one journal article.

Provide an overall assessment based on eligibility criteria with only one word answer yes or no, with no explanation.

Then, for each inclusion or exclusion criterion, answer with only one word, yes if it is included by the inclusion criterion or excluded by the exclusion criterion and answer no if it does not meet the inclusion criterion or not excluded by the exclusion criterion.

Here are the eligibility criteria:

To be eligible for inclusion in the SR sample, documents had to fulfill the following criteria:

Identify explicitly as a “systematic review” in their title

Assess the effect of a non-acute, non-communicable, environmental exposure on a health outcome. Environmental exposures can include air and water pollutants, radiation, noise, occupational hazards, lifestyle factors (like diet or physical activity) and lifestyle choices influenced by family and peers (like substance use), social and economic factors (like stress from work or living conditions).

Include studies in people or mammalian models

Be available in HTML format

Abstract X”

Among all the LLM tools we tested, ChatGPT v4.0 stood out with the highest accuracy (0.913) and specificity (0.932), but not with specificity (0.900). Compared to ChatGPTv4.0, combined decision using major voting did not improve overall accuracy (0.884) or sensitivity (0.808), but improved specificity (0.940).

Comparison between LLM tools. We compared the performance (sensitivity and specificity) between ChatGPTv4.0 and other LLM tools using the McNemar test and found that ChatGPTv4.0 performed significantly better than ChatGPT3.5 ( p value < 0.001), Google PaLM 2, and Llama 2 ( p value = 0.02) in terms of sensitivity; ChatGPTv4.0 performed worse than ChatGPTv3.5 and Google PaLM 2, in terms of specificity. Combining the decisions of different LLM tools using majority voting did not improve the overall accuracy compared to ChatGPTv4.0. Specifically, there was statistically significant difference ( p value = 0.008) in sensitivity between the combined decision (majority voting) and ChatGPTv4.0, and the combined decision was not significantly different ( p value = 0.134) than ChatGPTv4.0 in terms of specificity.

For this database, the newer versions of ChatGPT (3.5-Turbo and 4-Turbo) did not improve performance over ChatGPT (v4.0). However, both Gemini-1.0-pro (vs. PaLM 2) and Llama 3 (vs. Llama 2) improved over their older versions and surpassed the performance of ChatGPT (v4.0). Claude 3 also performed well but did not surpass the performance of ChatGPT (v4.0).

Comparison between ChatGPT v4.0 and zero-shot and hybrid methods: We aligned the screening results from 73 cases and 100 controls based on the ChatGPT v4.0 method, and similarly for the zero-shot and hybrid methods, to test for inconsistencies between these approaches, using the McNemar test as done in previous assessments. Our analysis showed that ChatGPT v4.0 significantly outperformed the zero-shot method in both sensitivity ( p value < 0.001) and specificity ( p value = 0.016). In comparison with the hybrid method, ChatGPT v4.0 also demonstrated superior sensitivity ( p value < 0.001) and better specificity ( p value = 0.04).

Monetary cost and time cost

To use the ChatGPT API or other LLM tools, the owners of these platforms charge a predetermined rate for access to the corresponding APIs. These fees are calculated in USD per thousand tokens, where tokens are the basic units used by these LLM platforms to quantify text length. In this context, a token can represent a word, a punctuation mark, or a character. The financial cost of screening 200 abstracts was approximately $6 for ChatGPT v4.0, $0.2 for ChatGPT v3.5, $10 for Llama 2 (using Replicate), while Google PaLM 2 offered its services for free to invited developers. Thus, the cumulative cost of evaluating 200 abstracts across all platforms was approximately $16.2. The cumulative cost of evaluating 200 abstracts across all latest models ($3 for GPT-4-Turbo, $0.05 for GPT-3.5-Turbo, free for Gemini-1.0-pro, $0.05 for Llama-3, $4 for Claude) was less, approximately $7.1. In terms of time efficiency, processing 200 abstracts with each of these LLM tools took approximately 10–20 min using a single thread. However, it is imperative to recognize that abstract screening lends itself well to parallelization. Consequently, one could significantly speed up the process by setting up multiple threads to simultaneously screen different subsets of abstracts, thereby reducing the overall time required for completion. This parallel approach not only increases efficiency, but also ensures that large amounts of data can be processed in a timely manner, making LLM tools even more attractive for large-scale abstract screening tasks. In summary, the monetary and time costs of using LLM tools for abstract screening are negligible compared to manual labeling.

Beyond majority voting

We have expanded our analysis to include a variety of approaches for synthesizing decisions across different LLM tools. Our methodology is inspired by the concept of combining multiple diagnostic tests in the absence of a gold standard, akin to situations where human expert consensus is not available. There are several publications discussing such scenarios [ 31 , 32 ], among which we proposed to use the latent class analysis (LCA) models.

Latent class analysis (LCA) is a statistical method used to identify subgroups within a population, which are not directly observed (hence “latent”) [ 33 , 34 , 35 ]. It is particularly useful when the research interest lies in categorizing individuals into mutually exclusive groups based on their responses to multiple observed variables. In the context of abstract screening, LCA can offer a sophisticated means of integrating decisions from different LLM tools without relying on a gold standard, typically provided by human expert consensus. This approach assumes that the unobserved subgroups (or “latent classes”) explain the dependence between the observed decisions made by each of the LLM tools.

Utilizing the LCA model, we treat the decisions from all LLM tools as dichotomous variables, corresponding to the adherence to each inclusion or exclusion criterion, as well as the overall decision. For instance, within the Bannach-Brown 2016 database (BB2016), there are eight criteria in total with four criteria each for inclusion and exclusion and one overall decision for inclusion, resulting in a total of nine binary items per LLM tool. In our analysis, we incorporated decisions from GPT v4.0, v3.5, and Llama 2. Decisions from Google PaLM 2 were excluded due to a high frequency (10% or more) of incomplete responses. Consequently, for the Bannach-Brown 2016 database, we worked with 27 binary items. For other databases such as Meijboom 2021 (Mj2021) and Menon 2022 (Me2022), the binary items totaled 21 and 15, respectively. It is important to note that LCA models were fitted to the binary data of each database independently.

The LCA model fitting process enables us to calculate the posterior probabilities of each abstract belonging to specific latent classes or subgroups. Abstracts are then categorized based on these probabilities, with assignment to the class for which an abstract has the highest posterior membership probability. The determination of the number of latent classes is a critical step in the LCA model fitting, which requires a priori specification. In our evaluation, we explored models with class numbers ranging from 2 to 6 and utilized the Bayesian information criterion (BIC) to identify the most “optimal” LCA model for our datasets.

Table 5 shows that after applying the Bayesian information criterion (BIC) to determine the most appropriate model for our data, we identified a 3-class model as the best fit for the binary decisions derived from the BB2016 database. Similarly, a 4-class model was optimal for the Mj2021 database, while a 3-class model was again best for the Me2022 database. The confusion matrices generated by the selected LCA models for each database provided a clear juxtaposition between the LLM-assigned classes and the actual labels of the abstracts (see Table  6 ).

The performance metrics derived from these models are noteworthy. For the BB2016 database, should we categorize abstracts within class with label 2 as “included” and assign the others in classes with label 1 or 3 to the “excluded” category, the LCA model achieved a sensitivity rate of 93% and a specificity rate of 96%, indicating a high degree of accuracy in classifying relevant and irrelevant abstracts. In the case of the Mj2021 database, if we view class 1 as “included” group and classes 2 and 3 as “excluded” group, the model achieved a perfect sensitivity rate of 100%, meaning that it correctly identified all relevant abstracts, although the specificity was lower at 79%, indicating a higher rate of false positives. Similarly, for the Me2022 database, considering class 1 to be the “excluded” category and classes 2 and 3 to be the “included” group, the model showed a sensitivity of 94.5% and a specificity of 83%, a good balance between identifying relevant abstracts and avoiding false inclusions.

These results highlight the robustness of the latent class analysis approach to the abstract screening task, providing an effective method for classifying abstracts when a gold standard is not available. The varying levels of sensitivity and specificity across databases underscore the need to tailor the LCA model to the specific characteristics of each dataset as well as further future research.

We have incorporated results from the latest LLM models to increase the robustness and relevance of our findings. However, to maintain consistency and comparability with our original analyses, we have not incorporated these new results into the previously established majority voting or latent class analysis (LCA) approaches. Instead, we have chosen to make all raw data, including results from these newer models, freely available in our GitHub repository. This approach allows interested researchers and practitioners to conduct further investigations or apply alternative methods of analysis. By providing access to this additional data, we aim to promote transparency and enable the broader community to engage with and potentially build upon our work.

This study began with a rigorous exploration of the capabilities of large language models (LLMs) in abstract screening. We used automation scripts developed in Python to interact with the APIs of several LLM tools, including Chat GPT v4.0, Google PaLM 2, and Meta Llama 2, as well as latest versions of these tools. Our central goal was to evaluate the efficiency and accuracy of these tools across three different databases of abstracts, leading us to a complex understanding of their potential in this context.

Large language models (LLMs), particularly ChatGPT, have garnered global attention since their inception. Employing LLMs for abstract screening in systematic reviews is an innovative concept [ 32 , 33 ] and remains underexplored. This study presents the first comprehensive evaluation of LLMs applied to systematic review processes. The findings are encouraging, suggesting that LLMs could revolutionize abstract screening. Specifically, ChatGPT v4.0 exhibited stellar performance across three test scenarios, achieving an accuracy of at least 85%. Furthermore, it attained sensitivity and specificity rates ranging from 80% to an impressive 95%. These exceptional outcomes highlight the substantial promise of LLMs in abstract screening, offering an efficient and capable alternative to the conventional, laborious approaches that typically necessitate extensive human annotation.

However, it is important to acknowledge that we are still in the early stages of integrating LLM tools into the abstract screening field, and they are not without their imperfections—for example, occasionally excluding many (20%) relevant studies, even with the best-performed LLM tools, ChatGPT v4.0. These tools are not a universal solution to all the challenges associated with abstract screening, and they are not ready to completely replace human expertise in this area. Instead, they should be embraced as invaluable assistants in the abstract screening process.

In discussing the limitations of our study, it is important to recognize the constraints associated with our dataset selection and model configuration. We used only three databases from the SYNERGY set, limiting the generalizability of our findings across disciplines and datasets. In addition, the reliance on human-curated labels as a gold standard, while necessary, introduces potential biases due to the retrospective nature of our analysis. These labels may contain errors, and the methodology used in the original reviews to resolve discrepancies may affect the validity of our conclusions.

The parameters chosen for our LLMs—temperature, top k, top p, and prompts—were set to defaults or based on natural conversations to balance output quality with ease of use. While this approach minimizes the need for technical expertise, it may not be optimal for all screening scenarios. In addition, the reliance of LLMs on abstracts alone, as opposed to full-text evaluations, presents a fundamental challenge; critical data influencing inclusion or exclusion may not be present in the abstracts, potentially compromising screening accuracy.

In addition, the rapid development of LLMs and their “black box” nature pose challenges to the transparency that is essential in scientific settings. The environmental impact of using these computationally intensive models is also significant [ 36 ], requiring sustainability considerations. Future research should focus on refining these tools to increase transparency and efficiency, and ensure their responsible development and use in systematic reviews.

Our research suggests that LLM tools are ready to take on a role in abstract screening and are poised to have an immediate and positive impact on the process. Their integration into abstract screening can manifest itself in a few ways. They can serve as autonomous AI reviewers, adding an extra layer of scrutiny and ensuring thoroughness. Our findings suggest that a collective decision, such as one derived from majority voting, can sometimes improve sensitivity, underscoring the potential of LLM tools as a reliable safeguard against oversight, ensuring both comprehensiveness and accuracy.

In addition, LLM tools can facilitate a synergistic partnership with human experts. They are adept at identifying “high-risk” abstracts where different LLM tools have reached different judgments, flagging them for further human evaluation, and promoting a careful and error-free screening process while minimizing human effort.

Another exciting prospect is the integration of LLM tools with hybrid approaches and active learning approach. In this scenario, LLM tools could autonomously annotate abstracts in the training set, minimizing the need for human labeling. Subsequently, these labeled abstracts could be used to train custom NLP models, paving the way for a streamlined and efficient abstract screening process, resulting in significant time and resource savings. Further research is needed to understand how the “uncertainty” in LLM-based decisions, when human-curated labels are not used, affects the performance of the hybrid approach. We also note from the test examples that the performance of hybrid approaches varies widely from example to example and depending on the text embedding tools used. Extensive research is needed to perfect the hybrid approaches.

Another future research direction is to explore how to fine-tune different versions of LLM tools and how to derive collective decisions. One idea is that by using different tuning parameters, one could propose different versions of LLM tools, and if they perform similarly in terms of accuracy but give different decisions (e.g., not highly correlated), we expect that better collective decisions would be obtained, as observed in many ensemble learning experiences [ 37 ]. However, this could be costly and require a lot of exploration.

More importantly, it would be particularly valuable to prospectively test these LLM tools, or integration of these tools with other approaches like zero-shot, active learning, in real-world meta-analysis and systematic review projects. This would provide deeper insights into their practical utility and effectiveness, and a more comprehensive understanding of their impact in live research environments.

In summary, while LLM tools may not be able to fully replace human experts in abstract screening, their ability to transform the screening process is both undeniable and imminent. With continued advances in technology and ongoing refinement, these tools will play a critical role in the future of abstract screening, ushering in a new era of efficiency and effectiveness.

Availability of data and materials

The datasets analyzed during the current study are available in the SYNERGY repository, https://github.com/asreview/synergy-dataset . We have created a GitHub repository https://github.com/mikeli380/LLMAbstractScreening to host our code (R/Python), abstracts, and intermediate results so that interested readers can replicate our findings.

Higgins JPT, et al. Cochrane handbook for systematic reviews of interventions. 2019.

Book   Google Scholar  

Moher D, et al. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. 2009;6(7). https://doi.org/10.1371/journal.pmed.1000097 .

Sackett DL, Rosenberg WMC, Gray JAM, Haynes RB,  Richardson WS. Evidence based medicine: what it is and what it isn’t. 1996. Clin Orthop Relat Res. 2007;455.  https://doi.org/10.1136/bmj.312.7023.71 .

Lavis JN, Posada FB, Haines PA, Osei E. Use of research to inform public policymaking. Lancet. 2004;364(9445). https://doi.org/10.1016/S0140-6736(04)17317-0 .

Bastian H, Glasziou P,  Chalmers I. Seventy-five trials and eleven systematic reviews a day: how will we ever keep up?. PLoS Med. 2010;7(9).   https://doi.org/10.1371/journal.pmed.1000326 .

Polanin JR, Pigott TD, Espelage DL, Grotpeter JK. Best practice guidelines for abstract screening large-evidence systematic reviews and meta-analyses. Res Synth Methods. 2019;10(3). https://doi.org/10.1002/jrsm.1354 .

Turner EH, Matthews AM, Linardatos E, Tell RA, Rosenthal R. Selective publication of antidepressant trials and its influence on apparent efficacy. N Engl J Med. 2008;358(3). https://doi.org/10.1056/nejmsa065779 .

Edwards P, Clarke M, DiGuiseppi C, Pratap S, Roberts I, Wentz R. Identification of randomized controlled trials in systematic reviews: accuracy and reliability of screening records. Stat Med. 2002;21(11). https://doi.org/10.1002/sim.1190 .

Wang Z, Nayfeh T, Tetzlaff J, O’Blenis P, Murad MH. Error rates of human reviewers during abstract screening in systematic reviews. PLoS ONE. 2020;15(1). https://doi.org/10.1371/journal.pone.0227742 .

O’Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev. 2015;4(1). https://doi.org/10.1186/2046-4053-4-5 .

OpenAI. Introducing ChatGPT Plus. OpenAI Blog; 2023. https://openai.com/blog/chatgpt-plus . Accessed 8 Nov 2023.

Google DeepMind. Introducing PaLM 2. Google Blog; 2023.  https://blog.google/technology/ai/google-palm-2-ai-large-language-model/ . Accessed 8 Nov 2023.

Meta AI Research. Meta and Microsoft introduce the next generation of Llama. Facebook AI Blog; 2023. https://ai.meta.com/blog/llama-2/ . Accessed 8 Nov 2023.

Claude.ai. Claude AI API reference. 2024. https://docs.anthropic.com/en/api/getting-started . Accessed 4 Jun 2024.

Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA. Deploying an interactive machine learning system in an evidence-based practice center: Abstrackr. 2012. https://doi.org/10.1145/2110363.2110464 .

Alshami A, Elsayed M, Ali E, Eltoukhy AEE, Zayed T. Harnessing the power of ChatGPT for automating systematic review process: methodology, case study, limitations, and future directions. Systems. 2023;11(7). https://doi.org/10.3390/systems11070351 .

Qureshi R, Shaughnessy D, Gill KAR, Robinson KA, Li T, Agai E. Are ChatGPT and large language models ‘the answer’ to bringing us closer to systematic review automation?  Syst Rev. 2023;12(1).   https://doi.org/10.1186/s13643-023-02243-z .

Xian Y, Lampert CH, Schiele B, Akata Z. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans Pattern Anal Mach Intell. 2019;41(9). https://doi.org/10.1109/TPAMI.2018.2857768 .

Moreno-Garcia CF, Jayne C, Elyan E, Aceves-Martins M. A novel application of machine learning and zero-shot classification methods for automated abstract screening in systematic reviews. Decis Anal J. 2023;6.  https://doi.org/10.1016/j.dajour.2023.100162 .

Ferdinands G, et al. Performance of active learning models for screening prioritization in systematic reviews: a simulation study into the Average Time to Discover relevant records. Syst Rev. 2023;12(1). https://doi.org/10.1186/s13643-023-02257-7 .

Boetje J, van de Schoot R. The SAFE procedure: a practical stopping heuristic for active learning-based screening in systematic reviews and meta-analyses. Syst Rev. 2024;13(1). https://doi.org/10.1186/s13643-024-02502-7 .

Yang Y, Loog M. A benchmark and comparison of active learning for logistic regression. Pattern Recognit. 2018;83. https://doi.org/10.1016/j.patcog.2018.06.004 .

De Bruin RJ, Ma Y, Ferdinands G, Teijema J, Van de Schoot. SYNERGY - open machine learning dataset on study selection in systematic reviews. DataverseNL; 2023. https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/HE6NAQ .

Millard SJ, Weston-Green K, Newell KA. The Wistar-Kyoto rat model of endogenous depression: a tool for exploring treatment resistance with an urgent need to focus on sex differences. Prog Neuropsychopharmacol Biol Psychiatry. 2020;101. https://doi.org/10.1016/j.pnpbp.2020.109908 .

Zawia NH, Mattia CJ, Bondy SC. Differential effects of difluoromethylornithine on basal and induced activity of cerebral ornithine decarboxylase and mRNA. Neuropharmacology. 1991;30(4). https://doi.org/10.1016/0028-3908(91)90058-J .

Shea BJ, et al. AMSTAR 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. BMJ. 2017;358. https://doi.org/10.1136/bmj.j4008 .

Bannach-Brown A, Liao J, Wegener G, Macleod M. Understanding in vivo modelling of depression in non-human animals: a systematic review protocol. Evid Based Preclin Med. 2016;3(2). https://doi.org/10.1002/ebm2.24 .

Meijboom RW, Gardarsdottir H, Becker ML, ten Wolde S, Egberts TCG, Giezen TJ. Incidence of and reasons and determinants associated with retransitioning from biosimilar etanercept to originator etanercept. BioDrugs. 2021;35(6). https://doi.org/10.1007/s40259-021-00501-x .

Menon JML, Struijs F, Whaley P. The methodological rigour of systematic reviews in environmental health. Crit Rev Toxicol. 2022;52(3). https://doi.org/10.1080/10408444.2022.2082917 .

Collins GS, et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. 2021;11(7). https://doi.org/10.1136/bmjopen-2020-048008 .

Joseph L, Gyorkos TW, Coupal L. Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard. Am J Epidemiol. 1995;141(3). https://doi.org/10.1093/oxfordjournals.aje.a117428 .

Hui SL, Walter SD. Estimating the error rates of diagnostic tests. Biometrics. 1980;36(1). https://doi.org/10.2307/2530508 .

Dendukuri N, Joseph L. Bayesian approaches to modeling the conditional dependence between multiple diagnostic tests. Biometrics. 2001;57(1). https://doi.org/10.1111/j.0006-341X.2001.00158.x .

Collins LM, Lanza ST. Latent class and latent transition analysis: with applications in the social, behavioral, and health sciences. 2010.

Google Scholar  

McCutcheon AL. Latent class analysis. Newbury Park: Sage; 1987.

Rillig MC, Ågerstrand M, Bi M, Gould KA, Sauerland U. Risks and benefits of large language models for the environment. Environm Sci Technol. 2023;57(9). https://doi.org/10.1021/acs.est.3c01106 .

Sagi O, Rokach L. Ensemble learning: a survey.  Wiley Interdiscip Rev Data Min Knowl Discov. 2018;8(4). https://doi.org/10.1002/widm.1249 .

Download references

Acknowledgements

The authors thank the reviewers and the Editor for their insightful comments and suggestions, which have greatly improved the quality of this work.

Not applicable.

Author information

Authors and affiliations.

Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA

Michael Li & Xianming Tan

Department of Mathematics and Statistics, University of North Carolina at Greensboro, Greensboro, NC, 27402, USA

Jianping Sun

Department of Biostatistics and Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA

Xianming Tan

You can also search for this author in PubMed   Google Scholar

Contributions

XT conceived the paper. ML and XT devised the study. ML, JS, and XT analyzed the datasets. ML, JS, and XT wrote the manuscript. All authors edited and approved the final manuscript.

Corresponding author

Correspondence to Xianming Tan .

Ethics declarations

Ethics approval and consent to participate, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1: table s1. tripod + ai checklist., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Li, M., Sun, J. & Tan, X. Evaluating the effectiveness of large language models in abstract screening: a comparative analysis. Syst Rev 13 , 219 (2024). https://doi.org/10.1186/s13643-024-02609-x

Download citation

Received : 20 December 2023

Accepted : 11 July 2024

Published : 21 August 2024

DOI : https://doi.org/10.1186/s13643-024-02609-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Abstract screening
  • Automated text analysis
  • Google PaLM

Systematic Reviews

ISSN: 2046-4053

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

systematic literature review using ai

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

arXiv's Accessibility Forum starts next month!

Help | Advanced Search

Computer Science > Digital Libraries

Title: streamlining the selection phase of systematic literature reviews (slrs) using ai-enabled gpt-4 assistant api.

Abstract: The escalating volume of academic literature presents a formidable challenge in staying updated with the newest research developments. Addressing this, this study introduces a pioneering AI-based tool, configured specifically to streamline the efficiency of the article selection phase in Systematic Literature Reviews (SLRs). Utilizing the robust capabilities of OpenAI's GPT-4 Assistant API, the tool successfully homogenizes the article selection process across a broad array of academic disciplines. Implemented through a tripartite approach consisting of data preparation, AI-mediated article assessment, and structured result presentation, this tool significantly accelerates the time-consuming task of literature reviews. Importantly, this tool could be highly beneficial in fields such as management and economics, where the SLR process involves substantial human judgment. The adoption of a standard GPT model can substantially reduce potential biases and enhance the speed and precision of the SLR selection phase. This not only amplifies researcher productivity and accuracy but also denotes a considerable stride forward in the way academic research is conducted amidst the surging body of scholarly publications.
Comments: 11 pages, 5 figures
Subjects: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
Cite as: [cs.DL]
  (or [cs.DL] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

buildings-logo

Article Menu

systematic literature review using ai

  • Subscribe SciFeed
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

A systematic review of the evolution of the concept of resilience in the construction industry.

systematic literature review using ai

1. Introduction

2. methodology, 2.1. search strategy, 2.2. selection, 2.3. extraction, 2.4. assessment, 3.1. the subject of resilience, 3.1.1. specific class, 3.1.2. general class, 3.1.3. temporal distribution of resilience subjects, 3.2. the impact factors, 3.2.1. natural and environmental factors, 3.2.2. human and external factors, 3.2.3. temporal distribution of impact factors, 3.3. capabilities of the subject, 3.3.1. management and operations, 3.3.2. performance and development, 3.3.3. temporal distribution of subject capabilities, 4. discussion, 4.1. the evolution and ambiguity of the early concept of resilience, 4.2. complexity of impact types and functional failures, 4.3. redefining the concept of resilience in a vuca environment, 4.4. future directions and suggestions, 5. conclusions, data availability statement, acknowledgments, conflicts of interest.

  • Zhang, B.; Mei, Y.; Xiong, Y.; Liu, Y. Can Digital Transformation Promote Service Innovation Performance of Construction Enterprises? The Mediating Role of Dual Innovation. Sustainability 2024 , 16 , 1176. [ Google Scholar ] [ CrossRef ]
  • Murtagh, N.; Scott, L.; Fan, J. Sustainable and Resilient Construction: Current Status and Future Challenges. J. Clean. Prod. 2020 , 268 , 122264. [ Google Scholar ] [ CrossRef ]
  • Cambraia, F.B.; Saurin, T.A.; Bulhões, I.R.; Formoso, C.T. A Knowledge Framework of Participation Supportive of Resilient and Safe Construction Projects: A Systematic Review. Saf. Sci. 2024 , 175 , 106494. [ Google Scholar ] [ CrossRef ]
  • Ranasinghe, U.; Jefferies, M.; Davis, P.; Pillay, M. Resilience Engineering Indicators and Safety Management: A Systematic Review. Saf. Health Work 2020 , 11 , 127–135. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Robbins, J.D.; Wang, J.; Esmaeili, B. Review of Safety Resilience in Construction: Current Research Status, Challenges, and Future Needs. In Proceedings of the Construction Research Congress 2022: Health and Safety, Workforce, and Education, Arlington, VA, USA, 9–12 March 2022; Jazizadeh, F., Shealy, T., Garvin, M.J., Eds.; Amer Soc Civil Engineers: New York, NY, USA, 2022; pp. 393–402. [ Google Scholar ]
  • Bhamra, R.; Dani, S.; Burnard, K. Resilience: The Concept, a Literature Review and Future Directions. Int. J. Prod. Res. 2011 , 49 , 5375–5393. [ Google Scholar ] [ CrossRef ]
  • Bocchini, P.; Frangopol, D.M.; Ummenhofer, T.; Zinke, T. Resilience and Sustainability of Civil Infrastructure: Toward a Unified Approach. J. Infrastruct. Syst. 2014 , 20 , 04014004. [ Google Scholar ] [ CrossRef ]
  • Holling, C.S. Resilience and Stability of Ecological Systems. Annu. Rev. Ecol. Evol. Syst. 1973 , 4 , 1–23. [ Google Scholar ] [ CrossRef ]
  • Annarelli, A.; Nonino, F. Strategic and Operational Management of Organizational Resilience: Current State of Research and Future Directions. Omega 2016 , 62 , 1–18. [ Google Scholar ] [ CrossRef ]
  • Pires Ribeiro, J.; Barbosa-Povoa, A. Supply Chain Resilience: Definitions and Quantitative Modelling Approaches—A Literature Review. Comput. Ind. Eng. 2018 , 115 , 109–122. [ Google Scholar ] [ CrossRef ]
  • Conz, E.; Magnani, G. A Dynamic Perspective on the Resilience of Firms: A Systematic Literature Review and a Framework for Future Research. Eur. Manag. J. 2020 , 38 , 400–412. [ Google Scholar ] [ CrossRef ]
  • Zhou, Y.; Wang, J.; Yang, H. Resilience of Transportation Systems: Concepts and Comprehensive Review. IEEE Trans. Intell. Transp. Syst. 2019 , 20 , 4262–4276. [ Google Scholar ] [ CrossRef ]
  • Liu, W.; Song, Z. Review of Studies on the Resilience of Urban Critical Infrastructure Networks. Reliab. Eng. Syst. Saf. 2020 , 193 , 106617. [ Google Scholar ] [ CrossRef ]
  • Foster, K.; Roche, M.; Delgado, C.; Cuzzillo, C.; Giandinoto, J.-A.; Furness, T. Resilience and Mental Health Nursing: An Integrative Review of International Literature. Int. J. Ment. Health Nurs. 2019 , 28 , 71–85. [ Google Scholar ] [ CrossRef ]
  • Quitana, G.; Molinos-Senante, M.; Chamorro, A. Resilience of Critical Infrastructure to Natural Hazards: A Review Focused on Drinking Water Systems. Int. J. Disaster Risk Reduct. 2020 , 48 , 101575. [ Google Scholar ] [ CrossRef ]
  • Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. BMJ 2021 , 372 , n71. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Afzal, M.; Li, R.Y.M.; Shoaib, M.; Ayyub, M.F.; Tagliabue, L.C.; Bilal, M.; Ghafoor, H.; Manta, O. Delving into the Digital Twin Developments and Applications in the Construction Industry: A PRISMA Approach. Sustainability 2023 , 15 , 16436. [ Google Scholar ] [ CrossRef ]
  • Li, R.Y.M.; Li, Y.L.; Crabbe, M.J.C.; Manta, O.; Shoaib, M. The Impact of Sustainability Awareness and Moral Values on Environmental Laws. Sustainability 2021 , 13 , 5882. [ Google Scholar ] [ CrossRef ]
  • Liao, L.; Yang, C.; Quan, L. Construction Supply Chain Management: A Systematic Literature Review and Future Development. J. Clean. Prod. 2023 , 382 , 135230. [ Google Scholar ] [ CrossRef ]
  • Ning, Y.; Gao, S. A Resilience Framework to Explorative Quality Management in Innovative Building Projects. J. Eng. Technol. Manag. 2021 , 62 , 101654. [ Google Scholar ] [ CrossRef ]
  • Wedawatta, G.; Ingirige, B.; Amaratunga, D. Building up resilienc e of construction sector smes and their supply chains to extreme weather events. Int. J. Strateg. Prop. Manag. 2010 , 14 , 362–375. [ Google Scholar ] [ CrossRef ]
  • Ranasinghe, U.; Jefferies, M.; Davis, P.; Pillay, M. Enabling a Resilient Work Environment: An Analysis of Causal Relationships between Resilience Engineering Factors in Construction Refurbishment Projects. J. Constr. Eng. Manage. 2023 , 149 , 04023078. [ Google Scholar ] [ CrossRef ]
  • Zhang, W.; Dong, H.; Wen, J.; Han, Q. A Resilience-Based Decision Framework for Post-Earthquake Restoration of Bridge Networks under Uncertainty. Struct. Infrastruct. Eng. 2023 , 1–16. [ Google Scholar ] [ CrossRef ]
  • Li, Y.; Dong, Y.; Frangopol, D.M.; Gautam, D. Long-Term Resilience and Loss Assessment of Highway Bridges under Multiple Natural Hazards. Struct. Infrastruct. Eng. 2020 , 16 , 626–641. [ Google Scholar ] [ CrossRef ]
  • Dong, Y.; Frangopol, D.M. Probabilistic Time-Dependent Multihazard Life-Cycle Assessment and Resilience of Bridges Considering Climate Change. J. Perform. Constr. Facil. 2016 , 30 , 04016034. [ Google Scholar ] [ CrossRef ]
  • Akiyama, M.; Frangopol, D.M.; Ishibashi, H. Toward Life-Cycle Reliability-, Risk- and Resilience-Based Design and Assessment of Bridges and Bridge Networks under Independent and Interacting Hazards: Emphasis on Earthquake, Tsunami and Corrosion. Struct. Infrastruct. Eng. 2020 , 16 , 26–50. [ Google Scholar ] [ CrossRef ]
  • Shafieezadeh, A.; Ivey Burden, L. Scenario-Based Resilience Assessment Framework for Critical Infrastructure Systems: Case Study for Seismic Resilience of Seaports. Reliab. Eng. Syst. Saf. 2014 , 132 , 207–219. [ Google Scholar ] [ CrossRef ]
  • Verschuur, J.; Pant, R.; Koks, E.; Hall, J. A Systemic Risk Framework to Improve the Resilience of Port and Supply-Chain Networks to Natural Hazards. Marit. Econ. Logist. 2022 , 24 , 489–506. [ Google Scholar ] [ CrossRef ]
  • John, A.; Yang, Z.; Riahi, R.; Wang, J. A Risk Assessment Approach to Improve the Resilience of a Seaport System Using Bayesian Networks. Ocean Eng. 2016 , 111 , 136–147. [ Google Scholar ] [ CrossRef ]
  • Azadeh, A.; Salehi, V.; Arvan, M.; Dolatkhah, M. Assessment of Resilience Engineering Factors in High-Risk Environments by Fuzzy Cognitive Maps: A Petrochemical Plant. Saf. Sci. 2014 , 68 , 99–107. [ Google Scholar ] [ CrossRef ]
  • Xiong, C.; Huang, J.; Lu, X. Framework for City-scale Building Seismic Resilience Simulation and Repair Scheduling with Labor Constraints Driven by Time–History Analysis. Comput. Aided Civ. Eng 2020 , 35 , 322–341. [ Google Scholar ] [ CrossRef ]
  • Katal, A.; Mortezazadeh, M.; Wang, L. (Leon) Modeling Building Resilience against Extreme Weather by Integrated CityFFD and CityBEM Simulations. Appl. Energy 2019 , 250 , 1402–1417. [ Google Scholar ] [ CrossRef ]
  • Bruneau, M.; Reinhorn, A. Exploring the Concept of Seismic Resilience for Acute Care Facilities. Earthq. Spectra 2007 , 23 , 41–62. [ Google Scholar ] [ CrossRef ]
  • Osman, M.M.; Sevinc, H. Adaptation of Climate-Responsive Building Design Strategies and Resilience to Climate Change in the Hot/Arid Region of Khartoum, Sudan. Sustain. Cities Soc. 2019 , 47 , 101429. [ Google Scholar ] [ CrossRef ]
  • Feng, X.; Xiu, C.; Bai, L.; Zhong, Y.; Wei, Y. Comprehensive Evaluation of Urban Resilience Based on the Perspective of Landscape Pattern: A Case Study of Shenyang City. Cities 2020 , 104 , 102722. [ Google Scholar ] [ CrossRef ]
  • Angulo, A.M.; Mur, J.; Trívez, F.J. Measuring Resilience to Economic Shocks: An Application to Spain. Ann. Reg. Sci. 2018 , 60 , 349–373. [ Google Scholar ] [ CrossRef ]
  • Molina Hutt, C.; Almufti, I.; Willford, M.; Deierlein, G. Seismic Loss and Downtime Assessment of Existing Tall Steel-Framed Buildings and Strategies for Increased Resilience. J. Struct. Eng. 2016 , 142 , C4015005. [ Google Scholar ] [ CrossRef ]
  • Azmi, N.A.; Sweis, G.; Sweis, R.; Sammour, F. Exploring Implementation of Blockchain for the Supply Chain Resilience and Sustainability of the Construction Industry in Saudi Arabia. Sustainability 2022 , 14 , 6427. [ Google Scholar ] [ CrossRef ]
  • Yao, H.; Wang, K. Concentrated or Dispersed: The Effects of Subcontracting Organizational Arrangements on Construction Project Resilience. J. Manag. Eng. 2024 , 40 , 04024017. [ Google Scholar ] [ CrossRef ]
  • Gerami Seresht, N. Enhancing Resilience in Construction against Infectious Diseases Using Stochastic Multi-Agent Approach. Autom. Constr. 2022 , 140 , 104315. [ Google Scholar ] [ CrossRef ]
  • Trinh, M.T.; Feng, Y.; Mohamed, S. Framework for Measuring Resilient Safety Culture in Vietnam’s Construction Environment. J. Constr. Eng. Manag. 2019 , 145 , 04018127. [ Google Scholar ] [ CrossRef ]
  • Wang, D.; Wu, Y.; Zhang, K. Interplay of Resources and Institutions in Improving Organizational Resilience of Construction Projects: A Dynamic Perspective. EMJ-Eng. Manag. J. 2023 , 35 , 346–357. [ Google Scholar ] [ CrossRef ]
  • He, Z.; Wang, G.; Chen, H.; Zou, Z.; Yan, H.; Liu, L. Measuring the Construction Project Resilience from the Perspective of Employee Behaviors. Buildings 2022 , 12 , 56. [ Google Scholar ] [ CrossRef ]
  • Peñaloza, G.A.; Saurin, T.A.; Formoso, C.T. Monitoring Complexity and Resilience in Construction Projects: The Contribution of Safety Performance Measurement Systems. Appl. Ergon. 2020 , 82 , 102978. [ Google Scholar ] [ CrossRef ]
  • Hilu, K.A.; Hiyassat, M.A. Qualitative Assessment of Resilience in Construction Projects. Constr. Innov. 2023 , 24 , 1297–1319. [ Google Scholar ] [ CrossRef ]
  • Milat, M.; Knezić, S.; Sedlar, J. Resilient Scheduling as a Response to Uncertainty in Construction Projects. Appl. Sci. 2021 , 11 , 6493. [ Google Scholar ] [ CrossRef ]
  • Hamerski, D.C.; Saurin, T.A.; Formoso, C.T.; Isatto, E.L. The Contributions of the Last Planner System to Resilient Performance in Construction Projects. Constr. Manag. Econ. 2024 , 42 , 328–345. [ Google Scholar ] [ CrossRef ]
  • Wang, D.; Wang, P.; Liu, Y. The Emergence Process of Construction Project Resilience: A Social Network Analysis Approach. Buildings 2022 , 12 , 822. [ Google Scholar ] [ CrossRef ]
  • Trinh, M.T.; Feng, Y. A Maturity Model for Resilient Safety Culture Development in Construction Companies. Buildings 2022 , 12 , 733. [ Google Scholar ] [ CrossRef ]
  • Feng, Y.; Trinh, M.T. Developing Resilient Safety Culture for Construction Projects. J. Constr. Eng. Manag. 2019 , 145 , 04019069. [ Google Scholar ] [ CrossRef ]
  • Saldanha, M.C.W.; Araújo, L.L.F.; Arcuri, R.; Vidal, M.C.R.; De Carvalho, P.V.R.; De Carvalho, R.J.M. Identifying Routes and Organizational Practices for Resilient Performance: A Study in the Construction Industry. Cogn. Technol. Work 2022 , 24 , 521–535. [ Google Scholar ] [ CrossRef ]
  • Trinh, M.T.; Feng, Y. Impact of Project Complexity on Construction Safety Performance: Moderating Role of Resilient Safety Culture. J. Constr. Eng. Manag. 2020 , 146 , 04019103. [ Google Scholar ] [ CrossRef ]
  • Chih, Y.-Y.; Hsiao, C.Y.-L.; Zolghadr, A.; Naderpajouh, N. Resilience of Organizations in the Construction Industry in the Face of COVID-19 Disturbances: Dynamic Capabilities Perspective. J. Manag. Eng. 2022 , 38 , 04022002. [ Google Scholar ] [ CrossRef ]
  • Malik, A.; Khan, K.I.A.; Qayyum, S.; Ullah, F.; Maqsoom, A. Resilient Capabilities to Tackle Supply Chain Risks: Managing Integration Complexities in Construction Projects. Buildings 2022 , 12 , 1322. [ Google Scholar ] [ CrossRef ]
  • Nassereddine, H.; Seo, K.W.; Rybkowski, Z.K.; Schranz, C.; Urban, H. Propositions for a Resilient, Post-COVID-19 Future for the AEC Industry. Front. Built Environ. 2021 , 7 , 687021. [ Google Scholar ] [ CrossRef ]
  • Chen, Y.; McCabe, B.; Hyatt, D. A Resilience Safety Climate Model Predicting Construction Safety Performance. Saf. Sci. 2018 , 109 , 434–445. [ Google Scholar ] [ CrossRef ]
  • Gao, L.; Luo, X.; Wang, Y.; Zhang, N.; Deng, X. Retention in Challenging International Construction Assignments: Role of Expatriate Resilience. J. Constr. Eng. Manag. 2024 , 150 , 04023158. [ Google Scholar ] [ CrossRef ]
  • Liu, K.; Liu, Y.; Kou, Y. Study on Construction Safety Management in Megaprojects from the Perspective of Resilient Governance. Saf. Sci. 2024 , 173 , 106442. [ Google Scholar ] [ CrossRef ]
  • Xu, S.; Lin, B.; Zou, P.X.W. Examining Construction Group’s Safety Attitude Resilience under Major Disruptions: An Agent-Based Modelling Approach. Saf. Sci. 2023 , 161 , 106071. [ Google Scholar ] [ CrossRef ]
  • Cai, B.; Xie, M.; Liu, Y.; Liu, Y.; Feng, Q. Availability-Based Engineering Resilience Metric and Its Corresponding Evaluation Methodology. Reliab. Eng. Syst. Saf. 2018 , 172 , 216–224. [ Google Scholar ] [ CrossRef ]
  • Kilanitis, I.; Sextos, A. Integrated Seismic Risk and Resilience Assessment of Roadway Networks in Earthquake Prone Areas. Bull. Earthq. Eng. 2019 , 17 , 181–210. [ Google Scholar ] [ CrossRef ]
  • Wang, H.; Zhou, J.; Dun, Z.; Cheng, J.; Li, H.; Dun, Z. Resilience Evaluation of High-Speed Railway Subgrade Construction Systems in Goaf Sites. Sustainability 2022 , 14 , 7806. [ Google Scholar ] [ CrossRef ]
  • Zhang, W.; Wang, N.; Nicholson, C. Resilience-Based Post-Disaster Recovery Strategies for Road-Bridge Networks. Struct. Infrastruct. Eng. 2017 , 13 , 1404–1413. [ Google Scholar ] [ CrossRef ]
  • Zhang, W.; Wang, N. Resilience-Based Risk Mitigation for Road Networks. Struct. Saf. 2016 , 62 , 57–65. [ Google Scholar ] [ CrossRef ]
  • Dong, Y.; Frangopol, D.M. Risk and Resilience Assessment of Bridges under Mainshock and Aftershocks Incorporating Uncertainties. Eng. Struct. 2015 , 83 , 198–208. [ Google Scholar ] [ CrossRef ]
  • Alipour, A.; Shafei, B. Seismic Resilience of Transportation Networks with Deteriorating Components. J. Struct. Eng. 2016 , 142 , C4015015. [ Google Scholar ] [ CrossRef ]
  • Chen, Z.; Hammad, A.W.A.; Alyami, M. Building Construction Supply Chain Resilience under Supply and Demand Uncertainties. Autom. Constr. 2024 , 158 , 105190. [ Google Scholar ] [ CrossRef ]
  • Xie, D.; Xin, J.; Wang, H.; Xiao, L. Identifying Critical Factors Affecting the Resilience of Additive Manufacturing Architecture Supply Chain. Buildings 2023 , 13 , 997. [ Google Scholar ] [ CrossRef ]
  • Cherian, T.M.; Mathivathanan, D.; Arun SJ, C.J.; Ramasubramaniam, M.; Alathur, S. Influence of Supply Chain Resilience, Information Technology Capabilities and Agility on Cost and Delivery Performance in Construction Supply Chains: An Indian Perspective. Int. J. Logist. Manag. 2022 , 34 , 1050–1076. [ Google Scholar ] [ CrossRef ]
  • Ekanayake, E.M.A.C.; Shen, G.Q.P.; Kumaraswamy, M.M.; Owusu, E.K.; Saka, A.B. Modeling Supply Chain Resilience in Industrialized Construction: A Hong Kong Case. J. Constr. Eng. Manag. 2021 , 147 , 05021009. [ Google Scholar ] [ CrossRef ]
  • Cheng, S.; Zhou, X.; Zhang, Y.; Duan, M.; Gao, J. Study on Resilience Factors and Enhancement Strategies in Prefabricated Building Supply Chains. Buildings 2024 , 14 , 195. [ Google Scholar ] [ CrossRef ]
  • Wehbe, F.; Hattab, M.A.; Hamzeh, F. Exploring Associations between Resilience and Construction Safety Performance in Safety Networks. Saf. Sci. 2016 , 82 , 338–351. [ Google Scholar ] [ CrossRef ]
  • Guo, Q.; Amin, S.; Hao, Q.; Haas, O. Resilience Assessment of Safety System at Subway Construction Sites Applying Analytic Network Process and Extension Cloud Models. Reliab. Eng. Syst. Saf. 2020 , 201 , 106956. [ Google Scholar ] [ CrossRef ]
  • Argyroudis, S.A.; Mitoulis, S.A.; Chatzi, E.; Baker, J.W.; Brilakis, I.; Gkoumas, K.; Vousdoukas, M.; Hynes, W.; Carluccio, S.; Keou, O.; et al. Digital Technologies Can Enhance Climate Resilience of Critical Infrastructure. Clim. Risk Manag. 2022 , 35 , 100387. [ Google Scholar ] [ CrossRef ]
  • Iannacone, L.; Sharma, N.; Tabandeh, A.; Gardoni, P. Modeling Time-Varying Reliability and Resilience of Deteriorating Infrastructure. Reliab. Eng. Syst. Saf. 2022 , 217 , 108074. [ Google Scholar ] [ CrossRef ]
  • Lounis, Z.; McAllister, T.P. Risk-Based Decision Making for Sustainable and Resilient Infrastructure Systems. J. Struct. Eng. 2016 , 142 , F4016005. [ Google Scholar ] [ CrossRef ]
  • Dao, J.; Ng, S.T.; Yang, Y.; Zhou, S.; Xu, F.J.; Skitmore, M. Semantic Framework for Interdependent Infrastructure Resilience Decision Support. Autom. Constr. 2021 , 130 , 103852. [ Google Scholar ] [ CrossRef ]
  • Burton, H.V.; Deierlein, G.; Lallemant, D.; Lin, T. Framework for Incorporating Probabilistic Building Performance in the Assessment of Community Seismic Resilience. J. Struct. Eng. 2016 , 142 , C4015007. [ Google Scholar ] [ CrossRef ]
  • Franchin, P.; Cavalieri, F. Probabilistic Assessment of Civil Infrastructure Resilience to Earthquakes. Comput. Aided Civ. Eng 2015 , 30 , 583–600. [ Google Scholar ] [ CrossRef ]
  • Wei, W.; Mojtahedi, M.; Yazdani, M.; Kabirifar, K. The Alignment of Australia’s National Construction Code and the Sendai Framework for Disaster Risk Reduction in Achieving Resilient Buildings and Communities. Buildings 2021 , 11 , 429. [ Google Scholar ] [ CrossRef ]
  • Yang, Y.; Ng, S.T.; Xu, F.J.; Skitmore, M. Towards Sustainable and Resilient High Density Cities through Better Integration of Infrastructure Networks. Sustain. Cities Soc. 2018 , 42 , 407–422. [ Google Scholar ] [ CrossRef ]
  • Jiang, S.; Ling, F.Y.Y.; Ma, G. Fostering Resilience in Project Teams: Adaptive Structuration Perspective. J. Manag. Eng. 2024 , 40 , 04023047. [ Google Scholar ] [ CrossRef ]
  • Lu, C.; Yu, D.; Luo, Q.; Xu, C. A Study of the Effects of Job Stress on the Psychosocial Safety Behavior of Construction Workers: The Mediating Role of Psychological Resilience. Buildings 2023 , 13 , 1930. [ Google Scholar ] [ CrossRef ]
  • Aasen, A.F.; Klakegg, O.J. Human Resilience and Cultural Change in the Construction Industry: Communication and Relationships in a Time of Enforced Adaptation. Front. Built Environ. 2023 , 9 , 1287483. [ Google Scholar ] [ CrossRef ]
  • Chen, Y.; McCabe, B.; Hyatt, D. Impact of Individual Resilience and Safety Climate on Safety Performance and Psychological Stress of Construction Workers: A Case Study of the Ontario Construction Industry. J. Saf. Res. 2017 , 61 , 167–176. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Nwaogu, J.M.; Chan, A.P.C. The Impact of Coping Strategies and Individual Resilience on Anxiety and Depression among Construction Supervisors. Buildings 2022 , 12 , 2148. [ Google Scholar ] [ CrossRef ]
  • Wang, T.-K.; Zhang, Q.; Chong, H.-Y.; Wang, X. Integrated Supplier Selection Framework in a Resilient Construction Supply Chain: An Approach via Analytic Hierarchy Process (AHP) and Grey Relational Analysis (GRA). Sustainability 2017 , 9 , 289. [ Google Scholar ] [ CrossRef ]
  • Shishodia, A.; Verma, P.; Dixit, V. Supplier Evaluation for Resilient Project Driven Supply Chain. Comput. Ind. Eng. 2019 , 129 , 465–478. [ Google Scholar ] [ CrossRef ]
  • Thurber, C. An Introduction to Seismology, Earthquakes, and Earth Structure. EoS Trans. 2003 , 84 , 209–210. [ Google Scholar ] [ CrossRef ]
  • Morse, S.S.; Mazet, J.A.; Woolhouse, M.; Parrish, C.R.; Carroll, D.; Karesh, W.B.; Zambrana-Torrelio, C.; Lipkin, W.I.; Daszak, P. Prediction and Prevention of the next Pandemic Zoonosis. Lancet 2012 , 380 , 1956–1965. [ Google Scholar ] [ CrossRef ]
  • Cutter, S.L.; Boruff, B.J.; Shirley, W.L. Social Vulnerability to Environmental Hazards*. Soc. Sci. Q. 2003 , 84 , 242–261. [ Google Scholar ] [ CrossRef ]
  • Sauter, S.L.; Murphy, L.R.; Hurrell, J.J. Prevention of Work-Related Psychological Disorders. A National Strategy Proposed by the National Institute for Occupational Safety and Health (NIOSH). Am. Psychol. 1990 , 45 , 1146–1158. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Daboun, O.; Md Yusof, A.; Khoso, A.R. Relationship Management in Construction Projects: Systematic Literature Review. Eng. Manag. J. 2023 , 35 , 120–143. [ Google Scholar ] [ CrossRef ]
  • Karuppiah, K.; Sankaranarayanan, B.; Ali, S.M. A Systematic Review of Sustainable Business Models: Opportunities, Challenges, and Future Research Directions. Decis. Anal. J. 2023 , 8 , 100272. [ Google Scholar ] [ CrossRef ]
  • Grzybowska, K.; Tubis, A.A. Supply Chain Resilience in Reality VUCA-An International Delphi Study. Sustainability 2022 , 14 , 10711. [ Google Scholar ] [ CrossRef ]
  • Li, K.; Ma, J.; Gao, J.; Xu, C.; Li, W.; Mao, Y.; Jiang, S. Resilience Assessment of Urban Distribution Network Under Heavy Rain: A Knowledge- Informed Data-Driven Approach. IEEE Access 2023 , 11 , 63741–63750. [ Google Scholar ] [ CrossRef ]
  • Yang, Y.; Ng, S.T.; Zhou, S.; Xu, F.J.; Li, D.; Li, H. A Federated Pre-Event Community Resilience Approach for Assessing Physical and Social Sub-Systems: An Extreme Rainfall Case in Hong Kong. Sustain. Cities Soc. 2020 , 52 , 101859. [ Google Scholar ] [ CrossRef ]
  • Bozza, A.; Asprone, D.; Fabbrocino, F. Urban Resilience: A Civil Engineering Perspective. Sustainability 2017 , 9 , 103. [ Google Scholar ] [ CrossRef ]
  • Hao, H.; Bi, K.; Chen, W.; Pham, T.M.; Li, J. Towards next Generation Design of Sustainable, Durable, Multi-Hazard Resistant, Resilient, and Smart Civil Engineering Structures. Eng. Struct. 2023 , 277 , 115477. [ Google Scholar ] [ CrossRef ]
  • Herrera, H. Resilience for Whom? The Problem Structuring Process of the Resilience Analysis. Sustainability 2017 , 9 , 1196. [ Google Scholar ] [ CrossRef ]
  • Juncos, A.E. Resilience in Peacebuilding: Contesting Uncertainty, Ambiguity, and Complexity. Contemp. Secur. Policy 2018 , 39 , 559–574. [ Google Scholar ] [ CrossRef ]
  • Li, Y.; Lei, S.; Sun, W.; Hu, C.; Hou, Y. A Distributionally Robust Resilience Enhancement Strategy for Distribution Networks Considering Decision-Dependent Contingencies. IEEE Trans. Smart Grid 2024 , 15 , 1450–1465. [ Google Scholar ] [ CrossRef ]
  • Dolla, T.; Jain, K.; Delhi, V.S.K. Strategies for Digital Transformation in Construction Projects: Stakeholders’ Perceptions and Actor Dynamics for Industry 4.0. J. Inf. Technol. Constr. 2023 , 28 , 151–175. [ Google Scholar ] [ CrossRef ]
  • Browder, R.E.; Dwyer, S.M.; Koch, H. Upgrading Adaptation: How Digital Transformation Promotes Organizational Resilience. Strateg. Entrep. J. 2024 , 18 , 128–164. [ Google Scholar ] [ CrossRef ]
  • Yang, Y.; Ng, S.T.; Li, N.; Xu, X.; Xu, P.; Xu, F.J. Adapting HLA-Based Co-Simulation for Interdependent Infrastructure Resilience Management. Autom. Constr. 2023 , 150 , 104860. [ Google Scholar ] [ CrossRef ]
  • Rathnasiri, P.; Adeniyi, O.; Thurairajah, N. Data-Driven Approaches to Built Environment Flood Resilience: A Scientometric and Critical Review. Adv. Eng. Inform. 2023 , 57 , 102085. [ Google Scholar ] [ CrossRef ]
  • Heidari, A.; Peyvastehgar, Y.; Amanzadegan, M. A Systematic Review of the BIM in Construction: From Smart Building Management to Interoperability of BIM & AI. Archit. Sci. Rev. 2024 , 67 , 237–254. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

DatabaseSearch StringsNumber
Scopus(TITLE-ABS-KEY(“construction industry” OR “civil engineering” OR “construction engineering” OR “architecture engineering” OR “construction sector” OR “building industry” OR “construction project”) AND TITLE-ABS-KEY(“resilience” OR “adaptive capacity” OR “robustness” OR “recoverability” OR “durability”) AND TITLE (“resilience” OR “resilient”))451
Web of Science(TS = (“construction industry” OR “civil engineering” OR “construction engineering” OR “architecture engineering” OR “construction sector” OR “building industry” OR “construction project”) AND TS = (“resilience” OR “adaptive capacity” OR “robustness” OR “recoverability” OR “durability”) AND TI = (“resilience” OR “resilient”) )763
Main CategoriesSubcategoriesNumber of SamplesTotal
Specific ClassSpecific Construction Projects3
Concrete Facilities and Systems418
Definite Geographic Locations 4
Specific Infrastructure Systems7
General ClassIndustries and Fields18
Concepts and Methods452
Extensive Networks and Systems19
Social and Community Impact11
Main CategoriesSubcategoriesNumber of SamplesTotal
Natural and environmental factorsGeological events16
Climate and weather events827
Biological events 3
Human and external factorsEconomic and supply chain risks3
Job and workplace-related risks649
Security and safety risks10
System and infrastructure risks11
Uncertainties19
Main CategoriesSubcategoriesNumber of SamplesTotal
Management and OperationsSafety and Risk Management3260
Operational Management21
Human and Community Management5
Resource and Inventory Management2
Performance and DevelopmentStructural and System Performance7
Network Performance217
Sustainability and Environmental Management6
Quality and Improvement Management2
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Li, J.; Yu, H.; Deng, X. A Systematic Review of the Evolution of the Concept of Resilience in the Construction Industry. Buildings 2024 , 14 , 2643. https://doi.org/10.3390/buildings14092643

Li J, Yu H, Deng X. A Systematic Review of the Evolution of the Concept of Resilience in the Construction Industry. Buildings . 2024; 14(9):2643. https://doi.org/10.3390/buildings14092643

Li, Jinjing, Haizhe Yu, and Xiaopeng Deng. 2024. "A Systematic Review of the Evolution of the Concept of Resilience in the Construction Industry" Buildings 14, no. 9: 2643. https://doi.org/10.3390/buildings14092643

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Diagnostic accuracy of artificial intelligence models in detecting osteoporosis using dental images: a systematic review and meta-analysis

Affiliations.

  • 1 Department of Oral and Maxillofacial Radiology, School of Dentistry, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
  • 2 School of Medicine, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
  • 3 Elderly Health Research Center, Endocrinology and Metabolism Population Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran.
  • 4 School of Dentistry, Research Institute for Dental Sciences, Mkhitar Heratsi Yerevan State Medical University, Yerevan, Armenia.
  • 5 Dentofacial Deformities Research Center, Research Institute of Dental Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
  • 6 Dental Research Center, Research Institute for Dental Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran. [email protected].
  • PMID: 39177815
  • DOI: 10.1007/s00198-024-07229-8

The current study aimed to systematically review the literature on the accuracy of artificial intelligence (AI) models for osteoporosis (OP) diagnosis using dental images. A thorough literature search was executed in October 2022 and updated in November 2023 across multiple databases, including PubMed, Scopus, Web of Science, and Google Scholar. The research targeted studies using AI models for OP diagnosis from dental radiographs. The main outcomes were the sensitivity and specificity of AI models regarding OP diagnosis. The "meta" package from the R Foundation was selected for statistical analysis. A random-effects model, along with 95% confidence intervals, was utilized to estimate pooled values. The Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool was employed for risk of bias and applicability assessment. Among 640 records, 22 studies were included in the qualitative analysis and 12 in the meta-analysis. The overall sensitivity for AI-assisted OP diagnosis was 0.85 (95% CI, 0.70-0.93), while the pooled specificity equaled 0.95 (95% CI, 0.91-0.97). Conventional algorithms led to a pooled sensitivity of 0.82 (95% CI, 0.57-0.94) and a pooled specificity of 0.96 (95% CI, 0.93-0.97). Deep convolutional neural networks exhibited a pooled sensitivity of 0.87 (95% CI, 0.68-0.95) and a pooled specificity of 0.92 (95% CI, 0.83-0.96). This systematic review corroborates the accuracy of AI in OP diagnosis using dental images. Future research should expand sample sizes in test and training datasets and standardize imaging techniques to establish the reliability of AI-assisted methods in OP diagnosis through dental images.

Keywords: Artificial intelligence; Bone density; Deep learning; Osteoporosis; Panoramic radiography.

© 2024. International Osteoporosis Foundation and Bone Health and Osteoporosis Foundation.

PubMed Disclaimer

  • NIH Consensus Development Panel on Osteoporosis Prevention D, Therapy (2001) Osteoporosis prevention, diagnosis, and therapy. JAMA 285:785–795. https://doi.org/10.1001/jama.285.6.785
  • Rachner TD, Khosla S, Hofbauer LC (2011) Osteoporosis: now and the future. Lancet 377:1276–1287. https://doi.org/10.1016/s0140-6736(10)62349-5 - DOI - PubMed - PMC
  • Kanis JA (2002) Diagnosis of osteoporosis and assessment of fracture risk. Lancet 359:1929–1936. https://doi.org/10.1016/s0140-6736(02)08761-5 - DOI - PubMed
  • Macdonald HM, Nishiyama KK, Kang J, Hanley DA, Boyd SK (2011) Age-related patterns of trabecular and cortical bone loss differ between sexes and skeletal sites: a population-based HR-pQCT study. J Bone Miner Res 26:50–62. https://doi.org/10.1002/jbmr.171 - DOI - PubMed
  • Sindeaux R, Figueiredo PT, de Melo NS, Guimarães AT, Lazarte L, Pereira FB, de Paula AP, Leite AF (2014) Fractal dimension and mandibular cortical width in normal and osteoporotic men and women. Maturitas 77:142–148. https://doi.org/10.1016/j.maturitas.2013.10.011 - DOI - PubMed

Publication types

  • Search in MeSH

Related information

Linkout - more resources, full text sources, miscellaneous.

  • NCI CPTAC Assay Portal
  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

IMAGES

  1. Artificial intelligence maturity model: a systematic literature review

    systematic literature review using ai

  2. Easy & Instant Literature Review using AI Tool

    systematic literature review using ai

  3. Systematic Literature Review Methodology

    systematic literature review using ai

  4. Mastering Systematic Literature Reviews with AI Tools

    systematic literature review using ai

  5. (PDF) Systematic literature review on opportunities, challenges, and

    systematic literature review using ai

  6. How to explain AI systems to end users: a systematic literature review

    systematic literature review using ai

COMMENTS

  1. Artificial intelligence in systematic reviews: promising when

    Background Systematic reviews provide a structured overview of the available evidence in medical-scientific research. However, due to the increasing medical-scientific research output, it is a time-consuming task to conduct systematic reviews. To accelerate this process, artificial intelligence (AI) can be used in the review process. In this communication paper, we suggest how to conduct a ...

  2. Artificial intelligence to automate the systematic review of scientific

    Artificial intelligence (AI) has acquired notorious relevance in modern computing as it effectively solves complex tasks traditionally done by humans. AI provides methods to represent and infer knowledge, efficiently manipulate texts and learn from vast amount of data. These characteristics are applicable in many activities that human find laborious or repetitive, as is the case of the ...

  3. Cheap, Quick, and Rigorous: Artificial Intelligence and the Systematic

    The systematic literature review (SLR) is the gold standard in providing research a firm evidence foundation to support decision-making. ... This technique provides a working example of how researchers can use AI and MLTs to automate the data synthesis and abstraction stage of their SLR, and aide in increasing the speed, frugality, and rigour ...

  4. Generative AI: A systematic review using topic modelling techniques

    These reviews aid in identifying and comprehending the sources, focuses, and context of the extensive literature available (Alsharhan et al., 2023, pp. 1-13; Fui-Hoon Nah et al., 2023; Singh & Singh, 2023), but a comprehensive and interdisciplinary systematic literature review on GAI is necessary, going beyond the examination of individual ...

  5. Using artificial intelligence methods for systematic review in health

    This review delineated automated tools and platforms that employ artificial intelligence (AI) approaches and evaluated the reported benefits and challenges in using such methods. A search was conducted in 4 databases (Medline, Embase, CDSR, and Epistemonikos) up to April 2021 for systematic reviews and other related reviews implementing AI methods.

  6. System for Systematic Literature Review Using Multiple AI agents

    The Systematic Literature Review (SLR) is a fundamental component of academic research, offering a comprehensive and unbiased overview of existing literature on a specific topic Keele et al. ().It involves a structured methodology for identifying, evaluating, and synthesizing all relevant research to address clearly defined research questions Kitchenham et al. ().

  7. Artificial intelligence in education: A systematic literature review

    Systematic review of 22 papers: Reviewed AI definition, pedagogical models used, reasons for using AI, the use of automated scoring, and comparison of assessment evaluation accuracy between AI use and non-use: 2010-2020: AI for student assessment: Hwang & Tu, 2021: Bibliometric mapping analysis and systematic review of 43 papers

  8. PRISMA AI reporting guidelines for systematic reviews and meta ...

    The PRISMA-AI implementations will assist stakeholders interested in using AI-related information in systematic reviews by creating a framework for reviewers that evaluates the quality of the data ...

  9. [2402.08565] Artificial Intelligence for Literature Reviews

    This manuscript presents a comprehensive review of the use of Artificial Intelligence (AI) in Systematic Literature Reviews (SLRs). A SLR is a rigorous and organised methodology that assesses and integrates previous research on a given topic. Numerous tools have been developed to assist and partially automate the SLR process. The increasing role of AI in this field shows great potential in ...

  10. How to optimize the systematic review process using AI tools

    A comprehensive literature review is a key component of any systematic review, and must be complete and thorough. ... However, the potential pitfalls of using AI tools in systematic reviews can be mitigated to a significant extent. AI-based tools that provide citations and sources, such as Elicit, ...

  11. An open source machine learning framework for efficient and ...

    It is a challenging task for any research field to screen the literature and determine what needs to be included in a systematic review in a transparent way. A new open source machine learning ...

  12. A Systematic Review and Meta-Analysis of Artificial Intelligence Tools

    As part of this investigation, a systematic literature review was conducted using the PRISMA methodology ... a PRISMA-based systematic literature review was conducted. AI has the potential to transform many industries, including medicine and healthcare. Large language models, such as ChatGPT, have attracted attention for their ability to ...

  13. Artificial Intelligence for Literature Reviews: Opportunities and

    Abstract. This manuscript presents a comprehensive review of the use of Artificial Intelligence (AI) in Systematic Literature Reviews (SLRs). A SLR is a rigorous and organised methodology that assesses and integrates prior research on a given topic. Numerous tools have been developed to assist and partially automate the SLR process.

  14. System for systematic literature review using multiple AI agents

    Systematic Literature Reviews (SLRs) have become the foundation of evidence-based studies, enabling researchers to identify, classify, and combine existing studies based on specific research questions. Conducting an SLR is largely a manual process. Over the previous years, researchers have made significant progress in automating certain phases of the SLR process, aiming to reduce the effort ...

  15. Artificial intelligence in systematic reviews: promising when

    Systematic reviews provide a structured overview of the available evidence in medical-scientific research. However, due to the increasing medical-scientific research output, it is a time-consuming task to conduct systematic reviews. To accelerate this process, artificial intelligence (AI) can be used in the review process.

  16. Systematic literature review on opportunities, challenges, and future

    Only articles relevant to AIEd were selected for this review. To identify relevant published articles, three of the authors collaboratively discussed and developed the criteria depicted in Fig. 1.Based on the previous studies (Nigam et al., 2021), the search query [("AI" OR "artificial intelligence") AND "education"] was used to include papers with these terms in the titles ...

  17. AI for Systematic Review

    Securely automate every stage of your literature review to produce evidence-based research faster, more accurately, and more transparently at scale. Rayyan A web-tool designed to help researchers working on systematic reviews, scoping reviews and other knowledge synthesis projects, by dramatically speeding up the process of screening and ...

  18. A systematic review of artificial intelligence technologies used for

    This review adoptes the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews guidelines (Tricco et al., 2018) to review the literature on using AI in story-writing.As shown in Fig. 1, this review involved four main steps: (1) identification, (2) screening, (3) eligibility, and (4) inclusion.The PRISMA criteria was adopted because it is a scientific ...

  19. Role of AI in Systematic Literature Review

    Using AI in systematic reviews is revolutionizing the process, making it more practical and sustainable. Incorporating AI into the process not only expedites the systematic literature reviews but also reduces human errors, and comes as a cost-effective systematic approach. In this article, we'll learn more about the role of AI in systematic ...

  20. AI tools in evidence synthesis

    A variety of AI tools can be used during the systematic review or evidence synthesis process. These may be used to assist with developing a search strategy; locating relevant articles or resources; or during the data screening, data extraction or synthesis stage.They can also be used to draft plain language summaries.. The overall consensus is that the AI tools can be very useful in different ...

  21. Systematic Literature Review (SLR) Using AI

    Systematic Literature Review (SLR) using AI involves leveraging artificial intelligence techniques to automate and expedite the process of reviewing and synthesizing large volumes of scholarly literature. Topics. zotero systematic-literature-reviews literature-search obsidian-md ai-tools Resources. Readme Activity.

  22. Use of AI in Systematic Literature Reviews

    The predominant use of AI in SLRs is to help streamline the initial steps of the process by automating the search for relevant literature and screening articles based on predefined eligibility criteria. Search engines now utilise AI, especially by strengthening Retrieval-Augmented Generation (RAG) frameworks through large language models.

  23. Evaluating the effectiveness of large language models in abstract

    Amidst the evolving methodologies described in the "Existing approaches to abstract screening in systematic reviews" section, modern AI tools based on large language models (LLMs), such as ChatGPT, PaLM, Llama, and Claude, are emerging as potential game-changers. Grounded in advanced language processing capabilities, these tools can be tailored to evaluate abstracts against nuanced ...

  24. A Systematic Literature Review on AI Safety: Identifying Trends

    Artificial intelligence (AI) is revolutionizing many aspects of our lives, except it raises fundamental safety and ethical issues. In this survey paper, we review the current state of research on safe and trustworthy AI. This work provides a structured and systematic overview of AI safety. In which, we emphasize the significance of designing AI systems with safety focus, encompassing elements ...

  25. Streamlining the Selection Phase of Systematic Literature Reviews (SLRs

    The escalating volume of academic literature presents a formidable challenge in staying updated with the newest research developments. Addressing this, this study introduces a pioneering AI-based tool, configured specifically to streamline the efficiency of the article selection phase in Systematic Literature Reviews (SLRs). Utilizing the robust capabilities of OpenAI's GPT-4 Assistant API ...

  26. Ethics Guidelines for Using AI-based Algorithms in Recruiting

    systematic literature review (keyword and backwards search) on existing ethics guidelines and principles for AI and extracted aspects that seemed applicable to guide recruiting processed. Based on 28 relevant papers we derived actionable guidelines for using AI-based algorithms in recruiting processes. We

  27. Buildings

    This study encourages the use of advanced methodologies like big data, AI, and high-level architecture (HLA) federated simulations to predict disruptions and optimize resilience strategies, thus providing a robust foundation for handling future uncertainties. ... A Systematic Literature Review and a Framework for Future Research. Eur. Manag. J ...

  28. Diagnostic accuracy of artificial intelligence models in ...

    The current study aimed to systematically review the literature on the accuracy of artificial intelligence (AI) models for osteoporosis (OP) diagnosis using dental images. A thorough literature search was executed in October 2022 and updated in November 2023 across multiple databases, including PubMed, Scopus, Web of Science, and Google Scholar ...

  29. Adoption and uses of cloud computing in academic libraries: A

    This study aims to synthesise the findings of research on cloud computing adoption and use in libraries. This systematic literature review is based on Preferred Reporting Items for Systematic Revie...