Perform (partial) double screening and check inter-reviewer agreement
*What is considered manually feasible is highly context-dependent (ie, the intended workload and/or reviewers available).
In our systematic review, by means of broad literature searches in several scientific databases, a first set of potentially relevant articles was identified, yielding 8456 articles, enough to expect the AI tool to be efficient in the title and abstract screening (scenario ① was avoided, see table 1 ). Subsequently, this complete set of articles was uploaded in reference manager EndNote X9 18 and review manager Covidence, 19 where 3761 duplicate articles were removed. Given that EndNote has quite low sensitivity in identifying duplicates, additional deduplication in Covidence was considered beneficial. 20 Deduplication is usually applied in systematic reviewing, 20 but is increasingly important prior to the use of AI. Since multiple decisions regarding a duplicate article weigh more than one, this will disproportionately influence classification and possibly the results ( table 1 , scenario ② ). In our review, a deduplicated set of articles was uploaded in the AI tool. Prior to the actual AI-supported title and abstract screening, the reviewers (SHBvD and AL, MGJB-K) trained themselves with a small selection of 74 articles. The first reviewer became familiar with the ASReview software, and all three reviewers learnt how to apply the eligibility criteria, to minimise personal influence on the article selection ( table 1 , scenario ③ ).
Defining the stopping criterion used in the screening process is left to the reviewer. 9 An optimal stopping criterion in active learning is considered a perfectly balanced trade-off between a certain cost (in terms of time spent) of screening one more article versus the predictive performance (in terms of identifying a new relevant article) that could be increased by adding one more decision. 21 The optimal stopping criterion in systematic reviewing would be the moment that screening additional articles will not result in more relevant articles being identified. 22 Therefore, in our review, we predetermined a data-driven stopping criterion for the title and abstract screening as ‘100 consecutive irrelevant articles’ in order to prevent the screening from being stopped before or a long time after all relevant articles were identified ( table 1 , scenario ④ ).
Due to the fact that the stopping criterion was reached after 1063 of the 4695 articles, only a part of the total number of articles was seen. Therefore, this approach might be sensitive to possible mistakes when articles are screened by only one reviewer, influencing the algorithm, possibly resulting in an incomplete selection of articles ( table 1 , scenario ③ ). 23 As a remedy, second reviewers (AL, MGJB-K) checked 20% of the titles and abstracts seen by the first reviewer. This 20% had a comparable ratio regarding relevant versus irrelevant articles over all articles seen. The percentual agreement and Cohen’s Kappa (κ), a measure for the inter-reviewer agreement above chance, were calculated to express the reliability of the decisions taken. 24 The decisions were agreed in 96% and κ was 0.83. A κ equal of at least 0.6 is generally considered high, 24 and thus it was assumed that the algorithm was reliably trained by the first reviewer.
The reporting of the use of the AI tool should be transparent. If the choices made regarding the use of the AI tool are not entirely reported ( table 1 , scenario ⑤ ), the reader will not be able to properly assess the methodology of the review, and review results may even be graded as low-quality due to the lack of transparent reporting. The ASReview tool offers the possibility to extract a data file providing insight into all decisions made during the screening process, in contrast to various other ‘black box’ AI-reviewing tools. 9 This file will be published alongside our systematic review to provide full transparency of our AI-supported screening. This way, the screening with AI is reproducible (remedy to scenario ⑥ , table 1 ).
We experienced an efficient process of title and abstract screening in our systematic review. Whereas the screening was performed with a database of 4695 articles, the stopping criterion was reached after 1063 articles, so 23% were seen. Figure 2A shows the proportion of articles identified as being relevant at any point during the AI-supported screening process. It can be observed that the articles are indeed prioritised by the active learning algorithm: in the beginning, relatively many relevant articles were found, but this decreased as the stopping criterion (vertical red line) was approached. Figure 2B compares the screening progress when using the AI tool versus manual screening. The moment the stopping criterion was reached, approximately 32 records would have been found when the titles and abstract would have been screened manually, compared with 142 articles labelled relevant using the AI tool. After the inter-reviewer agreement check, 142 articles proceeded to the full text reviewing phase, of which 65 were excluded because these were no articles with an original research format, and three because the full text could not be retrieved. After full text reviewing of the remaining 74 articles, 18 articles from 13 individual studies were included in our review. After snowballing, one additional article from a study already included was added.
Relevant articles identified after a certain number of titles and abstracts were screened using the AI tool compared with manual screening.
In our systematic review, the AI tool considerably reduced the number of articles in the screening process. Since the AI tool is offered open source, many researchers may benefit from its time-saving potential in selecting articles. Choices in several scenarios regarding the use of AI, however, are still left open to the researcher, and need consideration to prevent pitfalls. These include the choice whether or not to use AI by weighing the costs versus the benefits, the importance of deduplication, double screening to check inter-reviewer agreement, a data-driven stopping criterion to optimally use the algorithm’s predictive performance and quality of reporting of the AI-related methodology chosen. This communication paper is, to our knowledge, the first elaborately explaining and discussing these choices regarding the application of this AI tool in an example systematic review.
The main advantage of using the AI tool is the amount of time saved. Indeed, in our study, only 23% of the total number of articles were screened before the predefined stopping criterion was met. Assuming that all relevant articles were found, the AI tool saved 77% of the time for title and abstract screening. However, time should be invested to become acquainted with the tool. Whether the expected screening time saved outweighs this time investment is context-dependent (eg, researcher’s digital skills, systematic reviewing skills, topic knowledge). An additional advantage is that research questions previously unanswerable due to the insurmountable number of articles to screen in a ‘classic’ (ie, manual) review, now actually are possible to answer. An example of the latter is a review screening over 60 000 articles, 25 which would probably never have been performed without AI supporting the article selection.
Since the introduction of the ASReview tool in 2021, it was applied in seven published reviews. 25–31 An important note to make is that only one 25 clearly reported AI-related choices in the methods and a complete and transparent flowchart reflecting the study selection process in the Results section. Two reviews reported a relatively small number (<400) of articles to screen, 26 27 of which more than 75% of the articles were screened before the stopping criterion was met, so the amount of time saved was limited. Also, three reviews reported many initial articles (>6000) 25 28 29 and one reported 892 articles, 31 of which only 5%–10% needed to be screened. So in these reviews, the AI tool saved an impressive amount of screening time. In our systematic review, 3% of the articles were labelled relevant during the title and abstract screening and eventually, <1% of all initial articles were included. These percentages are low, and are in line with the three above-mentioned reviews (1%–2% and 0%–1%, respectively). 25 28 29 Still, relevancy and inclusion rates are much lower when compared with ‘classic’ systematic reviews. A study evaluating the screening process in 25 ‘classic’ systematic reviews showed that approximately 18% was labelled relevant and 5% was actually included in the reviews. 32 This difference is probably due to more narrow literature searches in ‘classic’ reviews for feasibility purposes compared with AI-supported reviews, resulting in a higher proportion of included articles.
In this paper, we show how we applied the AI tool, but we did not evaluate it in terms of accuracy. This means that we have to deal with a certain degree of uncertainty. Despite the data-driven stopping criterion there is a chance that relevant articles were missed, as 77% was automatically excluded. Considering this might have been the case, first, this could be due to wrong decisions of the reviewer that would have undesirably influenced the training of the algorithm by which the articles were labelled as (ir)relevant and the order in which they were presented to the reviewer. Relevant articles could have therefore remained unseen if the stopping criterion was reached before they were presented to the reviewer. As a remedy, in our own systematic review, of the 20% of the articles screened by the first reviewer, relevancy was also assessed by another reviewer to assess inter-reviewer reliability, which was high. It should be noted, though, that ‘classic’ title and abstract screening is not necessarily better than using AI, as medical-scientific researchers tend to assess one out of nine abstracts wrongly. 32 Second, the AI tool may not have properly ranked highly relevant to irrelevant articles. However, given that simulations proved this AI tool’s accuracy before 9–11 this was not considered plausible. Since our study applied, but did not evaluate, the AI tool, we encourage future studies evaluating the performance of the tool across different scientific disciplines and contexts, since research suggests that the tool’s performance depends on the context, for example, the complexity of the research question. 33 This could not only enrich the knowledge about the AI tool, but also increases certainty about using it. Also, future studies should investigate the effects of choices made regarding the amount of prior knowledge that is provided to the tool, the number of articles defining the stopping criterion, and how duplicate screening is best performed, to guide future users of the tool.
Although various researcher-in-the-loop AI tools for title and abstract screening have been developed over the years, 9 23 34 they often do not develop into usable mature software, 34 which impedes AI to be permanently implemented in research practice. For medical-scientific research practice, it would therefore be helpful if large systematic review institutions, like Cochrane and PRISMA, would consider to ‘officially’ make AI part of systematic reviewing practice. When guidelines on the use of AI in systematic reviews are made available and widely recognised, AI-supported systematic reviews can be uniformly conducted and transparently reported. Only then we can really benefit from AI’s time-saving potential and reduce our research time waste.
Our experience with the AI tool during the title and abstract screening was positive as it has highly accelerated the literature selection process. However, users should consider applying appropriate remedies to scenarios that may form a threat to the methodological quality of the review. We provided an overview of these scenarios, their pitfalls and remedies. These encourage reliable use and transparent reporting of AI in systematic reviewing. To ensure the continuation of conducting systematic reviews in the future, and given their importance for medical guidelines and practice, we consider this tool as an important addition to the review process.
Contributors: SHBvD proposed the methodology and conducted the study selection. MGJB-K, CJMD and AL critically reflected on the methodology. MGJB-K and AL contributed substantially to the study selection. CCB, JvdP and CJMD contributed to the study selection. The manuscript was primarily prepared by SHBvD and critically revised by all authors. All authors read and approved the final manuscript.
Funding: The systematic review is conducted as part of the RE-SAMPLE project. RE-SAMPLE has received funding from the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 965315).
Competing interests: None declared.
Provenance and peer review: Not commissioned; externally peer reviewed.
Not applicable.
Ai for systematic review.
Various AI tools are invaluable throughout the systematic review or evidence synthesis process. While the consensus acknowledges the significant utility of AI tools across different review stages, it's imperative to grasp their inherent biases and weaknesses. Moreover, ethical considerations such as copyright and intellectual property must be at the forefront.
Advertisement
3377 Accesses
11 Citations
Explore all metrics
With the digital revolution of artificial intelligence (AI) in language education, the way how people write and create stories has been transformed in recent years. Although recent studies have started to examine the roles of AI in literacy, there is a lack of systematic review to inform how it has been applied and what has been achieved in story-writing. This paper reviews the literature on the use of AI in story-writing during the last 5 years. The discussion explores the year of publication, countries of implementation, educational levels, participants and research methodology. In terms of research context, most studies were carried out in universities in the United States, and children and adult learners were the two most common participants. Most studies involved the collection and analysis of quantitative data. After that, the mechanisms of using AI for story-writing are investigated in terms of the types, approaches, and roles of AI. The pedagogies used in the learning context of AI-supported story-writing are discussed. Finally, the benefits of using AI in story-writing are pointed out. The findings show that the literature has paid most attention to learners’ creativity, writing skills, presentation skills, motivation, and satisfaction. The review also suggested that human-AI collaboration could effectively improve story creation. Some studies had trained high-level AI to help students write better stories. As findings from the current body of research are not conclusive, more work is needed in exploring challenges of using AI in story-writing. Lastly, a set of limitations and recommendations for future research are summarized in this study.
This is a preview of subscription content, log in via an institution to check access.
Subscribe and save.
Price includes VAT (Russian Federation)
Instant access to the full article PDF.
Rent this article via DeepDyve
Institutional subscriptions
Explore related subjects.
The authors declare that the data supporting the findings of this study are available within the article and its supplementary information files.
Alhussain, A. I., & Azmi, A. M. (2021). Automatic story generation: a survey of approaches. ACM Computing Surveys (CSUR) , 54 (5), 1–38.
Bai, B., Wang, J., & Zhou, H. (2021). An intervention study to improve primary school students’ self-regulated strategy use in English writing through e-learning in Hong Kong. Computer Assisted Language Learning , 1–23.
Biermann, O. C., Ma, N. F., & Yoon, D. (2022, June). From Tool to Companion: Storywriters Want AI Writers to Respect Their Personal Values and Writing Strategies. In Designing Interactive Systems Conference (pp. 1209–1227).
Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3 (2), 77–101.
Article Google Scholar
Braun, V., & Clarke, V. (2012) Thematic analysis. In H. Cooper (Ed.), Handbook of research methods in psychology . (Vol. 2: Research Designs, pp. 57–71). Washington, DC: APA Books.
Cambre, J., Reig, S., Kravitz, Q., & Kulkarni, C. (2020, July). " All Rise for the AI Director" Eliciting Possible Futures of Voice Technology through Story Completion. In Proceedings of the 2020 ACM Designing Interactive Systems Conference (pp. 2051–2064).
Candello, H., Pichiliani, M., Wessel, M., Pinhanez, C., & Muller, M. (2019, November). Teaching robots to act and converse in physical spaces: participatory design fictions with museum guides. In Proceedings of the Halfway to the Future Symposium 2019 (pp. 1–4).
Chang, T. S., Li, Y., Huang, H. W., & Whitfield, B. (2021, March). Exploring EFL students' writing performance and their acceptance of AI-based automated writing feedback. In 2021 2nd International Conference on Education Development and Studies (pp. 31–35). Association for Computing Machinery.
Chen, L., Chen, P., & Lin, Z. (2020). Artificial intelligence in education: A review. IEEE Access, 8 , 75,264–75,278.
Chen, Z. H., & Liu, W. Y. (2021). A six-stage story structure approach for elementary students’ story production: Quality, interest, and attitude. Computer Assisted Language Learning, 34 (1–2), 184–207.
Cheong, Y. G., Park, W. H., & Yu, H. Y. (2018, October). A Demonstration of an Intelligent Storytelling System. In Proceedings of the 26th ACM international conference on Multimedia (pp. 1258–1259).
Chung, J. J. Y., Kim, W., Yoo, K. M., Lee, H., Adar, E., & Chang, M. (2022, April). TaleBrush: Sketching Stories with Generative Pretrained Language Models. In CHI Conference on Human Factors in Computing Systems (pp. 1–19).
Chow, P. S. (2020). Ghost in the (Hollywood) machine: Emergent applications of artificial intelligence in the film industry. NECSUS_European Journal of Media Studies , 9 (1), 193–214.
Clark, E., Ross, A. S., Tan, C., Ji, Y., & Smith, N. A. (2018, March). Creative writing with a machine in the loop: Case studies on slogans and stories. In 23rd International Conference on Intelligent User Interfaces (pp. 329–340).
Coenen, A., Davis, L., Ippolito, D., Reif, E., & Yuan, A. (2021). Wordcraft: a Human-AI Collaborative Editor for story writing. arXiv preprint arXiv:2107.07430 .
Cohen, L., Manion, L., & Morrison, K. (2002). Research methods in education . Routledge.
Book Google Scholar
Creswell, J. W. (2012). Educational research: Planning. Conducting, and Evaluating, 260 , 375–382.
Google Scholar
Crompton, H., Jones, M. V., & Burke, D. (2022). Affordances and challenges of artificial intelligence in K-12 education: a systematic review. Journal of Research on Technology in Education , 1–21.
Dahlström, H. (2019). Digital writing tools from the student perspective. Education and Information Technologies, 24 (2), 1563–1581.
Del-Moral-Pérez, M. E., Villalustre-Martínez, L., & Neira-Piñeiro, M. D. R. (2019). Teachers’ perception about the contribution of collaborative creation of digital storytelling to the communicative and digital competence in primary education schoolchildren. Computer Assisted Language Learning, 32 (4), 342–365.
Frich, J., MacDonald Vermeulen, L., Remy, C., Biskjaer, M. M., & Dalsgaard, P. (2019, May). Mapping the landscape of creativity support tools in HCI. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (pp. 1–18).
Gagliano, P., Blustein, C., & Oppenheim, D. (2021). Agence, a dynamic film about (and with) artificial intelligence. In ACM SIGGRAPH 2021 Immersive Pavilion (pp. 1–2).
Gala, K., Somaiya, M., Gopani, M., & Joshi, A. (2021, September). Picture Tales: An Approach for Story Generation Using a Series of Images. In 2021 IEEE 4th International Conference on Computing, Power and Communication Technologies (GUCON) (pp. 1–5). IEEE.
Gero, K. I., Liu, V., & Chilton, L. (2022, June). Sparks: Inspiration for science writing using language models. In Designing Interactive Systems Conference (pp. 1002–1019).
Goldfarb-Tarrant, S., Feng, H., & Peng, N. (2019). Plan, write, and revise: an interactive system for open-domain story generation. arXiv preprint arXiv:1904.02357 .
Guan, J., Huang, F., Zhao, Z., Zhu, X., & Huang, M. (2020). A knowledge-enhanced pretraining model for commonsense story generation. Transactions of the Association for Computational Linguistics, 8 , 93–108.
Herrera-González, B. D., Gelbukh, A., & Calvo, H. (2020, October). Automatic Story Generation: State of the Art and Recent Trends. In Mexican International Conference on Artificial Intelligence (pp. 81–91). Springer, Cham.
Hsu, T. Y., Hsu, Y. C., & Huang, T. H. (2019, May). On how users edit computer-generated visual stories. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (pp. 1–6).
Ibáñez, M. B., & Delgado-Kloos, C. (2018). Augmented reality for STEM learning: A systematic review. Computers & Education, 123 , 109–123.
Issa, L., & Jusoh, S. (2019, October). Applying ontology in computational creativity approach for generating a story. In 2019 2nd International Conference on new Trends in Computing Sciences (ICTCS) (pp. 1–6). IEEE.
Karlimah, K., Hamdu, G., Pratiwi, V., Herdiansah, H., & Kurniawan, D. (2021, July). The development of motion comic storyboard based on digital literacy and elementary school mathematics ability in the new normal era during covid-19 pandemic. In Journal of Physics: Conference Series (Vol. 1987, No. 1, p. 012026). IOP Publishing.
Keller, J. M. (1984). The use of the ARCS model of motivation in teacher training. In K.S.A.J. Trott (Ed.), Aspects of educational technology volume XVII: Staff development and career updating . Kogan Page.
Keller, J. M. (1987). Development and use of the ARCS model of instructional design. Journal of Instructional Development, 10 (3), 2–10.
Article MathSciNet Google Scholar
Keller, J. M. (2009). Motivational design for learning and performance: The ARCS model approach . Springer Science & Business Media.
Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., & Socher, R. (2019). Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858 .
Kılıçkaya, F. (2020). Learners’ perceptions of collaborative digital graphic writing based on semantic mapping. Computer Assisted Language Learning, 33 (1–2), 58–84.
Klimashevskaia, A., Gadgil, R., Gerrity, T., Khosmood, F., Gütl, C., & Howe, P. (2021, November). Automatic News Article Generation from Legislative Proceedings: A Phenom-Based Approach. In International Conference on Statistical Language and Speech Processing (pp. 15–26). Springer, Cham.
Lee, M., Liang, P., & Yang, Q. (2022, April). Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. In CHI Conference on Human Factors in Computing Systems (pp. 1–19).
Li, X., & Zhang, B. (2020, October). AI poem case analysis: Take ancient Chinese poems as an example. In Proceedings of the 2020 Conference on Artificial Intelligence and Healthcare (pp. 132–136).
Lin, P. Y., Chai, C. S., Jong, M. S. Y., Dai, Y., Guo, Y., & Qin, J. (2021). Modeling the structural relationship among primary students’ motivation to learn artificial intelligence. Computers and Education: Artificial Intelligence, 2 , 100006.
Lin, J. W., & Chang, R. G. (2022). Chinese story generation of sentence format control based on multi-channel word embedding and novel data format. Soft Computing, 26 (5), 2179–2196.
Liu, C., Hou, J., Tu, Y. F., Wang, Y., & Hwang, G. J. (2021). Incorporating a reflective thinking promoting mechanism into artificial intelligence-supported English writing environments. Interactive Learning Environments , 1–19.
Min, K., Dang, M., & Moon, H. (2021). Deep learning-based short story generation for an image using the encoder-decoder structure. IEEE Access, 9 , 113,550–113,557.
Ng, D. T. K., & Chu, S. K. W. (2021). Motivating students to learn STEM via engaging flight simulation activities. Journal of Science Education and Technology, 30 (5), 608–629.
Ng, D. T. K., Leung, J. K. L., Chu, S. K. W., & Qiao, M. S. (2021). Conceptualizing AI literacy: An exploratory review. Computers and Education: Artificial Intelligence, 2 , 100041.
Ng, D. T. K., Luo, W., Chan, H. M. Y., & Chu, S. K. W. (2022). Using digital story writing as a pedagogy to develop AI literacy among primary students. Computers and Education: Artificial Intelligence, 3 , 100054.
Nichols, E., Gao, L., Vasylkiv, Y., & Gomez, R. (2021). Design and Analysis of a Collaborative Story Generation Game for Social Robots. Frontiers in Computer Science , 74.
Noceti, N., Odone, F., Marsella, A., Moro, M., & Nicora, E. (2020, July). Tangible Coding for kids with AI inside. In Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization (pp. 163–166).
Ochieng, P. A. (2009). An analysis of the strengths and limitation of qualitative and quantitative research paradigms. Problems of Education in the 21st Century , 13 , 13.
Osone, H., Lu, J. L., & Ochiai, Y. (2021, May). BunCho: AI Supported Story Co-Creation via Unsupervised Multitask Learning to Increase Writers’ Creativity in Japanese. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1–10).
Ouyang, F., Zheng, L., & Jiao, P. (2022). Artificial intelligence in online higher education: A systematic review of empirical research from 2011 to 2020. Education and Information Technologies , 1–33.
Park, W., & Park, K. (2018, February). Story creation and design algorithm in unity. In 2018 20th International Conference on Advanced Communication Technology (ICACT) (pp. 444–447). IEEE
Peng, N., Ghazvininejad, M., May, J., & Knight, K. (2018, June). Towards controllable story generation. In Proceedings of the First Workshop on Storytelling (pp. 43–49).
Petticrew, M., & Roberts, H. (2008). Systematic reviews in the social sciences: A practical guide . John Wiley & Sons.
Refat, N., Rahman, M. A., Asyhari, A. T., Kurniawan, I. F., Bhuiyan, M. Z. A., & Kassim, H. (2019). Interactive learning experience-driven smart communications networks for cognitive load management in grammar learning context. IEEE Access , 7 , 64,545–64,557.
Roemmele, M., & Gordon, A. S. (2018, March). Automated assistance for creative writing with an rnn language model. In Proceedings of the 23rd International Conference on Intelligent User Interfaces Companion (pp. 1–2).
Shakeri, H., Neustaedter, C., & DiPaola, S. (2021, October). SAGA: Collaborative Storytelling with GPT-3. In Companion Publication of the 2021 Conference on Computer Supported Cooperative Work and Social Computing (pp. 163–166).
Su, J., & Yang, W. (2022). Artificial intelligence in early childhood education: A scoping review. Computers and Education: Artificial Intelligence , 100049.
Suh, S., & An, P. (2022, March). Leveraging Generative Conversational AI to Develop a Creative Learning Environment for Computational Thinking. In 27th International Conference on Intelligent User Interfaces (pp. 73–76).
Takacs, Z. K., Swart, E. K., & Bus, A. G. (2015). Benefits and pitfalls of multimedia and interactive features in technology-enhanced storybooks: A meta-analysis. Review of Educational Research, 85 (4), 698–739.
Tanrıkulu, F. (2022). Students’ perceptions about the effects of collaborative digital storytelling on writing skills. Computer Assisted Language Learning, 35 (5–6), 1090–1105.
Transformer Jr, G. P., Note, E. X., Spellchecker, M. S., & Yampolskiy, R. (2020). When Should Co-Authorship Be Given to AI? PhilArchive . https://philarchive.org/archive/GPTWSCv1
Tricco, A. C., Lillie, E., Zarin, W., O'Brien, K. K., Colquhoun, H., Levac, D., ... & Straus, S. E. (2018). PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Annals of internal medicine , 169 (7), 467–473.
Tsou, W., & Tsai, S. C. (2022). Interactive learning for professional development of bilingual education by a blended instructional approach. Interactive Learning Environments , 1–13.
Valls-Vargas, J., Zhu, J., & Ontanón, S. (2014, September). Toward automatic role identification in unannotated folk tales. In Tenth Artificial Intelligence and Interactive Digital Entertainment Conference .
Wang, Y. (2021, May). The Application of Artificial Intelligence in Chinese News Media. In 2021 2nd International Conference on Artificial Intelligence and Information Systems (pp. 1–4).
Watcharapunyawong, S., & Usaha, S. (2013). Thai EFL Students’ Writing Errors in Different Text Types: The Interference of the First Language. English Language Teaching, 6 (1), 67–78.
Wicke, P., & Veale, T. (2021, March). Are You Not Entertained? Computational Storytelling With Non-verbal Interaction. In Companion of the 2021 ACM/IEEE International Conference on Human-Robot Interaction (pp. 200–204).
Woo, D. J., Wang, Y., & Susanto, H. (2022). Student-AI Creative Writing: Pedagogical Strategies for Applying Natural Language Generation in Schools. EdArXiv. June , 3 .
Wu, J., & Chen, D. T. V. (2020). A systematic review of educational digital storytelling. Computers & Education, 147 , 103786.
Xu, P., Patwary, M., Shoeybi, M., Puri, R., Fung, P., Anandkumar, A., & Catanzaro, B. (2020). MEGATRON-CNTRL: Controllable story generation with external knowledge using large-scale language models. arXiv preprint arXiv:2010.00840 .
Xu, Z., Banerjee, M., Ramirez, G., Zhu, G., & Wijekumar, K. (2019). The effectiveness of educational technology applications on adult English language learners’ writing quality: A meta-analysis. Computer Assisted Language Learning, 32 (1–2), 132–162.
Young, R. M., Ware, S. G., Cassell, B. A., & Robertson, J. (2013). Plans and planning in narrative generation: A review of plan-based approaches to the generation of story, discourse and interactivity in narratives. Sprache Und Datenverarbeitung, Special Issue on Formal and Computational Models of Narrative, 37 (1–2), 41–64.
Yu, M. (2021). The Dilemmas and Reform of Translation Education in the Age of Artificial Intelligence. In 2021 2nd International Conference on Artificial Intelligence and Education (ICAIE) (pp. 40–44). IEEE.
Yuan, A., Coenen, A., Reif, E., & Ippolito, D. (2022, March). Wordcraft: story writing With Large Language Models. In 27th International Conference on Intelligent User Interfaces (pp. 841–852).
Zainuddin, Z., Chu, S. K. W., Shujahat, M., & Perera, C. J. (2020). The impact of gamification on learning and instruction: A systematic review of empirical evidence. Educational Research Review, 30 , 100326.
Zhang, M. (2020, October). Application of Artificial Intelligence Interactive storytelling in Animated. In 2020 International Conference on Control, Robotics and Intelligent System (pp. 37–41).
Zhang, C., Yao, C., Liu, J., Zhou, Z., Zhang, W., Liu, L., ... & Wang, G. (2021, May). StoryDrawer: A Co-Creative Agent Supporting Children's Storytelling through Collaborative Drawing. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1–6).
Zhang, C., Yao, C., Wu, J., Lin, W., Liu, L., Yan, G., & Ying, F. (2022, April). StoryDrawer: A Child–AI Collaborative Drawing System to Support Children's Creative Visual Storytelling. In CHI Conference on Human Factors in Computing Systems (pp. 1–15).
Download references
Authors and affiliations.
Department of Curriculum and Instruction, The Education University of Hong Kong, Hong Kong, China
Xiaoxuan Fang
Faculty of Education, The University of Hong Kong, Hong Kong, China
Davy Tsz Kit Ng & Samuel Kai Wah Chu
Division of Integrative Systems and Design, The Hong Kong University of Science and Technology, Hong Kong, China
Jac Ka Lok Leung
You can also search for this author in PubMed Google Scholar
Correspondence to Xiaoxuan Fang .
Conflict of interest.
No potential conflict of interest was reported by the author.
Publisher's note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
SN | Author(s) and year of publication | Type of publication | Country | Educational level | Participant/Research object | Methodology | Type of AI technologies | Approach of AI technologies | Role of AI technologies | Pedagogy | Benefit |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Biermann et al. ( ) | Empirical study | Canada | Adult education | 20 adults (7 hobbyists and 13 professional writers) | Qualitative | AI-writers | Modern large-scale language models | Story collaborator | Human-AI collaborative writing | Help writers have good productivity and complete challenging writing tasks |
2 | Cambre et al. ( ) | Empirical study | USA | Adult education | 149 adults | Mixed methods | Voice assistant | No mention | No mention | No mention | No mention |
3 | Cheong et al. ( ) | Descriptive study | Korea | Not applicable | Not applicable | Not applicable | AI planner | The unity game engine | Story animator | Not applicable | Generate the story with humans and then visualize it as a 3D animation Encourage the user to interact with the story by manipulating props or characters |
4 | Chung et al. ( ) | Empirical study | USA | Adult education | 14 adults (7 female and 7 male) | Qualitative | TaleBrush | GPT-Neo | Story generator | Human-AI collaborative writing | Collaborate with writers to generate stories based on their intentions about characters’ fortune Maintain the novelty of generated sentences Inspire writers’ ideas when the AI application generates stories |
5 | Clark et al. ( ) | Empirical study | USA | Adult education | 36 adults | Mixed methods | Machine in the loop systems | A neural language model | Story co-creator | Human-AI collaborative writing | Generate suggestions based on writers’ story context |
6 | Coenen et al. ( ) | Descriptive study | USA | Not applicable | Not applicable | Not applicable | Wordcraft (an AI-assisted editor) | Neural language generation | Story collaborator | Not applicable | Collaborate with writers to complete a story |
7 | Gala et al. ( ) | Empirical study | India | Adult education | No mention | Quantitative | An encoder-decoder architecture | Recurrent neural networks | Story generator | No mention | Generate grammatically correct and sensible stories automatically based on images |
8 | Goldfarb-Tarrant et al. ( ) | Empirical study | USA | Adult education | 30 workers | Quantitative | A neural narrative generation system | A conditional language models implemented with LSTMs | Story co-creator | Human-AI collaborative writing | Interact with humans to generate stories |
9 | Hsu et al. ( ) | Empirical study | USA | Adult education | 197 workers | Quantitative | State-of-the-art visual storytelling models | Human-centered computer (supported storytelling system) | Story collaborator | Human-AI collaborative writing | Assist writers in generating machine-generated visual story based on the visual storytelling model using the VIST dataset (with photos) Collaborate with users to generate good quality and understandable stories by reducing word redundancy and increasing lexical diversity |
10 | Issa and Jusoh ( ) | Descriptive study | Jordan | Not applicable | Not applicable | Not applicable | Markov chain model (a model that hires statistics in determining a sequence of elements according to certain rules or history) | Natural language generation | Story generator | Not applicable | Generate educational stories automatically based on users setting characters |
11 | Karlimah et al. ( ) | Empirical study | Indonesia | Primary education | 25 primary students | Quantitative | Motion comic storyboard | Drawing applications, animation applications, and project export applications | Storyteller | Technology-mediated story creation | Improve primary students’ math skills by developing a motion comic prototype about fractions |
12 | Lee et al. ( ) | Empirical study | USA | Adult education | 63 writers | Quantitative | CoAuthor | GPT-3 | Story collaborator | Human-AI collaborative writing | Collaborate with writers to generate stories Enhance writers’ story language, ideation, and collaboration capabilities |
13 | Min et al. ( ) | Descriptive study | Korea | Not applicable | Stories from two datasets | Not applicable | Visual story writer model | Recurrent neural network structure and encoder-decoder model | Story generator | Not applicable | Generate several captions describing story contents based on the input images. These captions are then used to create a sequence of sentences to make a short story as the output |
14 | Nichols et al. ( ) | Empirical study | Canada | Higher education | 122 adults (workers and university students) | Mixed methods | AI agent | A large-scale neural language model | Story collaborator | Human-AI collaborative writing | Assist writers in generating sufficiently human-like utterances and propose a sample-and-rank approach to improve utterance quality |
15 | Noceti et al. ( ) | Empirical study | Italy | Preschool and primary education | Over 1000 participants (parents and children aged 5–8) | Qualitative | Triangle coding | Triangle language | Story generator | Human-AI collaborative writing | Help an interpretation of the shapes sequence and the generation of a fantasy sentence or a small story |
16 | Osone et al. ( ) | Empirical study | Japan | Adult education | 48 adults (16 writers and 32 readers) | Quantitative | BunCho (an AI supported story co-creation system) | GPT-2 | Story co-creator | Human-AI collaborative writing | Assist Japanese novelists in creating high-level and creative writing, enhance affective-enjoyed writing synopses (creativity, interestingness, comprehensibility, grammatical correctness, consistency of sentences), improve common metrics-creativity, and broadened their stories |
17 | Park and Park ( ) | Descriptive study | Korea | Not applicable | Not applicable | Not applicable | Intelligent narrative story creation systems | Unity game engine | Story co-creator | Not applicable | Enhance the satisfaction of users through VR in storytelling activities |
18 | Peng et al. ( ) | Empirical study | USA | No mention | 98,162 stories | Quantitative | An analyze-to-generate story framework | A conditional language model | Story generator | No mention | Generate stories based on control factors extracted from existing stories to reflect a user’s intent Provide a good interaction environment for users |
19 | Roemmele et al. ( ) | Empirical study | USA | Adult education | 139 adults | Quantitative | Creative Help | Recurrent neural network language model (RNN LM) | Story collaborator | Human-AI collaborative writing | Assist writers in generating more grammatical and coherent sentences, writing the story easier, influenced its content more, and were more helpful overall. The authors made significantly fewer changes to the sentence suggestions |
20 | Shakeri et al. ( ) | Empirical study | Canada | Adult education | 2 adults | Qualitative | SAGA (an asynchronous collaborative storytelling system) | GPT-3 | Story collaborator | Human-AI collaborative writing | Collaborate with users to generate stories based on their prompt in terms of the basis of the story, and can include things like the setting, genre of the story, and even descriptions of the characters |
21 | Suh and An ( ) | Descriptive study | Canada | Not applicable | Not applicable | Not applicable | CodeToon (generative conversational AI) | GPT-3 | Story co-creator | Not applicable | Assist students’ learning, creative, and sensemaking process in a visual programming environment where users can create comics from code Encourage out-of-the-box ideas and motivate users to participate actively in this co-creative process |
22 | Wicke and Veale ( ) | Descriptive study | Ireland | Adult education | 2 robot and 1 human | Not applicable | A multi-modal storytelling system | Scéalability storytelling framework | Story generator | Not applicable | Collaborate with robots to develop stories based on users’ emotions and gestures |
23 | Xu et al. ( ) | Empirical study | China | No mention | 98,161 stories | Mixed methods | MEGATRON-CNTRL (a novel framework) | Large-scale language models | Story generator | No mention | Assist writers in generating more fluent, consistent, and coherent stories with less repetition and higher diversity |
24 | Yuan et al. ( ) | Empirical study | USA | Adult education | 25 hobbyist writers | Mixed methods | Wordcraft (an AI-assisted editor) | A generative language model | Story co-creator | Human-AI collaborative writing | Respond to writers’ custom requests via open-ended conversation and express them in natural language Generate suggestions for writers in the creative process |
25 | Zhang ( ) | Descriptive study | China | Not applicable | Not applicable | Not applicable | An application of AI Interactive storytelling in Animation | UnrealTM game engine | Story animator | Not applicable | Use 3D animation system to present storyline setting by characteristics and scenario development |
26 | Zhang et al. ( ) | Empirical study | China | Preschool and primary education | 10 children aged 5–10 | Mixed methods | StoryDrawer | A co-creative agent | Story co-creator | Human-AI collaborative writing | Assist children’s oral and drawing skills |
27 | Zhang et al. ( ) | Empirical study | China | Preschool and primary education | 24 participants (12 parents and 12 Children aged 6–10) | Qualitative | StoryDrawer | A context-based voice agent and two AI-driven collaborative strategies | Story collaborator | Human-AI collaborative writing | Inspire participants’ creative, elaborate ideas, and contribute to their creative outcomes during an engaging visual storytelling experience |
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Reprints and permissions
Fang, X., Ng, D.T.K., Leung, J.K.L. et al. A systematic review of artificial intelligence technologies used for story writing. Educ Inf Technol 28 , 14361–14397 (2023). https://doi.org/10.1007/s10639-023-11741-5
Download citation
Received : 27 December 2022
Accepted : 20 March 2023
Published : 05 April 2023
Issue Date : November 2023
DOI : https://doi.org/10.1007/s10639-023-11741-5
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
Systematic literature reviews have been significantly helping researchers to stay up to date by synthesizing the highest level of evidence for the research questions. But, the end-to-end process is highly resource-intensive and time-consuming.
Ideally, it requires a year or more as it is a long-term study and includes multiple stages, right from protocol registration to submission for publication.
With the ever-expanding growth of research paper publications, conducting a manual systematic literature review would be difficult and impractical.
However, with the help of Artificial intelligence, i.e., AI software designed to automate systematic review, researchers can expedite the whole systematic search process for rapid reviews.
Using AI in systematic reviews is revolutionizing the process, making it more practical and sustainable.
Incorporating AI into the process not only expedites the systematic literature reviews but also reduces human errors, and comes as a cost-effective systematic approach.
In this article, we’ll learn more about the role of AI in systematic reviews and the different types of AI software that assist in automating the systematic review findings process .
A systematic literature review or systematic review is an in-depth analysis of observational studies conducted to provide an answer to the specific research question. It generates evidence-based synthesis by employing a thorough research strategy that identifies, screens, and analyzes all the relevant publications (particularly in healthcare research front), including both published and unpublished scientific literature.
Earlier, systematic reviews were mostly done manually. Researchers used to sift through a vast number of existing research articles, choose the ones that were pertinent based on predetermined criteria, extract data, and then synthesize the results. This laborious method required a lot of time and resources and was also subject to prejudices.
However, clinical practice of doing manual systematic reviews has become more impractical due to the upsurge in the number of research articles. And that’s when AI comes into the picture, providing a scalable method to accelerate the entire process and manage enormous amounts of data more effectively.
Let’s understand how AI automates the process of Systematic Literature Review tasks!
1. automates the systematic review process.
Natural Language Processing (NLP) algorithms used in AI are adept at scouring through large databases to find relevant studies for systematic reviews. These algorithms can recognize keyword trends, comprehend context, and exclude irrelevant papers to ensure that only significant scientific literature is picked for evaluation.
Machine learning models are trained to extract key findings from a set of scientific articles automatically. With the help of AI, data extraction becomes more reliable, complete, and consistent by identifying patterns in the presentation of data across different articles.
Research bias can potentially hamper the research findings and lead to distorted conclusions. So, AI algorithms are now being used to identify biases in particular research, such as publication bias or selection bias in qualitative research. By detecting these primary research biases, reviewers can account for them in their synthesis, which involves combining the results of multiple studies to draw more robust and reliable conclusions.
Systematic reviews can easily become obsolete because of the dynamic and ever-growing nature of research. However, with the help of AI tools , it can constantly search databases for the latest research and support the evidence synthesis and meta analyses of existing literature. When relevant papers are published, the AI may easily incorporate them into the present review to keep it up-to-date.
Here’s a thorough overview of the best AI software for systematic literature reviews.
With SciSpace Literature Review software, you can save time by automating the literature review process. It helps you land on pertinent studies and dig out the most relevant information required for the evidence-based synthesis. You just need to enter the keyword or the topic and the tool pulls out all the papers that are vital for your research. It reduces your intensive task of sifting through papers manually by automating the entire process using the AI algorithm.
Key Features
Refer to this detailed blog that helps you understand the SciSpace Literature review workspace thoroughly.
DistillerSR is a prominent systematic literature review software. Its AI-powered capabilities streamline the identification, extraction, and management of research. Its collaboration, and built-in project management features make it easier for researchers to speed up the literature review process.
Key features:
The key characteristic of EPPI-Reviewer is its sophisticated text and data mining capabilities. The platform effectively handles both qualitative and quantitative data by incorporating machine learning algorithms, ensuring a thorough review and quality assessment throughout.
Rayyan is a renowned literature review software for systematic review. It’s known for its user-friendly design and powerful AI-driven capabilities. Its design and layout ensure that even those unfamiliar with systematic review methods can easily follow the procedure.
Covidence stands out as it claims to reduce the time required to perform systematic reviews by 30% . It also emphasizes more on the user experience. Beyond its AI capabilities, its design ensures smoother navigation through the systematic review process quickly and easily.
The advent of AI-powered tools in systematic literature reviews has myriad benefits, and below are a few of the prominent ones:
Efficiency: These AI-powered tools' automated features significantly cut down on the time required to complete a review.
Accuracy: AI algorithms ensure human error-free data extraction and statistical analysis throughout.
Comprehensiveness: By scanning enormous amounts of data, These techniques ensure that no vital study is overlooked or ignored.
Collaboration: The majority of AI tools come with built-in project management functions that support team collaboration, providing uniform and consistent outcomes.
Despite the massive benefits of AI, careful consideration must be given to its usage. The data used to train AI algorithms determines their quality. It is essential to make sure that these algorithms are trained on several high-quality datasets in order to prevent biases or mistakes.
In addition, even though AI can speed up the systematic review process, human oversight is inevitable. To ensure AI outputs are accurate and relevant, researchers must validate them.
Artificial Intelligence is redefining the way we conduct systematic reviews. It has the potential to completely transform the way we synthesize research findings by automating labor-intensive activities, identifying biases, and constantly updating reviews and analyses.
While hindrances in traditional narrative reviews could still persist, the fusion of human expertise and AI abilities can steer a new era of swift, precise, and significant systematic reviews.
AI for thesis writing — Unveiling 7 best AI tools
How to Write an Effective Background of the Study: A Comprehensive Guide
Elicit vs. SciSpace: AI research assistant for effortless literature review
How to Use AI Tools for Conducting a Literature Review
A variety of AI tools can be used during the systematic review or evidence synthesis process. These may be used to assist with developing a search strategy; locating relevant articles or resources; or during the data screening, data extraction or synthesis stage. They can also be used to draft plain language summaries.
The overall consensus is that the AI tools can be very useful in different stages of the systematic or other evidence review but that it is important to fully understand any bias and weakness they may bring to the process. In many cases using new AI tools, which previous research has not assessed rigorously, should happen in conjunction with existing validated methods. It is also essential to consider ethical, copyright and intellectual property issues for example if the process involves you uploading data or full text of articles to an AI tool.
Below are some recent published articles on the topic:
Khalil H, Ameen D, Zarnegar A. Tools to support the automation of systematic reviews: a scoping review . J Clin Epidemiol 2022; 144: 22-42 https://www.jclinepi.com/article/S0895-4356(21)00402-9/fulltext "The current scoping review identified that LitSuggest, Rayyan, Abstractr, BIBOT, R software, RobotAnalyst, DistillerSR, ExaCT and NetMetaXL have potential to be used for the automation of systematic reviews. However, they are not without limitations. The review also identified other studies that employed algorithms that have not yet been developed into user friendly tools. Some of these algorithms showed high validity and reliability but their use is conditional on user knowledge of computer science and algorithms."
Khraisha Q, Put S, Kappenberg J, Warraitch A, Hadfield K. Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages . Res Syn Meth . 2024; 1-11. doi: 10.1002/jrsm.1715 "Although our findings indicate that, currently, substantial caution should be exercised if LLMs are being used to conduct systematic reviews, they also offer preliminary evidence that, for certain review tasks delivered under specific conditions, LLMs can rival human performance."
Mahuli, S., Rai, A., Mahuli, A. et al. Application ChatGPT in conducting systematic reviews and meta-analyses . Br Dent J 235, 90–92 (2023). https://doi.org/10.1038/s41415-023-6132-y Explores using ChatGPT for conducting Risk of Bias analysis and data extraction from a randomised controlled trial.
Ovelman, C., Kugley, S., Gartlehner, G., & Viswanathan, M. (2024). The use of a large language model to create plain language summaries of evidence reviews in healthcare: A feasibility study . Cochrane Evidence Synthesis and Methods, 2(2), e12041. https://onlinelibrary.wiley.com/doi/abs/10.1002/cesm.12041
Qureshi, R., Shaughnessy, D., Gill, K.A.R. et al. Are ChatGPT and large language models “the answer” to bringing us closer to systematic review automation? . Syst Rev 12 , 72 (2023). https://doi.org/10.1186/s13643-023-02243-z "Our experience from exploring the responses of ChatGPT suggest that while ChatGPT and LLMs show some promise for aiding in SR-related tasks, the technology is in its infancy and needs much development for such applications. Furthermore, we advise that great caution should be taken by non-content experts in using these tools due to much of the output appearing, at a high level, to be valid, while much is erroneous and in need of active vetting."
van Dijk SHB, Brusse-Keizer MGJ, Bucsán CC, et al. Artificial intelligence in systematic reviews: promising when appropriately used . BMJ Open 2023;13:e072254. doi: 10.1136/bmjopen-2023-072254 Suggests how to conduct a transparent and reliable systematic review using the AI tool ‘ASReview’ in the title and abstract screening.
June 2023 webinar including a panel discussion exploring the use of machine learning AI in Covidence (screening & data extraction tool).
Khalil H, Ameen D, Zarnegar A. Tools to support the automation of systematic reviews: a scoping review. J Clin Epidemiol. 2022 Apr;144:22-42. doi: 10.1016/j.jclinepi.2021.12.005. Epub 2021 Dec 8. PMID: 34896236. https://www.sciencedirect.com/science/article/pii/S0895435621004029?ref=pdf_download&fr=RR-2&rr=821cfdcf2d377762#tbl0004 [accessed 06-11-23].
Summary of validated tools available for each stage of the review
Recording of 1 hour webinar exploring Artificial Intelligence (AI) and its potential impact on the process of systematic reviews (August 15th, 2023). Note PICO Portal is a systematic review platform that leverages artificial intelligence to accelerate research and innovation.
Moderator Dr Greg Martin. Presenters: Eitan Agai - PICO Portal Founder & AI Expert; Riaz Qureshi - U. of Colorado Anschutz Medical Campus; Kevin Kallmes - Chief Executive Officer, Cofounder; Jeff Johnson - Chef Design Officer.
Search code, repositories, users, issues, pull requests..., provide feedback.
We read every piece of feedback, and take your input very seriously.
Use saved searches to filter your results more quickly.
To see all available qualifiers, see our documentation .
Systematic Literature Review (SLR) using AI involves leveraging artificial intelligence techniques to automate and expedite the process of reviewing and synthesizing large volumes of scholarly literature.
Folders and files.
Name | Name | |||
---|---|---|---|---|
498 Commits | ||||
Session | Topic | Date |
---|---|---|
1a (basic) | 23 and 24 Jan 2024 | |
1b (advance) | 4 and 5 Feb 2024 | |
2 | 18 Feb 2024 and 3 Mac 2024 | |
3 | 5 Mac 2024 | |
4a | 19 Mac 2024 | |
4b | 26 Mac 2024 | |
4c | 2 April 2024 |
Please create an Issue for any improvements, suggestions or errors in the content.
You can also contact me using Linkedin for any other queries or feedback.
We Make Clinical Research Crystal Clear
Systematic literature reviews (SLRs) generate comprehensive, evidence-based outputs in response to specific research questions using a structured, yet transparent approach. In clinical research, SLRs can support the holistic evaluation of the efficacy, safety, and cost-effectiveness of a medicinal product by combining findings from multiple clinical studies. These reviews provide clinicians, regulators, policy makers, economists, epidemiologists, and health technology assessment (HTA) bodies with more meaningful information than typical review articles.
An SLR starts with creating a search strategy which is free of selection bias. This strategy is used to find relevant research articles for review and analysis. However, in this digital era, where enormous amounts of data are just a click away, conducting an SLR can be a daunting and laborious task. Literature searches can retrieve thousands of publications and therefore manually reviewing and analysing these can be time-consuming. By the time an SLR is published, the results may be outdated and out of context.
Artificial intelligence (AI) can significantly improve the efficiency and accuracy of the SLRs. Typically, an SLR is conducted using the following steps:
AI can support some of these steps very efficiently, as detailed below:
Automated Search and Screening
The predominant use of AI in SLRs is to help streamline the initial steps of the process by automating the search for relevant literature and screening articles based on predefined eligibility criteria. Search engines now utilise AI, especially by strengthening Retrieval-Augmented Generation (RAG) frameworks through large language models. These can support the preparation of a search with complex queries that surpasses the capabilities of a conventional keywords-based literature search.
Typically, machine learning classifiers are used to find more pertinent articles. These classifiers are trained on a first set of user-selected papers, and then reiteratively use the automatic classifications and retrain the classifier to better identify further relevant literature.
AI techniques employed by automated tools use an article’s title, abstract, or the full text of papers as inputs, and then Natural Language Processing (NLP) algorithms analyse abstracts, titles, and keywords to determine their relevance to the research topic. The AI techniques can also incorporate a statistical selection process to detect the key terms characterising each cluster by scoring each citation based on the number of keywords present, enabling the screener to make more efficient decisions on their relevance. The resulting clusters emphasise the most representative terms and thereby better support the inclusion or omission of a publication from the analysis.
Data Extraction and Evidence Generation
In health research, researchers apply various protocols for literature review depending on the type of report to be generated. These include PICO (population, intervention, comparison, outcome), PCC (population, context, concept), PICODR (elements of PICO plus duration and results), PIBOSO (population, intervention, background, outcome, study design, and other).
Using AI, information is extracted from a predefined set of fields for interventional, diagnostic, or prognostic SLRs. For instance, NLP algorithms can extract key information such as study methodologies, findings, and statistical data, which can then be synthesised and analysed to draw meaningful conclusions. Knowledge graphs are produced that improve research paper classification and characterisation. AI technologies use a domain ontology to organise the data, which offers a formal description of the variable types and their association with each other.
Quality Assessment
It is important to minimise selection bias and maximise the external and internal validity of the publications selected for an SLR. To do this, systematic errors should be reduced to the largest possible extent.
A quality assessment provides evidence of the overall strength and reliability of an SLR. AI can help assess the quality of included studies by analysing factors such as study design, sample size, and methodology.
Machine learning algorithms can be trained on existing datasets to recognise patterns indicative of high-quality research, helping researchers to evaluate the reliability of the evidence more efficiently. Most validated checklists, such as PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) recommend evaluation of the bias in literature search and selection by an independent reviewer. Therefore, combining a manual quality check with automated screening is important to understand any gaps and discrepancies, which can be resolved by evaluating the conflicts between the screener’s and the reviewer's decisions.
Analysis, Data Visualization, and Preparation of the Report
AI-powered tools can support the meta-analysis of data extracted from multiple studies, enabling researchers to quantitatively synthesize findings and assess the overall effect sizes. AI can facilitate the organisation and categorisation of large volumes of literature through semantic analysis and clustering techniques. By identifying common themes and relationships among studies, AI algorithms can help researchers gain deeper insights into the existing literature, evaluate the current status of the topic of research, and identify gaps or areas for further investigation.
Additionally, AI-driven visualization techniques can be used to present complex information in a more accessible and comprehensible manner, facilitating understanding and decision-making.
AI technology uses algorithms and models to find patterns and trends, outliers in data, and correlations amongst diverse variables. Insights and recommendations generated from SLR data may help researchers understand the implications of gaps in current knowledge, processes, research, and policies. By incorporating feedback from researchers and updating algorithms based on new data, AI can continuously enhance the accuracy and efficiency of SLRs and improve the quality of research outputs.
Allied uses of Artificial Intelligence
An SLR report usually includes many abbreviated terms and acronyms and these must all be listed. AI can be used automate the creation of the abbreviations list. In addition, AI can also assist automated quality check of the data presented in an SLR.
Overall, the integration of AI into SLR looks promising, especially for enhancing the efficiency, accuracy, and comprehensiveness of the review process, enabling researchers to leverage the vast amount of available literature more effectively.
Write to us, if you want to know more about our customised and cost-effective AI solutions for SLRs, targeted literature reviews, and meta-analyses at: [email protected]
Systematic Reviews volume 13 , Article number: 219 ( 2024 ) Cite this article
249 Accesses
Metrics details
This study aimed to evaluate the performance of large language models (LLMs) in the task of abstract screening in systematic review and meta-analysis studies, exploring their effectiveness, efficiency, and potential integration into existing human expert-based workflows.
We developed automation scripts in Python to interact with the APIs of several LLM tools, including ChatGPT v4.0, ChatGPT v3.5, Google PaLM 2, and Meta Llama 2, and latest tools including ChatGPT v4.0 turbo, ChatGPT v3.5 turbo, Google Gemini 1.0 pro, Meta Llama 3, and Claude 3. This study focused on three databases of abstracts and used them as benchmarks to evaluate the performance of these LLM tools in terms of sensitivity, specificity, and overall accuracy. The results of the LLM tools were compared to human-curated inclusion decisions, gold standard for systematic review and meta-analysis studies.
Different LLM tools had varying abilities in abstract screening. Chat GPT v4.0 demonstrated remarkable performance, with balanced sensitivity and specificity, and overall accuracy consistently reaching or exceeding 90%, indicating a high potential for LLMs in abstract screening tasks. The study found that LLMs could provide reliable results with minimal human effort and thus serve as a cost-effective and efficient alternative to traditional abstract screening methods.
While LLM tools are not yet ready to completely replace human experts in abstract screening, they show great promise in revolutionizing the process. They can serve as autonomous AI reviewers, contribute to collaborative workflows with human experts, and integrate with hybrid approaches to develop custom tools for increased efficiency. As technology continues to advance, LLMs are poised to play an increasingly important role in abstract screening, reshaping the workflow of systematic review and meta-analysis studies.
Peer Review reports
A systematic review is a rigorous form of research that collates and synthesizes all existing evidence on a specific research question [ 1 ]. It stands as a cornerstone not just in medical research but across diverse academic disciplines. Unlike traditional literature reviews, systematic reviews follow a comprehensive and standardized process, such as the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guideline [ 2 ], designed to minimize bias and ensure reproducibility. Therefore, these reviews are recognized as one of the zenith levels of evidence in evidence-based research [ 3 ] and play a pivotal role in shaping clinical guidelines, healthcare policies, and informing medical decisions [ 4 ].
Commencing with a well-articulated research question, a typical systematic review launches an exhaustive search strategy that subsequently sweeps through databases, such as PubMed and Embase, supplemented by additional sources such as clinical trial registries and pertinent article reference lists, with aims to capture all relevant studies and mitigate bias. Predetermined inclusion and exclusion criteria guide the subsequent screening and selection of studies, encompassing facts like study design, patient demographics, and intervention types. Reviewers, working independently, appraise each study’s eligibility, reconciling disagreements through discussions or third-party reviews. Data extraction and synthesis follow, either through meta-analysis or narrative synthesis, depending on the heterogeneity of the selected studies.
The practice of conducting systematic reviews has gained substantial popularity with considerable demand within the academic community. A notable reference [ 5 ] to this trend found that approximately 75 trials and potentially 11 systematic reviews are disseminated daily, based on data from around 2010. Moreover, a query for “systematic review” in the Google Scholar database yields approximately 17,000 entries that have been published within the year 2023 alone, which translates into about 51 systematic reviews per day, as observed on November 26, 2023. This expanding volume of literature underscores the critical role that systematic reviews play in the consolidation of research findings across various fields of study.
Despite their pivotal role, executing systematic reviews remains a formidable task due to the abstract screening, a key phase that can be overwhelmingly time-consuming due to its volume. For example, in Polanin et al. [ 6 ], the authors reported that the research staff screened 29,846 abstracts independently (14,923 unique citations were double screened) over the course of 189 days. In addition, the variability in reporting standards, use of jargon, and varied study designs can further complicate the abstract screening process [ 7 ]. Nevertheless, the repetitive nature of the task, combined with cognitive fatigue, can lead to human errors [ 8 , 9 ]. Recent advancements in machine learning (ML) and deep learning propose possible solutions to these challenges. However, traditional ML models, while promising, require domain-specific training, a time-consuming process that often demands manual labeling of datasets [ 10 ].
Advancements in natural language processing (NLP) and artificial intelligence (AI) are opening doors to address challenges in systematic reviews. Large language models (LLMs) like ChatGPT [ 11 ], PaLM [ 12 ], Llama [ 13 ], and Claude [ 14 ] are capturing the research community’s attention. Their collective potential, especially their capability to operate without exhaustive domain-specific training, makes them prime candidates for revolutionizing the systematic review process.
While each of the aforementioned AI tools brings its unique capabilities to the table, the fundamental question remains: How do they stack up, both individually and collectively, in the realm of abstract screening, against the human expert-based process? In pursuit of answers, this research seeks to thoroughly investigate the potential of ChatGPT, Google PaLM, Llama, and Claude in automating the crucial abstract screening phase integral to systematic reviews. Our goal is to rigorously compare the performance of these advanced AI-driven methods with existing machine learning (ML)-based approaches. In doing so, we aim to develop AI strategies that masterfully balance efficiency and accuracy with minimal human intervention, ultimately transforming systematic review practice across disciplines.
The use of NLP for abstract screening is established [ 10 , 15 ]. However, the application of LLMs specifically for this task is a nascent field [ 16 , 17 ]. This emerging area offers significant potential to improve efficiency and accuracy. Our study aims to fill this gap by providing a comprehensive analysis of LLM capabilities in abstract screening, laying the groundwork for future research and application. This is particularly relevant considering the rapid evolution of this technology, highlighting its potential to streamline systematic reviews now and in the future.
The remainder of this paper is structured to provide a comprehensive exploration of our topic. We begin with an in-depth examination of existing methods for abstract screening, including both manual and NLP-based approaches, laying the groundwork for understanding the current state of the field. We then introduce the use of large language model (LLM) tools for abstract screening, detailing our experimental design to meticulously evaluate their performance in this context. Subsequent sections present our empirical findings and results, shedding light on the capabilities and limitations of the AI tools in question. Finally, we engage in a thoughtful discussion, reflecting on the implications of our findings and considering the future trajectory of abstract screening in systematic reviews.
In the vast realm of systematic reviews, the critical task of abstract screening serves as a foundational step in curating the highest quality of evidence [ 2 ]. However, this process often presents significant challenges due to the involvement of sifting through large volumes of literature to identify those that align with predefined criteria. Over time, various methodologies, ranging from manual evaluations to sophisticated AI-driven techniques, have been proposed to address the complexities of this task. In this section, we first describe the existing approaches on their operational mechanisms and associated advantages and disadvantages.
Historically, the process of abstract screening was firmly rooted in manual evaluations. In this conventional approach, individual reviewers would scrutinize each abstract against predefined criteria [ 1 ]. The meticulous nature of this method required that multiple experts independently evaluate the literature to ensure both reliability and reduced biases [ 8 ]. While the depth of human expertise brought about nuanced understanding, the manual nature of this method made it both time-consuming and, at times, prone to human error [ 6 , 9 ].
As technology evolved, the field witnessed the incorporation of natural language processing (NLP) to automate abstract screening [ 10 ]. In this framework, abstract text undergoes preprocessing and vectorization. Supervised machine learning models, notably the support vector machine (SVM) and the random forest (RF), are then trained on this vectorized data to classify literature based on specific criteria [ 15 ]. The strength of this approach lies in its potential for efficiency. However, its efficacy and accuracy hinges heavily on the availability of a well-curated, labeled training set.
A more recent and innovative approach is zero-shot classification, which was notably highlighted by Xian et al. [ 18 ]. Eschewing the need for an extensive labeled training dataset, zero-shot classification offers the allure of categorizing abstracts without prior domain-specific training. By calculating a probability score for each abstract, researchers obtain a dynamic measure of its alignment with predefined categories. While it does not require training for the machine learning model, the tradeoff is a loss of sensitivity and potential omissions of pertinent studies [ 19 ]. In this study, for a given set of abstracts, we first obtain embeddings (using OpenAI’s text embedding API, the babbage model) for abstracts and for a pre-specified description of an ideal study to include. We indeed use the inclusion/exclusion criteria (see “Examples of abstract screening by using LLMs,” “Automated workflow for streamlining abstract screening via ChatGPT and other tools,” and “ Study design ” sections) as the description of an ideal study to include. We then compute the cosine similarity score between each abstract (its embedding) and the pre-specified description (its embedding). We claim the top 10% (in terms of similarity) of abstracts as positive (i.e., a study should be included).
To reconcile the strengths and limitations of these models, a hybrid methodology has emerged [ 18 ]. It cleverly marries the rapid categorization capabilities of zero-shot classification with the precision of traditional machine learning models such as SVM and RF. Here, zero-shot classification provides an initial curation of all abstracts in the training set. Then, based on zero-shot classification’s curation, the abstracts with classification score higher than a pre-specified threshold are selected to undergo manual review to rectify zero-shot classification. The rectified classification then serves as a foundation for training traditional models, which are then employed for screening the broader dataset (e.g., in the testing set). The hybrid approach balances the speed of zero-shot and the precision of traditional ML, and potentially offers enhanced accuracy at reduced human efforts. However, this approach involves the use of multiple methodologies and still relies on well-curated, labeled training data (in this case, a subset of the whole training set). In this study, for a given set of abstracts, we rectify the “positive” abstracts (i.e., the top 10% of abstracts) identified by zero-shot, and then fit a classification model for rectified labels to the text embeddings of the corresponding abstracts using SVM. This learned classification model is then used to predict the class of other abstracts. We added the R code for the zero-shot and hybrid approaches to the GitHub repository https://github.com/mikeli380/LLMAbstractScreening .
Active learning [ 20 , 21 , 22 ] is an innovative approach to machine learning that optimizes the training process by allowing the model to selectively query a human annotator for labels on the most informative data points. This method is particularly advantageous in scenarios where labeled data is scarce or expensive to obtain. Active learning models can start with minimal datasets, often requiring only one relevant and one irrelevant abstract, making them particularly suitable for tasks such as abstract screening, where the acquisition of large labeled datasets can be prohibitive.
The main advantage of active learning is its efficiency. By focusing on the most informative samples, it reduces the amount of data that needs to be labeled while still effectively training the model. This can significantly reduce the time and resources required for the annotation process. However, the effectiveness of active learning depends heavily on the initial selection of samples and the criteria used to determine the informativeness of subsequent data points. If not well calibrated, the model may request labels for data that do not significantly improve its performance, resulting in an inefficient use of resources. In addition, the iterative nature of querying and updating the model based on new labels can introduce complexity into the training process, potentially requiring more sophisticated infrastructure and oversight than traditional supervised learning methods.
While this study did not test active learning approaches for abstract screening, readers interested in exploring this methodology further are referred to [ 23 ] for detailed information on the application of active learning in abstract screening contexts.
Amidst the evolving methodologies described in the “ Existing approaches to abstract screening in systematic reviews ” section, modern AI tools based on large language models (LLMs), such as ChatGPT, PaLM, Llama, and Claude, are emerging as potential game-changers. Grounded in advanced language processing capabilities, these tools can be tailored to evaluate abstracts against nuanced criteria and offer detailed assessments and classifications. Their prowess signals transformative potential for abstract screening. In this section, we first present two examples to illustrate the potential of LLMs in the context of abstract screening, and then precede with our in-depth discussion on the study’s workflow and design that aims to critically investigate the performance of LLMs in this domain.
In this subsection, we demonstrate the process of LLM-based abstract screening using ChatGPT with two specified abstracts: (1) Millard et al. [ 24 ] and (2) Zawia et al. [ 25 ]. Using other LLM tools such as Llama, Google PaLM, or Claude for abstract screening is similar and will be skipped for brevity.
In constructing the prompts, we adopted a standardized approach to mimic a typical interaction between a senior researcher and a research assistant. Each prompt was designed to include three critical elements: a brief statement of the topic under review, the exact eligibility criteria as specified in the study guidelines, and the text of the abstract to be reviewed. This approach was designed to simulate a realistic and practical scenario that reflects common academic practices. In addition, to ensure consistency and objectivity in the responses, we required that the results of the LLM be in a binary “yes” or “no” format, directly indicating whether each abstract met the inclusion criteria. This format was chosen to eliminate any ambiguity in the model’s decisions and to facilitate straightforward analysis of its accuracy.
ChatGPT prompt: “I would like you to help me with conducting a systematic review on animal depression. I will provide the title and abstract for one journal article and would like you to screen the paper for inclusion. Here are the inclusion criteria: Does the study provide primary data of an animal model of depression or depressive-like phenotype with an appropriate control group? Here is the abstract:
Wistar-Kyoto (WKY) rats show endogenous depressive behavior that can be reversed by antidepressants. Given that WKYs exhibit decreased sensitivity to some antidepressants and treatment-resistant depressed patients often show hypothalamic-pituitary-thyroid (HPT) dysregulation, we examined the behavioral and HPT hormonal responses of WKYs to altered thyroid status. “Euthyroid” WKYs had elevated basal plasma TSH and T(3) levels as compared to Wistars. Hypothyroidism increased TSH levels more in WKYs than in Wistars and increased response latency in the open field test (OFT) of WKYs only. Administration of T(4) and T(3) suppressed plasma TSH equally in both strains. Wistars responded to increased T(3) levels with decreased response latency and increased activity in the OFT, but increased immobility in the forced swim test. In contrast, WKYs responded only to the high T(3) levels with decreased response latency in the OFT. These results suggest the existence of a decreased central nervous system sensitivity to thyroid hormones in WKYs that could be related to their depressive behavior.”
ChatGPT v4 response (accessed on 10/15/2023):
Given the abstract and the inclusion criteria, this study appears to meet the inclusion criteria for your systematic review on animal models of depression. |
Consideration for inclusion: 1. Animal model and depressive-like phenotype: • The study utilizes Wistar-Kyoto (WKY) rats which are mentioned to show endogenous depressive behavior. • This study investigates behaviors and hormonal responses associated with depression in this particular animal model, with a focus on the role of thyroid hormones. 2. Primary data and appropriate control group: • The study provides primary data on behavioral and hormonal responses of WKY rats, comparing them to Wistars (likely serving as the control group), under altered thyroid status, allowing for comparisons between the depressive-like phenotype and controls. • The utilization of both “Euthyroid” WKYs and Wistars as controls helps in effectively investigating the behavioral changes and responses in WKY rats due to alterations in thyroid status. 3. Behavioral test indicating depression: • Depressive behavior and responses are assessed using standardized tests like the open field test (OFT) and the forced swim test, which are conventional methods to investigate depressive-like behaviors in animal models. • The mention of antidepressant’s impact on WKY rats indicates a focus on depressive-like states. 4. Relevance to human condition: • Insights gained from this study on decreased central nervous system sensitivity to thyroid hormones and its possible relation to depressive behavior in WKYs can have implications in understanding treatment-resistant depression in humans, making it valuable for the review. Conclusion: Given the provided abstract, this study does seem to fit the inclusion criteria for a systematic review on animal depression, as it provides primary data on an animal model demonstrating depressive-like behavior, includes an appropriate control group, and employs standardized tests to study the depressive-like states and their relation to thyroid hormones. The detailed exploration of depressive behaviors and hormonal imbalances in relation to altered thyroid status can offer valuable insights for understanding and treating depression. |
The induction of the activity of cerebral ornithine decarboxylase (EC 4.1.1.17) and mRNA by electrical stimulation exhibits regional differences. The effects of the enzyme inhibitor difluoromethylornithine on these regional variations was examined. Administration of this inhibitor resulted in pronounced depression of both basal and induced activity of ornithine decarboxylase in the hippocampus. Basal activity of the enzyme in the neocortex and the cerebellum appeared to be resistant to difluoromethylornithine but the induced enzyme activity was sensitive to the effects of this inhibitor. Susceptibility to difluoromethylornithine may be directly correlated with a slower turnover rate for ornithine decarboxylase. These results suggest that ornithine decarboxylase in the hippocampus may possess a longer half-life than its counterparts in other regions of the brain. Pretreatment with difluoromethylornithine had no effect on the induced ornithine decarboxylase mRNA in the neocortex. Thus, elevated activity of ornithine decarboxylase enzyme, due to electrical stimulation, appears to not have any effect on either the transcription or the decay rate of the induced ornithine decarboxylase mRNA. These findings support the concept of region-specific regulation of cerebral ornithine decarboxylase. ”
Given the provided abstract and the inclusion criteria, this study does not appear to meet the criteria for inclusion in a systematic review on animal models of depression or depressive-like phenotypes with an appropriate control group. |
Consideration for inclusion: 1. Depression or depressive-like phenotype: • The abstract does not provide any information or data related to depression or depressive-like behavior or phenotypes in animals. It is mainly focused on the activity of cerebral ornithine decarboxylase and its regional differences in response to electrical stimulation and inhibitor administration. 2. Animal model: • There is no mention of the use of any specific animal model or any behavioral assays to assess depressive-like symptoms or states in the provided abstract. 3. Behavioral assessment and primary data: • The abstract does not contain any details about behavioral assessments related to depression or depressive-like states. It primarily provides data on the effects of enzyme inhibitor difluoromethylornithine on cerebral ornithine decarboxylase activity and mRNA, but does not relate these findings to depression or depressive-like behaviors. 4. Relevance to depression: • While the study explores biochemical aspects in brain regions like the hippocampus, which is implicated in mood disorders like depression, the connection between the findings and depressive symptoms or states is not established in the provided abstract. Conclusion: This article does not seem to fit the inclusion criteria for a systematic review on animal models of depression or depressive-like phenotypes as it lacks primary data on depressive-like behaviors or phenotypes in animals, and there is no mention of the use of animal models representative of depression or an appropriate control group for such a comparison. |
From the above two examples presented, it is evident that ChatGPT adeptly comprehends the posed queries and efficiently extracted the inclusion and exclusion criteria. It not only makes clear decisions regarding study inclusion but also articulates its rationale by cross-referencing each criterion.
It is worth noting that the same ChatGPT prompt might result in slightly different responses between different users due to the variability in its response generation, but the conclusion for the eligibility of an abstract is typically consistent. In addition, enhancing the specificity of the criteria within the prompts can further refine the screening process. To illustrate, consider the following enhanced prompts:
Enhanced ChatGPT prompt for abstract (1): “I seek assistance with a systematic review focused on animal depression. I’ll provide the abstract of a particular journal article and would appreciate an assessment for its inclusion based on the following criteria:
The study employs an animal model reflecting depression or a depressive-like phenotype.
It incorporates an appropriate control group.
Primary data is presented within the study.
and here are the exclusion criteria:
The study bypasses the use of an animal model.
The utilized animal model doesn’t characterize depression or a depressive-like phenotype.
Absence of a control group within the study.
The study is primarily a review, devoid of primary data.
Here is the abstract: …”.
Let’s assess the provided abstract based on the inclusion and exclusion criteria: |
Inclusion criteria: 1. Animal model reflecting depression: The study uses Wistar-Kyoto (WKY) rats, which are described as showing “endogenous depressive behavior.” Thus, this criterion is met. 2. Appropriate control group: The study compares WKY rats to Wistars (another strain), which serves as a comparison or control group. This criterion is met. 3. Primary data: The abstract mentions specific results such as the elevated basal plasma TSH and T(3) levels in “Euthyroid” WKYs compared to Wistars and the different responses in the open field test (OFT). This suggests the presence of primary data. Thus, this criterion is met. |
Exclusion criteria: 1. Bypasses the use of an animal model: The study uses WKY rats as an animal model. Thus, this criterion is not met, which is good for inclusion. 2. Animal model doesn’t characterize depression: The WKY rats show “endogenous depressive behavior.” Thus, this criterion is not met, which again supports inclusion. 3. Absence of a control group: The study does use a control group (Wistars). Thus, this criterion is not met, supporting inclusion. 4. Study is primarily a review: The abstract presents experimental data and does not indicate that the study is a review. Thus, this criterion is not met, which is favorable for inclusion. |
Conclusion: Given that the abstract meets all the inclusion criteria and none of the exclusion criteria, it should be considered in your systematic review on animal depression. |
With these revised prompts applied to the previously discussed abstracts, the conclusions remain consistent. Nonetheless, we contend that this enhanced structure not only provides a more rigorous framework but also aligns more closely with widely accepted practices in systematic review criteria definition.
For systematic reviews, particularly when dealing with voluminous data, efficient workflows are paramount. The ChatGPT API (application programming interface) offers a dynamic solution, enabling the automation of abstract screening on a large scale, which circumvents the labor-intensive process of manually inputting abstracts into a chatbot interface. In this subsection, we present an automated workflow for streamlining abstract screening via ChatGPT. Note, though this automated workflow uses ChatGPT as the platform, analogous workflows work for other AI platforms like PaLM, Llama, and Claude.
Automated workflow:
Data collection: The preliminary step entails accumulating a list of titles and abstracts. By utilizing carefully crafted keywords, we retrieve these from PubMed and other pertinent databases. This comprehensive approach ensures the potential inclusion of all relevant studies for a subsequent detailed screening. It is worth noting that while this list is expansive, most of these studies may not find their way into the final meta-analysis post-screening.
Automation through Python: We have devised a python script aimed at harnessing the capabilities of ChatGPT for evaluating the amassed abstracts.
This script interacts with the ChatGPT API (specifically, the GPT-4 version) and, when furnished with tailored prompts, extracts structured responses from ChatGPT.
Typically, the AI’s response commences with a succinct summary, delves into explanations aligned with each criterion, and concludes with a decisive judgment, as exemplified by the examples in the “ Examples of abstract screening by using LLMs ” section.
This automated process efficiently saves ChatGPT’s verdicts on each abstract for ensuing analyses. For instance, it extracts the final decisions regarding the inclusion or exclusion of studies and determines the stance on each pre-specified criterion for every abstract, as exemplified by the last example in the “ Examples of abstract screening by using LLMs ” section.
Additionally, to ascertain the efficiency and cost-effectiveness of this methodology, the script also monitors the time, token usage, and the financial implications of querying the OpenAI API.
In essence, we envision this procedure as delegating the meticulous task of poring over scientific summaries to an AI assistant. This virtual entity meticulously sifts through each summary, determining its alignment with stipulated criteria.
Tuning parameters in the ChatGPT API: The effectiveness of the ChatGPT API is not only dependent on the input data; it is also significantly influenced by adjustable parameters that can refine the responses. Parameters such as temperature, top k, and top p critically affect model performance by modulating the randomness and focus of the output. While fine-tuning these parameters can improve results, it requires significant technical expertise and resources. Defaults, which are rigorously tested by developers, strike a balance between output quality and ease of use, making LLMs more accessible to a wider range of users without the need for complex parameter optimization. While customization holds promise for custom applications, the default settings provide an efficient and practical solution that facilitates wider adoption of LLM technologies.
Given the complexity of fine-tuning these parameters to optimize performance specifically for abstract screening, our study primarily used the recommended default settings for these parameters as provided by the respective platforms (detailed in Table 1 ). This approach was chosen to maintain the feasibility of our experiments and to ensure that our findings are applicable to typical deployment scenarios.
The automated workflow described applies to other LLM tools or different versions of the same tools. Throughout our study, we have tested several popular LLM tools available to us. The specific LLM models and their versions used at the time of our first submission are detailed in Table 1 , along with the tuning parameters.
The field of large language models (LLMs) has evolved rapidly since we started this study in 2023. New models are frequently released, and existing versions are constantly updated. To account for these advances, we have expanded our analysis to include results from the latest versions of previously studied models, as well as a few entirely new ones. As a best practice, we set the temperature parameter to 0 for all latest models. Table 1 now includes these latest models and their versions.
A brief explanation of these parameters is as follows:
Temperature: The temperature controls the randomness of the outputs, with a range from 0 to 2. A temperature value greater than 1 is random and 0 is deterministic. The maximum temperature of 2 gives the most creative and variable outputs.
Max length: The max length is the maximum number of tokens for the model to generate as a response. A single word is generally 2–3 tokens.
Stop sequences: This parameter controls which tokens or phrases will stop the LLM from generating more text.
Top p: When generating text, the model samples from the top p percentage of most likely tokens to generate. The top p is the cumulative probability cutoff for the model’s selection of tokens to generate. Lower top p values mean sampling from a smaller, more top-weighted nucleus.
Top k: When generating text, the model samples from the top k most likely tokens. When deciding the next word to output, the model will select the most likely word when top k is lower.
Frequency penalty: This frequency penalty parameter controls how the LLM penalizes tokens that have already appeared in the input and output text. A frequency penalty of 0 implies that the frequency of the tokens does not impact the LLM and will generate them based on their probability.
Presence penalty: This parameter controls how the LLM penalizes tokens that have not been used recently. A presence penalty of 0 means that the LLM does not care about the recency of the tokens and will generate them based on their probability.
In our pursuit to assess ChatGPT’s proficiency in abstract screening, we selected certain benchmark databases that have existing performance data from other methodologies. This selection aids in a comparative analysis of performance.
In selecting the systematic reviews for our study, we used a systematic approach guided by specific criteria to ensure relevance and reliability. These studies were selected from the publicly available SYNERGY [ 23 ] dataset, which contains 26 systematic reviews from different disciplines. Key selection criteria included:
Clarity and conciseness of eligibility criteria: The selected studies had well-defined and explicit eligibility criteria. This clarity is essential for accurate replication of the study selection process, which is critical for assessing the performance of LLM tools in an analogous real-world application.
Completeness and cleanliness of data: We ensured that the selected reviews had complete datasets, with all necessary information on included and excluded studies clearly documented, minimizing the risk of ambiguities affecting our analysis.
In addition, to comply with the AMSTAR-2 [ 26 ] guidelines, in particular point 5, we reviewed the methodologies of these reviews to confirm the selection of studies was performed in duplicate and disagreements were resolved by consensus. While our analysis assumes that these systematic reviews adhere to high standards, we recognize the inherent limitations of using pre-existing datasets as a proxy for gold standards in the discussion section.
We picked the following 3 databases from the publicly available SYNERGY dataset [ 23 ]:
Bannach-Brown 2016 [ 27 ]—topic: use of animal models to study depressive behavior
Human-curated (gold standard) results: 1258 excluded abstracts and 230 included abstracts.
We randomly selected 100 excluded abstracts and 100 included abstracts for screening by LLM tools.
Meijboom 2021 [ 28 ]—topic: retransitioning of etanercept in patients with a rheumatic disease
Human-curated (gold standard) results: 599 excluded abstracts and all 32 included abstracts.
We randomly selected 100 excluded abstracts and 32 included abstracts for screening by LLM tools.
Menon 2022 [ 29 ]—topic: methodological rigor of systematic reviews in environmental health
Human-curated (gold standard) results: 896 excluded abstracts and 73 included abstracts.
We randomly selected 100 excluded abstracts and all 73 included abstracts for screening by LLM tools.
For each chosen database, abstracts were categorized as either “cases” (those included based on a gold standard) or “controls” (those excluded per the gold standard). From each category, we randomly selected 100 abstracts (we use all abstracts if there are less than 100 abstracts in that category in a database). These abstracts underwent evaluation by ChatGPT (v4.0) as per our established workflow. Subsequently, ChatGPT’s decisions were juxtaposed against the gold standard to determine sensitivity, specificity, and overall accuracy. The same abstracts were also processed using other LLM tools as listed in Table 1 to record their respective verdicts.
To quantify the efficacy of ChatGPT and other AI tools for each database, we calculated the following metrics: (1) sensitivity, (2) specificity, and (3) overall accuracy, where sensitivity is defined as the number of true positives divided by the sum of true positives and false negatives, specificity as the number of true negatives divided by the sum of true negatives and false positives, and accuracy as sum of true positives and true negatives divided by the total number of abstracts. For each metric, associated 95% confidence intervals were also determined. Although it is very common in the field to report F1 score, recall rate, and precision, we believe it is more appropriate to report sensitivity and specificity given this study design. In addition, F1 score, recall rate, and precision can be derived from sensitivity and specificity.
Furthermore, to explore the potential of a unified decision-making process, we combined the decisions from all AI tools using a voting mechanism. The majority decision across the tools was taken as the final verdict for each abstract. For this consolidated approach, we again computed sensitivity, specificity, overall accuracy, and the associated 95% CIs for each database. We also explore the use of latent class analysis (LCA), a model-based clustering approach, to derive consolidated decisions. More details on this LCA approach are provided in the “ Beyond majority voting ” section.
For a given database, 100 cases and 100 controls yield a two-sided 95% confidence interval with a half-width equal to 0.048 when the underline sensitivity (specificity) is approximately 95%.
All statistical analyses were conducted using the R statistical software (version 4.3.1). All tests were two-sided with an alpha level set at 0.05 unless otherwise mentioned.
To improve the transparency and reproducibility of studies using AI tools, we have included the TRIPOD + AI checklist [ 30 ] in our report. This checklist has been adapted to reflect the specifics of our research, which focuses on the evaluation of large language models for abstract screening rather than diagnostic or prognostic modeling. The completed checklist is presented in Table S 1 , to provide readers with a comprehensive overview of our study’s adherence to established reporting standards.
We present the results for each of the 3 databases. For each, we first present the prompts we used when we called LLM tools to screen an abstract, then present the performance data (accuracy, sensitivity, and specificity for each method or LLM tool), followed by a summary of the performance, and a comparison of the performance of the different methods to that based on ChatGPT v4.0.
The prompts we used for screening abstracts in this database are as follows:
Conduct a systematic review on animal depression. I provide the title and abstract for one journal article. Provide an overall assessment based on eligibility criteria with only one word answer yes or no with no explanation. Then, for each inclusion or exclusion criterion, answer with only one word, yes if it is included by the inclusion criterion or excluded by the exclusion criterion, and answer no if it does not meet the inclusion criterion or not excluded by the exclusion criterion. After answering all the criteria with yes or no, then provide an overall explanation.
Here is the eligibility criteria: Inclusion Criteria: 1. Any article providing primary data of an animal model of depression or depressive-like phenotype with an appropriate control group (specified above). 2. Animals of all ages, sexes and species, where depression-like phenotype intended to mimic the human condition have been induced. Including animal models where depressive-like phenotypes are induced in the presence of a comorbidity (e.g. obesity or cancer). 3. All studies that claim to model depression or depressive-like phenotypes in animals. Studies that induce depressive behavior or model depression and that also test a treatment or intervention (prior or subsequent to model induction), with no exclusion criteria based on dosage, timing or frequency. 4. Studies measuring behavioral, anatomical and structural, electrophysiological, histological and/or neurochemical outcomes and where genomic, proteomic or metabolomic outcomes are measured in addition to behavioral, anatomical, electrophysiological, histological or neurochemical outcomes. Exclusion Criteria: 1. Review article, editorials, case reports, letters or comments, conference or seminar abstracts, studies providing primary data but not appropriate control group. 2. Human studies and ex vivo, in vitro or in silico studies. Studies will be excluded if authors state an intention to induce or investigate only anxiety or anxious behavior. Studies will be excluded if there is no experimental intervention on the animals (e.g. purely observational studies). 3. Studies that investigate treatments or interventions, but no depressive behavior or model of depression is induced (e.g. toxicity and side-effect studies). 4. Where metabolic outcome measures are the primary outcome measure of a study. Where genomic, proteomic, metabolic or metabolomic outcomes are the sole outcome measures in a study, they will be excluded.
Here is the abstract:
Among all the LLM tools we tested, ChatGPT v4.0 stood out with the highest accuracy (0.945) and specificity (0.960), and satisfactory sensitivity (0.930). Combined decision using major voting seemed to improve sensitivity (0.970) considerably but did not improve specificity (0.870) much. Comparatively, the zero-shot method was less effective across these metrics, while the hybrid method attained superior accuracy (0.961) and specificity (0.982), albeit with reduced sensitivity (0.843). We acknowledge that our findings regarding the zero-shot and hybrid methods differ from those reported in [ 18 ] for the same set of abstracts. This discrepancy could be attributed to the use of distinct sets of embeddings in our analysis compared to the previous study.
For this database, the newer versions of ChatGPT (3.5-Turbo and 4-Turbo) did not improve performance over ChatGPT (v4.0). Gemini-1.0-pro (vs. PaLM 2) and Llama 3 (vs. Llama 2) improved over their older versions, but did not surpass the performance of ChatGPT (v4.0). Claude 3 performed well, but still did not surpass the performance of ChatGPT (v4.0).
Comparison between LLM tools. We compared the performance (sensitivity and specificity) between ChatGPTv4.0 and other LLM tools using the McNemar test and found that ChatGPTv4.0 performed significantly better ( p value = 0.002) than Google PaLM 2 in terms of sensitivity; ChatGPTv4.0 performed significantly better than ChatGPTv3.5 ( p value = 0.008) and better than Llama-2 ( p value < 0.001) in terms of specificity. Combining the decisions of different LLM tools using majority voting did not improve the overall accuracy compared to ChatGPTv4.0. Specifically, there was no statistically significant difference ( p value = 0.134) in sensitivity between the combined decision (majority voting) and ChatGPTv4.0, and the combined decision was significantly worse ( p value = 0.008) than ChatGPTv4.0 in terms of specificity.
Comparison between ChatGPT v4.0 and zero-shot and hybrid methods: We assessed the performance (sensitivity and specificity) of ChatGPT v4.0 against both the zero-shot and hybrid approaches using the McNemar test. Specifically, we aligned the screening results from 100 cases and 100 controls as per the ChatGPT v4.0 method and similarly for the zero-shot and hybrid methods, testing for inconsistencies between these approaches as previously done. Our analysis revealed that ChatGPT v4.0 significantly outperformed the zero-shot method in sensitivity ( p value < 0.001) but showed comparable effectiveness in specificity ( p value = 0.37). Additionally, ChatGPT v4.0 demonstrated marginally superior sensitivity compared to the hybrid method ( p value = 0.07), while its performance in specificity was similar ( p value = 1.00).
Conduct a systematic review on transitioning patients from an originator to a corresponding biosimilar.
I provide the title and abstract for one journal article. Provide an overall assessment based on eligibility criteria with only one word answer yes or no, with no explanation. Then, for each inclusion or exclusion criterion, answer with only one word, yes if it is included by the inclusion criterion or excluded by the exclusion criterion, and answer no if it does not meet the inclusion criterion or not excluded by the exclusion criterion.
After answering all the criteria with yes or no, then provide an overall explanation.
Here is the eligibility criteria:
Articles were included if they met the following criteria:
Study involved transitioning from a TNFα inhibitor (including etanercept, infliximab, and adalimumab) originator to a biosimilar
The number of patients who retransitioned was reported or could be calculated
The article was an original research article published in a peer-reviewed journal
The article included baseline characteristics of the patients who transitioned
The article was written in English
The full-text version of the article could be obtained.
Transitioning was defined as patients in whom the biosimilar was introduced after the originator, without treatment with other drugs in between. Retransitioning was defined as restarting the originator directly after discontinuing a biosimilar, without treatment with other drugs in between. In summary, transitioning was defined as switching from the originator to a biosimilar; retransitioning was defined as switching from the originator to a biosimilar and back to the originator. Both transitioning and retransitioning involved changes with the same active biological substance.
Among all the LLM tools we tested, ChatGPT v4.0 stood out with the highest accuracy (0.840), but not with specificity (0.860) or satisfactory sensitivity (0.812). Compared to ChatGPTv4.0, combined decision using major voting did not improve overall accuracy (0.720), but improved sensitivity (1.000) at the sacrifice of specificity (0.630).
Comparison between LLM tools. We compared the performance (sensitivity and specificity) between ChatGPTv4.0 and other LLM tools using the McNemar test and found that ChatGPTv4.0 performed significantly better ( p value < 0.001) than Google PaLM 2, but significantly worse than ChatGPT3.5 ( p value = 0.001) and Llama 2 in terms of sensitivity; ChatGPTv4.0 performed significantly better than ChatGPTv3.5 ( p value < 0.001) and better than Llama 2 ( p value < 0.001), but worse than Google PaLM 2 ( p value = 0.002), in terms of specificity. Combining the decisions of different LLM tools using majority voting did not improve the overall accuracy compared to ChatGPTv4.0. Specifically, there was statistically significant difference ( p value = 0.008) in sensitivity between the combined decision (majority voting) and ChatGPTv4.0, and the combined decision was not significantly worse ( p value > 0.50) than ChatGPTv4.0 in terms of specificity.
For this database, the newer versions of ChatGPT (3.5-Turbo and 4-Turbo) did not improve performance over ChatGPT (v4.0), and Gemini-1.0-pro (vs. PaLM 2) did not improve performance either. However, Llama 3 (vs. Llama 2) improved over its older version and surpassed the performance of ChatGPT (v4.0). Claude 3 also slightly surpassed the performance of ChatGPT (v4.0).
Comparison between ChatGPT v4.0 and zero-shot and hybrid methods: We evaluated the performance of ChatGPT v4.0, focusing on sensitivity and specificity, in comparison with the zero-shot and hybrid approaches, employing the McNemar test as described above. In this analysis, we aligned the screening results from 32 cases and 100 controls for the tests. Our findings indicated that ChatGPT v4.0 significantly surpassed the zero-shot method in sensitivity ( p value = 0.0002) and exhibited marginally improved specificity ( p value = 0.099). Furthermore, ChatGPT v4.0 showed notably higher sensitivity than the hybrid method ( p value < 0.001), although its specificity was comparatively lower.
“Conduct a systematic review on the methodological rigour of systematic reviews in environmental health.
I provide the title and abstract for one journal article.
Provide an overall assessment based on eligibility criteria with only one word answer yes or no, with no explanation.
Then, for each inclusion or exclusion criterion, answer with only one word, yes if it is included by the inclusion criterion or excluded by the exclusion criterion and answer no if it does not meet the inclusion criterion or not excluded by the exclusion criterion.
Here are the eligibility criteria:
To be eligible for inclusion in the SR sample, documents had to fulfill the following criteria:
Identify explicitly as a “systematic review” in their title
Assess the effect of a non-acute, non-communicable, environmental exposure on a health outcome. Environmental exposures can include air and water pollutants, radiation, noise, occupational hazards, lifestyle factors (like diet or physical activity) and lifestyle choices influenced by family and peers (like substance use), social and economic factors (like stress from work or living conditions).
Include studies in people or mammalian models
Be available in HTML format
Abstract X”
Among all the LLM tools we tested, ChatGPT v4.0 stood out with the highest accuracy (0.913) and specificity (0.932), but not with specificity (0.900). Compared to ChatGPTv4.0, combined decision using major voting did not improve overall accuracy (0.884) or sensitivity (0.808), but improved specificity (0.940).
Comparison between LLM tools. We compared the performance (sensitivity and specificity) between ChatGPTv4.0 and other LLM tools using the McNemar test and found that ChatGPTv4.0 performed significantly better than ChatGPT3.5 ( p value < 0.001), Google PaLM 2, and Llama 2 ( p value = 0.02) in terms of sensitivity; ChatGPTv4.0 performed worse than ChatGPTv3.5 and Google PaLM 2, in terms of specificity. Combining the decisions of different LLM tools using majority voting did not improve the overall accuracy compared to ChatGPTv4.0. Specifically, there was statistically significant difference ( p value = 0.008) in sensitivity between the combined decision (majority voting) and ChatGPTv4.0, and the combined decision was not significantly different ( p value = 0.134) than ChatGPTv4.0 in terms of specificity.
For this database, the newer versions of ChatGPT (3.5-Turbo and 4-Turbo) did not improve performance over ChatGPT (v4.0). However, both Gemini-1.0-pro (vs. PaLM 2) and Llama 3 (vs. Llama 2) improved over their older versions and surpassed the performance of ChatGPT (v4.0). Claude 3 also performed well but did not surpass the performance of ChatGPT (v4.0).
Comparison between ChatGPT v4.0 and zero-shot and hybrid methods: We aligned the screening results from 73 cases and 100 controls based on the ChatGPT v4.0 method, and similarly for the zero-shot and hybrid methods, to test for inconsistencies between these approaches, using the McNemar test as done in previous assessments. Our analysis showed that ChatGPT v4.0 significantly outperformed the zero-shot method in both sensitivity ( p value < 0.001) and specificity ( p value = 0.016). In comparison with the hybrid method, ChatGPT v4.0 also demonstrated superior sensitivity ( p value < 0.001) and better specificity ( p value = 0.04).
To use the ChatGPT API or other LLM tools, the owners of these platforms charge a predetermined rate for access to the corresponding APIs. These fees are calculated in USD per thousand tokens, where tokens are the basic units used by these LLM platforms to quantify text length. In this context, a token can represent a word, a punctuation mark, or a character. The financial cost of screening 200 abstracts was approximately $6 for ChatGPT v4.0, $0.2 for ChatGPT v3.5, $10 for Llama 2 (using Replicate), while Google PaLM 2 offered its services for free to invited developers. Thus, the cumulative cost of evaluating 200 abstracts across all platforms was approximately $16.2. The cumulative cost of evaluating 200 abstracts across all latest models ($3 for GPT-4-Turbo, $0.05 for GPT-3.5-Turbo, free for Gemini-1.0-pro, $0.05 for Llama-3, $4 for Claude) was less, approximately $7.1. In terms of time efficiency, processing 200 abstracts with each of these LLM tools took approximately 10–20 min using a single thread. However, it is imperative to recognize that abstract screening lends itself well to parallelization. Consequently, one could significantly speed up the process by setting up multiple threads to simultaneously screen different subsets of abstracts, thereby reducing the overall time required for completion. This parallel approach not only increases efficiency, but also ensures that large amounts of data can be processed in a timely manner, making LLM tools even more attractive for large-scale abstract screening tasks. In summary, the monetary and time costs of using LLM tools for abstract screening are negligible compared to manual labeling.
We have expanded our analysis to include a variety of approaches for synthesizing decisions across different LLM tools. Our methodology is inspired by the concept of combining multiple diagnostic tests in the absence of a gold standard, akin to situations where human expert consensus is not available. There are several publications discussing such scenarios [ 31 , 32 ], among which we proposed to use the latent class analysis (LCA) models.
Latent class analysis (LCA) is a statistical method used to identify subgroups within a population, which are not directly observed (hence “latent”) [ 33 , 34 , 35 ]. It is particularly useful when the research interest lies in categorizing individuals into mutually exclusive groups based on their responses to multiple observed variables. In the context of abstract screening, LCA can offer a sophisticated means of integrating decisions from different LLM tools without relying on a gold standard, typically provided by human expert consensus. This approach assumes that the unobserved subgroups (or “latent classes”) explain the dependence between the observed decisions made by each of the LLM tools.
Utilizing the LCA model, we treat the decisions from all LLM tools as dichotomous variables, corresponding to the adherence to each inclusion or exclusion criterion, as well as the overall decision. For instance, within the Bannach-Brown 2016 database (BB2016), there are eight criteria in total with four criteria each for inclusion and exclusion and one overall decision for inclusion, resulting in a total of nine binary items per LLM tool. In our analysis, we incorporated decisions from GPT v4.0, v3.5, and Llama 2. Decisions from Google PaLM 2 were excluded due to a high frequency (10% or more) of incomplete responses. Consequently, for the Bannach-Brown 2016 database, we worked with 27 binary items. For other databases such as Meijboom 2021 (Mj2021) and Menon 2022 (Me2022), the binary items totaled 21 and 15, respectively. It is important to note that LCA models were fitted to the binary data of each database independently.
The LCA model fitting process enables us to calculate the posterior probabilities of each abstract belonging to specific latent classes or subgroups. Abstracts are then categorized based on these probabilities, with assignment to the class for which an abstract has the highest posterior membership probability. The determination of the number of latent classes is a critical step in the LCA model fitting, which requires a priori specification. In our evaluation, we explored models with class numbers ranging from 2 to 6 and utilized the Bayesian information criterion (BIC) to identify the most “optimal” LCA model for our datasets.
Table 5 shows that after applying the Bayesian information criterion (BIC) to determine the most appropriate model for our data, we identified a 3-class model as the best fit for the binary decisions derived from the BB2016 database. Similarly, a 4-class model was optimal for the Mj2021 database, while a 3-class model was again best for the Me2022 database. The confusion matrices generated by the selected LCA models for each database provided a clear juxtaposition between the LLM-assigned classes and the actual labels of the abstracts (see Table 6 ).
The performance metrics derived from these models are noteworthy. For the BB2016 database, should we categorize abstracts within class with label 2 as “included” and assign the others in classes with label 1 or 3 to the “excluded” category, the LCA model achieved a sensitivity rate of 93% and a specificity rate of 96%, indicating a high degree of accuracy in classifying relevant and irrelevant abstracts. In the case of the Mj2021 database, if we view class 1 as “included” group and classes 2 and 3 as “excluded” group, the model achieved a perfect sensitivity rate of 100%, meaning that it correctly identified all relevant abstracts, although the specificity was lower at 79%, indicating a higher rate of false positives. Similarly, for the Me2022 database, considering class 1 to be the “excluded” category and classes 2 and 3 to be the “included” group, the model showed a sensitivity of 94.5% and a specificity of 83%, a good balance between identifying relevant abstracts and avoiding false inclusions.
These results highlight the robustness of the latent class analysis approach to the abstract screening task, providing an effective method for classifying abstracts when a gold standard is not available. The varying levels of sensitivity and specificity across databases underscore the need to tailor the LCA model to the specific characteristics of each dataset as well as further future research.
We have incorporated results from the latest LLM models to increase the robustness and relevance of our findings. However, to maintain consistency and comparability with our original analyses, we have not incorporated these new results into the previously established majority voting or latent class analysis (LCA) approaches. Instead, we have chosen to make all raw data, including results from these newer models, freely available in our GitHub repository. This approach allows interested researchers and practitioners to conduct further investigations or apply alternative methods of analysis. By providing access to this additional data, we aim to promote transparency and enable the broader community to engage with and potentially build upon our work.
This study began with a rigorous exploration of the capabilities of large language models (LLMs) in abstract screening. We used automation scripts developed in Python to interact with the APIs of several LLM tools, including Chat GPT v4.0, Google PaLM 2, and Meta Llama 2, as well as latest versions of these tools. Our central goal was to evaluate the efficiency and accuracy of these tools across three different databases of abstracts, leading us to a complex understanding of their potential in this context.
Large language models (LLMs), particularly ChatGPT, have garnered global attention since their inception. Employing LLMs for abstract screening in systematic reviews is an innovative concept [ 32 , 33 ] and remains underexplored. This study presents the first comprehensive evaluation of LLMs applied to systematic review processes. The findings are encouraging, suggesting that LLMs could revolutionize abstract screening. Specifically, ChatGPT v4.0 exhibited stellar performance across three test scenarios, achieving an accuracy of at least 85%. Furthermore, it attained sensitivity and specificity rates ranging from 80% to an impressive 95%. These exceptional outcomes highlight the substantial promise of LLMs in abstract screening, offering an efficient and capable alternative to the conventional, laborious approaches that typically necessitate extensive human annotation.
However, it is important to acknowledge that we are still in the early stages of integrating LLM tools into the abstract screening field, and they are not without their imperfections—for example, occasionally excluding many (20%) relevant studies, even with the best-performed LLM tools, ChatGPT v4.0. These tools are not a universal solution to all the challenges associated with abstract screening, and they are not ready to completely replace human expertise in this area. Instead, they should be embraced as invaluable assistants in the abstract screening process.
In discussing the limitations of our study, it is important to recognize the constraints associated with our dataset selection and model configuration. We used only three databases from the SYNERGY set, limiting the generalizability of our findings across disciplines and datasets. In addition, the reliance on human-curated labels as a gold standard, while necessary, introduces potential biases due to the retrospective nature of our analysis. These labels may contain errors, and the methodology used in the original reviews to resolve discrepancies may affect the validity of our conclusions.
The parameters chosen for our LLMs—temperature, top k, top p, and prompts—were set to defaults or based on natural conversations to balance output quality with ease of use. While this approach minimizes the need for technical expertise, it may not be optimal for all screening scenarios. In addition, the reliance of LLMs on abstracts alone, as opposed to full-text evaluations, presents a fundamental challenge; critical data influencing inclusion or exclusion may not be present in the abstracts, potentially compromising screening accuracy.
In addition, the rapid development of LLMs and their “black box” nature pose challenges to the transparency that is essential in scientific settings. The environmental impact of using these computationally intensive models is also significant [ 36 ], requiring sustainability considerations. Future research should focus on refining these tools to increase transparency and efficiency, and ensure their responsible development and use in systematic reviews.
Our research suggests that LLM tools are ready to take on a role in abstract screening and are poised to have an immediate and positive impact on the process. Their integration into abstract screening can manifest itself in a few ways. They can serve as autonomous AI reviewers, adding an extra layer of scrutiny and ensuring thoroughness. Our findings suggest that a collective decision, such as one derived from majority voting, can sometimes improve sensitivity, underscoring the potential of LLM tools as a reliable safeguard against oversight, ensuring both comprehensiveness and accuracy.
In addition, LLM tools can facilitate a synergistic partnership with human experts. They are adept at identifying “high-risk” abstracts where different LLM tools have reached different judgments, flagging them for further human evaluation, and promoting a careful and error-free screening process while minimizing human effort.
Another exciting prospect is the integration of LLM tools with hybrid approaches and active learning approach. In this scenario, LLM tools could autonomously annotate abstracts in the training set, minimizing the need for human labeling. Subsequently, these labeled abstracts could be used to train custom NLP models, paving the way for a streamlined and efficient abstract screening process, resulting in significant time and resource savings. Further research is needed to understand how the “uncertainty” in LLM-based decisions, when human-curated labels are not used, affects the performance of the hybrid approach. We also note from the test examples that the performance of hybrid approaches varies widely from example to example and depending on the text embedding tools used. Extensive research is needed to perfect the hybrid approaches.
Another future research direction is to explore how to fine-tune different versions of LLM tools and how to derive collective decisions. One idea is that by using different tuning parameters, one could propose different versions of LLM tools, and if they perform similarly in terms of accuracy but give different decisions (e.g., not highly correlated), we expect that better collective decisions would be obtained, as observed in many ensemble learning experiences [ 37 ]. However, this could be costly and require a lot of exploration.
More importantly, it would be particularly valuable to prospectively test these LLM tools, or integration of these tools with other approaches like zero-shot, active learning, in real-world meta-analysis and systematic review projects. This would provide deeper insights into their practical utility and effectiveness, and a more comprehensive understanding of their impact in live research environments.
In summary, while LLM tools may not be able to fully replace human experts in abstract screening, their ability to transform the screening process is both undeniable and imminent. With continued advances in technology and ongoing refinement, these tools will play a critical role in the future of abstract screening, ushering in a new era of efficiency and effectiveness.
The datasets analyzed during the current study are available in the SYNERGY repository, https://github.com/asreview/synergy-dataset . We have created a GitHub repository https://github.com/mikeli380/LLMAbstractScreening to host our code (R/Python), abstracts, and intermediate results so that interested readers can replicate our findings.
Higgins JPT, et al. Cochrane handbook for systematic reviews of interventions. 2019.
Book Google Scholar
Moher D, et al. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. 2009;6(7). https://doi.org/10.1371/journal.pmed.1000097 .
Sackett DL, Rosenberg WMC, Gray JAM, Haynes RB, Richardson WS. Evidence based medicine: what it is and what it isn’t. 1996. Clin Orthop Relat Res. 2007;455. https://doi.org/10.1136/bmj.312.7023.71 .
Lavis JN, Posada FB, Haines PA, Osei E. Use of research to inform public policymaking. Lancet. 2004;364(9445). https://doi.org/10.1016/S0140-6736(04)17317-0 .
Bastian H, Glasziou P, Chalmers I. Seventy-five trials and eleven systematic reviews a day: how will we ever keep up?. PLoS Med. 2010;7(9). https://doi.org/10.1371/journal.pmed.1000326 .
Polanin JR, Pigott TD, Espelage DL, Grotpeter JK. Best practice guidelines for abstract screening large-evidence systematic reviews and meta-analyses. Res Synth Methods. 2019;10(3). https://doi.org/10.1002/jrsm.1354 .
Turner EH, Matthews AM, Linardatos E, Tell RA, Rosenthal R. Selective publication of antidepressant trials and its influence on apparent efficacy. N Engl J Med. 2008;358(3). https://doi.org/10.1056/nejmsa065779 .
Edwards P, Clarke M, DiGuiseppi C, Pratap S, Roberts I, Wentz R. Identification of randomized controlled trials in systematic reviews: accuracy and reliability of screening records. Stat Med. 2002;21(11). https://doi.org/10.1002/sim.1190 .
Wang Z, Nayfeh T, Tetzlaff J, O’Blenis P, Murad MH. Error rates of human reviewers during abstract screening in systematic reviews. PLoS ONE. 2020;15(1). https://doi.org/10.1371/journal.pone.0227742 .
O’Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev. 2015;4(1). https://doi.org/10.1186/2046-4053-4-5 .
OpenAI. Introducing ChatGPT Plus. OpenAI Blog; 2023. https://openai.com/blog/chatgpt-plus . Accessed 8 Nov 2023.
Google DeepMind. Introducing PaLM 2. Google Blog; 2023. https://blog.google/technology/ai/google-palm-2-ai-large-language-model/ . Accessed 8 Nov 2023.
Meta AI Research. Meta and Microsoft introduce the next generation of Llama. Facebook AI Blog; 2023. https://ai.meta.com/blog/llama-2/ . Accessed 8 Nov 2023.
Claude.ai. Claude AI API reference. 2024. https://docs.anthropic.com/en/api/getting-started . Accessed 4 Jun 2024.
Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA. Deploying an interactive machine learning system in an evidence-based practice center: Abstrackr. 2012. https://doi.org/10.1145/2110363.2110464 .
Alshami A, Elsayed M, Ali E, Eltoukhy AEE, Zayed T. Harnessing the power of ChatGPT for automating systematic review process: methodology, case study, limitations, and future directions. Systems. 2023;11(7). https://doi.org/10.3390/systems11070351 .
Qureshi R, Shaughnessy D, Gill KAR, Robinson KA, Li T, Agai E. Are ChatGPT and large language models ‘the answer’ to bringing us closer to systematic review automation? Syst Rev. 2023;12(1). https://doi.org/10.1186/s13643-023-02243-z .
Xian Y, Lampert CH, Schiele B, Akata Z. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans Pattern Anal Mach Intell. 2019;41(9). https://doi.org/10.1109/TPAMI.2018.2857768 .
Moreno-Garcia CF, Jayne C, Elyan E, Aceves-Martins M. A novel application of machine learning and zero-shot classification methods for automated abstract screening in systematic reviews. Decis Anal J. 2023;6. https://doi.org/10.1016/j.dajour.2023.100162 .
Ferdinands G, et al. Performance of active learning models for screening prioritization in systematic reviews: a simulation study into the Average Time to Discover relevant records. Syst Rev. 2023;12(1). https://doi.org/10.1186/s13643-023-02257-7 .
Boetje J, van de Schoot R. The SAFE procedure: a practical stopping heuristic for active learning-based screening in systematic reviews and meta-analyses. Syst Rev. 2024;13(1). https://doi.org/10.1186/s13643-024-02502-7 .
Yang Y, Loog M. A benchmark and comparison of active learning for logistic regression. Pattern Recognit. 2018;83. https://doi.org/10.1016/j.patcog.2018.06.004 .
De Bruin RJ, Ma Y, Ferdinands G, Teijema J, Van de Schoot. SYNERGY - open machine learning dataset on study selection in systematic reviews. DataverseNL; 2023. https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/HE6NAQ .
Millard SJ, Weston-Green K, Newell KA. The Wistar-Kyoto rat model of endogenous depression: a tool for exploring treatment resistance with an urgent need to focus on sex differences. Prog Neuropsychopharmacol Biol Psychiatry. 2020;101. https://doi.org/10.1016/j.pnpbp.2020.109908 .
Zawia NH, Mattia CJ, Bondy SC. Differential effects of difluoromethylornithine on basal and induced activity of cerebral ornithine decarboxylase and mRNA. Neuropharmacology. 1991;30(4). https://doi.org/10.1016/0028-3908(91)90058-J .
Shea BJ, et al. AMSTAR 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. BMJ. 2017;358. https://doi.org/10.1136/bmj.j4008 .
Bannach-Brown A, Liao J, Wegener G, Macleod M. Understanding in vivo modelling of depression in non-human animals: a systematic review protocol. Evid Based Preclin Med. 2016;3(2). https://doi.org/10.1002/ebm2.24 .
Meijboom RW, Gardarsdottir H, Becker ML, ten Wolde S, Egberts TCG, Giezen TJ. Incidence of and reasons and determinants associated with retransitioning from biosimilar etanercept to originator etanercept. BioDrugs. 2021;35(6). https://doi.org/10.1007/s40259-021-00501-x .
Menon JML, Struijs F, Whaley P. The methodological rigour of systematic reviews in environmental health. Crit Rev Toxicol. 2022;52(3). https://doi.org/10.1080/10408444.2022.2082917 .
Collins GS, et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. 2021;11(7). https://doi.org/10.1136/bmjopen-2020-048008 .
Joseph L, Gyorkos TW, Coupal L. Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard. Am J Epidemiol. 1995;141(3). https://doi.org/10.1093/oxfordjournals.aje.a117428 .
Hui SL, Walter SD. Estimating the error rates of diagnostic tests. Biometrics. 1980;36(1). https://doi.org/10.2307/2530508 .
Dendukuri N, Joseph L. Bayesian approaches to modeling the conditional dependence between multiple diagnostic tests. Biometrics. 2001;57(1). https://doi.org/10.1111/j.0006-341X.2001.00158.x .
Collins LM, Lanza ST. Latent class and latent transition analysis: with applications in the social, behavioral, and health sciences. 2010.
Google Scholar
McCutcheon AL. Latent class analysis. Newbury Park: Sage; 1987.
Rillig MC, Ågerstrand M, Bi M, Gould KA, Sauerland U. Risks and benefits of large language models for the environment. Environm Sci Technol. 2023;57(9). https://doi.org/10.1021/acs.est.3c01106 .
Sagi O, Rokach L. Ensemble learning: a survey. Wiley Interdiscip Rev Data Min Knowl Discov. 2018;8(4). https://doi.org/10.1002/widm.1249 .
Download references
The authors thank the reviewers and the Editor for their insightful comments and suggestions, which have greatly improved the quality of this work.
Not applicable.
Authors and affiliations.
Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
Michael Li & Xianming Tan
Department of Mathematics and Statistics, University of North Carolina at Greensboro, Greensboro, NC, 27402, USA
Jianping Sun
Department of Biostatistics and Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
Xianming Tan
You can also search for this author in PubMed Google Scholar
XT conceived the paper. ML and XT devised the study. ML, JS, and XT analyzed the datasets. ML, JS, and XT wrote the manuscript. All authors edited and approved the final manuscript.
Correspondence to Xianming Tan .
Ethics approval and consent to participate, competing interests.
The authors declare that they have no competing interests.
Publisher’s note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material 1: table s1. tripod + ai checklist., rights and permissions.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .
Reprints and permissions
Cite this article.
Li, M., Sun, J. & Tan, X. Evaluating the effectiveness of large language models in abstract screening: a comparative analysis. Syst Rev 13 , 219 (2024). https://doi.org/10.1186/s13643-024-02609-x
Download citation
Received : 20 December 2023
Accepted : 11 July 2024
Published : 21 August 2024
DOI : https://doi.org/10.1186/s13643-024-02609-x
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
ISSN: 2046-4053
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
arXiv's Accessibility Forum starts next month!
Help | Advanced Search
Title: streamlining the selection phase of systematic literature reviews (slrs) using ai-enabled gpt-4 assistant api.
Abstract: The escalating volume of academic literature presents a formidable challenge in staying updated with the newest research developments. Addressing this, this study introduces a pioneering AI-based tool, configured specifically to streamline the efficiency of the article selection phase in Systematic Literature Reviews (SLRs). Utilizing the robust capabilities of OpenAI's GPT-4 Assistant API, the tool successfully homogenizes the article selection process across a broad array of academic disciplines. Implemented through a tripartite approach consisting of data preparation, AI-mediated article assessment, and structured result presentation, this tool significantly accelerates the time-consuming task of literature reviews. Importantly, this tool could be highly beneficial in fields such as management and economics, where the SLR process involves substantial human judgment. The adoption of a standard GPT model can substantially reduce potential biases and enhance the speed and precision of the SLR selection phase. This not only amplifies researcher productivity and accuracy but also denotes a considerable stride forward in the way academic research is conducted amidst the surging body of scholarly publications.
Comments: | 11 pages, 5 figures |
Subjects: | Digital Libraries (cs.DL); Artificial Intelligence (cs.AI) |
Cite as: | [cs.DL] |
(or [cs.DL] for this version) | |
Focus to learn more arXiv-issued DOI via DataCite |
Access paper:.
Code, data and media associated with this article, recommenders and search tools.
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .
You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.
All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .
Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.
Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.
Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.
Original Submission Date Received: .
Find support for a specific problem in the support section of our website.
Please let us know what you think of our products and services.
Visit our dedicated information section to learn more about MDPI.
A systematic review of the evolution of the concept of resilience in the construction industry.
2. methodology, 2.1. search strategy, 2.2. selection, 2.3. extraction, 2.4. assessment, 3.1. the subject of resilience, 3.1.1. specific class, 3.1.2. general class, 3.1.3. temporal distribution of resilience subjects, 3.2. the impact factors, 3.2.1. natural and environmental factors, 3.2.2. human and external factors, 3.2.3. temporal distribution of impact factors, 3.3. capabilities of the subject, 3.3.1. management and operations, 3.3.2. performance and development, 3.3.3. temporal distribution of subject capabilities, 4. discussion, 4.1. the evolution and ambiguity of the early concept of resilience, 4.2. complexity of impact types and functional failures, 4.3. redefining the concept of resilience in a vuca environment, 4.4. future directions and suggestions, 5. conclusions, data availability statement, acknowledgments, conflicts of interest.
Click here to enlarge figure
Database | Search Strings | Number |
---|---|---|
Scopus | (TITLE-ABS-KEY(“construction industry” OR “civil engineering” OR “construction engineering” OR “architecture engineering” OR “construction sector” OR “building industry” OR “construction project”) AND TITLE-ABS-KEY(“resilience” OR “adaptive capacity” OR “robustness” OR “recoverability” OR “durability”) AND TITLE (“resilience” OR “resilient”)) | 451 |
Web of Science | (TS = (“construction industry” OR “civil engineering” OR “construction engineering” OR “architecture engineering” OR “construction sector” OR “building industry” OR “construction project”) AND TS = (“resilience” OR “adaptive capacity” OR “robustness” OR “recoverability” OR “durability”) AND TI = (“resilience” OR “resilient”) ) | 763 |
Main Categories | Subcategories | Number of Samples | Total |
---|---|---|---|
Specific Class | Specific Construction Projects | 3 | |
Concrete Facilities and Systems | 4 | 18 | |
Definite Geographic Locations | 4 | ||
Specific Infrastructure Systems | 7 | ||
General Class | Industries and Fields | 18 | |
Concepts and Methods | 4 | 52 | |
Extensive Networks and Systems | 19 | ||
Social and Community Impact | 11 |
Main Categories | Subcategories | Number of Samples | Total |
---|---|---|---|
Natural and environmental factors | Geological events | 16 | |
Climate and weather events | 8 | 27 | |
Biological events | 3 | ||
Human and external factors | Economic and supply chain risks | 3 | |
Job and workplace-related risks | 6 | 49 | |
Security and safety risks | 10 | ||
System and infrastructure risks | 11 | ||
Uncertainties | 19 |
Main Categories | Subcategories | Number of Samples | Total |
---|---|---|---|
Management and Operations | Safety and Risk Management | 32 | 60 |
Operational Management | 21 | ||
Human and Community Management | 5 | ||
Resource and Inventory Management | 2 | ||
Performance and Development | Structural and System Performance | 7 | |
Network Performance | 2 | 17 | |
Sustainability and Environmental Management | 6 | ||
Quality and Improvement Management | 2 |
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
Li, J.; Yu, H.; Deng, X. A Systematic Review of the Evolution of the Concept of Resilience in the Construction Industry. Buildings 2024 , 14 , 2643. https://doi.org/10.3390/buildings14092643
Li J, Yu H, Deng X. A Systematic Review of the Evolution of the Concept of Resilience in the Construction Industry. Buildings . 2024; 14(9):2643. https://doi.org/10.3390/buildings14092643
Li, Jinjing, Haizhe Yu, and Xiaopeng Deng. 2024. "A Systematic Review of the Evolution of the Concept of Resilience in the Construction Industry" Buildings 14, no. 9: 2643. https://doi.org/10.3390/buildings14092643
Article access statistics, further information, mdpi initiatives, follow mdpi.
Subscribe to receive issue release notifications and newsletters from MDPI journals
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
Email citation, add to collections.
Your saved search, create a file for external citation management software, your rss feed.
Affiliations.
The current study aimed to systematically review the literature on the accuracy of artificial intelligence (AI) models for osteoporosis (OP) diagnosis using dental images. A thorough literature search was executed in October 2022 and updated in November 2023 across multiple databases, including PubMed, Scopus, Web of Science, and Google Scholar. The research targeted studies using AI models for OP diagnosis from dental radiographs. The main outcomes were the sensitivity and specificity of AI models regarding OP diagnosis. The "meta" package from the R Foundation was selected for statistical analysis. A random-effects model, along with 95% confidence intervals, was utilized to estimate pooled values. The Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool was employed for risk of bias and applicability assessment. Among 640 records, 22 studies were included in the qualitative analysis and 12 in the meta-analysis. The overall sensitivity for AI-assisted OP diagnosis was 0.85 (95% CI, 0.70-0.93), while the pooled specificity equaled 0.95 (95% CI, 0.91-0.97). Conventional algorithms led to a pooled sensitivity of 0.82 (95% CI, 0.57-0.94) and a pooled specificity of 0.96 (95% CI, 0.93-0.97). Deep convolutional neural networks exhibited a pooled sensitivity of 0.87 (95% CI, 0.68-0.95) and a pooled specificity of 0.92 (95% CI, 0.83-0.96). This systematic review corroborates the accuracy of AI in OP diagnosis using dental images. Future research should expand sample sizes in test and training datasets and standardize imaging techniques to establish the reliability of AI-assisted methods in OP diagnosis through dental images.
Keywords: Artificial intelligence; Bone density; Deep learning; Osteoporosis; Panoramic radiography.
© 2024. International Osteoporosis Foundation and Bone Health and Osteoporosis Foundation.
PubMed Disclaimer
Linkout - more resources, full text sources, miscellaneous.
NCBI Literature Resources
MeSH PMC Bookshelf Disclaimer
The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.
IMAGES
COMMENTS
Background Systematic reviews provide a structured overview of the available evidence in medical-scientific research. However, due to the increasing medical-scientific research output, it is a time-consuming task to conduct systematic reviews. To accelerate this process, artificial intelligence (AI) can be used in the review process. In this communication paper, we suggest how to conduct a ...
Artificial intelligence (AI) has acquired notorious relevance in modern computing as it effectively solves complex tasks traditionally done by humans. AI provides methods to represent and infer knowledge, efficiently manipulate texts and learn from vast amount of data. These characteristics are applicable in many activities that human find laborious or repetitive, as is the case of the ...
The systematic literature review (SLR) is the gold standard in providing research a firm evidence foundation to support decision-making. ... This technique provides a working example of how researchers can use AI and MLTs to automate the data synthesis and abstraction stage of their SLR, and aide in increasing the speed, frugality, and rigour ...
These reviews aid in identifying and comprehending the sources, focuses, and context of the extensive literature available (Alsharhan et al., 2023, pp. 1-13; Fui-Hoon Nah et al., 2023; Singh & Singh, 2023), but a comprehensive and interdisciplinary systematic literature review on GAI is necessary, going beyond the examination of individual ...
This review delineated automated tools and platforms that employ artificial intelligence (AI) approaches and evaluated the reported benefits and challenges in using such methods. A search was conducted in 4 databases (Medline, Embase, CDSR, and Epistemonikos) up to April 2021 for systematic reviews and other related reviews implementing AI methods.
The Systematic Literature Review (SLR) is a fundamental component of academic research, offering a comprehensive and unbiased overview of existing literature on a specific topic Keele et al. ().It involves a structured methodology for identifying, evaluating, and synthesizing all relevant research to address clearly defined research questions Kitchenham et al. ().
Systematic review of 22 papers: Reviewed AI definition, pedagogical models used, reasons for using AI, the use of automated scoring, and comparison of assessment evaluation accuracy between AI use and non-use: 2010-2020: AI for student assessment: Hwang & Tu, 2021: Bibliometric mapping analysis and systematic review of 43 papers
The PRISMA-AI implementations will assist stakeholders interested in using AI-related information in systematic reviews by creating a framework for reviewers that evaluates the quality of the data ...
This manuscript presents a comprehensive review of the use of Artificial Intelligence (AI) in Systematic Literature Reviews (SLRs). A SLR is a rigorous and organised methodology that assesses and integrates previous research on a given topic. Numerous tools have been developed to assist and partially automate the SLR process. The increasing role of AI in this field shows great potential in ...
A comprehensive literature review is a key component of any systematic review, and must be complete and thorough. ... However, the potential pitfalls of using AI tools in systematic reviews can be mitigated to a significant extent. AI-based tools that provide citations and sources, such as Elicit, ...
It is a challenging task for any research field to screen the literature and determine what needs to be included in a systematic review in a transparent way. A new open source machine learning ...
As part of this investigation, a systematic literature review was conducted using the PRISMA methodology ... a PRISMA-based systematic literature review was conducted. AI has the potential to transform many industries, including medicine and healthcare. Large language models, such as ChatGPT, have attracted attention for their ability to ...
Abstract. This manuscript presents a comprehensive review of the use of Artificial Intelligence (AI) in Systematic Literature Reviews (SLRs). A SLR is a rigorous and organised methodology that assesses and integrates prior research on a given topic. Numerous tools have been developed to assist and partially automate the SLR process.
Systematic Literature Reviews (SLRs) have become the foundation of evidence-based studies, enabling researchers to identify, classify, and combine existing studies based on specific research questions. Conducting an SLR is largely a manual process. Over the previous years, researchers have made significant progress in automating certain phases of the SLR process, aiming to reduce the effort ...
Systematic reviews provide a structured overview of the available evidence in medical-scientific research. However, due to the increasing medical-scientific research output, it is a time-consuming task to conduct systematic reviews. To accelerate this process, artificial intelligence (AI) can be used in the review process.
Only articles relevant to AIEd were selected for this review. To identify relevant published articles, three of the authors collaboratively discussed and developed the criteria depicted in Fig. 1.Based on the previous studies (Nigam et al., 2021), the search query [("AI" OR "artificial intelligence") AND "education"] was used to include papers with these terms in the titles ...
Securely automate every stage of your literature review to produce evidence-based research faster, more accurately, and more transparently at scale. Rayyan A web-tool designed to help researchers working on systematic reviews, scoping reviews and other knowledge synthesis projects, by dramatically speeding up the process of screening and ...
This review adoptes the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews guidelines (Tricco et al., 2018) to review the literature on using AI in story-writing.As shown in Fig. 1, this review involved four main steps: (1) identification, (2) screening, (3) eligibility, and (4) inclusion.The PRISMA criteria was adopted because it is a scientific ...
Using AI in systematic reviews is revolutionizing the process, making it more practical and sustainable. Incorporating AI into the process not only expedites the systematic literature reviews but also reduces human errors, and comes as a cost-effective systematic approach. In this article, we'll learn more about the role of AI in systematic ...
A variety of AI tools can be used during the systematic review or evidence synthesis process. These may be used to assist with developing a search strategy; locating relevant articles or resources; or during the data screening, data extraction or synthesis stage.They can also be used to draft plain language summaries.. The overall consensus is that the AI tools can be very useful in different ...
Systematic Literature Review (SLR) using AI involves leveraging artificial intelligence techniques to automate and expedite the process of reviewing and synthesizing large volumes of scholarly literature. Topics. zotero systematic-literature-reviews literature-search obsidian-md ai-tools Resources. Readme Activity.
The predominant use of AI in SLRs is to help streamline the initial steps of the process by automating the search for relevant literature and screening articles based on predefined eligibility criteria. Search engines now utilise AI, especially by strengthening Retrieval-Augmented Generation (RAG) frameworks through large language models.
Amidst the evolving methodologies described in the "Existing approaches to abstract screening in systematic reviews" section, modern AI tools based on large language models (LLMs), such as ChatGPT, PaLM, Llama, and Claude, are emerging as potential game-changers. Grounded in advanced language processing capabilities, these tools can be tailored to evaluate abstracts against nuanced ...
Artificial intelligence (AI) is revolutionizing many aspects of our lives, except it raises fundamental safety and ethical issues. In this survey paper, we review the current state of research on safe and trustworthy AI. This work provides a structured and systematic overview of AI safety. In which, we emphasize the significance of designing AI systems with safety focus, encompassing elements ...
The escalating volume of academic literature presents a formidable challenge in staying updated with the newest research developments. Addressing this, this study introduces a pioneering AI-based tool, configured specifically to streamline the efficiency of the article selection phase in Systematic Literature Reviews (SLRs). Utilizing the robust capabilities of OpenAI's GPT-4 Assistant API ...
systematic literature review (keyword and backwards search) on existing ethics guidelines and principles for AI and extracted aspects that seemed applicable to guide recruiting processed. Based on 28 relevant papers we derived actionable guidelines for using AI-based algorithms in recruiting processes. We
This study encourages the use of advanced methodologies like big data, AI, and high-level architecture (HLA) federated simulations to predict disruptions and optimize resilience strategies, thus providing a robust foundation for handling future uncertainties. ... A Systematic Literature Review and a Framework for Future Research. Eur. Manag. J ...
The current study aimed to systematically review the literature on the accuracy of artificial intelligence (AI) models for osteoporosis (OP) diagnosis using dental images. A thorough literature search was executed in October 2022 and updated in November 2023 across multiple databases, including PubMed, Scopus, Web of Science, and Google Scholar ...
This study aims to synthesise the findings of research on cloud computing adoption and use in libraries. This systematic literature review is based on Preferred Reporting Items for Systematic Revie...