Illustration with collage of pictograms of clouds, pie chart, graph pictograms on the following

Predictive analytics is a branch of advanced analytics that makes predictions about future outcomes using historical data combined with statistical modeling, data mining techniques and machine learning .

Companies employ predictive analytics to find patterns in this data to identify risks and opportunities. Predictive analytics is often associated with big data and data science .

Today, companies today are inundated with data from log files to images and video, and all of this data resides in disparate data repositories across an organization. To gain insights from this data, data scientists use deep learning and machine learning algorithms to find patterns and make predictions about future events. Some of these statistical techniques include logistic and linear regression models, neural networks and decision trees. Some of these modeling techniques use initial predictive learnings to make additional predictive insights.

Read why IBM was named a leader in the IDC MarketScape: Worldwide AI Governance Platforms 2023 report.

Register for the ebook on AI data stores

Predictive analytics models are designed to assess historical data, discover patterns, observe trends, and use that information to predict future trends. Popular predictive analytics models include classification, clustering, and time series models.

Classification models

Classification models fall under the branch of supervised machine learning models. These models categorize data based on historical data, describing relationships within a given dataset. For example, this model can be used to classify customers or prospects into groups for segmentation purposes. Alternatively, it can also be used to answer questions with binary outputs, such answering yes or no or true and false; popular use cases for this are fraud detection and credit risk evaluation. Types of classification models include logistic regression , decision trees, random forest, neural networks, and Naïve Bayes.

Clustering models

Clustering models fall under unsupervised learning . They group data based on similar attributes. For example, an e-commerce site can use the model to separate customers into similar groups based on common features and develop marketing strategies for each group. Common clustering algorithms include k-means clustering, mean-shift clustering, density-based spatial clustering of applications with noise (DBSCAN), expectation-maximization (EM) clustering using Gaussian Mixture Models (GMM), and hierarchical clustering.

Time series models

Time series models use various data inputs at a specific time frequency, such as daily, weekly, monthly, et cetera. It is common to plot the dependent variable over time to assess the data for seasonality, trends, and cyclical behavior, which may indicate the need for specific transformations and model types. Autoregressive (AR), moving average (MA), ARMA, and ARIMA models are all frequently used time series models. As an example, a call center can use a time series model to forecast how many calls it will receive per hour at different times of day.

Predictive analytics can be deployed in across various industries for different business problems. Below are a few industry use cases to illustrate how predictive analytics can inform decision-making within real-world situations.

  • Banking: Financial services use machine learning and quantitative tools to make predictions about their prospects and customers. With this information, banks can answer questions like who is likely to default on a loan, which customers pose high or low risks, which customers are the most lucrative to target resources and marketing spend and what spending is fraudulent in nature.
  • Healthcare: Predictive analytics in health care is used to detect and manage the care of chronically ill patients, as well as to track specific infections such as sepsis. Geisinger Health used predictive analytics to mine health records to learn more about how sepsis is diagnosed and treated.  Geisinger created a predictive model based on health records for more than 10,000 patients who had been diagnosed with sepsis in the past. The model yielded impressive results, correctly predicting patients with a high rate of survival.
  • Human resources (HR): HR teams use predictive analytics and employee survey metrics to match prospective job applicants, reduce employee turnover and increase employee engagement. This combination of quantitative and qualitative data allows businesses to reduce their recruiting costs and increase employee satisfaction, which is particularly useful when labor markets are volatile.
  • Marketing and sales: While marketing and sales teams are very familiar with business intelligence reports to understand historical sales performance, predictive analytics enables companies to be more proactive in the way that they engage with their clients across the customer lifecycle. For example, churn predictions can enable sales teams to identify dissatisfied clients sooner, enabling them to initiate conversations to promote retention. Marketing teams can leverage predictive data analysis for cross-sell strategies, and this commonly manifests itself through a recommendation engine on a brand’s website.
  • Supply chain: Businesses commonly use predictive analytics to manage product inventory and set pricing strategies. This type of predictive analysis helps companies meet customer demand without overstocking warehouses. It also enables companies to assess the cost and return on their products over time. If one part of a given product becomes more expensive to import, companies can project the long-term impact on revenue if they do or do not pass on additional costs to their customer base. For a deeper look at a case study, you can read more about how FleetPride used this type of data analytics to inform their decision making on their inventory of parts for excavators and tractor trailers. Past shipping orders enabled them to plan more precisely to set appropriate supply thresholds based on demand.

An organization that knows what to expect based on past patterns has a business advantage in managing inventories, workforce, marketing campaigns, and most other facets of operation.

  • Security: Every modern organization must be concerned with keeping data secure. A combination of automation and predictive analytics improves security. Specific patterns associated with suspicious and unusual end user behavior can trigger specific security procedures.
  • Risk reduction: In addition to keeping data secure, most businesses are working to reduce their risk profiles. For example, a company that extends credit can use data analytics to better understand if a customer poses a higher-than-average risk of defaulting. Other companies may use predictive analytics to better understand whether their insurance coverage is adequate. 
  • Operational efficiency : More efficient workflows translate to improved profit margins. For example, understanding when a vehicle in a fleet used for delivery is going to need maintenance before it’s broken down on the side of the road means deliveries are made on time, without the additional costs of having the vehicle towed and bringing in another employee to complete the delivery.
  • Improved decision making: Running any business involves making calculated decisions. Any expansion or addition to a product line or other form of growth requires balancing the inherent risk with the potential outcome. Predictive analytics can provide insight to inform the decision-making process and offer a competitive advantage.

IBM Watson® Studio empowers data scientists, developers and analysts to build, run and manage AI models, and optimize decisions anywhere on IBM Cloud Pak for Data.

IBM® SPSS® Statistics is a powerful statistical software platform. It offers a user-friendly interface and a robust set of features that lets your organization quickly extract actionable insights from your data.

IBM® SPSS® Modeler is a leading visual data science and machine learning (ML) solution designed to help enterprises accelerate time to value by speeding up operational tasks for data scientists.

Unlock the value of enterprise data and build an insight-driven organization that delivers business advantage with IBM Consulting.

Modern predictive analytics can empower your business to augment data with real-time insights to predict and shape your future. Read this guide to learn more.

Build a ML model to estimate the risk associated with granting a credit card to an applicant, helping to assess if they should receive it.

See how IBM SPSS® Modeler can deliver data science productivity and rapid ROI using the IBM-commissioned Forrester Consulting tool.

IBM SPSS Statistics offers advanced statistical analysis, a vast library of machine learning algorithms, text analysis, open-source extensibility, integration with big data and seamless deployment into applications.

  • Business Essentials
  • Leadership & Management
  • Credential of Leadership, Impact, and Management in Business (CLIMB)
  • Entrepreneurship & Innovation
  • Digital Transformation
  • Finance & Accounting
  • Business in Society
  • For Organizations
  • Support Portal
  • Media Coverage
  • Founding Donors
  • Leadership Team

predictive analytics research topics

  • Harvard Business School →
  • HBS Online →
  • Business Insights →

Business Insights

Harvard Business School Online's Business Insights Blog provides the career insights you need to achieve your goals and gain confidence in your business skills.

  • Career Development
  • Communication
  • Decision-Making
  • Earning Your MBA
  • Negotiation
  • News & Events
  • Productivity
  • Staff Spotlight
  • Student Profiles
  • Work-Life Balance
  • AI Essentials for Business
  • Alternative Investments
  • Business Analytics
  • Business Strategy
  • Business and Climate Change
  • Design Thinking and Innovation
  • Digital Marketing Strategy
  • Disruptive Strategy
  • Economics for Managers
  • Entrepreneurship Essentials
  • Financial Accounting
  • Global Business
  • Launching Tech Ventures
  • Leadership Principles
  • Leadership, Ethics, and Corporate Accountability
  • Leading with Finance
  • Management Essentials
  • Negotiation Mastery
  • Organizational Leadership
  • Power and Influence for Positive Impact
  • Strategy Execution
  • Sustainable Business Strategy
  • Sustainable Investing
  • Winning with Digital Platforms

What Is Predictive Analytics? 5 Examples

Hand reaching for predictive analytics graphic

  • 26 Oct 2021

Data analytics —the practice of examining data to answer questions, identify trends, and extract insights—can provide you with the information necessary to strategize and make impactful business decisions.

There are four key types of data analytics :

  • Descriptive , which answers the question, “What happened?”
  • Diagnostic , which answers the question, “Why did this happen?”
  • Prescriptive , which answers the question, “What should we do next?”
  • Predictive , which answers the question, “What might happen in the future?”

The ability to predict future events and trends is crucial across industries. Predictive analytics appears more often than you might assume—from your weekly weather forecast to algorithm-enabled medical advancements. Here’s an overview of predictive analytics to get you started on the path to data-informed strategy formulation and decision-making.

Access your free e-book today.

What Is Predictive Analytics?

Predictive analytics is the use of data to predict future trends and events. It uses historical data to forecast potential scenarios that can help drive strategic decisions.

The predictions could be for the near future—for instance, predicting the malfunction of a piece of machinery later that day—or the more distant future, such as predicting your company’s cash flows for the upcoming year.

Predictive analysis can be conducted manually or using machine-learning algorithms. Either way, historical data is used to make assumptions about the future.

One predictive analytics tool is regression analysis , which can determine the relationship between two variables ( single linear regression ) or three or more variables ( multiple regression ). The relationships between variables are written as a mathematical equation that can help predict the outcome should one variable change.

“Regression allows us to gain insights into the structure of that relationship and provides measures of how well the data fit that relationship,” says Harvard Business School Professor Jan Hammond, who teaches the online course Business Analytics , one of the three courses that make up the Credential of Readiness (CORe) program . “Such insights can prove extremely valuable for analyzing historical trends and developing forecasts.”

Forecasting can enable you to make better decisions and formulate data-informed strategies. Here are several examples of predictive analytics in action to inspire you to use it at your organization.

Credential of Readiness | Master the fundamentals of business | Learn More

5 Examples of Predictive Analytics in Action

1. finance: forecasting future cash flow.

Every business needs to keep periodic financial records, and predictive analytics can play a big role in forecasting your organization’s future health. Using historical data from previous financial statements, as well as data from the broader industry, you can project sales, revenue, and expenses to craft a picture of the future and make decisions.

HBS Professor V.G. Narayanan mentions the importance of forecasting in the course Financial Accounting , which is also part of CORe .

“Managers need to be looking ahead in order to plan for the future health of their business,” Narayanan says. “No matter the field in which you work, there is always a great amount of uncertainty involved in this process.”

2. Entertainment & Hospitality: Determining Staffing Needs

One example explored in Business Analytics is casino and hotel operator Caesars Entertainment’s use of predictive analytics to determine venue staffing needs at specific times.

In entertainment and hospitality, customer influx and outflux depend on various factors, all of which play into how many staff members a venue or hotel needs at a given time. Overstaffing costs money, and understaffing could result in a bad customer experience, overworked employees, and costly mistakes.

To predict the number of hotel check-ins on a given day, a team developed a multiple regression model that considered several factors. This model enabled Caesars to staff its hotels and casinos and avoid overstaffing to the best of its ability.

3. Marketing: Behavioral Targeting

In marketing, consumer data is abundant and leveraged to create content, advertisements, and strategies to better reach potential customers where they are. By examining historical behavioral data and using it to predict what will happen in the future, you engage in predictive analytics.

Predictive analytics can be applied in marketing to forecast sales trends at various times of the year and plan campaigns accordingly.

Additionally, historical behavioral data can help you predict a lead’s likelihood of moving down the funnel from awareness to purchase. For instance, you could use a single linear regression model to determine that the number of content offerings a lead engages with predicts—with a statistically significant level of certainty—their likelihood of converting to a customer down the line. With this knowledge, you can plan targeted ads at various points in the customer’s lifecycle.

Related: What Is Marketing Analytics?

4. Manufacturing: Preventing Malfunction

While the examples above use predictive analytics to take action based on likely scenarios, you can also use predictive analytics to prevent unwanted or harmful situations from occurring. For instance, in the manufacturing field, algorithms can be trained using historical data to accurately predict when a piece of machinery will likely malfunction.

When the criteria for an upcoming malfunction are met, the algorithm is triggered to alert an employee who can stop the machine and potentially save the company thousands, if not millions, of dollars in damaged product and repair costs. This analysis predicts malfunction scenarios in the moment rather than months or years in advance.

Some algorithms even recommend fixes and optimizations to avoid future malfunctions and improve efficiency, saving time, money, and effort. This is an example of prescriptive analytics; more often than not, one or more types of analytics are used in tandem to solve a problem.

5. Health Care: Early Detection of Allergic Reactions

Another example of using algorithms for rapid, predictive analytics for prevention comes from the health care industry . The Wyss Institute at Harvard University partnered with the KeepSmilin4Abbie Foundation to develop a wearable piece of technology that predicts an anaphylactic allergic reaction and automatically administers life-saving epinephrine.

The sensor, called AbbieSense, detects early physiological signs of anaphylaxis as predictors of an ensuing reaction—and it does so far quicker than a human can. When a reaction is predicted to occur, an algorithmic response is triggered. The algorithm can predict the reaction’s severity, alert the individual and caregivers, and automatically inject epinephrine when necessary. The technology’s ability to predict the reaction at a faster speed than manual detection could save lives.

Business Analytics | Become a data-driven leader | Learn More

Using Data to Strategize for the Future

No matter your industry, predictive analytics can provide the insights needed to make your next move. Whether you’re driving financial decisions, formulating marketing strategies, changing your course of action, or working to save lives, building a foundation in analytical skills can serve you well.

For hands-on practice and a deeper understanding of how you can put analytics to work for your organization, consider taking Business Analytics , one of three online courses that make up HBS Online’s CORe program .

Do you want to become a data-driven professional? Explore our eight-week Business Analytics course and our three-course Credential of Readiness (CORe) program to deepen your analytical skills and apply them to real-world business problems.

predictive analytics research topics

About the Author

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Ann Transl Med
  • v.8(4); 2020 Feb

Logo of anntransmed

Predictive analytics in the era of big data: opportunities and challenges

Big data have changed the way we generate, manage, analyze and leverage data in any industries. There is no exception in clinical medicine where large volume of data is generated from electronic healthcare records, wearable devices and insurance companies ( 1 ). This has greatly changed the way we perform clinical studies. Instead of performing data entry and curation manually, the information technology has significantly improved the efficacy of data management. With such a large volume of data, many clinical questions can be addressed by using big data analytics ( 1 - 3 ). Three steps are typically involved in the big data analytics ( Table 1 ). The first step is the formulation of clinical questions ( 4 ), which can be categorized into three types: (I) epidemiological question on prevalence and incidence and risk factors; (II) effectiveness and/or safety of an intervention; and (III) predictive analytics. The second step is the design of a study, which transforms the clinical question into a study design. For example, the prevalence of catheter-related blood stream infection (CRBSI) as well as its risk factors can be addressed with retrospective or prospective cohort study. A case-control study design can be used to identify risk factors. The effectiveness can be addressed by a randomized controlled trial or an observational study. The third step involves the statistical analysis and/or modelling by using data collected under a certain design.

Among all these big data analytics, the predictive analytics are becoming increasingly important in clinical medicine ( 5 ). The use of predictive analytics in clinical medicine includes but not limited to risk stratification, differential diagnosis (classification), prognosis, prediction of disease occurrence and prediction for the effectiveness of a certain intervention ( 6 - 8 ). In other words, the Predictive analytics involve the whole process of the disease course from disease prevention, diagnosis, treatment and finally to the prognosis. For example, from the perspective of disease prevention, smoking is a strong risk factor for the development of lung cancer and thus modifying this factor can help to reduce the risk of lung cancer. If a patient is diagnosed with lung cancer, risk stratification by using genetic and clinical features in a predictive model can help to determine whether surgical intervention and/or chemotherapy should be used. Finally, accurate prediction of the long-term outcome is also important for communication with family members and medical decision making. Thus, the literature involving clinical prediction have witnessed a rapid increase in recent years. Conventionally, predictors are entered into a generalized linear model to estimate a vector of coefficients, and the resulting model can be generalized to samples that are not used for training the model ( 9 ). However, the model training process is not straightforward and there is no single approach that can fit for all situations. For example, the generalized linear model is easy to interpret for subject matter audience, but it cannot automatically capture the high-order relationship among covariates ( 10 ). In contrast, the sophisticated neural networks and deep learning approaches are capable of modeling any mathematical functions, which however is at the cost of interpretability (e.g., these models are considered to be black box algorithm because domain experts cannot easily understand how the predictors/features influence the outcome/label) ( 11 , 12 ).

In a recent special report published in the Annals of Translational Medicine , Zhou and colleagues provided a comprehensive tutorial on how to perform predictive modeling ( 13 ). There are 16 sections involving variable selection (feature engineering), model calibration, utility, and nomogram for the ease of clinical application. They also discussed some challenging conditions such as the presence of competing risks and the curse of dimensionality. Potential readers of this report include clinical investigators, physicians, and even statisticians. More importantly, the R code for each step of modeling are provided and well explained. For beginners with limited experience in R coding, this can be a good starting point.

However, I want to clarify that the authors have confused the parametric and non-parametric modeling in the first chart. First, let’s look at the formal definition for parametric and non-parametric modeling from the textbook Artificial Intelligence: A Modern Approach ( 14 ). The author stated that:

“ A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs. ”

From this definition, the neural networks should apparently be classified as the parametric modeling approach because there are multiple weights attached to the nodes of the neural network ( 15 ). Actually, a neural network with only one layer is simply a linear regression model, and the latter is a prototype of parametric model. The purpose of training a neural network model is to estimate weights and bias for each node, then the weighted sum is passed to the next layer node and there is usually a non-linear activation function to transform the signal. Other machine learning methods such as k-nearest neighbors and decision trees can be safely classified as non-parametric models.

Furthermore, in the prediction evaluation model branch, the authors classified drawing nomogram and building prediction scores into the evaluation process of a model. I have to argue that there is no evaluation of the model at all with these two approaches. The use of risk scores and nomograms are simply the presentation of trained prediction models so that they can be used in clinical practice ( 16 ). It has nothing to do with the calibration or discrimination of the model. Nomogram and/or risk scores should be done after the final model is confirmed by using a variety of validation methods. The validations in the training set and external set cannot be considered as conceptually parallel. The external validation should be considered more robust in identifying the problem of overfitting than the internal validation no matter which procedure is used (e.g., there are many statistical methods to perform model validation if there is only one single dataset such as cross validation, simple-split and leave-one-out) ( 17 ).

In conclusion, the comprehensive tutorial is timely in the era of big data that it provides practical tools for conducting predictive analytics. With more advanced information technology being applied to patients, a large volume of data can be collected with ease. Thus, the interests in leveraging big data to advance the healthcare are increasing. Predictive analytics is the cornerstone of precision medicine that patients with different clinical characteristics and genetic backgrounds should be treated differently. Although there is a great deal of challenges in leveraging big data to advance the healthcare ( 18 , 19 ), the opportunities are equally abundant.

Acknowledgments

Funding: Z Zhang received funding from Zhejiang Province Public Welfare Technology Application Research Project (CN) (LGF18H150005) and the National Natural Science Foundation of China (Grant No. 81901929).

Ethical Statement: The author is accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Provenance: This is an invited article commissioned by the Editorial Office, Annals of Translational Medicine .

Conflicts of Interest: The author has no conflicts of interest to declare.

Precision Health Analytics With Predictive Analytics and Implementation Research: JACC State-of-the-Art Review

Affiliations.

  • 1 College of Medicine and College of Public Health and Health Professions, University of Florida Health Science Center, Gainesville, Florida. Electronic address: [email protected].
  • 2 School of Medicine and Duke Clinical Research Institute, Duke University, Durham, North Carolina.
  • 3 Center for Translation Research and Implementation Science, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland.
  • 4 Office of Genomics and Precision Public Health, Centers for Disease Control and Prevention, Atlanta, Georgia.
  • 5 Columbia University School of Social Work, New York, New York.
  • 6 School of Public Health and Information Science, University of Louisville, Louisville, Kentucky.
  • 7 Division of Lung Diseases, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland.
  • 8 AllianceChicago, Chicago, Illinois.
  • 9 Health Policy and Management, Milken Institute School of Public Health, George Washington University, Washington, DC.
  • 10 Division of Cardiovascular Sciences, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland.
  • 11 Predictive Analytics and Comparative Effectiveness (PACE) Center, Sackler School of Graduate Biomedical Sciences, Tufts University, Tufts Medical Center, Boston, Massachusetts.
  • 12 Department of Medicine Health Services and Care Research Program, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin.
  • 13 Geisinger Health System, Danville, Pennsylvania.
  • 14 College of Medicine and College of Public Health and Health Professions, University of Florida Health Science Center, Gainesville, Florida.
  • 15 Health Technology, Telemedicine and Advanced Technology Research Center, Frederick, Maryland.
  • 16 School of Medicine, University of Texas Health Science Center at San Antonio and South Texas Veterans Health Care System, San Antonio, Texas.
  • 17 Division of Blood Diseases and Resources, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland.
  • 18 Division of Biostatistics & Epidemiology, Division of Pulmonary Medicine, Cincinnati Children's Hospital, Department of Pediatrics, University of Cincinnati, Cincinnati, Ohio.
  • 19 Center for Translation Research and Implementation Science, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland. Electronic address: [email protected].
  • PMID: 32674794
  • DOI: 10.1016/j.jacc.2020.05.043

Emerging data science techniques of predictive analytics expand the quality and quantity of complex data relevant to human health and provide opportunities for understanding and control of conditions such as heart, lung, blood, and sleep disorders. To realize these opportunities, the information sources, the data science tools that use the information, and the application of resulting analytics to health and health care issues will require implementation research methods to define benefits, harms, reach, and sustainability; and to understand related resource utilization implications to inform policymakers. This JACC State-of-the-Art Review is based on a workshop convened by the National Heart, Lung, and Blood Institute to explore predictive analytics in the context of implementation science. It highlights precision medicine and precision public health as complementary and compelling applications of predictive analytics, and addresses future research and training endeavors that might further foster the application of predictive analytics in clinical medicine and public health.

Keywords: exposome; genome; implementation research; predictive analytics; social determinants.

Copyright © 2020 American College of Cardiology Foundation. All rights reserved.

Publication types

  • Research Support, N.I.H., Extramural
  • Cardiology*
  • Delivery of Health Care / methods*
  • Periodicals as Topic*
  • Precision Medicine / methods*
  • Public Health*

Grad Coach

Research Topics & Ideas: Data Science

50 Topic Ideas To Kickstart Your Research Project

Research topics and ideas about data science and big data analytics

If you’re just starting out exploring data science-related topics for your dissertation, thesis or research project, you’ve come to the right place. In this post, we’ll help kickstart your research by providing a hearty list of data science and analytics-related research ideas , including examples from recent studies.

PS – This is just the start…

We know it’s exciting to run through a list of research topics, but please keep in mind that this list is just a starting point . These topic ideas provided here are intentionally broad and generic , so keep in mind that you will need to develop them further. Nevertheless, they should inspire some ideas for your project.

To develop a suitable research topic, you’ll need to identify a clear and convincing research gap , and a viable plan to fill that gap. If this sounds foreign to you, check out our free research topic webinar that explores how to find and refine a high-quality research topic, from scratch. Alternatively, consider our 1-on-1 coaching service .

Research topic idea mega list

Data Science-Related Research Topics

  • Developing machine learning models for real-time fraud detection in online transactions.
  • The use of big data analytics in predicting and managing urban traffic flow.
  • Investigating the effectiveness of data mining techniques in identifying early signs of mental health issues from social media usage.
  • The application of predictive analytics in personalizing cancer treatment plans.
  • Analyzing consumer behavior through big data to enhance retail marketing strategies.
  • The role of data science in optimizing renewable energy generation from wind farms.
  • Developing natural language processing algorithms for real-time news aggregation and summarization.
  • The application of big data in monitoring and predicting epidemic outbreaks.
  • Investigating the use of machine learning in automating credit scoring for microfinance.
  • The role of data analytics in improving patient care in telemedicine.
  • Developing AI-driven models for predictive maintenance in the manufacturing industry.
  • The use of big data analytics in enhancing cybersecurity threat intelligence.
  • Investigating the impact of sentiment analysis on brand reputation management.
  • The application of data science in optimizing logistics and supply chain operations.
  • Developing deep learning techniques for image recognition in medical diagnostics.
  • The role of big data in analyzing climate change impacts on agricultural productivity.
  • Investigating the use of data analytics in optimizing energy consumption in smart buildings.
  • The application of machine learning in detecting plagiarism in academic works.
  • Analyzing social media data for trends in political opinion and electoral predictions.
  • The role of big data in enhancing sports performance analytics.
  • Developing data-driven strategies for effective water resource management.
  • The use of big data in improving customer experience in the banking sector.
  • Investigating the application of data science in fraud detection in insurance claims.
  • The role of predictive analytics in financial market risk assessment.
  • Developing AI models for early detection of network vulnerabilities.

Research topic evaluator

Data Science Research Ideas (Continued)

  • The application of big data in public transportation systems for route optimization.
  • Investigating the impact of big data analytics on e-commerce recommendation systems.
  • The use of data mining techniques in understanding consumer preferences in the entertainment industry.
  • Developing predictive models for real estate pricing and market trends.
  • The role of big data in tracking and managing environmental pollution.
  • Investigating the use of data analytics in improving airline operational efficiency.
  • The application of machine learning in optimizing pharmaceutical drug discovery.
  • Analyzing online customer reviews to inform product development in the tech industry.
  • The role of data science in crime prediction and prevention strategies.
  • Developing models for analyzing financial time series data for investment strategies.
  • The use of big data in assessing the impact of educational policies on student performance.
  • Investigating the effectiveness of data visualization techniques in business reporting.
  • The application of data analytics in human resource management and talent acquisition.
  • Developing algorithms for anomaly detection in network traffic data.
  • The role of machine learning in enhancing personalized online learning experiences.
  • Investigating the use of big data in urban planning and smart city development.
  • The application of predictive analytics in weather forecasting and disaster management.
  • Analyzing consumer data to drive innovations in the automotive industry.
  • The role of data science in optimizing content delivery networks for streaming services.
  • Developing machine learning models for automated text classification in legal documents.
  • The use of big data in tracking global supply chain disruptions.
  • Investigating the application of data analytics in personalized nutrition and fitness.
  • The role of big data in enhancing the accuracy of geological surveying for natural resource exploration.
  • Developing predictive models for customer churn in the telecommunications industry.
  • The application of data science in optimizing advertisement placement and reach.

Recent Data Science-Related Studies

While the ideas we’ve presented above are a decent starting point for finding a research topic, they are fairly generic and non-specific. So, it helps to look at actual studies in the data science and analytics space to see how this all comes together in practice.

Below, we’ve included a selection of recent studies to help refine your thinking. These are actual studies,  so they can provide some useful insight as to what a research topic looks like in practice.

  • Data Science in Healthcare: COVID-19 and Beyond (Hulsen, 2022)
  • Auto-ML Web-application for Automated Machine Learning Algorithm Training and evaluation (Mukherjee & Rao, 2022)
  • Survey on Statistics and ML in Data Science and Effect in Businesses (Reddy et al., 2022)
  • Visualization in Data Science VDS @ KDD 2022 (Plant et al., 2022)
  • An Essay on How Data Science Can Strengthen Business (Santos, 2023)
  • A Deep study of Data science related problems, application and machine learning algorithms utilized in Data science (Ranjani et al., 2022)
  • You Teach WHAT in Your Data Science Course?!? (Posner & Kerby-Helm, 2022)
  • Statistical Analysis for the Traffic Police Activity: Nashville, Tennessee, USA (Tufail & Gul, 2022)
  • Data Management and Visual Information Processing in Financial Organization using Machine Learning (Balamurugan et al., 2022)
  • A Proposal of an Interactive Web Application Tool QuickViz: To Automate Exploratory Data Analysis (Pitroda, 2022)
  • Applications of Data Science in Respective Engineering Domains (Rasool & Chaudhary, 2022)
  • Jupyter Notebooks for Introducing Data Science to Novice Users (Fruchart et al., 2022)
  • Towards a Systematic Review of Data Science Programs: Themes, Courses, and Ethics (Nellore & Zimmer, 2022)
  • Application of data science and bioinformatics in healthcare technologies (Veeranki & Varshney, 2022)
  • TAPS Responsibility Matrix: A tool for responsible data science by design (Urovi et al., 2023)
  • Data Detectives: A Data Science Program for Middle Grade Learners (Thompson & Irgens, 2022)
  • MACHINE LEARNING FOR NON-MAJORS: A WHITE BOX APPROACH (Mike & Hazzan, 2022)
  • COMPONENTS OF DATA SCIENCE AND ITS APPLICATIONS (Paul et al., 2022)
  • Analysis on the Application of Data Science in Business Analytics (Wang, 2022)

As you can see, these research topics are a lot more focused than the generic topic ideas we presented earlier. So, for you to develop a high-quality research topic, you’ll need to get specific and laser-focused on a specific context with specific variables of interest.  In the video below, we explore some other important things you’ll need to consider when crafting your research topic.

Get 1-On-1 Help

If you’re still unsure about how to find a quality research topic, check out our Research Topic Kickstarter service, which is the perfect starting point for developing a unique, well-justified research topic.

Research Topic Kickstarter - Need Help Finding A Research Topic?

You Might Also Like:

IT & Computer Science Research Topics

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

predictive analytics Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Predictive Analytics of Energy Usage by IoT-Based Smart Home Appliances for Green Urban Development

Green IoT primarily focuses on increasing IoT sustainability by reducing the large amount of energy required by IoT devices. Whether increasing the efficiency of these devices or conserving energy, predictive analytics is the cornerstone for creating value and insight from large IoT data. This work aims at providing predictive models driven by data collected from various sensors to model the energy usage of appliances in an IoT-based smart home environment. Specifically, we address the prediction problem from two perspectives. Firstly, an overall energy consumption model is developed using both linear and non-linear regression techniques to identify the most relevant features in predicting the energy consumption of appliances. The performances of the proposed models are assessed using a publicly available dataset comprising historical measurements from various humidity and temperature sensors, along with total energy consumption data from appliances in an IoT-based smart home setup. The prediction results comparison show that LSTM regression outperforms other linear and ensemble regression models by showing high variability ( R 2 ) with the training (96.2%) and test (96.1%) data for selected features. Secondly, we develop a multi-step time-series model using the auto regressive integrated moving average (ARIMA) technique to effectively forecast future energy consumption based on past energy usage history. Overall, the proposed predictive models will enable consumers to minimize the energy usage of home appliances and the energy providers to better plan and forecast future energy demand to facilitate green urban development.

Influence of AI and Machine Learning in Insurance Sector

The Aim of this research is to identify influence, usage, and the benefits of AI (Artificial Intelligence) and ML (Machine learning) using big data analytics in Insurance sector. Insurance sector is the most volatile industry since multiple natural influences like Brexit, pandemic, covid 19, Climate changes, Volcano interruptions. This research paper will be used to explore potential scope and use cases for AI, ML and Big data processing in Insurance sector for Automate claim processing, fraud prevention, predictive analytics, and trend analysis towards possible cause for business losses or benefits. Empirical quantitative research method is used to verify the model with the sample of UK insurance sector analysis. This research will conclude some practical insights for Insurance companies using AI, ML, Big data processing and Cloud computing for the better client satisfaction, predictive analysis, and trending.

Can HRM predict mental health crises? Using HR analytics to unpack the link between employment and suicidal thoughts and behaviors

PurposeThe aim of this research is to determine the extent to which the human resource (HR) function can screen and potentially predict suicidal employees and offer preventative mental health assistance.Design/methodology/approachDrawing from the 2019 National Survey of Drug Use and Health (N = 56,136), this paper employs multivariate binary logistic regression to model the work-related predictors of suicidal ideation, planning and attempts.FindingsThe results indicate that known periods of joblessness, the total number of sick days and absenteeism over the last 12 months are significantly associated with various suicidal outcomes while controlling for key psychosocial correlates. The results also indicate that employee assistance programs are associated with a significantly reduced likelihood of suicidal ideation. These findings are consistent with conservation of resources theory.Research limitations/implicationsThis research demonstrates preliminarily that the HR function can unobtrusively detect employee mental health crises by collecting data on key predictors.Originality/valueIn the era of COVID-19, employers have a duty of care to safeguard employee mental health. To this end, the authors offer an innovative way through which the HR function can employ predictive analytics to address mental health crises before they result in tragedy.

An AI-Enabled Predictive Analytics Dashboard for Acute Neurosurgical Referrals

Abstract Healthcare dashboards make key information about service and clinical outcomes available to staff in an easy-to-understand format. Most dashboards are limited to providing insights based on group-level inference, rather than individual prediction. Here, we evaluate a dashboard which could analyze and forecast acute neurosurgical referrals based on 10,033 referrals made to a large volume tertiary neurosciences center in central London, U.K., from the start of the Covid-19 pandemic lockdown period until October 2021. As anticipated, referral volumes significantly increased in this period, largely due to an increase in spinal referrals. Applying a range of validated time-series forecasting methods, we found that referrals were projected to increase beyond this time-point. Using a mixed-methods approach, we determined that the dashboard was usable, feasible, and acceptable to key stakeholders. Dashboards provide an effective way of visualizing acute surgical referral data and for predicting future volume without the need for data-science expertise.

Price Bubbles in the Real Estate Markets - Analysis and Prediction

The article concerns the issue of price bubbles on the markets, with particular emphasis on the specificity of the real estate market. Up till now, more than a decade after the subprime crisis, there is no accurate enough method to predict price movements, their culmination and, eventually, the burst of price and speculative bubbles on the markets. Hence, the main goal of the article is to present the possibility of early detection of price bubbles and their consequences from the point of view of the surveyed managers. The following research hypothesis was verified: price bubbles on the real estate market cannot be excluded, therefore constant monitoring and predictive analytics of this market are needed. In addition to standard research methods (desk research or statistical analysis), the authors conducted their own survey on a group of randomly selected managers from Portugal and Poland in the context of their attitude to crises and price bubbles. The obtained results allowed us to conclude that managers in both analysed countries are different relating the effects of price bubbles to the activities of their own companies but are similar (about 40% of respondents) expecting quick detection and deactivation of emerging bubbles by the government or by central bank. Nearly 40% of Polish and Portuguese managers claimed that the consequences of crises must include an increased responsibility of managers for their decisions, especially those leading to failures.

Covid-19 Impact and Implications on Traffic: Smart Predictive Analytics for Mobility Navigation

Empirical study on classifiers for earlier prediction of covid-19 infection cure and death rate in the indian states.

Machine Learning methods can play a key role in predicting the spread of respiratory infection with the help of predictive analytics. Machine Learning techniques help mine data to better estimate and predict the COVID-19 infection status. A Fine-tuned Ensemble Classification approach for predicting the death and cure rates of patients from infection using Machine Learning techniques has been proposed for different states of India. The proposed classification model is applied to the recent COVID-19 dataset for India, and a performance evaluation of various state-of-the-art classifiers to the proposed model is performed. The classifiers forecasted the patients’ infection status in different regions to better plan resources and response care systems. The appropriate classification of the output class based on the extracted input features is essential to achieve accurate results of classifiers. The experimental outcome exhibits that the proposed Hybrid Model reached a maximum F1-score of 94% compared to Ensembles and other classifiers like Support Vector Machine, Decision Trees, and Gaussian Naïve Bayes on a dataset of 5004 instances through 10-fold cross-validation for predicting the right class. The feasibility of automated prediction for COVID-19 infection cure and death rates in the Indian states was demonstrated.

People Analytics: Augmenting Horizon from Predictive Analytics to Prescriptive Analytics

Analytics techniques: descriptive analytics, predictive analytics, and prescriptive analytics, unlocking drivers for employee engagement through human resource analytics.

The authors have discussed in detail the meaning of employee engagement and its relevance for the organizations in the present scenario. The authors also highlighted the various factors that predict the employee engagement of the employees in the varied organizations. The authors have emphasized on the role that HR analytics can play to identify the reasons for low level of engagement among employees and suggesting ways to improve the same using predictive analytics. The authors have also advocated the benefits that organizations can reap by making use of HR analytics in measuring the engagement levels of the employees and improving the engagement levels of diverse workforce in the existing organizations. The authors have also proposed the future perspectives of the proposed study that help the organizations and officials from the top management to tap the benefits of analytics in the function of human resource management and to address the upcoming issues related to employee behavior.

Export Citation Format

Share document.

Applying Predictive Analytics on Research Information to Enhance Funding Discovery and Strengthen Collaboration in Project Proposals

  • Conference paper
  • First Online: 11 May 2021
  • Cite this conference paper

predictive analytics research topics

  • Dang Vu Nguyen Hai   ORCID: orcid.org/0000-0002-5496-3633 12 &
  • Martin Gaedke   ORCID: orcid.org/0000-0002-6729-2912 12  

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12706))

Included in the following conference series:

  • International Conference on Web Engineering

1850 Accesses

In academic and industrial research, writing a project proposal is one of the essential but time-consuming activities. Nevertheless, most proposals end in rejection. Moreover, research funding is getting more competitive these days. Funding agencies are increasingly looking for more extensive and more interdisciplinary research proposals. To increase the funding success rate, this PhD project focuses on three open challenges: poor data quality, inefficient funding discovery, and ineffective collaborative team building. We envision a Predictive Analytics-based approach that involves analyzing research information and using statistical and machine learning models that can assure data quality, increase funding discovery efficiency and the effectiveness of collaboration building. Accordingly, the goal of this PhD project is to support decision-making process to maximize the funding success rates of universities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

https://duraspace.org/vivo .

https://profiles.catalyst.harvard.edu/ .

https://www.elsevier.com/solutions/pure .

https://clarivate.com/webofsciencegroup/solutions/converis .

https://www.elsevier.com/solutions/funding-institutional .

https://duraspace.org/vivo/community/ .

Azeroual, O.: Text and data quality mining in CRIS. Information 10 (12), 374 (2019). https://doi.org/10.3390/info10120374 , https://www.mdpi.com/2078-2489/10/12/374

Azeroual, O., Saake, G., Schallehn, E.: Analyzing data quality issues in research information systems via data profiling. Int. J. Inf. Manag. 41 , 50–56 (2018)

Article   Google Scholar  

Cai, L., Zhu, Y.: The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14 , 2 (2015)

Google Scholar  

CrossRef: Funder registry factsheet. https://www.crossref.org/pdfs/about-funder-registry.pdf . Accessed 2 Feb 2021

Dolgin, E.: The hunt for the lesser-known funding source. Nature 570 (7759), 127–130 (2019)

Guillaumet, A., García, F., Cuadrón, O.: Analyzing a CRIS: from data to insight in university research. Procedia Comput. Sci. 146 , 230–240 (2019)

Kash, W.: Predictive analytics tools are boosting graduation rates and ROI, say university officials. https://edscoop.com/predictive-analytics-tools-are-boosting-graduation-rates-and-roi-say-university-officials/ . Accessed 25 Jan 2021

Langer, A., Vu Nguyen Hai, D., Gaedke, M.: SolidRDP: applying solid data containers for research data publishing. In: Bielikova, M., Mikkonen, T., Pautasso, C. (eds.) ICWE 2020. LNCS, vol. 12128, pp. 399–415. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50578-3_27

Chapter   Google Scholar  

Manu, T., Parmar, M., Shashikumara, A., Asjola, V.: Research information management systems: a comparative study. In: Research Data Access and Management in Modern Libraries, pp. 54–80. IGI Global (2019)

Mishra, N., Silakari, S.: Predictive analytics: a survey, trends, applications, oppurtunities & challenges. Int. J. Comput. Sci. Inf. Technol. 3 (3), 4434–4438 (2012)

Rajni, J., Malaya, D.B.: Predictive analytics in a higher education context. IT Prof. 17 (4), 24–33 (2015). https://doi.org/10.1109/MITP.2015.68

van Rijnsoever, F.J., Hessels, L.K.: How academic researchers select collaborative research projects: a choice experiment. J. Technol. Transfer 1–32 (2020). https://doi.org/10.1007/s10961-020-09833-2

Sohn, E.: Secrets to writing a winning grant. Nature 577 (7788), 133–135 (2020)

Thompson, L.: How to increase your institution’s grant success rates. https://elsevier.com/connect/how-to-increase-your-grant-success-rates-with-insights-discovery-and-decisions . Accessed 24 Jan 2021

University, I.: Some reasons proposals fail. https://www.montana.edu/research/osp/general/reasons.html . Accessed 20 Jan 2021

Vu Nguyen Hai, D., Langer, A., Gaedke, M.: TUCfis: Applying vivo as the new RIS of the technical university of Chemnitz. Technische Informationsbibliothek TIB (2020). https://doi.org/10.5446/48014

Wieringa, R.J.: Design Science Methodology for Information Systems and Software Engineering. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-43839-8

Book   Google Scholar  

Download references

Acknowledgements

This PhD project is supported by the project IB20 Fis Heavy/TU Chemnitz/259038, funded by the Saxon State Ministry for Science and Art. In addition, we would like to thank André Langer, Maik Benndorf and Sebastian Heil for their supports during the writing process of this Symposium.

Author information

Authors and affiliations.

Chemnitz University of Technology, Chemnitz, Germany

Dang Vu Nguyen Hai & Martin Gaedke

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Dang Vu Nguyen Hai .

Editor information

Editors and affiliations.

Dipartimento di Elettronica, Politecnico di Milano, Milan, Italy

Marco Brambilla

E2S UPPA, LIUPPA, Université de Pau et des Pays de l’Adour, Anglet, France

Richard Chbeir

Econometric Institute, Erasmus University Rotterdam, Rotterdam, The Netherlands

Flavius Frasincar

Inria Saclay-Île-de-France, Institut Polytechnique de Paris, Palaiseau, France

Ioana Manolescu

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Cite this paper.

Vu Nguyen Hai, D., Gaedke, M. (2021). Applying Predictive Analytics on Research Information to Enhance Funding Discovery and Strengthen Collaboration in Project Proposals. In: Brambilla, M., Chbeir, R., Frasincar, F., Manolescu, I. (eds) Web Engineering. ICWE 2021. Lecture Notes in Computer Science(), vol 12706. Springer, Cham. https://doi.org/10.1007/978-3-030-74296-6_37

Download citation

DOI : https://doi.org/10.1007/978-3-030-74296-6_37

Published : 11 May 2021

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-74295-9

Online ISBN : 978-3-030-74296-6

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Search Search Please fill out this field.

What Is Predictive Analytics?

  • How It Works
  • Analytics vs. Machine Learning
  • Types of Models
  • Business Uses

The Bottom Line

  • Behavioral Economics

Predictive Analytics: Definition, Model Types, and Uses

predictive analytics research topics

Erika Rasure is globally-recognized as a leading consumer economics subject matter expert, researcher, and educator. She is a financial therapist and transformational coach, with a special interest in helping women learn how to invest.

predictive analytics research topics

Investopedia / Julie Bang

Predictive analytics is the use of statistics and modeling techniques to forecast future outcomes. Current and historical data patterns are examined and plotted to determine the likelihood that those patterns will repeat.

Businesses use predictive analytics to fine-tune their operations and decide whether new products are worth the investment. Investors use predictive analytics to decide where to put their money. Internet retailers use predictive analytics to fine-tune purchase recommendations to their users and increase sales.

Key Takeaways

  • Industries from insurance to marketing use predictive techniques to make important decisions.
  • Predictive models help make weather forecasts, develop video games, translate voice-to-text messages, make customer service decisions, and develop investment portfolios.
  • Predictive analytics determines a likely outcome based on an examination of current and historical data.
  • Decision trees, regression, and neural networks all are types of predictive models.
  • People often confuse predictive analytics with machine learning even though the two are different disciplines.

Understanding Predictive Analytics

Predictive analytics looks for past patterns to measure the likelihood that those patterns will reoccur. It draws on a series of techniques to make these determinations, including artificial intelligence (AI), data mining , machine learning, modeling, and statistics. For instance, data mining involves the analysis of large sets of data to detect patterns from it. Text analysis does the same using large blocks of text.

Predictive models are used for many applications, including weather forecasts, creating video games, translating voice to text, customer service, and investment portfolio strategies. All of these applications use descriptive statistical models of existing data to make predictions about future data.

Predictive analytics helps businesses manage inventory, develop marketing strategies , and forecast sales . It also helps businesses survive, especially in highly competitive industries such as health care and retail. Investors and financial professionals draw on this technology to help craft investment portfolios and reduce their overall risk potential.

These models determine relationships, patterns, and structures in data that are used to draw conclusions as to how changes in the underlying processes that generate the data will change the results. Predictive models build on these descriptive models and look at past data to determine the likelihood of certain future outcomes, given current conditions or a set of expected future conditions.

Uses of Predictive Analytics

Predictive analytics is a decision-making tool in many industries. Following are some examples.

Manufacturing

Forecasting is essential in manufacturing to optimize the use of resources in a supply chain . Critical spokes of the supply chain wheel, whether it is inventory management or the shop floor, require accurate forecasts for functioning.

Predictive modeling is often used to clean and optimize the quality of data used for such forecasts. Modeling ensures that more data can be ingested by the system, including from customer-facing operations, to ensure a more accurate forecast.

Credit scoring makes extensive use of predictive analytics. When a consumer or business applies for credit, data on the applicant's credit history and the credit record of borrowers with similar characteristics are used to predict the risk that the applicant might fail to repay any new credit that is approved.

Underwriting

Data and predictive analytics play an important role in underwriting. Insurance companies examine applications for new policies to determine the likelihood of having to pay out for a future claim . The analysis is based on the current risk pool of similar policyholders as well as past events that have resulted in payouts.

Predictive models that consider characteristics in comparison to data about past policyholders and claims are routinely used by actuaries .

Marketing professionals planning a new campagn look at how consumers have reacted to the overall economy. They can use these shifts in demographics to determine if the current mix of products will entice consumers to make a purchase.

Stock Traders

Active traders look at a variety of historical metrics when deciding whether to buy a particular stock or other asset.

Moving averages, bands, and breakpoints all are based on historical data and are used to forecast future price movements.

Fraud Detection

Financial services use predictive analytics to examine transactions for irregular trends and patterns. The irregularities pinpointed can then be examined as potential signs of fraudulent activity.

This may be done by analyzing activity between bank accounts or analyzing when certain transactions occur.

Supply Chain

Supply chain analytics is used to manage inventory levels and set pricing strategies. Supply chain predictive analytics use historical data and statistical models to forecast future supply chain performance, demand, and potential disruptions.

This helps businesses proactively identify and address risks, optimize resources and processes, and improve decision-making. Companies can forecast what materials should be on hand at any given moment and whether there will be any shortages.

Human Resources

Human resources uses predictive analytics to improve various processes such as identifying future workforce skill requirements or identifying factors that contribute to high staff turnover.

Predictive analytics can also analyze an employee's performance, skills, and preferences to predict their career progression and help with career development.

Predictive Analytics vs. Machine Learning

A common misconception is that predictive analytics and machine learning are the same. Predictive analytics help us understand possible future occurrences by analyzing the past. At its core, predictive analytics includes a series of statistical techniques (including machine learning, predictive modeling, and data mining) and uses statistics (both historical and current) to estimate, or predict, future outcomes.

Thus, machine learning is a tool used in predictive analysis.

Machine learning is a subfield of computer science that means "the programming of a digital computer to behave in a way which, if done by human beings or animals, would be described as involving the process of learning." That's a 1959 definition by Arthur Samuel, a pioneer in computer gaming and artificial intelligence.

The most common predictive models include decision trees, regressions (linear and logistic), and neural networks, which is the emerging field of deep learning methods and technologies.

Types of Predictive Analytical Models

There are three common techniques used in predictive analytics: Decision trees, neural networks, and regression.

Decision Trees

If you want to understand what leads to someone's decisions, you may find it useful to build a decision tree .

This type of model places data into different sections based on certain variables, such as price or market capitalization . Just as the name implies, it looks like a tree with individual branches and leaves. Branches indicate the choices available while individual leaves represent a particular decision.

Decision trees are easy to understand and dissect. They're useful when you need to make a decision quickly.

This is the model that is used the most in statistical analysis. Use it when you want to decipher patterns in large sets of data and when there's a linear relationship between the inputs.

This method works by figuring out a formula, which represents the relationship between all the inputs found in the dataset.

For example, you can use regression to figure out how price and other key factors can shape the performance of a stock .

Neural Networks

Neural networks were developed as a form of predictive analytics by imitating the way the human brain works. This model can deal with complex data relationships using artificial intelligence and pattern recognition.

Use this method if you have any of several hurdles that you need to overcome. For example, you may have too much data on hand, or don't have the formula you need to find a relationship between the inputs and outputs in your dataset, or need to make predictions rather than come up with explanations.

If you've already used decision trees and regression as models, you can confirm your findings with neural networks.

Cluster Models

Clustering is a method of aggregating data that share similar attributes. For example, Amazon.com can cluster sales based on the quantity purchased, or on the average account age of its consumers.

separating data into similar groups based on shared features, analysts may be able to identify other characteristics that define future activity.

Time Series Modeling

In some cases, data relates to time, and specific predictive analytics rely on the relationship between what happens when. These types of models assess inputs at specific frequencies such as daily, weekly, or monthly iterations.

Then, analytical models can seek seasonality, trends, or behavioral patterns based on timing.

This type of predictive model is useful to predict when peak customer service periods are needed or when specific sales can be expected to jump.

How Businesses Can Use Predictive Analytics

As noted above, predictive analysis can be used in a number of different applications. Businesses can capitalize on models to help advance their interests and improve their operations. Predictive models are frequently used by businesses to help improve customer service and outreach.

Executives and business owners can take advantage of this kind of statistical analysis to determine customer behavior. For instance, the owner of a business can use predictive techniques to identify and target regular customers who might otherwise defect to a competitor.

Predictive analytics plays a key role in advertising and marketing . Companies can use models to determine which customers are likely to respond positively to marketing and sales campaigns. Business owners can save money by targeting customers who will respond positively rather than doing blanket campaigns.

Benefits of Predictive Analytics

As mentioned above, predictive analytics can help anticipate outcomes when there are no obvious answers available.

Investors, financial professionals, and business leaders use models to help reduce risk. For instance, an investor or an advisor can use models to help craft an investment portfolio with an appropriate level of risk, considering factors such as age, family responsibilities, and goals.

Businesses use them to keep their costs down. They can determine the likelihood of success or failure of a product before it is developed. Or they can set aside capital for production improvements before the manufacturing process begins.

Criticism of Predictive Analytics

The use of predictive analytics has been criticized and, in some cases, legally restricted due to perceived inequities in its outcomes. Most commonly, this involves predictive models that result in statistical discrimination against racial or ethnic groups in areas such as credit scoring, home lending, employment, or risk of criminal behavior.

A famous example of this is the now illegal practice of redlining in home lending by banks. Regardless of the accuracy of the predictions, their use is discouraged as they perpetuate discriminatory lending practices and contribute to the decline of redlined neighborhoods.

How Does Netflix Use Predictive Analytics?

Data collection is important to a company like Netflix. It collects data from its customers based on their behavior and past viewing patterns. It uses that information to make recommendations based on their preferences.

This is the basis of the "Because you watched..." lists you'll find on the site. Other sites, notably Amazon, use their data for "Others who bought this also bought..." lists.

What Are the 3 Pillars of Data Analytics?

The three pillars of data analytics are the needs of the entity that is using the model, the data and technology used to study it, and the actions and insights that result from the analysis.

What Is Predictive Analytics Good For?

Predictive analytics is good for forecasting, risk management, customer behavior analytics, fraud detection, and operational optimization. Predictive analytics can help organizations improve decision-making, optimize processes, and increase efficiency and profitability. This branch of analytics is used to leverage data to forecast what may happen in the future.

What Is the Best Model for Predictive Analytics?

The best model for predictive analytics depends on several factors, such as the type of data, the objective of the analysis, the complexity of the problem, and the desired accuracy of the results. The best model to choose from may range from linear regression, neural networks, clustering, or decision trees.

The goal of predictive analytics is to make predictions about future events, then use those predictions to improve decision-making. Predictive analytics is used in a variety of industries including finance, healthcare, marketing, and retail. Different methods are used in predictive analytics such as regression analysis, decision trees, or neural networks.

Predictive Analytics Today. " WHAT IS PREDICTIVE ANALYSIS? "

IBM. " Predictive analytics ."

Global Newswire. " Trends in Predictive Analytics Market Size & Share will Reach $10.95 Billion by 2022 ."

PWC. " Big data: innovation in investing ."

Samule, Arthur. " Some Studies in Machine Learning Using the Game of Checkers. " IBM Journal of Research and Development, vol. 3, no. 3, July 1959, pp. 210-229.

SAS. " Predictive Analysis ."

Logi Analytics. " What Is Predictive Analysis? "

Utreee. What is Predictive Analytics, its Benefits and Challenges? "

predictive analytics research topics

  • Terms of Service
  • Editorial Policy
  • Privacy Policy
  • Your Privacy Choices

predictive analytics research topics

Data Science Central

  • Author Portal
  • 3D Printing
  • AI Data Stores
  • AI Hardware
  • AI Linguistics
  • AI User Interfaces and Experience
  • AI Visualization
  • Cloud and Edge
  • Cognitive Computing
  • Containers and Virtualization
  • Data Science
  • Data Security
  • Digital Factoring
  • Drones and Robot AI
  • Internet of Things
  • Knowledge Engineering
  • Machine Learning
  • Quantum Computing
  • Robotic Process Automation
  • The Mathematics of AI
  • Tools and Techniques
  • Virtual Reality and Gaming
  • Blockchain & Identity
  • Business Agility
  • Business Analytics
  • Data Lifecycle Management
  • Data Privacy
  • Data Strategist
  • Data Trends
  • Digital Communications
  • Digital Disruption
  • Digital Professional
  • Digital Twins
  • Digital Workplace
  • Marketing Tech
  • Sustainability
  • Agriculture and Food AI
  • AI and Science
  • AI in Government
  • Autonomous Vehicles
  • Education AI
  • Energy Tech
  • Financial Services AI
  • Healthcare AI
  • Logistics and Supply Chain AI
  • Manufacturing AI
  • Mobile and Telecom AI
  • News and Entertainment AI
  • Smart Cities
  • Social Media and AI
  • Functional Languages
  • Other Languages
  • Query Languages
  • Web Languages
  • Education Spotlight
  • Newsletters
  • O’Reilly Media

18 Great Articles About Predictive Analytics

Vincent Granville

  • March 13, 2018 at 1:30 pm

This resource is part of a  series on specific topics related to data science : regression, clustering, neural networks, deep learning, Hadoop, decision trees, ensembles, correlation, outliers, regression, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, time series, cross-validation, model fitting, dataviz, AI and many more. To keep receiving these articles,  sign up on DSC .

  • Differences between Data Mining and Predictive Analytics  
  • Automated Predictive Analytics – What Could Possibly Go Wrong?  +
  • Predictive Analytics in the Supply Chain  
  • Predictive Analytics Goes to College – to Predict Student Success  
  • Hype Cycle History on Predictive Analytics  
  • Predictive Analytics for Beginners  
  • An Intro to Predictive Analytics: Can I predict the future?  
  • Prescriptive versus Predictive Analytics  
  • The Ultimate Guide for Choosing Algorithms for Predictive Modeling  
  • Unraveling Real-Time Predictive Analytics  
  • Financial Firms Embrace Predictive Analytics  
  • Is Predictive Analytics Mainstream?  
  • What is Predictive Analytics?  
  • Predictive Analytics Demystified  
  • Predictive Analytics in Excel  
  • Interpreting Predictive Analytics with a Grain of Salt  
  • Predictive Analytics Strategy  
  • Predictive Analytics and Sensor Data  

12

Source for picture: article flagged with a +

DSC Resources

  • Services:  Hire a Data Scientist  |  Search DSC  |  Classifieds  |  Find a Job
  • Contributors:  Post a Blog  |  Ask a Question
  • Follow us:  @DataScienceCtrl  |  @AnalyticBridge

Related Content

'  data-srcset=

We are in the process of writing and adding new material (compact eBooks) exclusively available to our members, and written in simple English, by world leading experts in AI, data science, and machine learning.

Welcome to the newly launched Education Spotlight page! View Listings

  • Survey Paper
  • Open access
  • Published: 25 July 2020

Predictive big data analytics for supply chain demand forecasting: methods, applications, and research opportunities

  • Mahya Seyedan 1 &
  • Fereshteh Mafakheri   ORCID: orcid.org/0000-0002-7991-4635 1  

Journal of Big Data volume  7 , Article number:  53 ( 2020 ) Cite this article

110k Accesses

119 Citations

23 Altmetric

Metrics details

Big data analytics (BDA) in supply chain management (SCM) is receiving a growing attention. This is due to the fact that BDA has a wide range of applications in SCM, including customer behavior analysis, trend analysis, and demand prediction. In this survey, we investigate the predictive BDA applications in supply chain demand forecasting to propose a classification of these applications, identify the gaps, and provide insights for future research. We classify these algorithms and their applications in supply chain management into time-series forecasting, clustering, K-nearest-neighbors, neural networks, regression analysis, support vector machines, and support vector regression. This survey also points to the fact that the literature is particularly lacking on the applications of BDA for demand forecasting in the case of closed-loop supply chains (CLSCs) and accordingly highlights avenues for future research.

Introduction

Nowadays, businesses adopt ever-increasing precision marketing efforts to remain competitive and to maintain or grow their margin of profit. As such, forecasting models have been widely applied in precision marketing to understand and fulfill customer needs and expectations [ 1 ]. In doing so, there is a growing attention to analysis of consumption behavior and preferences using forecasts obtained from customer data and transaction records in order to manage products supply chains (SC) accordingly [ 2 , 3 ].

Supply chain management (SCM) focuses on flow of goods, services, and information from points of origin to customers through a chain of entities and activities that are connected to one another [ 4 ]. In typical SCM problems, it is assumed that capacity, demand, and cost are known parameters [ 5 ]. However, this is not the case in reality, as there are uncertainties arising from variations in customers’ demand, supplies transportation, organizational risks and lead times. Demand uncertainties, in particular, has the greatest influence on SC performance with widespread effects on production scheduling, inventory planning, and transportation [ 6 ]. In this sense, demand forecasting is a key approach in addressing uncertainties in supply chains [ 7 , 8 , 9 ].

A variety of statistical analysis techniques have been used for demand forecasting in SCM including time-series analysis and regression analysis [ 10 ]. With the advancements in information technologies and improved computational efficiencies, big data analytics (BDA) has emerged as a means of arriving at more precise predictions that better reflect customer needs, facilitate assessment of SC performance, improve the efficiency of SC, reduce reaction time, and support SC risk assessment [ 11 ].

The focus of this meta-research (literature review) paper is on “demand forecasting” in supply chains. The characteristics of demand data in today’s ever expanding and sporadic global supply chains makes the adoption of big data analytics (and machine learning) approaches a necessity for demand forecasting. The digitization of supply chains [ 12 ] and incoporporation Blockchain technologies [ 13 ] for better tracking of supply chains further highlights the role of big data analytics. Supply chain data is high dimensional generated across many points in the chain for varied purposes (products, supplier capacities, orders, shipments, customers, retailers, etc.) in high volumes due to plurality of suppliers, products, and customers and in high velocity reflected by many transactions continuously processed across supply chain networks. In the sense of such complexities, there has been a departure from conventional (statistical) demand forecasting approaches that work based on identifying statistically meannignful trends (characterized by mean and variance attributes) across historical data [ 14 ], towards intelligent forecasts that can learn from the historical data and intelligently evolve to adjust to predict the ever changing demand in supply chains [ 15 ]. This capability is established using big data analytics techniques that extract forecasting rules through discovering the underlying relationships among demand data across supply chain networks [ 16 ]. These techniques are computationally intensive to process and require complex machine-programmed algorithms [ 17 ].

With SCM efforts aiming at satisfying customer demand while minimizing the total cost of supply, applying machine-learning/data analytics algorithms could facilitate precise (data-driven) demand forecasts and align supply chain activities with these predictions to improve efficiency and satisfaction. Reflecting on these opportunities, in this paper, first a taxonmy of data sources in SCM is proposed. Then, the importance of demand management in SCs is investigated. A meta-research (literature review) on BDA applications in SC demand forecasting is explored according to categories of the algorithms utilized. This review paves the path to a critical discussion of BDA applications in SCM highlighting a number of key findings and summarizing the existing challenges and gaps in BDA applications for demand forecasting in SCs. On that basis, the paper concludes by presenting a number of avenues for future research.

Data in supply chains

Data in the context of supply chains can be categorized into customer, shipping, delivery, order, sale, store, and product data [ 18 ]. Figure  1 provides the taxonomy of supply chain data. As such, SC data originates from different (and segmented) sources such as sales, inventory, manufacturing, warehousing, and transportation. In this sense, competition, price volatilities, technological development, and varying customer commitments could lead to underestimation or overestimation of demand in established forecasts [ 19 ]. Therefore, to increase the precision of demand forecast, supply chain data shall be carefully analyzed to enhance knowledge about market trends, customer behavior, suppliers and technologies. Extracting trends and patterns from such data and using them to improve accuracy of future predictions can help minimize supply chain costs [ 20 , 21 ].

figure 1

Taxonomy of supply chain data

Analysis of supply chain data has become a complex task due to (1) increasing multiplicity of SC entities, (2) growing diversity of SC configurations depending on the homogeneity or heterogeneity of products, (3) interdependencies among these entities (4) uncertainties in dynamical behavior of these components, (5) lack of information as relate to SC entities; [ 11 ], (6) networked manufacturing/production entities due to their increasing coordination and cooperation to achieve a high level customization and adaptaion to varying customers’ needs [ 22 ], and finally (7) the increasing adoption of supply chain digitization practices (and use of Blockchain technologies) to track the acitivities across supply chains [ 12 , 13 ].

Big data analytics (BDA) has been increasingly applied in management of SCs [ 23 ], for procurement management (e.g., supplier selection [ 24 ], sourcing cost improvement [ 25 ], sourcing risk management [ 26 ], product research and development [ 27 ], production planning and control [ 28 ], quality management [ 29 ], maintenance, and diagnosis [ 30 ], warehousing [ 31 ], order picking [ 32 ], inventory control [ 33 ], logistics/transportation (e.g., intelligent transportation systems [ 34 ], logistics planning [ 35 ], in-transit inventory management [ 36 ], demand management (e.g., demand forecasting [ 37 ], demand sensing [ 38 ], and demand shaping [ 39 ]. A key application of BDA in SCM is to provide accurate forecasting, especially demand forecasting, with the aim of reducing the bullwhip effect [ 14 , 40 , 41 , 42 ].

Big data is defined as high-volume, high-velocity, high-variety, high value, and high veracity data requiring innovative forms of information processing that enable enhanced insights, decision making, and process automation [ 43 ]. Volume refers to the extensive size of data collected from multiple sources (spatial dimension) and over an extended period of time (temporal dimension) in SCs. For example, in case of freight data, we have ERP/WMS order and item-level data, tracking, and freight invoice data. These data are generated from sensors, bar codes, Enterprise resource planning (ERP), and database technologies. Velocity can be defined as the rate of generation and delivery of specific data; in other words, it refers to the speed of data collection, reliability of data transferring, efficiency of data storage, and excavation speed of discovering useful knowledge as relate to decision-making models and algorithms. Variety refers to generating varied types of data from diverse sources such as the Internet of Things (IoT), mobile devices, online social networks, and so on. For instance, the vast data from SCM are usually variable due to the diverse sources and heterogeneous formats, particularly resulted from using various sensors in manufacturing sites, highways, retailer shops, and facilitated warehouses. Value refers to the nature of the data that must be discovered to support decision-making. It is the most important yet the most elusive, of the 5 Vs. Veracity refers to the quality of data, which must be accurate and trustworthy, with the knowledge that uncertainty and unreliability may exist in many data sources. Veracity deals with conformity and accuracy of data. Data should be integrated from disparate sources and formats, filtered and validated [ 23 , 44 , 45 ]. In summary, big data analytics techniques can deal with a collection of large and complex datasets that are difficult to process and analyze using traditional techniques [ 46 ].

The literature points to multiple sources of big data across the supply chains with varied trade-offs among volume, velocity, variety, value, and veracity attributes [ 47 ]. We have summarized these sources and trade-offs in Table  1 . Although, the demand forecasts in supply chains belong to the lower bounds of volume, velocity, and variety, however, these forecasts can use data from all sources across the supply chains from low volume/variety/velocity on-the-shelf inventory reports to high volume/variety/velocity supply chain tracking information provided through IoT. This combination of data sources used in SC demand forecasts, with their diverse temporal and spatial attributes, places a greater emphasis on use of big data analytics in supply chains, in general, and demand forecasting efforts, in particular.

The big data analytics applications in supply chain demand forecasting have been reported in both categories of supervised and unsupervised learning. In supervised learning, data will be associated with labels, meaning that the inputs and outputs are known. The supervised learning algorithms identify the underlying relationships between the inputs and outputs in an effort to map the inputs to corresponding outputs given a new unlabeled dataset [ 48 ]. For example, in case of a supervised learning model for demand forecasting, future demand can be predicted based on the historical data on product demand [ 41 ]. In unsupervised learning, data are unlabeled (i.e. unknown output), and the BDA algorithms try to find the underlying patterns among unlabeled data [ 48 ] by analyzing the inputs and their interrelationships. Customer segmentation is an example of unsupervised learning in supply chains that clusters different groups of customers based on their similarity [ 49 ]. Many machine-learning/data analytics algorithms can facilitate both supervised learning (extracting the input–output relationships) and unsupervised learning (extracting inputs, outputs and their relationships) [ 41 ].

Demand management in supply chains

The term “demand management” emerged in practice in the late 1980s and early 1990s. Traditionally, there are two approaches for demand management. A forward approach which looks at potential demand over the next several years and a backward approach that relies on past or ongoing capabilities in responding to demand [ 50 ].

In forward demand management, the focus will be on demand forecasting and planning, data management, and marketing strategies. Demand forecasting and planning refer to predicting the quantities and timings of customers’ requests. Such predictions aim at achieving customers’ satisfaction by meeting their needs in a timely manner [ 51 ]. Accurate demand forecasting could improve the efficiency and robustness of production processes (and the associated supply chains) as the resources will be aligned with requirements leading to reduction of inventories and wastes [ 52 , 53 ].

In the light of the above facts, there are many approaches proposed in the literature and practice for demand forecasting and planning. Spreadsheet models, statistical methods (like moving averages), and benchmark-based judgments are among these approaches. Today, the most widely used demand forecasting and planning tool is Excel. The most widespread problem with spreadsheet models used for demand forecasting is that they are not scalable for large-scale data. In addition, the complexities and uncertainties in SCM (with multiplicity and variability of demand and supply) cannot be extracted, analyzed, and addressed through simple statistical methods such as moving averages or exponential smoothing [ 50 ]. During the past decade, traditional solutions for SC demand forecasting and planning have faced many difficulties in driving the costs down and reducing inventories [ 50 ]. Although, in some cases, the suggested solutions have improved the day’s payable, they have pushed up the SC costs as a burden to suppliers.

The era of big data and high computing analytics has enabled data processing at a large scale that is efficient, fast, easy, and with reduced concerns about data storage and collection due to cloud services. The emergence of new technologies in data storage and analytics and the abundance of quality data have created new opportunities for data-driven demand forecasting and planning. Demand forecast accuracy can be significantly improved with data-mining algorithms and tools that can sift through data, analyze the results, and learn about the relationships involved. This could lead to highly accurate demand forecasting models that learn from data and are scalable for application in SCM. In the following section, a review of BDA applications in SCM is presented. These applications are categorized based on the employed techniques in establishing the data-drive demand forecasts.

BDA for demand forecasting in SCM

This survey aims at reviewing the articles published in the area of demand and sales forecasting in SC in the presence of big data to provide a classification of the literature based on algorithms utilized as well as a survey of applications. To the best of our knowledge, no comprehensive review of the literature specifically on SC demand forecasting has been conducted with a focus on classification of techniques of data analytics and machine learning. In doing so, we performed a thorough search of the existing literature, through Scopus, Google Scholar, and Elsevier, with publication dates ranging from 2005 to 2019. The keywords used for the search were supply chain, demand forecasting, sales forecasting, big data analytics, and machine learning.

Figure  2 shows the trend analysis of publications in demand forecasting for SC appeared from 2005 to 2019. There is a steadily increasing trend in the number of publications from 2005 to 2019. It is expected that such growth continues in 2020. Reviewing the past 15 years of research on big data analysis/machine learning applications in SC demand forecasting, we identified 64 research papers (excluding books, book chapters, and review papers) and categorized them with respect to the methodologies adopted for demand forecasting. The five most frequently used techniques are listed in Table  2 that includes “Neural Network,” “Regression”, “Time-series forecasting (ARIMA)”, “Support Vector Machine”, and “Decision Tree” methods. This table implies the growing use of big data analysis techniques in SC demand forecasting. It shall be mentioned that there were a few articles using multiple of these techniques.

figure 2

Distribution of literature in supply chain demand forecasting from 2005 to 2019

It shall be mentioned that there are literature review papers exploring the use of big data analytics in SCM [ 10 , 16 , 23 , 54 , 55 , 56 , 57 , 58 , 59 , 60 , 61 , 62 , 63 , 64 , 65 , 66 , 67 ]. However, this study focuses on the specific topic of “demand forecasting” in SCM to explore BDA applications in line with this particular subtopic in SCM.

As Hofmann and Rutschmann [ 58 ] indicated in their literature review, the key questions to answer are why, what and how big data analytics/machine-learning algorithms could enhance forecasts’ accuracy in comparison to conventional statistical forecasting approaches.

Conventional methods have faced a number of limitations for demand forecasting in the context of SCs. There are a lot of parameters influencing the demand in supply chains, however, many of them were not captured in studies using conventional methods for the sake of simplicity. In this regard, the forecasts could only provide a partial understanding of demand variations in supply chains. In addition, the unexplained demand variations could be simply considered as statistical noise. Conventional approaches could provide shorter processing times in exchange for a compromise on robustness and accuracy of predictions. Conventional SC demand forecasting approaches are mostly done manually with high reliance on the planner’s skills and domain knowledge. It would be worthwhile to fully automate the forecasting process to reduce such a dependency [ 58 ]. Finally, data-driven techniques could learn to incorporate non-linear behaviors and could thus provide better approximations in demand forecasting compared to conventional methods that are mostly derived based on linear models. There is a significant level of non-linearity in demand behavior in SC particularly due to competition among suppliers, the bullwhip effect, and mismatch between supply and demand [ 40 ].

To extract valuable knowledge from a vast amount of data, BDA is used as an advanced analytics technique to obtain the data needed for decision-making. Reduced operational costs, improved SC agility, and increased customer satisfaction are mentioned among the benefits of applying BDA in SCM [ 68 ]. Researchers used various BDA techniques and algorithms in SCM context, such as classification, scenario analysis, and optimization [ 23 ]. Machine-learning techniques have been used to forecast demand in SCs, subject to uncertainties in prices, markets, competitors, and customer behaviors, in order to manage SCs in a more efficient and profitable manner [ 40 ].

BDA has been applied in all stages of supply chains, including procurement, warehousing, logistics/transportation, manufacturing, and sales management. BDA consists of descriptive analytics, predictive analytics, and prescriptive analytics. Descriptive analysis is defined as describing and categorizing what happened in the past. Predictive analytics are used to predict future events and discover predictive patterns within data by using mathematical algorithms such as data mining, web mining, and text mining. Prescriptive analytics apply data and mathematical algorithms for decision-making. Multi-criteria decision-making, optimization, and simulation are among the prescriptive analytics tools that help to improve the accuracy of forecasting [ 10 ].

Predictive analytics are the ones mostly utilized in SC demand and procurement forecasting [ 23 ]. In this sense, in the following subsections, we will review various predictive big data analytics approaches, presented in the literature for demand forecasting in SCM, categorized based on the employed data analytics/machine learning technique/algorithm, with elaborations of their purpose and applications (summarized in Table  3 ).

Time-series forecasting

Time series are methodologies for mining complex and sequential data types. In time-series data, sequence data, consisting of long sequences of numeric data, recorded at equal time intervals (e.g., per minute, per hour, or per day). Many natural and human-made processes, such as stock markets, medical diagnosis, or natural phenomenon, can generate time-series data. [ 48 ].

In case of demand forecasting using time-series, demand is recorded over time at equal size intervals [ 69 , 70 ]. Combinations of time-series methods with product or market features have attracted much attention in demand forecasting with BDA. Ma et al. [ 71 ] proposed and developed a demand trend-mining algorithm for predictive life cycle design. In their method, they combined three models (a) a decision tree model for large-scale historical data classification, (b) a discrete choice analysis for present and past demand modeling, and (c) an automated time-series forecasting model for future trend analysis. They tested and applied their 3-level approach in smartphone design, manufacturing and remanufacturing.

Time-series approach was used for forecasting of search traffic (service demand) subject to changes in consumer attitudes [ 37 ]. Demand forecasting has been achieved through time-series models using exponential smoothing with covariates (ESCov) to provide predictions for short-term, mid-term, and long-term demand trends in the chemical industry SCs [ 7 ]. In addition, Hamiche et al. [ 72 ] used a customer-responsive time-series approach for SC demand forecasting.

In case of perishable products, with short life cycles, having appropriate (short-term) forecasting is extremely critical. Da Veiga et al. [ 73 ] forecasted the demand for a group of perishable dairy products using Autoregressive Integrated Moving Average (ARIMA) and Holt-Winters (HW) models. The results were compared based on mean absolute percentage error (MAPE) and Theil inequality index (U-Theil). The HW model showed a better goodness-of-fit based on both performance metrics.

In case of ARIMA, the accuracy of predictions could diminish where there exists a high level of uncertainty in future patterns of parameters [ 42 , 74 , 75 , 76 ]. HW model forecasting can yield better accuracy in comparison to ARIMA [ 73 ]. HW is simple and easy to use. However, data horizon could not be larger than a seasonal cycle; otherwise, the accuracy of forecasts will decrease sharply. This is due to the fact that inputs of an HW model are themselves predicted values subject to longer-term potential inaccuracies and uncertainties [ 45 , 73 ].

Clustering analysis

Clustering analysis is a data analysis approach that partitions a group of data objects into subgroups based on their similarities. Several applications of clustering analysis has been reported in business analytics, pattern recognition, and web development [ 48 ]. Han et al. [ 48 ] have emphasized the fact that using clustering customers can be organized into groups (clusters), such that customers within a group present similar characteristic.

A key target of demand forecasting is to identify demand behavior of customers. Extraction of similar behavior from historical data leads to recognition of customer clusters or segments. Clustering algorithms such as K-means, self-organizing maps (SOMs), and fuzzy clustering have been used to segment similar customers with respect to their behavior. The clustering enhances the accuracy of SC demand forecasting as the predictions are established for each segment comprised of similar customers. As a limitation, the clustering methods have the tendency to identify the customers, that do not follow a pattern, as outliers [ 74 , 77 ].

Hierarchical forecasts of sales data are performed by clustering and categorization of sales patterns. Multivariate ARIMA models have been used in demand forecasting based on point-of-sales data in industrial bakery chains [ 19 ]. These bakery goods are ordered and clustered daily with a continuous need to demand forecasts in order to avoid both shortage or waste [ 19 ]. Fuel demand forecasting in thermal power plants is another domain with applications of clustering methods. Electricity consumption patterns are derived using a clustering of consumers, and on that basis, demand for the required fuel is established [ 77 ].

K-nearest-neighbor (KNN)

KNN is a method of classification that has been widely used for pattern recognition. KNN algorithm identifies the similarity of a given object to the surrounding objects (called tuples) by generating a similarity index. These tuples are described by n attributes. Thus, each tuple corresponds to a point in an n-dimensional space. The KNN algorithm searches for k tuples that are closest to a given tuple [ 48 ]. These similarity-based classifications will lead to formation of clusters containing similar objects. KNN can also be integrated into regression analysis problems [ 78 ] for dimensionality reduction of the data [ 79 ]. In the realm of demand forecasting in SC, Nikolopoulos et al. [ 80 ] applied KNN for forecasting sporadic demand in an automotive spare parts supply chain. In another study, KNN is used to forecast future trends of demand for Walmart’s supply chain planning [ 81 ].

Artificial neural networks

In artificial neural networks, a set of neurons (input/output units) are connected to one another in different layers in order to establish mapping of the inputs to outputs by finding the underlying correlations between them. The configuration of such networks could become a complex problem, due to a high number of layers and neurons, as well as variability of their types (linear or nonlinear), which needs to follow a data-driven learning process to be established. In doing so, each unit (neuron) will correspond to a weight, that is tuned through a training step [ 48 ]. At the end, a weighted network with minimum number of neurons, that could map the inputs to outputs with a minimum fitting error (deviation), is identified.

As the literature reveals, artificial neural networks (ANN) are widely applied for demand forecasting [ 82 , 83 , 84 , 85 ]. To improve the accuracy of ANN-based demand predictions, Liu et al. [ 86 ] proposed a combination of a grey model and a stacked auto encoder applied to a case study of predicting demand in a Brazilian logistics company subject to transportation disruption [ 87 ]. Amirkolaii et al. [ 88 ] applied neural networks in forecasting spare parts demand to minimize supply chain shortages. In this case of spare parts supply chain, although there were multiple suppliers to satisfy demand for a variety of spare parts, the demand was subject to high variability due to a varying number of customers and their varying needs. Their proposed ANN-based forecasting approach included (1) 1 input demand feature with 1 Stock-Keeping Unit (SKU), (2) 1 input demand feature with all SKUs, (3) 16 input demand features with 1 SKU, and (4) 16 input demand features with all SKUs. They applied neural networks with back propagation and compared the results with a number of benchmarks reporting a Mean Square Error (MSE) for each configuration scenario.

Huang et al. [ 89 ] compared a backpropagation (BP) neural network and a linear regression analysis for forecasting of e-logistics demand in urban and rural areas in China using data from 1997 to 2015. By comparing mean absolute error (MAE) and the average relative errors of backpropagation neural network and linear regression, they showed that backpropagation neural networks could reach higher accuracy (reflecting lower differences between predicted and actual data). This is due to the fact that a Sigmoid function was used as the transfer function in the hidden layer of BP, which is differentiable for nonlinear problems such as the one presented in their case study, whereas the linear regression works well with linear problems.

ANNs have also been applied in demand forecasting for server models with one-week demand prediction ahead of order arrivals. In this regard, Saha et al. [ 90 ] proposed an ANN-based forecasting model using a 52-week time-series data fitted through both BP and Radial Basis Function (RBF) networks. A RBF network is similar to a BP network except for the activation/transfer function in RBF that follows a feed-forward process using a radial basis function. RBF results in faster training and convergence to ANN weights in comparison with BP networks without compromising the forecasting precision.

Researchers have combined ANN-based machine-learning algorithms with optimization models to draw optimal courses of actions, strategies, or decisions for future. Chang et al. [ 91 ] employed a genetic algorithm in the training phase of a neural network using sales/supply chain data in the printed circuit board industry in Taiwan and presented an evolving neural network-forecasting model. They proposed use of a Genetic Algorithms (GA)-based cost function optimization to arrive at the best configuration of the corresponding neural network for sales forecast with respect to prediction precision. The proposed model was then compared to back-propagation and linear regression approaches using three performance indices of MAPE, Mean Absolute Deviation (MAD), and Total Cost Deviation (TCD), presenting its superior prediction precision.

Regression analysis

Regression models are used to generate continuous-valued functions utilized for prediction. These methods are used to predict the value of a response (dependent) variable with respect to one or more predictor (independent) variables. There are various forms of regression analysis, such as linear, multiple, weighted, symbolic (random), polynomial, nonparametric, and robust. The latter approach is useful when errors fail to satisfy normalcy conditions or when we deal with big data that could contain significant number of outliers [ 48 ].

Merkuryeva et al. [ 92 ] analyzed three prediction approaches for demand forecasting in the pharmaceutical industry: a simple moving average model, multiple linear regressions, and a symbolic regression with searches conducted through an evolutionary genetic programming. In this experiment, symbolic regression exhibited the best fit with the lowest error.

As perishable products must be sold due to a very short preservation time, demand forecasting for this type of products has drawn increasing attention. Yang and Sutrisno [ 93 ] applied and compared regression analysis and neural network techniques to derive demand forecasts for perishable goods. They concluded that accurate daily forecasts are achievable with knowledge of sales numbers in the first few hours of the day using either of the above methods.

Support vector machine (SVM)

SVM is an algorithm that uses a nonlinear mapping to transform a set of training data into a higher dimension (data classes). SVM searches for an optimal separating hyper-plane that can separate the resulting class from another) [ 48 ]. Villegas et al. [ 94 ] tested the applicability of SVMs for demand forecasting in household and personal care SCs with a dataset comprised of 229 weekly demand series in the UK. Wu [ 95 ] applied an SVM, using a particle swarm optimization (PSO) to search for the best separating hyper-plane, classifying the data related to car sales and forecasting the demand in each cluster.

Support vector regression (SVR)

Continuous variable classification problems can be solved by support vector regression (SVR), which is a regression implementation of SVM. The main idea behind SVR regression is the computation of a linear regression function within a high-dimensional feature space. SVR has been applied in financial/cost prediction problems, handwritten digit recognition, and speaker identification, object recognition, etc. [ 48 ].

Guanghui [ 96 ] used the SVR method for SC needs prediction. The use of SVR in demand forecasting can yield a lower mean square error than RBF neural networks due to the fact that the optimization (cost) function in SVR does not consider the points beyond a margin of distance from the training set. Therefore, this method leads to higher forecast accuracy, although, similar to SVM, it is only applicable to a two-class problem (such as normal versus anomaly detection/estimation problems). Sarhani and El Afia [ 97 ] sought to forecast SC demand using SVR and applied Particle swarm optimization (PSO) and GA to optimize SVR parameters. SVR-PSO and SVR-GA approaches were compared with respect to accuracy of predictions using MAPE. The results showed a superior performance by PSO in terms time intensity and MAPE when configuring the SVR parameters.

Mixed approaches

Some works in the literature have used a combination of the aforementioned techniques. In these studies, the data flow into a sequence of algorithms and the outputs of one stage become inputs of the next step. The outputs are explanatory in the form of qualitative and quantitative information with a sequence of useful information extracted out of each algorithm. Examples of such studies include [ 15 , 98 , 99 , 100 , 101 , 102 , 103 , 104 , 105 ].

In more complex supply chains with several points of supply, different warehouses, varied customers, and several products, the demand forecasting becomes a high dimensional problem. To address this issue, Islek and Oguducu [ 100 ] applied a clustering technique, called bipartite graph clustering, to analyze the patterns of sales for different products. Then, they combined a moving average model and a Bayesian belief network approaches to improve the accuracy of demand forecasting for each cluster. Kilimci et al. [ 101 ] developed an intelligent demand forecasting system by applying time-series and regression methods, a support vector regression algorithm, and a deep learning model in a sequence. They dealt with a case involving big amount of data accounting for 155 features over 875 million records. First, they used a principal component analysis for dimension reduction. Then, data clustering was performed. This is followed by demand forecasting for each cluster using a novel decision integration strategy called boosting ensemble. They concluded that the combination of a deep neural network with a boosting strategy yielded the best accuracy, minimizing the prediction error for demand forecasting.

Chen and Lu [ 98 ] combined clustering algorithms of SOM, a growing hierarchical self-organizing mapping (GHSOM), and K-means, with two machine-learning techniques of SVR and extreme learning machine (ELM) in sales forecasting of computers. The authors found that the combination of GHSOM and ELM yielded better accuracy and performance in demand forecasts for their computer retailing case study. Difficulties in forecasting also occur in cases with high product variety. For these types of products in an SC, patterns of sales can be extracted for clustered products. Then, for each cluster, a machine-learning technique, such as SVR, can be employed to further improve the prediction accuracy [ 104 ].

Brentan et al. [ 106 ] used and analyzed various BDA techniques for demand prediction; including support vector machines (SVM), and adaptive neural fuzzy inference systems (ANFIS). They combined the predicted values derived from each machine learning techniques, using a linear regression process to arrive at an average prediction value adopted as the benchmark forecast. The performance (accuracy) of each technique is then analyzed with respect to their mean square root error (RMSE) and MAE values obtained through comparing the target values and the predicted ones.

In summary, Table  3 provides an overview of the recent literature on the application of Predictive BDA in demand forecasting.

Discussions

The data produced in SCs contain a great deal of useful knowledge. Analysis of such massive data can help us to forecast trends of customer behavior, markets, prices, and so on. This can help organizations better adapt to competitive environments. To forecast demand in an SC, with the presences of big data, different predictive BDA algorithms have been used. These algorithms could provide predictive analytics using time-series approaches, auto-regressive methods, and associative forecasting methods [ 10 ]. The demand forecasts from these BDA methods could be integrated with product design attributes as well as with online search traffic mapping to incorporate customer and price information [ 37 , 71 ].

Predictive BDA algorithms

Most of the studies examined, developed and used a certain data-mining algorithm for their case studies. However, there are very few comparative studies available in the literature to provide a benchmark for understanding of the advantages and disadvantages of these methodologies. Additionally, as depicted by Table  3 , there is no clear trend between the choice of the BDA algorithm/method and the application domain or category.

Predictive BDA applicability

Most data-driven models used in the literature consider historical data. Such a backward-looking forecasting ignores the new trends and highs and lows in different economic environments. Also, organizational factors, such as reputation and marketing strategies, as well as internal risks (related to availability of SCM resources), could greatly influence the demand [ 107 ] and thus contribute to inaccuracy of BDA-based demand predictions using historical data. Incorporating existing driving factors outside the historical data, such as economic instability, inflation, and purchasing power, could help adjust the predictions with respect to unseen future scenarios of demand. Combining predictive algorithms with optimization or simulation can equip the models with prescriptive capabilities in response to future scenarios and expectations.

Predictive BDA in closed-loop supply chains (CLSC)

The combination of forward and reverse flow of material in a SC is referred to as a closed-loop supply chain (CLSC). A CLSC is a more complex system than a traditional SC because it consists of the forward and reverse SC simultaneously [ 108 ]. Economic impact, environmental impact, and social responsibility are three significant factors in designing a CLSC network with inclusion of product recycling, remanufacturing, and refurbishment functions. The complexity of a CLSC, compared to a common SC, results from the coordination between backward and forward flows. For example, transportation cost, holding cost, and forecasting demand are challenging issues because of uncertainties in the information flows from the forward chain to the reverse one. In addition, the uncertainties about the rate of returned products and efficiencies of recycling, remanufacturing, and refurbishment functions are some of the main barriers in establishing predictions for the reverse flow [ 5 , 6 , 109 ]. As such, one key finding from this literature survey is that CLSCs particularly deal with the lack of quality data for remanufacturing. Remanufacturing refers to the disassembly of products, cleaning, inspection, storage, reconditioning, replacement, and reassembling. As a result of deficiencies in data, optimal scheduling of remanufacturing functions is cumbersome due to uncertainties in the quality and quantity of used products as well as timing of returns and delivery delays.

IoT-based approaches can overcome the difficulties of collecting data in a CLSC. In an IoT environment, objects are monitored and controlled remotely across existing network infrastructures. This enables more direct integration between the physical world and computer-based systems. The results include improved efficiency, accuracy, and economic benefit across SCs [ 50 , 54 , 110 ].

Radio frequency identification (RFID) is another technology that has become very popular in SCs. RFID can be used for automation of processes in an SC, and it is useful for coordination of forecasts in CLSCs with dispersed points of return and varied quantities and qualities of returned used products [ 10 , 111 , 112 , 113 , 114 ].

Conclusions

The growing need to customer behavior analysis and demand forecasting is deriven by globalization and increasing market competitions as well as the surge in supply chain digitization practices. In this study, we performed a thorough review for applications of predictive big data analytics (BDA) in SC demand forecasting. The survey overviewed the BDA methods applied to supply chain demand forecasting and provided a comparative categorization of them. We collected and analyzed these studies with respect to methods and techniques used in demand prediction. Seven mainstream techniques were identified and studied with their pros and cons. The neural networks and regression analysis are observed as the two mostly employed techniques, among others. The review also pointed to the fact that optimization models or simulation can be used to improve the accuracy of forecasting through formulating and optimizing a cost function for the fitting of the predictions to data.

One key finding from reviewing the existing literature was that there is a very limited research conducted on the applications of BDA in CLSC and reverse logistics. There are key benefits in adopting a data-driven approach for design and management of CLSCs. Due to increasing environmental awareness and incentives from the government, nowadays a vast quantity of returned (used) products are collected, which are of various types and conditions, received and sorted in many collection points. These uncertainties have a direct impact on the cost-efficiency of remanufacturing processes, the final price of the refurbished products and the demand for these products [ 115 ]. As such, design and operation of CLSCs present a case for big data analytics from both supply and demand forecasting perspectives.

Availability of data and materials

The paper presents a review of the literature extracted from main scientific databases without presenting data.

Abbreviations

Adaptive neural fuzzy inference systems

Auto regressive integrated moving average

Artificial neural network

  • Big data analytics

Backpropagation

Closed-loop supply chain

Extreme learning machine

Enterprise resource planning

Genetic algorithms

Growing hierarchical self-organizing map

Holt-winters

Internet of things

K-nearest-neighbor

Mean absolute deviation

Mean absolute error

Mean absolute percentage error

Mean square error

Mean square root error

Radial basis function

Particle swarm optimization

Self-organizing maps

Stock-keeping unit

Supply chain analytics

Supply chain

  • Supply chain management

Support vector machine

Support vector regression

Total cost deviation

Theil inequality index

You Z, Si Y-W, Zhang D, Zeng X, Leung SCH, Li T. A decision-making framework for precision marketing. Expert Syst Appl. 2015;42(7):3357–67. https://doi.org/10.1016/J.ESWA.2014.12.022 .

Article   Google Scholar  

Guo ZX, Wong WK, Li M. A multivariate intelligent decision-making model for retail sales forecasting. Decis Support Syst. 2013;55(1):247–55. https://doi.org/10.1016/J.DSS.2013.01.026 .

Wei J-T, Lee M-C, Chen H-K, Wu H-H. Customer relationship management in the hairdressing industry: an application of data mining techniques. Expert Syst Appl. 2013;40(18):7513–8. https://doi.org/10.1016/J.ESWA.2013.07.053 .

Lu LX, Swaminathan JM. Supply chain management. Int Encycl Soc Behav Sci. 2015. https://doi.org/10.1016/B978-0-08-097086-8.73032-7 .

Gholizadeh H, Tajdin A, Javadian N. A closed-loop supply chain robust optimization for disposable appliances. Neural Comput Appl. 2018. https://doi.org/10.1007/s00521-018-3847-9 .

Tosarkani BM, Amin SH. A possibilistic solution to configure a battery closed-loop supply chain: multi-objective approach. Expert Syst Appl. 2018;92:12–26. https://doi.org/10.1016/J.ESWA.2017.09.039 .

Blackburn R, Lurz K, Priese B, Göb R, Darkow IL. A predictive analytics approach for demand forecasting in the process industry. Int Trans Oper Res. 2015;22(3):407–28. https://doi.org/10.1111/itor.12122 .

Article   MathSciNet   MATH   Google Scholar  

Boulaksil Y. Safety stock placement in supply chains with demand forecast updates. Oper Res Perspect. 2016;3:27–31. https://doi.org/10.1016/J.ORP.2016.07.001 .

Article   MathSciNet   Google Scholar  

Tang CS. Perspectives in supply chain risk management. Int J Prod Econ. 2006;103(2):451–88. https://doi.org/10.1016/J.IJPE.2005.12.006 .

Wang G, Gunasekaran A, Ngai EWT, Papadopoulos T. Big data analytics in logistics and supply chain management: certain investigations for research and applications. Int J Prod Econ. 2016;176:98–110. https://doi.org/10.1016/J.IJPE.2016.03.014 .

Awwad M, Kulkarni P, Bapna R, Marathe A. Big data analytics in supply chain: a literature review. In: Proceedings of the international conference on industrial engineering and operations management, 2018(SEP); 2018, p. 418–25.

Büyüközkan G, Göçer F. Digital Supply Chain: literature review and a proposed framework for future research. Comput Ind. 2018;97:157–77.

Kshetri N. 1 Blockchain’s roles in meeting key supply chain management objectives. Int J Inf Manage. 2018;39:80–9.

Michna Z, Disney SM, Nielsen P. The impact of stochastic lead times on the bullwhip effect under correlated demand and moving average forecasts. Omega. 2019. https://doi.org/10.1016/J.OMEGA.2019.02.002 .

Zhu Y, Zhao Y, Zhang J, Geng N, Huang D. Spring onion seed demand forecasting using a hybrid Holt-Winters and support vector machine model. PLoS ONE. 2019;14(7):e0219889. https://doi.org/10.1371/journal.pone.0219889 .

Govindan K, Cheng TCE, Mishra N, Shukla N. Big data analytics and application for logistics and supply chain management. Transport Res Part E Logist Transport Rev. 2018;114:343–9. https://doi.org/10.1016/J.TRE.2018.03.011 .

Bohanec M, Kljajić Borštnar M, Robnik-Šikonja M. Explaining machine learning models in sales predictions. Expert Syst Appl. 2017;71:416–28. https://doi.org/10.1016/J.ESWA.2016.11.010 .

Constante F, Silva F, Pereira A. DataCo smart supply chain for big data analysis. Mendeley Data. 2019. https://doi.org/10.17632/8gx2fvg2k6.5 .

Huber J, Gossmann A, Stuckenschmidt H. Cluster-based hierarchical demand forecasting for perishable goods. Expert Syst Appl. 2017;76:140–51. https://doi.org/10.1016/J.ESWA.2017.01.022 .

Ali MM, Babai MZ, Boylan JE, Syntetos AA. Supply chain forecasting when information is not shared. Eur J Oper Res. 2017;260(3):984–94. https://doi.org/10.1016/J.EJOR.2016.11.046 .

Bian W, Shang J, Zhang J. Two-way information sharing under supply chain competition. Int J Prod Econ. 2016;178:82–94. https://doi.org/10.1016/J.IJPE.2016.04.025 .

Mourtzis D. Challenges and future perspectives for the life cycle of manufacturing networks in the mass customisation era. Logist Res. 2016;9(1):2.

Nguyen T, Zhou L, Spiegler V, Ieromonachou P, Lin Y. Big data analytics in supply chain management: a state-of-the-art literature review. Comput Oper Res. 2018;98:254–64. https://doi.org/10.1016/J.COR.2017.07.004 .

Choi Y, Lee H, Irani Z. Big data-driven fuzzy cognitive map for prioritising IT service procurement in the public sector. Ann Oper Res. 2018;270(1–2):75–104. https://doi.org/10.1007/s10479-016-2281-6 .

Huang YY, Handfield RB. Measuring the benefits of erp on supply management maturity model: a “big data” method. Int J Oper Prod Manage. 2015;35(1):2–25. https://doi.org/10.1108/IJOPM-07-2013-0341 .

Miroslav M, Miloš M, Velimir Š, Božo D, Đorđe L. Semantic technologies on the mission: preventing corruption in public procurement. Comput Ind. 2014;65(5):878–90. https://doi.org/10.1016/J.COMPIND.2014.02.003 .

Zhang Y, Ren S, Liu Y, Si S. A big data analytics architecture for cleaner manufacturing and maintenance processes of complex products. J Clean Prod. 2017;142:626–41. https://doi.org/10.1016/J.JCLEPRO.2016.07.123 .

Shu Y, Ming L, Cheng F, Zhang Z, Zhao J. Abnormal situation management: challenges and opportunities in the big data era. Comput Chem Eng. 2016;91:104–13. https://doi.org/10.1016/J.COMPCHEMENG.2016.04.011 .

Krumeich J, Werth D, Loos P. Prescriptive control of business processes: new potentials through predictive analytics of big data in the process manufacturing industry. Bus Inform Syst Eng. 2016;58(4):261–80. https://doi.org/10.1007/s12599-015-0412-2 .

Guo SY, Ding LY, Luo HB, Jiang XY. A Big-Data-based platform of workers’ behavior: observations from the field. Accid Anal Prev. 2016;93:299–309. https://doi.org/10.1016/J.AAP.2015.09.024 .

Chuang Y-F, Chia S-H, Wong J-Y. Enhancing order-picking efficiency through data mining and assignment approaches. WSEAS Transactions on Business and Economics. 2014;11(1):52–64.

Google Scholar  

Ballestín F, Pérez Á, Lino P, Quintanilla S, Valls V. Static and dynamic policies with RFID for the scheduling of retrieval and storage warehouse operations. Comput Ind Eng. 2013;66(4):696–709. https://doi.org/10.1016/J.CIE.2013.09.020 .

Alyahya S, Wang Q, Bennett N. Application and integration of an RFID-enabled warehousing management system—a feasibility study. J Ind Inform Integr. 2016;4:15–25. https://doi.org/10.1016/J.JII.2016.08.001 .

Cui J, Liu F, Hu J, Janssens D, Wets G, Cools M. Identifying mismatch between urban travel demand and transport network services using GPS data: a case study in the fast growing Chinese city of Harbin. Neurocomputing. 2016;181:4–18. https://doi.org/10.1016/J.NEUCOM.2015.08.100 .

Shan Z, Zhu Q. Camera location for real-time traffic state estimation in urban road network using big GPS data. Neurocomputing. 2015;169:134–43. https://doi.org/10.1016/J.NEUCOM.2014.11.093 .

Ting SL, Tse YK, Ho GTS, Chung SH, Pang G. Mining logistics data to assure the quality in a sustainable food supply chain: a case in the red wine industry. Int J Prod Econ. 2014;152:200–9. https://doi.org/10.1016/J.IJPE.2013.12.010 .

Jun S-P, Park D-H, Yeom J. The possibility of using search traffic information to explore consumer product attitudes and forecast consumer preference. Technol Forecast Soc Chang. 2014;86:237–53. https://doi.org/10.1016/J.TECHFORE.2013.10.021 .

He W, Wu H, Yan G, Akula V, Shen J. A novel social media competitive analytics framework with sentiment benchmarks. Inform Manage. 2015;52(7):801–12. https://doi.org/10.1016/J.IM.2015.04.006 .

Marine-Roig E, Anton Clavé S. Tourism analytics with massive user-generated content: a case study of Barcelona. J Destination Market Manage. 2015;4(3):162–72. https://doi.org/10.1016/J.JDMM.2015.06.004 .

Carbonneau R, Laframboise K, Vahidov R. Application of machine learning techniques for supply chain demand forecasting. Eur J Oper Res. 2008;184(3):1140–54. https://doi.org/10.1016/J.EJOR.2006.12.004 .

Article   MATH   Google Scholar  

Munir K. Cloud computing and big data: technologies, applications and security, vol. 49. Berlin: Springer; 2019.

Rostami-Tabar B, Babai MZ, Ali M, Boylan JE. The impact of temporal aggregation on supply chains with ARMA(1,1) demand processes. Eur J Oper Res. 2019;273(3):920–32. https://doi.org/10.1016/J.EJOR.2018.09.010 .

Beyer MA, Laney D. The importance of ‘big data’: a definition. Stamford: Gartner; 2012. p. 2014–8.

Benabdellah AC, Benghabrit A, Bouhaddou I, Zemmouri EM. Big data for supply chain management: opportunities and challenges. In: Proceedings of IEEE/ACS international conference on computer systems and applications, AICCSA, no. 11, p. 20–26; 2016. https://doi.org/10.1109/AICCSA.2016.7945828 .

Kumar M. Applied big data analytics in operations management. Appl Big Data Anal Oper Manage. 2016. https://doi.org/10.4018/978-1-5225-0886-1 .

Zhong RY, Huang GQ, Lan S, Dai QY, Chen X, Zhang T. A big data approach for logistics trajectory discovery from RFID-enabled production data. Int J Prod Econ. 2015;165:260–72. https://doi.org/10.1016/J.IJPE.2015.02.014 .

Varela IR, Tjahjono B. Big data analytics in supply chain management: trends and related research. In: 6th international conference on operations and supply chain management, vol. 1, no. 1, p. 2013–4; 2014. https://doi.org/10.13140/RG.2.1.4935.2563 .

Han J, Kamber M, Pei J. Data mining: concepts and techniques. Burlington: Morgan Kaufmann Publishers; 2013. https://doi.org/10.1016/B978-0-12-381479-1.00001-0 .

Book   MATH   Google Scholar  

Arunachalam D, Kumar N. Benefit-based consumer segmentation and performance evaluation of clustering approaches: an evidence of data-driven decision-making. Expert Syst Appl. 2018;111:11–34. https://doi.org/10.1016/J.ESWA.2018.03.007 .

Chase CW. Next generation demand management: people, process, analytics, and technology. Hoboken: Wiley; 2016.

Book   Google Scholar  

SAS Institute. Demand-driven forecasting and planning: take responsiveness to the next level. 13; 2014. https://www.sas.com/content/dam/SAS/en_us/doc/whitepaper2/demand-driven-forecasting-planning-107477.pdf .

Acar Y, Gardner ES. Forecasting method selection in a global supply chain. Int J Forecast. 2012;28(4):842–8. https://doi.org/10.1016/J.IJFORECAST.2011.11.003 .

Ma S, Fildes R, Huang T. Demand forecasting with high dimensional data: the case of SKU retail sales forecasting with intra- and inter-category promotional information. Eur J Oper Res. 2016;249(1):245–57. https://doi.org/10.1016/J.EJOR.2015.08.029 .

Addo-Tenkorang R, Helo PT. Big data applications in operations/supply-chain management: a literature review. Comput Ind Eng. 2016;101:528–43. https://doi.org/10.1016/J.CIE.2016.09.023 .

Agrawal S, Singh RK, Murtaza Q. A literature review and perspectives in reverse logistics. Resour Conserv Recycl. 2015;97:76–92. https://doi.org/10.1016/J.RESCONREC.2015.02.009 .

Gunasekaran A, Kumar Tiwari M, Dubey R, Fosso Wamba S. Big data and predictive analytics applications in supply chain management. Comput Ind Eng. 2016;101:525–7. https://doi.org/10.1016/J.CIE.2016.10.020 .

Hazen BT, Skipper JB, Ezell JD, Boone CA. Big data and predictive analytics for supply chain sustainability: a theory-driven research agenda. Comput Ind Eng. 2016;101:592–8. https://doi.org/10.1016/J.CIE.2016.06.030 .

Hofmann E, Rutschmann E. Big data analytics and demand forecasting in supply chains: a conceptual analysis. Int J Logist Manage. 2018;29(2):739–66. https://doi.org/10.1108/IJLM-04-2017-0088 .

Jain A, Sanders NR. Forecasting sales in the supply chain: consumer analytics in the big data era. Int J Forecast. 2019;35(1):170–80. https://doi.org/10.1016/J.IJFORECAST.2018.09.003 .

Jin J, Liu Y, Ji P, Kwong CK. Review on recent advances in information mining from big consumer opinion data for product design. J Comput Inf Sci Eng. 2018;19(1):010801. https://doi.org/10.1115/1.4041087 .

Kumar R, Mahto D. Industrial forecasting support systems and technologies in practice: a review. Glob J Res Eng. 2013;13(4):17–33.

MathSciNet   Google Scholar  

Mishra D, Gunasekaran A, Papadopoulos T, Childe SJ. Big Data and supply chain management: a review and bibliometric analysis. Ann Oper Res. 2016;270(1):313–36. https://doi.org/10.1007/s10479-016-2236-y .

Ren S, Zhang Y, Liu Y, Sakao T, Huisingh D, Almeida CMVB. A comprehensive review of big data analytics throughout product lifecycle to support sustainable smart manufacturing: a framework, challenges and future research directions. J Clean Prod. 2019;210:1343–65. https://doi.org/10.1016/J.JCLEPRO.2018.11.025 .

Singh Jain AD, Mehta I, Mitra J, Agrawal S. Application of big data in supply chain management. Mater Today Proc. 2017;4(2):1106–15. https://doi.org/10.1016/J.MATPR.2017.01.126 .

Souza GC. Supply chain analytics. Bus Horiz. 2014;57(5):595–605. https://doi.org/10.1016/J.BUSHOR.2014.06.004 .

Tiwari S, Wee HM, Daryanto Y. Big data analytics in supply chain management between 2010 and 2016: insights to industries. Comput Ind Eng. 2018;115:319–30. https://doi.org/10.1016/J.CIE.2017.11.017 .

Zhong RY, Newman ST, Huang GQ, Lan S. Big Data for supply chain management in the service and manufacturing sectors: challenges, opportunities, and future perspectives. Comput Ind Eng. 2016;101:572–91. https://doi.org/10.1016/J.CIE.2016.07.013 .

Ramanathan U, Subramanian N, Parrott G. Role of social media in retail network operations and marketing to enhance customer satisfaction. Int J Oper Prod Manage. 2017;37(1):105–23. https://doi.org/10.1108/IJOPM-03-2015-0153 .

Coursera. Supply chain planning. Coursera E-Learning; 2019. https://www.coursera.org/learn/planning .

Villegas MA, Pedregal DJ. Supply chain decision support systems based on a novel hierarchical forecasting approach. Decis Support Syst. 2018;114:29–36. https://doi.org/10.1016/J.DSS.2018.08.003 .

Ma J, Kwak M, Kim HM. Demand trend mining for predictive life cycle design. J Clean Prod. 2014;68:189–99. https://doi.org/10.1016/J.JCLEPRO.2014.01.026 .

Hamiche K, Abouaïssa H, Goncalves G, Hsu T. A robust and easy approach for demand forecasting in supply chains. IFAC-PapersOnLine. 2018;51(11):1732–7. https://doi.org/10.1016/J.IFACOL.2018.08.206 .

Da Veiga CP, Da Veiga CRP, Catapan A, Tortato U, Da Silva WV. Demand forecasting in food retail: a comparison between the Holt-Winters and ARIMA models. WSEAS Trans Bus Econ. 2014;11(1):608–14.

Murray PW, Agard B, Barajas MA. Forecasting supply chain demand by clustering customers. IFAC-PapersOnLine. 2015;48(3):1834–9. https://doi.org/10.1016/J.IFACOL.2015.06.353 .

Ramos P, Santos N, Rebelo R. Performance of state space and ARIMA models for consumer retail sales forecasting. Robot Comput Integr Manuf. 2015;34:151–63. https://doi.org/10.1016/J.RCIM.2014.12.015 .

Schaer O, Kourentzes N. Demand forecasting with user-generated online information. Int J Forecast. 2019;35(1):197–212. https://doi.org/10.1016/J.IJFORECAST.2018.03.005 .

Pang Y, Yao B, Zhou X, Zhang Y, Xu Y, Tan Z. Hierarchical electricity time series forecasting for integrating consumption patterns analysis and aggregation consistency; 2018. In: IJCAI international joint conference on artificial intelligence; 2018, p. 3506–12.

Goyal R, Chandra P, Singh Y. Suitability of KNN regression in the development of interaction based software fault prediction models. IERI Procedia. 2014;6:15–21. https://doi.org/10.1016/J.IERI.2014.03.004 .

Runkler TA. Data analytics (models and algorithms for intelligent data analysis). In: Revista Espanola de las Enfermedades del Aparato Digestivo (Vol. 26, Issue 4). Springer Fachmedien Wiesbaden; 2016. https://doi.org/10.1007/978-3-658-14075-5 .

Nikolopoulos KI, Babai MZ, Bozos K. Forecasting supply chain sporadic demand with nearest neighbor approaches. Int J Prod Econ. 2016;177:139–48. https://doi.org/10.1016/j.ijpe.2016.04.013 .

Gaur M, Goel S, Jain E. Comparison between nearest Neighbours and Bayesian network for demand forecasting in supply chain management. In: 2015 international conference on computing for sustainable global development, INDIACom 2015, May; 2015, p. 1433–6.

Burney SMA, Ali SM, Burney S. A survey of soft computing applications for decision making in supply chain management. In: 2017 IEEE 3rd international conference on engineering technologies and social sciences, ICETSS 2017, 2018, p. 1–6. https://doi.org/10.1109/ICETSS.2017.8324158 .

González Perea R, Camacho Poyato E, Montesinos P, Rodríguez Díaz JA. Optimisation of water demand forecasting by artificial intelligence with short data sets. Biosyst Eng. 2019;177:59–66. https://doi.org/10.1016/J.BIOSYSTEMSENG.2018.03.011 .

Vhatkar S, Dias J. Oral-care goods sales forecasting using artificial neural network model. Procedia Comput Sci. 2016;79:238–43. https://doi.org/10.1016/J.PROCS.2016.03.031 .

Wong WK, Guo ZX. A hybrid intelligent model for medium-term sales forecasting in fashion retail supply chains using extreme learning machine and harmony search algorithm. Int J Prod Econ. 2010;128(2):614–24. https://doi.org/10.1016/J.IJPE.2010.07.008 .

Liu C, Shu T, Chen S, Wang S, Lai KK, Gan L. An improved grey neural network model for predicting transportation disruptions. Expert Syst Appl. 2016;45:331–40. https://doi.org/10.1016/J.ESWA.2015.09.052 .

Yuan WJ, Chen JH, Cao JJ, Jin ZY. Forecast of logistics demand based on grey deep neural network model. Proc Int Conf Mach Learn Cybern. 2018;1:251–6. https://doi.org/10.1109/ICMLC.2018.8527006 .

Amirkolaii KN, Baboli A, Shahzad MK, Tonadre R. Demand forecasting for irregular demands in business aircraft spare parts supply chains by using artificial intelligence (AI). IFAC-PapersOnLine. 2017;50(1):15221–6. https://doi.org/10.1016/J.IFACOL.2017.08.2371 .

Huang L, Xie G, Li D, Zou C. Predicting and analyzing e-logistics demand in urban and rural areas: an empirical approach on historical data of China. Int J Performabil Eng. 2018;14(7):1550–9. https://doi.org/10.23940/ijpe.18.07.p19.15501559 .

Saha C, Lam SS, Boldrin W. Demand forecasting for server manufacturing using neural networks. In: Proceedings of the 2014 industrial and systems engineering research conference, June 2014; 2015.

Chang P-C, Wang Y-W, Tsai C-Y. Evolving neural network for printed circuit board sales forecasting. Expert Syst Appl. 2005;29(1):83–92. https://doi.org/10.1016/J.ESWA.2005.01.012 .

Merkuryeva G, Valberga A, Smirnov A. Demand forecasting in pharmaceutical supply chains: a case study. Procedia Comput Sci. 2019;149:3–10. https://doi.org/10.1016/J.PROCS.2019.01.100 .

Yang CL, Sutrisno H. Short-term sales forecast of perishable goods for franchise business. In: 2018 10th international conference on knowledge and smart technology: cybernetics in the next decades, KST 2018, p. 101–5; 2018. https://doi.org/10.1109/KST.2018.8426091 .

Villegas MA, Pedregal DJ, Trapero JR. A support vector machine for model selection in demand forecasting applications. Comput Ind Eng. 2018;121:1–7. https://doi.org/10.1016/J.CIE.2018.04.042 .

Wu Q. The hybrid forecasting model based on chaotic mapping, genetic algorithm and support vector machine. Expert Syst Appl. 2010;37(2):1776–83. https://doi.org/10.1016/J.ESWA.2009.07.054 .

Guanghui W. Demand forecasting of supply chain based on support vector regression method. Procedia Eng. 2012;29:280–4. https://doi.org/10.1016/J.PROENG.2011.12.707 .

Sarhani M, El Afia A. Intelligent system based support vector regression for supply chain demand forecasting. In: 2014 2nd world conference on complex systems, WCCS 2014; 2015, p. 79–83. https://doi.org/10.1109/ICoCS.2014.7060941 .

Chen IF, Lu CJ. Sales forecasting by combining clustering and machine-learning techniques for computer retailing. Neural Comput Appl. 2017;28(9):2633–47. https://doi.org/10.1007/s00521-016-2215-x .

Fasli M, Kovalchuk Y. Learning approaches for developing successful seller strategies in dynamic supply chain management. Inf Sci. 2011;181(16):3411–26. https://doi.org/10.1016/J.INS.2011.04.014 .

Islek I, Oguducu SG. A retail demand forecasting model based on data mining techniques. In: IEEE international symposium on industrial electronics; 2015, p. 55–60. https://doi.org/10.1109/ISIE.2015.7281443 .

Kilimci ZH, Akyuz AO, Uysal M, Akyokus S, Uysal MO, Atak Bulbul B, Ekmis MA. An improved demand forecasting model using deep learning approach and proposed decision integration strategy for supply chain. Complexity. 2019;2019:1–15. https://doi.org/10.1155/2019/9067367 .

Loureiro ALD, Miguéis VL, da Silva LFM. Exploring the use of deep neural networks for sales forecasting in fashion retail. Decis Support Syst. 2018;114:81–93. https://doi.org/10.1016/J.DSS.2018.08.010 .

Punam K, Pamula R, Jain PK. A two-level statistical model for big mart sales prediction. In: 2018 international conference on computing, power and communication technologies, GUCON 2018; 2019. https://doi.org/10.1109/GUCON.2018.8675060 .

Puspita PE, İnkaya T, Akansel M. Clustering-based Sales Forecasting in a Forklift Distributor. In: Uluslararası Muhendislik Arastirma ve Gelistirme Dergisi, 1–17; 2019. https://doi.org/10.29137/umagd.473977 .

Thomassey S. Sales forecasts in clothing industry: the key success factor of the supply chain management. Int J Prod Econ. 2010;128(2):470–83. https://doi.org/10.1016/J.IJPE.2010.07.018 .

Brentan BM, Ribeiro L, Izquierdo J, Ambrosio JK, Luvizotto E, Herrera M. Committee machines for hourly water demand forecasting in water supply systems. Math Probl Eng. 2019;2019:1–11. https://doi.org/10.1155/2019/9765468 .

Mafakheri F, Breton M, Chauhan S. Project-to-organization matching: an integrated risk assessment approach. Int J IT Project Manage. 2012;3(3):45–59. https://doi.org/10.4018/jitpm.2012070104 .

Mafakheri F, Nasiri F. Revenue sharing coordination in reverse logistics. J Clean Prod. 2013;59:185–96. https://doi.org/10.1016/J.JCLEPRO.2013.06.031 .

Bogataj M. Closed Loop Supply Chain (CLSC): economics, modelling, management and control. Int J Prod Econ. 2017;183:319–21. https://doi.org/10.1016/J.IJPE.2016.11.020 .

Hopkins J, Hawking P. Big Data Analytics and IoT in logistics: a case study. Int J Logist Manage. 2018;29(2):575–91. https://doi.org/10.1108/IJLM-05-2017-0109 .

de Oliveira CM, Soares PJSR, Morales G, Arica J, Matias IO. RFID and its applications on supply chain in Brazil: a structured literature review (2006–2016). Espacios. 2017;38(31). https://www.scopus.com/inward/record.uri?eid=2-s2.0-85021922345&partnerID=40&md5=f062191611541391ded4cdb73eea55cb .

Griva A, Bardaki C, Pramatari K, Papakiriakopoulos D. Retail business analytics: customer visit segmentation using market basket data. Expert Syst Appl. 2018;100:1–16. https://doi.org/10.1016/J.ESWA.2018.01.029 .

Lee CKM, Ho W, Ho GTS, Lau HCW. Design and development of logistics workflow systems for demand management with RFID. Expert Syst Appl. 2011;38(5):5428–37. https://doi.org/10.1016/J.ESWA.2010.10.012 .

Mohebi E, Marquez L. Application of machine learning and RFID in the stability optimization of perishable foods; 2008.

Jiao Z, Ran L, Zhang Y, Li Z, Zhang W. Data-driven approaches to integrated closed-loop sustainable supply chain design under multi-uncertainties. J Clean Prod. 2018;185:105–27.

Levis AA, Papageorgiou LG. Customer demand forecasting via support vector regression analysis. Chem Eng Res Des. 2005;83(8):1009–18. https://doi.org/10.1205/CHERD.04246 .

Chi H-M, Ersoy OK, Moskowitz H, Ward J. Modeling and optimizing a vendor managed replenishment system using machine learning and genetic algorithms. Eur J Oper Res. 2007;180(1):174–93. https://doi.org/10.1016/J.EJOR.2006.03.040 .

Sun Z-L, Choi T-M, Au K-F, Yu Y. Sales forecasting using extreme learning machine with applications in fashion retailing. Decis Support Syst. 2008;46(1):411–9. https://doi.org/10.1016/J.DSS.2008.07.009 .

Efendigil T, Önüt S, Kahraman C. A decision support system for demand forecasting with artificial neural networks and neuro-fuzzy models: a comparative analysis. Expert Syst Appl. 2009;36(3):6697–707. https://doi.org/10.1016/J.ESWA.2008.08.058 .

Lee CC, Ou-Yang C. A neural networks approach for forecasting the supplier’s bid prices in supplier selection negotiation process. Expert Syst Appl. 2009;36(2):2961–70. https://doi.org/10.1016/J.ESWA.2008.01.063 .

Chen F-L, Chen Y-C, Kuo J-Y. Applying Moving back-propagation neural network and Moving fuzzy-neuron network to predict the requirement of critical spare parts. Expert Syst Appl. 2010;37(9):6695–704. https://doi.org/10.1016/J.ESWA.2010.04.037 .

Wu Q. Product demand forecasts using wavelet kernel support vector machine and particle swarm optimization in manufacture system. J Comput Appl Math. 2010;233(10):2481–91. https://doi.org/10.1016/J.CAM.2009.10.030 .

Babai MZ, Ali MM, Boylan JE, Syntetos AA. Forecasting and inventory performance in a two-stage supply chain with ARIMA(0,1,1) demand: theory and empirical analysis. Int J Prod Econ. 2013;143(2):463–71. https://doi.org/10.1016/J.IJPE.2011.09.004 .

Kourentzes N. Intermittent demand forecasts with neural networks. Int J Prod Econ. 2013;143(1):198–206. https://doi.org/10.1016/J.IJPE.2013.01.009 .

Lau HCW, Ho GTS, Zhao Y. A demand forecast model using a combination of surrogate data analysis and optimal neural network approach. Decis Support Syst. 2013;54(3):1404–16. https://doi.org/10.1016/J.DSS.2012.12.008 .

Arunraj NS, Ahrens D. A hybrid seasonal autoregressive integrated moving average and quantile regression for daily food sales forecasting. Int J Prod Econ. 2015;170:321–35. https://doi.org/10.1016/J.IJPE.2015.09.039 .

Di Pillo G, Latorre V, Lucidi S, Procacci E. An application of support vector machines to sales forecasting under promotions. 4OR. 2016. https://doi.org/10.1007/s10288-016-0316-0 .

da Veiga CP, da Veiga CRP, Puchalski W, dos Coelho LS, Tortato U. Demand forecasting based on natural computing approaches applied to the foodstuff retail segment. J Retail Consumer Serv. 2016;31:174–81. https://doi.org/10.1016/J.JRETCONSER.2016.03.008 .

Chawla A, Singh A, Lamba A, Gangwani N, Soni U. Demand forecasting using artificial neural networks—a case study of American retail corporation. In: Applications of artificial intelligence techniques in wind power generation. Integrated Computer-Aided Engineering; 2018, p. 79–90. https://doi.org/10.3233/ica-2001-8305 .

Pereira MM, Machado RL, Ignacio Pires SR, Pereira Dantas MJ, Zaluski PR, Frazzon EM. Forecasting scrap tires returns in closed-loop supply chains in Brazil. J Clean Prod. 2018;188:741–50. https://doi.org/10.1016/J.JCLEPRO.2018.04.026 .

Fanoodi B, Malmir B, Jahantigh FF. Reducing demand uncertainty in the platelet supply chain through artificial neural networks and ARIMA models. Comput Biol Med. 2019;113:103415. https://doi.org/10.1016/J.COMPBIOMED.2019.103415 .

Sharma R, Singhal P. Demand forecasting of engine oil for automotive and industrial lubricant manufacturing company using neural network. Mater Today Proc. 2019;18:2308–14. https://doi.org/10.1016/J.MATPR.2019.07.013 .

Tanizaki T, Hoshino T, Shimmura T, Takenaka T. Demand forecasting in restaurants using machine learning and statistical analysis. Procedia CIRP. 2019;79:679–83. https://doi.org/10.1016/J.PROCIR.2019.02.042 .

Wang C-H, Chen J-Y. Demand forecasting and financial estimation considering the interactive dynamics of semiconductor supply-chain companies. Comput Ind Eng. 2019;138:106104. https://doi.org/10.1016/J.CIE.2019.106104 .

Download references

Acknowledgements

The authors are very much thankful to anonymous reviewers whose comments and suggestion were very helpful in improving the quality of the manuscript.

Author information

Authors and affiliations.

Concordia Institute for Information Systems Engineering (CIISE), Concordia University, Montreal, H3G 1M8, Canada

Mahya Seyedan & Fereshteh Mafakheri

You can also search for this author in PubMed   Google Scholar

Contributions

The authors contributed equally to the writing of the paper. First author conducted the literature search. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Fereshteh Mafakheri .

Ethics declarations

Ethics approval.

Not applicable.

Competing interests

The authors declare no competing or conflicting interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Seyedan, M., Mafakheri, F. Predictive big data analytics for supply chain demand forecasting: methods, applications, and research opportunities. J Big Data 7 , 53 (2020). https://doi.org/10.1186/s40537-020-00329-2

Download citation

Received : 05 April 2020

Accepted : 17 July 2020

Published : 25 July 2020

DOI : https://doi.org/10.1186/s40537-020-00329-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Demand forecasting
  • Closed-loop supply chains
  • Machine-learning

predictive analytics research topics

  • Open access
  • Published: 29 August 2023

Healthcare predictive analytics using machine learning and deep learning techniques: a survey

  • Mohammed Badawy   ORCID: orcid.org/0000-0001-9494-1386 1 ,
  • Nagy Ramadan 1 &
  • Hesham Ahmed Hefny 2  

Journal of Electrical Systems and Information Technology volume  10 , Article number:  40 ( 2023 ) Cite this article

9100 Accesses

4 Citations

Metrics details

Healthcare prediction has been a significant factor in saving lives in recent years. In the domain of health care, there is a rapid development of intelligent systems for analyzing complicated data relationships and transforming them into real information for use in the prediction process. Consequently, artificial intelligence is rapidly transforming the healthcare industry, and thus comes the role of systems depending on machine learning and deep learning in the creation of steps that diagnose and predict diseases, whether from clinical data or based on images, that provide tremendous clinical support by simulating human perception and can even diagnose diseases that are difficult to detect by human intelligence. Predictive analytics for healthcare a critical imperative in the healthcare industry. It can significantly affect the accuracy of disease prediction, which may lead to saving patients' lives in the case of accurate and timely prediction; on the contrary, in the case of an incorrect prediction, it may endanger patients' lives. Therefore, diseases must be accurately predicted and estimated. Hence, reliable and efficient methods for healthcare predictive analysis are essential. Therefore, this paper aims to present a comprehensive survey of existing machine learning and deep learning approaches utilized in healthcare prediction and identify the inherent obstacles to applying these approaches in the healthcare domain.

Introduction

Each day, human existence evolves, yet the health of each generation either improves or deteriorates. There are always uncertainties in life. Occasionally encounter many individuals with fatal health problems due to the late detection of diseases. Concerning the adult population, chronic liver disease would affect more than 50 million individuals worldwide. However, if the sickness is diagnosed early, it can be stopped. Disease prediction based on machine learning can be utilized to identify common diseases at an earlier stage. Currently, health is a secondary concern, which has led to numerous problems. Many patients cannot afford to see a doctor, and others are extremely busy and on a tight schedule, yet ignoring recurring symptoms for an extended length of time can have significant health repercussions [ 1 ].

Diseases are a global issue; thus, medical specialists and researchers are exerting their utmost efforts to reduce disease-related mortality. In recent years, predictive analytic models has played a pivotal role in the medical profession because of the increasing volume of healthcare data from a wide range of disparate and incompatible data sources. Nonetheless, processing, storing, and analyzing the massive amount of historical data and the constant inflow of streaming data created by healthcare services has become an unprecedented challenge utilizing traditional database storage [ 2 , 3 , 4 ]. A medical diagnosis is a form of problem-solving and a crucial and significant issue in the real world. Illness diagnosis is the process of translating observational evidence into disease names. The evidence comprises data received from evaluating a patient and substances generated from the patient; illnesses are conceptual medical entities that detect anomalies in the observed evidence [ 5 ].

Healthcare is the collective effort of society to ensure, provide, finance, and promote health. In the twentieth century, there was a significant shift toward the ideal of wellness and the prevention of sickness and incapacity. The delivery of healthcare services entails organized public or private efforts to aid persons in regaining health and preventing disease and impairment [ 6 ]. Health care can be described as standardized rules that help evaluate actions or situations that affect decision-making [ 7 ]. Healthcare is a multi-dimensional system. The basic goal of health care is to diagnose and treat illnesses or disabilities. A healthcare system’s key components are health experts (physicians or nurses), health facilities (clinics and hospitals that provide medications and other diagnostic services), and a funding institution to support the first two [ 8 ].

With the introduction of systems based on computers, the digitalization of all medical records and the evaluation of clinical data in healthcare systems have become widespread routine practices. The phrase "electronic health records" was chosen by the Institute of Medicine, a division of the National Academies of Sciences, Engineering, and Medicine, in 2003 to define the records that continued to enhance the healthcare sector for the benefit of both patients and physicians. Electronic Health Records (EHR) are "computerized medical records for patients that include all information in an individual's past, present, or future that occurs in an electronic system used to capture, store, retrieve, and link data primarily to offer healthcare and health-related services," according to Murphy, Hanken, and Waters [ 8 ].

Daily, healthcare services produce an enormous amount of data, making it increasingly complicated to analyze and handle it in "conventional ways." Using machine learning and deep learning, this data may be properly analyzed to generate actionable insights. In addition, genomics, medical data, social media data, environmental data, and other data sources can be used to supplement healthcare data. Figure  1 provides a visual picture of these data sources. The four key healthcare applications that can benefit from machine learning are prognosis, diagnosis, therapy, and clinical workflow, as outlined in the following section [ 9 ].

figure 1

Illustration of heterogeneous sources contributing to healthcare data [ 9 ]

The long-term investment in developing novel technologies based on machine learning as well as deep learning techniques to improve the health of individuals via the prediction of future events reflects the increased interest in predictive analytics techniques to enhance healthcare. Clinical predictive models, as they have been formerly referred to, assisted in the diagnosis of people with an increased probability of disease. These prediction algorithms are utilized to make clinical treatment decisions and counsel patients based on some patient characteristics [ 10 ].

The concept of medical care is used to stress the organization and administration of curative care, which is a subset of health care. The ecology of medical care was first introduced by White in 1961. White also proposed a framework for perceiving patterns of health concerning symptoms experienced by populations of interest, along with individuals’ choices in getting medical treatment. In this framework, it is possible to calculate the proportion of the population that used medical services over a specific period of time. The "ecology of medical care" theory has become widely accepted in academic circles over the past few decades [ 6 ].

Medical personnel usually face new problems, changing tasks, and frequent interruptions because of the system's dynamism and scalability. This variability often makes disease recognition a secondary concern for medical experts. Moreover, the clinical interpretation of medical data is a challenging task from an epistemological point of view. This not only applies to professionals with extensive experience but also to representatives, such as young physician assistants, with varied or little experience [ 11 ]. The limited time available to medical personnel, the speedy progression of diseases, and the fluctuating patient dynamics make diagnosis a particularly complex process. However, a precise method of diagnosis is critical to ensuring speedy treatment and, thus, patient safety [ 12 ].

Predictive analytics for health care are critical industry requirements. It can have a significant impact on the accuracy of disease prediction, which can save patients' lives in the case of an accurate and timely prediction but can also endanger patients' lives in the case of an incorrect prediction. Diseases must therefore be accurately predicted and estimated. As a result, dependable and efficient methods for healthcare predictive analysis are required.

The purpose of this paper is to present a comprehensive review of common machine learning and deep learning techniques that are utilized in healthcare prediction, in addition to identifying the inherent obstacles that are associated with applying these approaches in the healthcare domain.

The rest of the paper is organized as follows: Section  " Background " gives a theoretical background on artificial intelligence, machine learning, and deep learning techniques. Section  " Disease prediction with analytics " outlines the survey methodology and presents a literature review of machine learning as well as deep learning approaches employed in healthcare prediction. Section  " Results and Discussion " gives a discussion of the results of previous works related to healthcare prediction. Section  " Challenges " covers the existing challenges related to the topic of this survey. Finally, Section  " Conclusion " concludes the paper.

The extensive research and development of cutting-edge tools based on machine learning and deep learning for predicting individual health outcomes demonstrate the increased interest in predictive analytics techniques to improve health care. Clinical predictive models assisted physicians in better identifying and treating patients who were at a higher risk of developing a serious illness. Based on a variety of factors unique to each individual patient, these prediction algorithms are used to advise patients and guide clinical practice.

Artificial intelligence (AI) is the ability of a system to interpret data, and it makes use of computers and machines to improve humans' capacity for decision-making, problem-solving, and technological innovation [ 13 ]. Figure  2 depicts machine learning and deep learning as subsets of AI.

figure 2

AI, ML, and DL

Machine learning

Machine learning (ML) is a subfield of AI that aims to develop predictive algorithms based on the idea that machines should have the capability to access data and learn on their own [ 14 ]. ML utilizes algorithms, methods, and processes to detect basic correlations within data and create descriptive and predictive tools that process those correlations. ML is usually associated with data mining, pattern recognition, and deep learning. Although there are no clear boundaries between these areas and they often overlap, it is generally accepted that deep learning is a relatively new subfield of ML that uses extensive computational algorithms and large amounts of data to define complex relationships within data. As shown in Fig.  3 , ML algorithms can be divided into three categories: supervised learning, unsupervised learning, and reinforcement learning [ 15 ].

figure 3

Different types of machine learning algorithms

Supervised learning

Supervised learning is an ML model for investigating the input–output correlation information of a system depending on a given set of training examples that are paired between the inputs and the outputs [ 16 ]. The model is trained with a labeled dataset. It matches how a student learns fundamental math from a teacher. This kind of learning requires labeled data with predicted correct answers based on algorithm output [ 17 ]. The most widely used supervised learning-based techniques include linear regression, logistic regression, decision trees, random forests, support vector machines, K-nearest neighbor, and naive Bayes.

A. Linear regression

Linear regression is a statistical method commonly used in predictive investigations. It succeeds in forecasting the dependent, output, variable (Y) based on the independent, input, variable (X). The connection between X and Y is represented as shown in Eq.  1 assuming continuous, real, and numeric parameters.

where m indicates the slope and c indicates the intercept. According to Eq.  1 , the association between the independent parameters (X) and the dependent parameters (Y) can be inferred [ 18 ].

The advantage of linear regression is that it is straightforward to learn and easy to-eliminate overfitting through regularization. One drawback of linear regression is that it is not convenient when applied to nonlinear relationships. However, it is not recommended for most practical applications as it greatly simplifies real-world problems [ 19 ]. The implementation tools utilized in linear regression are Python, R, MATLAB, and Excel.

As shown in Fig.  4 , observations are highlighted in red, and random deviations' result (shown in green) from the basic relationship (shown in yellow) between the independent variable (x) and the dependent variable (y) [ 20 ].

figure 4

Linear regression model

B. Logistic regression

Logistic regression, also known as the logistic model, investigates the correlation between many independent variables and a categorical dependent variable and calculates the probability of an event by fitting the data to a logistic curve [ 21 ]. Discrete mean values must be binary, i.e., have only two outcomes: true or false, 0 or 1, yes or no, or either superscript or subscript. In logistic regression, categorical variables need to be predicted and classification problems should be solved. Logistic regression can be implemented using various tools such as R, Python, Java, and MATLAB [ 18 ]. Logistic regression has many benefits; for example, it shows the linear relationship between dependent and independent variables with the best results. It is also simple to understand. On the other hand, it can only predict numerical output, is not relevant to nonlinear data, and is sensitive to outliers [ 22 ].

C. Decision tree

The decision tree (DT) is the supervised learning technique used for classification. It combines the values of attributes based on their order, either ascending or descending [ 23 ]. As a tree-based strategy, DT defines each path starting from the root using a data-separating sequence until a Boolean conclusion is attained at the leaf node [ 24 , 25 ]. DT is a hierarchical representation of knowledge interactions that contains nodes and links. When relations are employed to classify, nodes reflect purposes [ 26 , 27 ]. An example of DT is presented in Fig.  5 .

figure 5

Example of a DT

DTs have various drawbacks, such as increased complexity with increasing nomenclature, small modifications that may lead to a different architecture, and more processing time to train data [ 18 ]. The implementation tools used in DT are Python (Scikit-Learn), RStudio, Orange, KNIME, and Weka [ 22 ].

D. Random forest

Random forest (RF) is a basic technique that produces correct results most of the time. It may be utilized for classification and regression. The program produces an ensemble of DTs and blends them [ 28 ].

In the RF classifier, the higher the number of trees in the forest, the more accurate the results. So, the RF has generated a collection of DTs called the forest and combined them to achieve more accurate prediction results. In RF, each DT is built only on a part of the given dataset and trained on approximations. The RF brings together several DTs to reach the optimal decision [ 18 ].

As indicated in Fig.  6 , RF randomly selects a subset of features from the data, and from each subset it generates n random trees [ 20 ]. RF will combine the results from all DTs and provide them in the final output.

figure 6

Random forest architecture

Two parameters are being used for tuning RF models: mtry —the count of randomly selected features to be considered in each division; and ntree —the model trees count. The mtry parameter has a trade-off: Large values raise the correlation between trees, but enhance the per-tree accuracy [ 29 ].

The RF works with a labeled dataset to do predictions and build a model. The final model is utilized to classify unlabeled data. The model integrates the concept of bagging with a random selection of traits to build variance-controlled DTs [ 30 ].

RF offers significant benefits. First, it can be utilized for determining the relevance of the variables in a regression and classification task [ 31 , 32 ]. This relevance is measured on a scale, based on the impurity drop at each node used for data segmentation [ 33 ]. Second, it automates missing values contained in the data and resolves the overfitting problem of DT. Finally, RF can efficiently handle huge datasets. On the other side, RF suffers from drawbacks; for example, it needs more computing and resources to generate the output results and it requires training effort due to the multiple DTs involved in it. The implementation tools used in RF are Python Scikit-Learn and R [ 18 ].

E. Support vector machine

The supervised ML technique for classification issues and regression models is called the support vector machine (SVM). SVM is a linear model that offers solutions to issues that are both linear and nonlinear. as shown in Fig.  7 . Its foundation is the idea of margin calculation. The dataset is divided into several groups to build relations between them [ 18 ].

figure 7

Support vector machine

SVM is a statistics-based learning method that follows the principle of structural risk minimization and aims to locate decision bounds, also known as hyperplanes, that can optimally separate classes by finding a hyperplane in a usable N-dimensional space that explicitly classifies data points [ 34 , 35 , 36 ]. SVM indicates the decision boundary between two classes by defining the value of each data point, in particular the support vector points placed on the boundary between the respective classes [ 37 ].

SVM has several advantages; for example, it works perfectly with both semi-structured and unstructured data. The kernel trick is a strong point of SVM. Moreover, it can handle any complex problem with the right functionality and can also handle high-dimensional data. Furthermore, SVM generalization has less allocation risk. On the other hand, SVM has many downsides. The model training time is increased on a large dataset. Choosing the right kernel function is also a difficult process. In addition, it is not working well with noisy data. Implementation tools used in SVM include SVMlight with C, LibSVM with Python, MATLAB or Ruby, SAS, Kernlab, Scikit-Learn, and Weka [ 22 ].

F. K-nearest neighbor

K-nearest neighbor (KNN) is an "instance-based learning" or non-generalized learning algorithm, which is often known as a “lazy learning” algorithm [ 38 ]. KNN is used for solving classification problems. To anticipate the target label of the novel test data, KNN determines the distance of the nearest training data class labels with a new test data point in the existence of a K value, as shown in Fig.  8 . It then calculates the number of nearest data points using the K value and terminates the label of the new test data class. To determine the number of nearest-distance training data points, KNN usually sets the value of K according to (1): k  =  n ^(1/2), where n is the size of the dataset [ 22 ].

figure 8

K-nearest neighbor

KNN has many benefits; for example, it is sufficiently powerful if the size of the training data is large. It is also simple and flexible, with attributes and distance functions. Moreover, it can handle multi-class datasets. KNN has many drawbacks, such as the difficulty of choosing the appropriate K value, it being very tedious to choose the distance function type for a particular dataset, and the computation cost being a little high due to the distance between all the training data points, the implementation tools used in KNN are Python (Scikit-Learn), WEKA, R, KNIME, and Orange [ 22 ].

G. Naive Bayes

Naive Bayes (NB) focuses on the probabilistic model of Bayes' theorem and is simple to set up as the complex recursive parameter estimation is basically none, making it suitable for huge datasets [ 39 ]. NB determines the class membership degree based on a given class designation [ 40 ]. It scans the data once, and thus, classification is easy [ 41 ]. Simply, the NB classifier assumes that there is no relation between the presence of a particular feature in a class and the presence of any other characteristic. It is mainly targeted at the text classification industry [ 42 ].

NB has great benefits such as ease of implementation, can provide a good result even using fewer training data, can manage both continuous and discrete data, and is ideal to solve the prediction of multi-class problems, and the irrelevant feature does not affect the prediction. NB, on the other hand, has the following drawbacks: It assumes that all features are independent which is not always viable in real-world problems, suffers from zero frequency problems, and the prediction of NB is not usually accurate. Implementation tools are WEKA, Python, RStudio, and Mahout [ 22 ].

To summarize the previously discussed models, Table 1 demonstrates the advantages and disadvantages of each model.

Unsupervised learning

Unlike supervised learning, there are no correct answers and no teachers in unsupervised learning [ 42 ]. It follows the concept that a machine can learn to understand complex processes and patterns on its own without external guidance. This approach is particularly useful in cases where experts have no knowledge of what to look for in the data and the data itself do not include the objectives. The machine predicts the outcome based on past experiences and learns to predict the real-valued outcome from the information previously provided, as shown in Fig.  9 .

figure 9

Workflow of unsupervised learning [ 23 ]

Unsupervised learning is widely used in the processing of multimedia content, as clustering and partitioning of data in the lack of class labels is often a requirement [ 43 ]. Some of the most popular unsupervised learning-based approaches are k-means, principal component analysis (PCA), and apriori algorithm.

The k-means algorithm is the common portioning method [ 44 ] and one of the most popular unsupervised learning algorithms that deal with the well-known clustering problem. The procedure classifies a particular dataset by a certain number of preselected (assuming k -sets) clusters [ 45 ]. The pseudocode of the K-means algorithm is shown in Pseudocode 1.

predictive analytics research topics

K means has several benefits such as being more computationally efficient than hierarchical grouping in case of large variables. It provides more compact clusters than hierarchical ones when a small k is used. Also, the ease of implementation and comprehension of assembly results is another benefit. However, K -means have disadvantages such as the difficulty of predicting the value of K . Also, as different starting sections lead to various final combinations, the performance is affected. It is accurate for raw points and local optimization, and there is no single solution for a given K value—so the average of the K value must be run multiple times (20–100 times) and then pick the results with the minimum J [ 19 ].

B. Principal component analysis

In modern data analysis, principal component analysis (PCA) is an essential tool as it provides a guide for extracting the most important information from a dataset, compressing the data size by keeping only those important features without losing much information, and simplifying the description of a dataset [ 46 , 47 ].

PCA is frequently used to reduce data dimensions before applying classification models. Moreover, unsupervised methods, such as dimensionality reduction or clustering algorithms, are commonly used for data visualizations, detection of common trends or behaviors, and decreasing the data quantity to name a few only [ 48 ].

PCA converts the 2D data into 1D data. This is done by changing the set of variables into new variables known as principal components (PC) which are orthogonal [ 23 ]. In PCA, data dimensions are reduced to make calculations faster and easier. To illustrate how PCA works, let us consider an example of 2D data. When these data are plotted on a graph, it will take two axes. Applying PCA, the data turn into 1D. This process is illustrated in Fig.  10 [ 49 ].

figure 10

Visualization of data before and after applying PCA [ 49 ]

Apriori algorithm is considered an important algorithm, which was first introduced by R. Agrawal and R. Srikant, and published in [ 50 , 51 ].

The principle of the apriori algorithm is to represent the filter generation strategy. It creates a filter element set ( k  + 1) based on the repeated k element groups. Apriori uses an iterative strategy called planar search, where k item sets are employed to explore ( k  + 1) item sets. First, the set of repeating 1 item is produced by scanning the dataset to collect the number for each item and then collecting items that meet the minimum support. The resulting group is called L1. Then L1 is used to find L2, the recursive set of two elements is used to find L3, and so on until no repeated k element groups are found. Finding every Lk needs a full dataset scan. To improve production efficiency at the level-wise of repeated element groups, a key property called the apriori property is used to reduce the search space. Apriori property states that all non-empty subsets of a recursive element group must be iterative. A two-step technique is used to identify groups of common elements: join and prune activities [ 52 ].

Although it is simple, the apriori algorithm suffers from several drawbacks. The main limitation is the costly wasted time to contain many candidates sets with a lot of redundant item sets. It also suffers from low minimum support or large item sets, and multiple rounds of data are needed for data mining which usually results in irrelevant items, in addition to difficulties in discovering individual elements of events [ 53 , 54 ].

To summarize the previously discussed models, Table 2 demonstrates the advantages and disadvantages of each model.

Reinforcement learning

Reinforcement learning (RL) is different from supervised learning and unsupervised learning. It is a goal-oriented learning approach. RL is closely related to an agent (controller) that takes responsibility for the learning process to achieve a goal. The agent chooses actions, and as a result, the environment changes its state and returns rewards. Positive or negative numerical values are used as rewards. An agent's goal is to maximize the rewards accumulated over time. A job is a complete environment specification that identifies how to generate rewards [ 55 ]. Some of the most popular reinforcement learning-based algorithms are the Q-learning algorithm and the Monte Carlo tree search (MCTS).

A. Q-learning

Q-learning is a type of model-free RL. It can be considered an asynchronous dynamic programming approach. It enables agents to learn how to operate optimally in Markovian domains by exploring the effects of actions, without the need to generate domain maps [ 56 ]. It represented an incremental method of dynamic programming that imposed low computing requirements. It works through the successive improvement of the assessment of individual activity quality in particular states [ 57 ].

In information theory, Q-learning is strongly employed, and other related investigations are underway. Recently, Q-learning combined with information theory has been employed in different disciplines such as natural language processing (NLP), pattern recognition, anomaly detection, and image classification [ 58 , 59 , 60 , 60 ]. Moreover, a framework has been created to provide a satisfying response based on the user’s utterance using RL in a voice interaction system [ 61 ]. Furthermore, a high-resolution deep learning-based prediction system for local rainfall has been constructed [ 62 ].

The advantage of developmental Q-learning is that it is possible to identify the reward value effectively on a given multi-agent environment method as agents in ant Q-learning are interacting with each other. The problem with Q-learning is that its output can be stuck in the local minimum as agents just take the shortest path [ 63 ].

B. Monte Carlo tree search

Monte Carlo tree search (MCTS) is an effective technique for solving sequential selection problems. Its strategy is based on a smart tree search that balances exploration and exploitation. MCTS presents random samples in the form of simulations and keeps activity statistics for better educated choices in each future iteration. MCTS is a decision-making algorithm that is employed in searching tree-like huge complex regions. In such trees, each node refers to a state, which is also referred to as problem configuration, while edges represent transitions from one state to another [ 64 ].

The MCTS is related directly to cases that can be represented by a Markov decision process (MDP), which is a type of discrete-time random control process. Some modifications of the MCTS make it possible to apply it to partially observable Markov decision processes (POMDP) [ 65 ]. Recently, MCTS coupled with deep RL became the base of AlphaGo developed by Google DeepMind and documented in [ 66 ]. The basic MCTS method is conceptually simple, as shown in Fig.  11 .

figure 11

Basic MCTS process

Tree 1 is constructed progressively and unevenly. The tree policy is utilized to get the critical node of the current tree for each iteration of the method. The tree strategy seeks to strike a balance between exploration and exploitation concerns. Then, from the specified node, simulation 2 is run, and the search tree is then updated according to the obtained results. This comprises adding a child node that matches the specified node's activity and updating its ancestor's statistics. During this simulation, movements are performed based on some default policy, which in its simplest case is to make uniform random movements. The benefit of MCTS is that there is no need to evaluate the values of the intermediate state, which significantly minimizes the amount of required knowledge in the field [ 67 ].

To summarize the previously discussed models, Table 3 demonstrates the advantages and disadvantages of each model.

Deep learning

Over the past decades, ML has had a significant impact on our daily lives with examples including efficient computer vision, web search, and recognition of optical characters. In addition, by applying ML approaches, AI at the human level has also been improved [ 68 , 69 , 70 ]. However, when it comes to the mechanisms of human information processing (such as sound and vision), the performance of traditional ML algorithms is far from satisfactory. The idea of deep learning (DL) was formed in the late 20th inspired by the deep hierarchical structures of human voice recognition and production systems. DL breaks have been introduced in 2006 when Hinton built a deep-structured learning architecture called deep belief network (DBN) [ 71 ].

The performance of classifiers using DL has been extensively improved with the increased complexity of data compared to classical learning methods. Figure  12 shows the performance of classic ML algorithms and DL methods [ 72 ]. The performance of typical ML algorithms becomes stable when they reach the training data threshold, but DL improves its performance as the complexity of data increases [ 73 ].

figure 12

Performance of deep learning concerning the complexity of data

DL (deep ML, or deep-structured learning) is a subset of ML that involves a collection of algorithms attempting to represent high-level abstractions for data through a model that has complicated structures or is otherwise, composed of numerous nonlinear transformations. The most important characteristic of DL is the depth of the network. Another essential aspect of DL is the ability to replace handcrafted features generated by efficient algorithms for unsupervised or semi-supervised feature learning and hierarchical feature extraction [ 74 ].

DL has significantly advanced the latest technologies in a variety of applications, including machine translation, speech, and visual object recognition, NLP, and text automation, using multilayer artificial neural networks (ANNs) [ 15 ].

Different DL designs in the past two decades give enormous potential for employment in various sectors such as automatic voice recognition, computer vision, NLP, and bioinformatics. This section discusses the most common architectures of DL such as convolutional neural networks (CNNs), long short-term memory (LSTM), and recurrent convolution neural networks (RCNNs) [ 75 ].

A. Convolutional neural network

CNNs are special types of neural networks inspired by the human visual cortex and used in computer vision. It is an automatic feed-forward neural network in which information transfers exclusively in the forward direction [ 76 ]. CNN is frequently applied in face recognition, human organ localization, text analysis, and biological image recognition [ 77 ].

Since CNN was first created in 1989, it has done well in disease diagnosis over the past three decades [ 78 ]. Figure  13 depicts the general architecture of a CNN composed of feature extractors and a classifier. Each layer of the network accepts the output of the previous layer as input and passes it on to the next layer in feature extraction layers. A typical CNN architecture consists of three types of layers: convolution, pooling, and classification. There are two types of layers at the network's low and middle levels: convolutional layers and pooling layers. Even-numbered layers are used for convolutions, while odd-numbered layers are used for pooling operations. The convolution and pooling layers' output nodes are categorized in a two-dimensional plane called feature mapping. Each layer level is typically generated by combining one or more previous layers [ 79 ].

figure 13

Architecture of CNN [ 79 ]

CNN has a lot of benefits, including a human optical processing system, greatly improved 2D and 3D image processing structure, and is effective in learning and extracting abstract information from 2D information. The max-pooling layer in CNN is efficient in absorbing shape anisotropy. Furthermore, they are constructed from sparse connections with paired weights and contain far fewer parameters than a fully connected network of equal size. CNNs are trained using a gradient-based learning algorithm and are less susceptible to the diminishing gradient problem because the gradient-based approach trains the entire network to directly reduce the error criterion, allowing CNNs to provide highly optimized weights [ 79 ].

B. Long short-term memory

LSTM is a special type of recurrent neural network (RNN) with internal memory and multiplicative gates. Since the original LSTM introduction in 1997 by Sepp Hochrieiter and Jürgen Schmidhuber, a variety of LSTM cell configurations have been described [ 80 ].

LSTM has contributed to the development of well-known software such as Alexa, Siri, Cortana, Google Translate, and Google voice assistant [ 81 ]. LSTM is an implementation of RNN with a special connection between nodes. The special components within the LSTM unit include the input, output, and forget gates. Figure  14 depicts a single LSTM cell.

figure 14

LSTM unit [ 82 ]

x t  = Input vector at the time t.

h t-1  = Previous hidden state.

c t-1  = Previous memory state.

h t  = Current hidden state.

c t  = Current memory state.

[ x ] = Multiplication operation.

[+] = Addition operation.

LSTM is an RNN module that handles gradient loss problems. In general, RNN uses LSTM to eliminate propagation errors. This allows the RNN to learn over multiple time steps. LSTM is characterized by cells that hold information outside the recurring network. This cell enables the RNN to learn over many time steps. The basic principle of LSTMs is the state of the cell, which contains information outside the recurrent network. A cell is like a memory in a computer, which decides when data should be stored, written, read, or erased via the LSTM gateway [ 82 ]. Many network architectures use LSTM such as bidirectional LSTM, hierarchical and attention-based LSTM, convolutional LSTM, autoencoder LSTM, grid LSTM, cross-modal, and associative LSTM [ 83 ].

Bidirectional LSTM networks move the state vector forward and backward in both directions. This implies that dependencies must be considered in both temporal directions. As a result of inverse state propagation, the expected future correlations can be included in the network's current output [ 84 ]. Bidirectional LSTM investigates and analyzes this because it encapsulates spatially and temporally scattered information and can tolerate incomplete inputs via a flexible cell state vector propagation communication mechanism. Based on the detected gaps in data, this filtering mechanism reidentifies the connections between cells for each data sequence. Figure  15 depicts the architecture. A bidirectional network is used in this study to process properties from multiple dimensions into a parallel and integrated architecture [ 83 ].

figure 15

(left) Bidirectional LSTM and (right) filter mechanism for processing incomplete data [ 84 ]

Hierarchical LSTM networks solve multi-dimensional problems by breaking them down into subproblems and organizing them in a hierarchical structure. This has the advantage of focusing on a single or multiple subproblems. This is accomplished by adjusting the weights within the network to generate a certain level of interest [ 83 ]. A weighting-based attention mechanism that analyzes and filters input sequences is also used in hierarchical LSTM networks for long-term dependency prediction [ 85 ].

Convolutional LSTM reduces and filters input data collected over a longer period using convolutional operations applied in LSTM networks or the LSTM cell architecture directly. Furthermore, due to their distinct characteristics, convolutional LSTM networks are useful for modeling many quantities such as spatially and temporally distributed relationships. However, many quantities can be expected collectively in terms of reduced feature representation. Decoding or decoherence layers are required to predict different output quantities not as features but based on their parent units [ 83 ].

The LSTM autoencoder solves the problem of predicting high-dimensional parameters by shrinking and expanding the network [ 86 ]. The autoencoder architecture is separately trained with the aim of accurate reconstruction of the input data as reported in [ 87 ]. Only the encoder is used during testing and commissioning to extract the low-dimensional properties that are transmitted to the LSTM. The LSTM was extended to multimodal prediction using this strategy. To compress the input data and cell states, the encoder and decoder are directly integrated into the LSTM cell architecture. This combined reduction improves the flow of information in the cell and results in an improved cell state update mechanism for both short-term and long-term dependency [ 83 ].

Grid long short-term memory is a network of LSTM cells organized into a multi-dimensional grid that can be applied to sequences, vectors, or higher-dimensional data like images [ 88 ]. Grid LSTM has connections to the spatial or temporal dimensions of input sequences. Thus, connections of different dimensions within cells extend the normal flow of information. As a result, grid LSTM is appropriate for the parallel prediction of several output quantities that may be independent, linear, or nonlinear. The network's dimensions and structure are influenced by the nature of the input data and the goal of the prediction [ 89 ].

A novel method for the collaborative prediction of numerous quantities is the cross-modal and associative LSTM. It uses several standard LSTMs to separately model different quantities. To calculate the dependencies of the quantities, these LSTM streams communicate with one another via recursive connections. The chosen layers' outputs are added as new inputs to the layers before and after them in other streams. Consequently, a multimodal forecast can be made. The benefit of this approach is that the correlation vectors that are produced have the same dimensions as the input vectors. As a result, neither the parameter space nor the computation time increases [ 90 ].

C. Recurrent convolution neural network

CNN is a key method for handling various computer vision challenges. In recent years, a new generation of CNNs has been developed, the recurrent convolution neural network (RCNN), which is inspired by large-scale recurrent connections in the visual systems of animals. The recurrent convolutional layer (RCL) is the main feature of RCNN, which integrates repetitive connections among neurons in the normal convolutional layer. With the increase in the number of repetitive computations, the receptive domains (RFs) of neurons in the RCL expand infinitely, which is contrary to biological facts [ 91 ].

The RCNN prototype was proposed by Ming Liang and Xiaolin Hu [ 92 , 93 ], and the structure is illustrated in Fig.  16 , in which both forward and redundant connections have local connectivity and weights shared between distinct sites. This design is quite like the recurrent multilayer perceptron (RMLP) concept which is often used for dynamic control [ 94 , 95 ] (Fig.  17 , middle). Like the distinction between MLP and CNN, the primary distinction is that in RMLP, common local connections are used in place of full connections. For this reason, the proposed model is known as RCNN [ 96 ].

figure 16

Illustration of the architectures of CNN, RMLP, and RCNN [ 85 ]

figure 17

Illustration of the total number of reviewed papers

The main unit of RCNN is the RCL. RCLs develop through discrete time steps. RCNN offers three basic advantages. First, it allows each unit to accommodate background information in an arbitrarily wide area in the current layer. Second, recursive connections improve the depth of the network while keeping the number of mutable parameters constant through weight sharing. This is consistent with the trend of modern CNN architecture to grow deeper with a relatively limited number of parameters. The third aspect of RCNN is the time exposed in RCNN which is a CNN with many paths between the input layer and the output layer, which makes learning simple. On one hand, having longer paths makes it possible for the model to learn very complex features. On the other hand, having shorter paths may improve the inverse gradient during training [ 91 ].

To summarize the previously discussed models, Table 4 demonstrates the advantages and disadvantages of each model.

Disease prediction with analytics

The studies discussed in this paper have been presented and published in high-quality journals and international conferences published by IEEE, Springer, and Elsevier, and other major scientific publishers such as Hindawi, Frontiers, Taylor, and MDPI. The search engines used are Google Scholar, Scopus, and Science Direct. All papers selected covered the period from 2019 to 2022. Machine learning, deep learning, health care, surgery, cardiology, radiology, hepatology, and nephrology are some of the terms used to search for these studies. The studies chosen for this survey are concerned with the use of machine learning as well as deep learning algorithms in healthcare prediction. For this survey, empirical and review articles on the topics were considered. This section discusses existing research efforts that healthcare prediction using various techniques in ML and DL. This survey gives a detailed discussion about the methods and algorithms which are used for predictions, performance metrics, and tools of their model.

ML-based healthcare prediction

To predict diabetes patients, the authors of [ 97 ] utilized a framework to develop and evaluate ML classification models like logistic regression, KNN, SVM, and RF. ML method was implemented on the Pima Indian Diabetes Database (PIDD) which has 768 rows and 9 columns. The forecast accuracy delivers 83%. Results of the implementation approach indicate how the logistic regression outperformed other algorithms of ML, in addition only a structured dataset was selected but unstructured data are not considered, also model should be implemented in other healthcare domains like heart disease, and COVID-19, finally other factors should be considered for diabetes prediction, like family history of diabetes, smoking habits, and physical inactivity.

The authors created a diagnosis system in [ 98 ] that uses two different datasets (Frankfurt Hospital in Germany and PIDD provided by the UCI ML repository) and four prediction models (RF, SVM, NB, and DT) to predict diabetes. the SVM algorithm performed with an accuracy of 83.1 percent. There are some aspects of this study that need to be improved; such as, using a DL approach to predict diabetes may lead to achieving better results; furthermore, the model should be tested in other healthcare domains such as heart disease and COVID-19 prediction datasets.

In [ 99 ], the authors proposed three ML methods (logistic regression, DT, and boosted RF) to assess COVID-19 using OpenData Resources from Mexico and Brazil. To predict rescue and death, the proposed model incorporates just the COVID-19 patient's geographical, social, and economic conditions, as well as clinical risk factors, medical reports, and demographic data. On the dataset utilized, the model for Mexico has a 93 percent accuracy, and an F1 score is 0.79. On the other hand, on the used dataset, the Brazil model has a 69 percent accuracy and an F1 score is 0.75. The three ML algorithms have been examined and the acquired results showed that logistic regression is the best way of processing data. The authors should be concerned about the usage of authentication and privacy management of the created data.

A new model for predicting type 2 diabetes using a network approach and ML techniques was presented by the authors in [ 100 ] (logistic regression, SVM, NB, KNN, decision tree, RF, XGBoost, and ANN). To predict the risk of type 2 diabetes, the healthcare data of 1,028 type 2 diabetes patients and 1,028 non-type 2 diabetes patients were extracted from de-identified data. The experimental findings reveal the models’ effectiveness with an area under curve (AUC) varied from 0.79 to 0.91. The RF model achieved higher accuracy than others. This study relies only on the dataset providing hospital admission and discharge summaries from one insurance company. External hospital visits and information from other insurance companies are missing for people with many insurance providers.

The authors of [ 101 ] proposed a healthcare management system that can be used by patients to schedule appointments with doctors and verify prescriptions. It gives support for ML to detect ailments and determine medicines. ML models including DT, RF, logistic regression, and NB classifiers are applied to the datasets of diabetes, heart disease, chronic kidney disease, and liver. The results showed that among all the other models, logistic regression had the highest accuracy of 98.5 percent in the heart dataset. while the least accuracy is of the DT classifier which came out to be 92 percent. In the liver dataset the logistic regression with maximum accuracy of 75.17% among all others. In the chronic renal disease dataset, the logistic regression, RF, and Gaussian NB, all performed well with an accuracy of 1, the accuracy of 100% should be verified by using k-fold cross-validation to test the reliability of the models. In the diabetes dataset random forest with maximum accuracy of 83.67 percent. The authors should include a hospital directory as then various hospitals and clinics can be accessed through a single portal. Additionally, image datasets could be included to allow image processing of reports and the deployment of DL to detect diseases.

In [ 102 ], the authors developed an ML model to predict the occurrence of Type 2 Diabetes in the following year (Y + 1) using factors in the present year (Y). Between 2013 and 2018, the dataset was obtained as an electronic health record from a private medical institute. The authors applied logistic regression, RF, SVM, XGBoost, and ensemble ML algorithms to predict the outcome of non-diabetic, prediabetes, and diabetes. Feature selection was applied to choose the three classes efficiently. FPG, HbA1c, triglycerides, BMI, gamma-GTP, gender, age, uric acid, smoking, drinking, physical activity, and family history were among the features selected. According to the experimental results, the maximum accuracy was 73% from RF, while the lowest was 71% from the logistic regression model. The authors presented a model that used only one dataset. As a result, additional data sources should be applied to verify the models developed in this study.

The authors of [ 103 ] classified the diabetes dataset using SVM and NB algorithms with feature selection to improve the model's accuracy. PIDD is taken from the UCI Repository for analysis. For training and testing purposes the authors employed the k-fold cross-validation model, the SVM classifier was performing better than the NB method it offers around 91% correct predictions; however, the authors acknowledge that they need to extend to the latest dataset that will contain additional attributes and rows.

K-means clustering is an unsupervised ML algorithm that was introduced by the authors of [ 104 ] for the purpose of detecting heart disease in its earliest stages using the UCI heart disease dataset. PCA is used for dimensionality reduction. The outcome of the method demonstrates early cardiac disease prediction with 94.06% accuracy. The authors should apply the proposed technique using more than one algorithm and use more than one dataset.

In [ 105 ], the authors constructed a predictive model for the classification of diabetes data using the logistic regression classification technique. The dataset includes 459 patients for training data and 128 cases for testing data. The prediction accuracy using logistic regression was obtained at 92%. The main limitation of this research is that the authors have not compared the model with other diabetes prediction algorithms, so it cannot be confirmed.

The authors of [ 106 ] developed a prediction model that analyzes the user's symptoms and predicts the disease using ML algorithms (DT classifier, RF classifier, and NB classifier). The purpose of this study was to solve health-related problems by allowing medical professionals to predict diseases at an early stage. The dataset is a sample of 4920 patient records with 41 illnesses diagnosed. A total of 41 disorders were included as a dependent variable. All algorithms achieved the same accuracy score of 95.12%. The authors noticed that overfitting occurred when all 132 symptoms from the original dataset were assessed instead of 95 symptoms. That is, the tree appears to remember the dataset provided and thus fails to classify new data. As a result, just 95 symptoms were assessed during the data-cleansing process, with the best ones being chosen.

In [ 107 ], the authors built a decision-making system that assists practitioners to anticipate cardiac problems in exact classification through a simpler method and will deliver automated predictions about the condition of the patient’s heart. implemented 4 algorithms (KNN, RF, DT, and NB), all these algorithms were used in the Cleveland Heart Disease dataset. The accuracy varies for different classification methods. The maximum accuracy is given when they utilized the KNN algorithm with the Correlation factor which is almost 94 percent. The authors should extend the presented technique to leverage more than one dataset and forecast different diseases.

The authors of [ 108 ] used the Cleveland dataset, which included 303 cases and 76 attributes, to test out three different classification strategies: NB, SVM, and DT in addition to KNN. Only 14 of these 76 characteristics are going to be put through the testing process. The authors performed data preprocessing to remove noisy data. The KNN obtained the greatest accuracy with 90.79 percent. The authors need to use more sophisticated models to improve the accuracy of early heart disease prediction.

The authors of [ 109 ] proposed a model to predict heart disease by making use of a cardiovascular dataset, which was then classified through the application of supervised machine learning algorithms (DT, NB, logistic regression, RF, SVM, and KNN). The results reveal that the DT classification model predicted cardiovascular disorders better than other algorithms with an accuracy of 73 percent. The authors highlighted that the ensemble ML techniques employing the CVD dataset can generate a better illness prediction model.

In [ 110 ], the authors attempted to increase the accuracy of heart disease prediction by applying a logistic regression using a healthcare dataset to determine whether patients have heart illness problems or not. The dataset was acquired from an ongoing cardiovascular study on people of the town of Framingham, Massachusetts. The model reached an accuracy prediction of 87 percent. The authors acknowledge the model could be improved with more data and the use of more ML models.

Because breast cancer affects one in every 28 women in India, the author of [ 111 ] presented an accurate classification technique to examine the breast cancer dataset containing 569 rows and 32 columns. Similarly employing a heart disease dataset and Lung cancer dataset, this research offered A novel way to function selection. This method of selection is based on genetic algorithms mixed with the SVM classification. The classifier results are Lung cancer 81.8182, Diabetes 78.9272. noticed that the size, kind, and source of data used are not indicated.

In [ 112 ], the authors predicted the risk factors that cause heart disease using the K-means clustering algorithm and analyzed with a visualization tool using a Cleveland heart disease dataset with 76 features of 303 patients, holds 209 records with 8 attributes such as age, chest pain type, blood pressure, blood glucose level, ECG in rest, heart rate as well as four types of chest pain. The authors forecast cardiac diseases by taking into consideration the primary characteristics of four types of chest discomfort solely and K-means clustering is a common unsupervised ML technique.

The aim of the article [ 113 ] was to report the advantages of using a variety of data mining (DM) methods and validated heart disease survival prediction models. From the observations, the authors proposed that logistic regression and NB achieved the highest accuracy when performed on a high-dimensional dataset on the Cleveland hospital dataset and DT and RF produce better results on low-dimensional datasets. RF delivers more accuracy than the DT classifier as the algorithm is an optimized learning algorithm. The author mentioned that this work can be extended to other ML algorithms, the model could be developed in a distributed environment such as Map–Reduce, Apache Mahout, and HBase.

In [ 114 ], the authors proposed a single algorithm named hybridization to predict heart disease that combines used techniques into one single algorithm. The presented method has three phases. Preprocessing phase, classification phase, and diagnosis phase. They employed the Cleveland database and algorithms NB, SVM, KNN, NN, J4.8, RF, and GA. NB and SVM always perform better than others, whereas others depend on the specified features. results attained an accuracy of 89.2 percent. The authors need to is the key goal. Notice that the dataset is little; hence, the system was not able to train adequately, so the accuracy of the method was bad.

Using six algorithms (logistic regression, KNN, DT, SVM, NB, and RF), the authors of [ 115 ] explored different data representations to better understand how to use clinical data for predicting liver disease. The original dataset was taken from the northeast of Andhra Pradesh, India. includes 583 liver patient data, whereas 75.64 percent are male, and 24.36 percent are female. The analysis result indicated that the logistic regression classifier delivers the most increased order exactness of 75 percent depending on the f1 measure to forecast the liver illness and NB gives the least precision of 53 percent. The authors merely studied a few prominent supervised ML algorithms; more algorithms can be picked to create an increasingly exact model of liver disease prediction and performance can be steadily improved.

In [ 116 ], the authors aimed to predict coronary heart disease (CHD) based on historical medical data using ML technology. The goal of this study is to use three supervised learning approaches, NB, SVM, and DT, to find correlations in CHD data that could aid improve prediction rates. The dataset contains a retrospective sample of males from KEEL, a high-risk heart disease location in the Western Cape of South Africa. The model utilized NB, SVM, and DT. NB achieved the most accurate among the three models. SVM and DT J48 outperformed NB with a specificity rate of 82 percent but showed an inadequate sensitivity rate of less than 50 percent.

With the help of DM and network analysis methods, the authors of [ 117 ] created a chronic disease risk prediction framework that was created and evaluated in the Australian healthcare system to predict type 2 diabetes risk. Using a private healthcare funds dataset from Australia that spans six years and three different predictive algorithms (regression, parameter optimization, and DT). The accuracy of the prediction ranges from 82 to 87 percent. The hospital admission and discharge summary are the dataset's source. As a result, it does not provide information about general physician visits or future diagnoses.

DL-based healthcare prediction

With the help of DL algorithms such as CNN for autofeature extraction and illness prediction, the authors of [ 118 ] proposed a system for predicting patients with the more common inveterate diseases, and they used KNN for distance calculation to locate the exact matching in the dataset and the outcome of the final sickness prediction. A combination of disease symptoms was made for the structure of the dataset, the living habits of a person, and the specific attaches to doctor consultations which are acceptable in this general disease prediction. In this study, the Indian chronic kidney disease dataset was utilized that comprises 400 occurrences, 24 characteristics, and 2 classes were restored from the UCI ML store. Finally, a comparative study of the proposed system with other algorithms such as NB, DT, and logistic regression has been demonstrated in this study. The findings showed that the proposed system gives an accuracy of 95% which is higher than the other two methods. So, the proposed technique should be applied using more than one dataset.

In [ 119 ], the authors developed a DL approach that uses chest radiography images to differentiate between patients with mild, pneumonia, and COVID-19 infections, providing a valid mechanism for COVID-19 diagnosis. To increase the intensity of the chest X-ray image and eliminate noise, image-enhancing techniques were used in the proposed system. Two distinct DL approaches based on a pertained neural network model (ResNet-50) for COVID-19 identification utilizing chest X-ray (CXR) pictures are proposed in this work to minimize overfitting and increase the overall capabilities of the suggested DL systems. The authors emphasized that tests using a vast and hard dataset encompassing several COVID-19 cases are necessary to establish the efficacy of the suggested system.

Diabetes disease prediction was the topic of the article [ 120 ], in which the authors presented a cuckoo search-based deep LSTM classifier for prediction. The deep convLSTM classifier is used in cuckoo search optimization, which is a nature-inspired method for accurately predicting disease by transferring information and therefore reducing time consumption. The PIMA dataset is used to predict the onset of diabetes. The National Institute of Diabetes and Digestive and Kidney Diseases provided the data. The dataset is made up of independent variables including insulin level, age, and BMI index, as well as one dependent variable. The new technique was compared to traditional methods, and the results showed that the proposed method achieved 97.591 percent accuracy, 95.874 percent sensitivity, and 97.094 percent specificity, respectively. The authors noticed more datasets are needed, as well as new approaches to improve the classifier's effectiveness.

In [ 121 ], the authors presented a wavelet-based convolutional neural network to handle data limitations in this time of COVID-19 fast emergence. By investigating the influence of discrete wavelet transform decomposition up to 4 levels, the model demonstrated the capability of multi-resolution analysis for detecting COVID-19 chest X-rays. The wavelet sub-bands are CNN’s inputs at each decomposition level. COVID-19 chest X-ray-12 is a collection of 1,944 chest X-ray pictures divided into 12 groups that were compiled from two open-source datasets (National Institute Health containing several X-rays of pneumonia-related diseases, whereas the COVID-19 dataset is collected from Radiology Society North America). COVID-Neuro wavelet, a suggested model, was trained alongside other well-known ImageNet pre-trained models on COVID-CXR-12. The authors acknowledge they hope to investigate the effects of other wavelet functions besides the Haar wavelet.

A CNN framework for COVID-19 identification was suggested in [ 122 ] it made use of computed tomography images that was developed by the authors. The proposed framework employs a public CT dataset of 2482 CT images from patients of both classifications. the system attained an accuracy of 96.16 percent and recall of 95.41 percent after training using only 20 percent of the dataset. The authors stated that the use of the framework should be extended to multimodal medical pictures in the future.

Using an LSTM network enhanced by two processes to perform multi-label classification based on patients' clinical visit records, the authors of [ 123 ] performed multi-disease prediction for intelligent clinical decision support. A massive dataset of electronic health records was collected from a prominent hospital in southeast China. The suggested LSTM approach outperforms several standard and DL models in predicting future disease diagnoses, according to model evaluation results. The F1 score rises from 78.9 to 86.4 percent, respectively, with the state-of-the-art conventional and DL models, to 88.0 percent with the suggested technique. The authors stated that the model prediction performance may be enhanced further by including new input variables and that to reduce computational complexity, the method only uses one data source.

In [ 124 ], the authors introduced an approach to creating a supervised ANN structure based on the subnets (the group of neurons) instead of layers, in the cases of low datasets, this effectively predicted the disease. The model was evaluated using textual data and compared to multilayer perceptrons (MLPs) as well as LSTM recurrent neural network models using three small-scale publicly accessible benchmark datasets. On the Iris dataset, the experimental findings for classification reached 97% accuracy, compared to 92% for RNN (LSTM) with three layers, and the model had a lower error rate, 81, than RNN (LSTM) and MLP on the diabetic dataset, while RNN (LSTM) has a high error rate of 84. For larger datasets, however, this method is useless. This model is useless because it has not been implemented on large textual and image datasets.

The authors of [ 125 ] presented a novel AI and Internet of Things (IoT) convergence-based disease detection model for a smart healthcare system. Data collection, reprocessing, categorization, and parameter optimization are all stages of the proposed model. IoT devices, such as wearables and sensors, collect data, which AI algorithms then use to diagnose diseases. The forest technique is then used to remove any outliers found in the patient data. Healthcare data were used to assess the performance of the CSO-LSTM model. During the study, the CSO-LSTM model had a maximum accuracy of 96.16% on heart disease diagnoses and 97.26% on diabetes diagnoses. This method offered a greater prediction accuracy for heart disease and diabetes diagnosis, but there was no feature selection mechanism; hence, it requires extensive computations.

The global health crisis posed by coronaviruses was a subject of [ 126 ]. The authors aimed at detecting disease in people whose X-ray had been selected as potential COVID-19 candidates. Chest X-rays of people with COVID-19, viral pneumonia, and healthy people are included in the dataset. The study compared the performance of two DL algorithms, namely CNN and RNN. DL techniques were used to evaluate a total of 657 chest X-ray images for the diagnosis of COVID-19. VGG19 is the most successful model, with a 95% accuracy rate. The VGG19 model successfully categorizes COVID-19 patients, healthy individuals, and viral pneumonia cases. The dataset's most failing approach is InceptionV3. The success percentage can be improved, according to the authors, by improving data collection. In addition to chest radiography, lung tomography can be used. The success ratio and performance can be enhanced by creating numerous DL models.

In [ 127 ], the authors developed a method based on the RNN algorithm for predicting blood glucose levels for diabetics a maximum of one hour in the future, which required the patient's glucose level history. The Ohio T1DM dataset for blood glucose level prediction, which included blood glucose level values for six people with type 1 diabetes, was used to train and assess the approach. The distribution features were further honed with the use of studies that revealed the procedure's certainty estimate nature. The authors point out that they can only evaluate prediction goals with enough glucose level history; thus, they cannot anticipate the beginning levels after a gap, which does not improve the prediction's quality.

To build a new deep anomaly detection model for fast, reliable screening, the authors of [ 128 ] used an 18-layer residual CNN pre-trained on ImageNet with a different anomaly detection mechanism for the classification of COVID-19. On the X-ray dataset, which contains 100 images from 70 COVID-19 persons and 1431 images from 1008 non-COVID-19 pneumonia subjects, the model obtains a sensitivity of 90.00 percent specificity of 87.84 percent or sensitivity of 96.00 percent specificity of 70.65 percent. The authors noted that the model still has certain flaws, such as missing 4% of COVID-19 cases and having a 30% false positive rate. In addition, more clinical data are required to confirm and improve the model's usefulness.

In [ 129 ], the authors developed COVIDX-Net, a novel DL framework that allows radiologists to diagnose COVID-19 in X-ray images automatically. Seven algorithms (MobileNetV2, ResNetV2, VGG19, DenseNet201, InceptionV3, Inception, and Xception) were evaluated using a small dataset of 50 photographs (MobileNetV2, ResNetV2, VGG19, DenseNet201, InceptionV3, Inception, and Xception). Each deep neural network model can classify the patient's status as a negative or positive COVID-19 case based on the normalized intensities of the X-ray image. The f1-scores for the VGG19 and dense convolutional network (DenseNet) models were 0.89 and 0.91, respectively. With f1-scores of 0.67, the InceptionV3 model has the weakest classification performance.

The authors of [ 130 ] designed a DL approach for delivering 30-min predictions about future glucose levels based on a Dilated RNN (DRNN). The performance of the DRNN models was evaluated using data from two electronic health records datasets: OhioT1DM from clinical trials and the in silicon dataset from the UVA-Padova simulator. It outperformed established glucose prediction approaches such as neural networks (NNs), support vector regression (SVR), and autoregressive models (ARX). The results demonstrated that it significantly improved glucose prediction performance, although there are still some limits, such as the authors' creation of a data-driven model that heavily relies on past EHR. The quality of the data has a significant impact on the accuracy of the prediction. The number of clinical datasets is limited and , however, often restricted. Because certain data fields are manually entered, they are occasionally incorrect.

In [ 131 ], the authors utilized a deep neural network (DNN) to discover 15,099 stroke patients, researchers were able to predict stroke death based on medical history and human behaviors utilizing large-scale electronic health information. The Korea Centers for Disease Control and Prevention collected data from 2013 to 2016 and found that there are around 150 hospitals in the country, all having more than 100 beds. Gender, age, type of insurance, mode of admission, necessary brain surgery, area, length of hospital stays, hospital location, number of hospital beds, stroke kind, and CCI were among the 11 variables in the DL model. To automatically create features from the data and identify risk factors for stroke, researchers used a DNN/scaled principal component analysis (PCA). 15,099 people with a history of stroke were enrolled in the study. The data were divided into a training set (66%) and a testing set (34%), with 30 percent of the samples used for validation in the training set. DNN is used to examine the variables of interest, while scaled PCA is utilized to improve the DNN's continuous inputs. The sensitivity, specificity, and AUC values were 64.32%, 85.56%, and 83.48%, respectively.

The authors of [ 132 ] proposed (GluNet), an approach to glucose forecasting. This method made use of a personalized DNN to forecast the probabilistic distribution of short-term measurements for people with Type 1 diabetes based on their historical data. These data included insulin doses, meal information, glucose measurements, and a variety of other factors. It utilized the newest DL techniques consisting of four components: post-processing, dilated CNN, label recovery/ transform, and data preprocessing. The authors run the models on the subjects from the OhioT1DM datasets. The outcomes revealed significant enhancements over the previous procedures via a comprehensive comparison concerning the and root mean square error (RMSE) having a time lag of 60 min prediction horizons (PH) and RMSE having a small-time lag for the case of prediction horizons in the virtual adult participants. If the PH is properly matched to the lag between input and output, the user may learn the control of the system more frequently and it achieves good performance. Additionally, GluNet was validated on two clinical datasets. It attained an RMSE with a time lag of 60 min PH and RMSE with a time lag of 30-min PH. The authors point out that the model does not consider physiological knowledge, and that they need to test GluNet with larger prediction horizons and use it to predict overnight hypoglycemia.

The authors of [ 133 ] proposed the short-term blood glucose prediction model (VMD-IPSO-LSTM), which is a short-term strategy for predicting blood glucose (VMD-IPSO-LSTM). Initially, the intrinsic modal functions (IMF) in various frequency bands were obtained using the variational modal decomposition (VMD) technique, which deconstructed the blood glucose content. The short- and long-term memory networks then constructed a prediction mechanism for each blood glucose component’s intrinsic modal functions (IMF). Because the time window length, learning rate, and neuron count are difficult to set, the upgraded PSO approach optimized these parameters. The improved LSTM network anticipated each IMF, and the projected subsequence was superimposed in the final step to arrive at the ultimate prediction result. The data of 56 participants were chosen as experimental data among 451 diabetic Mellitus patients. The experiments revealed that it improved prediction accuracy at "30 min, 45 min, and 60 min." The RMSE and MAPE were lower than the "VMD-PSO-LSTM, VMD-LSTM, and LSTM," indicating that the suggested model is effective. The longer time it took to anticipate blood glucose levels and the higher accuracy of the predictions gave patients and doctors more time to improve the effectiveness of diabetes therapy and manage blood glucose levels. The authors noted that they still faced challenges, such as an increase in calculation volume and operation time. The time it takes to estimate glucose levels in the short term will be reduced.

To speed up diagnosis and cut down on mistakes, the authors of [ 134 ] proposed a new paradigm for primary COVID-19 detection based on a radiology review of chest radiography or chest X-ray. The authors used a dataset of chest X-rays from verified COVID-19 patients (408 photographs), confirmed pneumonia patients (4273 images), and healthy people (1590 images) to perform a three-class image classification (1590 images). There are 6271 people in total in the dataset. To fulfill this image categorization problem, the authors plan to use CNN and transfer learning. For all the folds of data, the model's accuracy ranged from 93.90 percent to 98.37 percent. Even the lowest level of accuracy, 93.90 percent, is still quite good. The authors will face a restriction, particularly when it comes to adopting such a model on a large scale for practical usage.

In [ 135 ], the authors proposed DL models for predicting the number of COVID-19-positive cases in Indian states. The Ministry of Health and Family Welfare dataset contains time series data for 32 individual confirmed COVID-19 cases in each of the states (28) and union territories (4) since March 14, 2020. This dataset was used to conduct an exploratory analysis of the increase in the number of positive cases in India. As prediction models, RNN-based LSTMs are used. Deep LSTM, convolutional LSTM, and bidirectional LSTM models were tested on 32 states/union territories, and the model with the best accuracy was chosen based on absolute error. Bidirectional LSTM produced the best performance in terms of prediction errors, while convolutional LSTM produced the worst performance. For all states, daily and weekly forecasts were calculated, and bi-LSTM produced accurate results (error less than 3%) for short-term prediction (1–3 days).

With the goal of increasing the reliability and precision of type 1 diabetes predictions, the authors of [ 136 ] proposed a new method based on CNNs and DL. It was about figuring out how to extract the behavioral pattern. Numerous observations of identical behaviors were used to fill in the gaps in the data. The suggested model was trained and verified using data from 759 people with type 1 diabetes who visited Sheffield Teaching Hospitals between 2013 and 2015. A subject's type 1 diabetes test, demographic data (age, gender, years with diabetes), and the final 84 days (12 weeks) of self-monitored blood glucose (SMBG) measurements preceding the test formed each item in the training set. In the presence of insufficient data and certain physiological specificities, prediction accuracy deteriorates, according to the authors.

The authors of [ 137 ] constructed a framework using the PIDD. PID's participants are all female and at least 21 years old. PID comprises 768 incidences, with 268 samples diagnosed as diabetic and 500 samples not diagnosed as diabetic. The eight most important characteristics that led to diabetes prediction. The accuracy of functional classifiers such as ANN, NB, DT, and DL is between 90 and 98 percent. On the PIMA dataset, DL had the best results for diabetes onset among the four, with an accuracy rate of 98.07 percent. The technique uses a variety of classifiers to accurately predict the disease, but it failed to diagnose it at an early stage.

To summarize all previous works discussed in this section, we will categorize them according to the diseases along with the techniques used to predict each disease, the datasets used, and the main findings, as shown in Table 5 .

Results and discussion

This study conducted a systematic review to examine the latest developments in ML and DL for healthcare prediction. It focused on healthcare forecasting and how the use of ML and DL can be relevant and robust. A total of 41 papers were reviewed, 21 in ML and 20 in DL as depicted in Fig.  17 .

In this study, the reviewed paper were classified by diseases predicted; as a result, 5 diseases were discussed including diabetes, COVID-19, heart, liver, and chronic kidney). Table 6 illustrates the number of reviewed papers for each disease in addition to the adopted prediction techniques in each disease.

Table 6 provides a comprehensive summary of the various ML and DL models used for disease prediction. It indicates the number of studies conducted on each disease, the techniques employed, and the highest level of accuracy attained. As shown in Table 6 , the optimal diagnostic accuracy for each disease varies. For diabetes, the DL model achieved a 98.07% accuracy rate. For COVID-19, the accuracy of the logistic regression model was 98.5%. The CSO-LSTM model achieved an accuracy of 96.16 percent for heart disease. For liver disease, the accuracy of the logistic regression model was 75%. The accuracy of the logistic regression model for predicting multiple diseases was 98.5%. It is essential to note that these are merely the best accuracy included in this survey. In addition, it is essential to consider the size and quality of the datasets used to train and validate the models. It is more likely that models trained on larger and more diverse datasets will generalize well to new data. Overall, the results presented in Table 6 indicate that ML and DL models can be used to accurately predict disease. When selecting a model for a specific disease, it is essential to carefully consider the various models and techniques.

Although ML and DL have made incredible strides in recent years, they still have a long way to go before they can effectively be used to solve the fundamental problems plaguing the healthcare systems. Some of the challenges associated with implementing ML and DL approaches in healthcare prediction are discussed here.

The Biomedical Data Stream is the primary challenge that needs to be handled. Significant amounts of new medical data are being generated rapidly, and the healthcare industry as a whole is evolving rapidly. Some examples of such real-time biological signals include measurements of blood pressure, oxygen saturation, and glucose levels. While some variants of DL architecture have attempted to address this problem, many challenges remain before effective analyses of rapidly evolving, massive amounts of streaming data can be conducted. These include problems with memory consumption, feature selection, missing data, and computational complexity. Another challenge for ML and DL is tackling the complexity of the healthcare domain.

Healthcare and biomedical research present more intricate challenges than other fields. There is still a lot we do not know about the origins, transmission, and cures for many of these incredibly diverse diseases. It is hard to collect sufficient data because there are not always enough patients. A solution to this issue may be found, however. The small number of patients necessitates exhaustive patient profiling, innovative data processing, and the incorporation of additional datasets. Researchers can process each dataset independently using the appropriate DL technique and then represent the results in a unified model to extract patient data.

The use of ML and DL techniques for healthcare prediction has the potential to change the way traditional healthcare services are delivered. In the case of ML and DL applications, healthcare data is deemed the most significant component that contributes to medical care systems. This paper aims to present a comprehensive review of the most significant ML and DL techniques employed in healthcare predictive analytics. In addition, it discussed the obstacles and challenges of applying ML and DL Techniques in the healthcare domain. As a result of this survey, a total of 41 papers covering the period from 2019 to 2022 were selected and thoroughly reviewed. In addition, the methodology for each paper was discussed in detail. The reviewed studies have shown that AI techniques (ML and DL) play a significant role in accurately diagnosing diseases and helping to anticipate and analyze healthcare data by linking hundreds of clinical records and rebuilding a patient's history using these data. This work advances research in the field of healthcare predictive analytics using ML and DL approaches and contributes to the literature and future studies by serving as a resource for other academics and researchers.

Availability of data and materials

Not applicable.

Abbreviations

Artificial Intelligence

Machine Learning

Decision Tree

Electronic Health Records

Random Forest

Support Vector Machine

K-Nearest Neighbor

Naive Bayes

Reinforcement Learning

Natural Language Processing

Monte Carlo Tree Search

Partially Observable Markov Decision Processes

Deep Learning

Deep Belief Network

Artificial Neural Networks

Convolutional Neural Networks

Long Short-Term Memory

Recurrent Convolution Neural Networks

Recurrent Neural Networks

Recurrent Convolutional Layer

Receptive Domains

Recurrent Multilayer Perceptron

Pima Indian Diabetes Database

Coronary Heart Disease

Chest X-Ray

Multilayer Perceptrons

Internet of Things

Dilated RNN

Neural Networks

Support Vector Regression

Principal Component Analysis

Deep Neural Network

Prediction Horizons

Root Mean Square Error

Intrinsic Modal Functions

Variational Modal Decomposition

Self-Monitored Blood Glucose

Latha MH, Ramakrishna A, Reddy BSC, Venkateswarlu C, Saraswathi SY (2022) Disease prediction by stacking algorithms over big data from healthcare communities. Intell Manuf Energy Sustain: Proc ICIMES 2021(265):355

Google Scholar  

Van Calster B, Wynants L, Timmerman D, Steyerberg EW, Collins GS (2019) Predictive analytics in health care: how can we know it works? J Am Med Inform Assoc 26(12):1651–1654

Sahoo PK, Mohapatra SK, Wu SL (2018) SLA based healthcare big data analysis and computing in cloud network. J Parallel Distrib Comput 119:121–135

Thanigaivasan V, Narayanan SJ, Iyengar SN, Ch N (2018) Analysis of parallel SVM based classification technique on healthcare using big data management in cloud storage. Recent Patents Comput Sci 11(3):169–178

Elmahdy HN (2014) Medical diagnosis enhancements through artificial intelligence

Xiong X, Cao X, Luo L (2021) The ecology of medical care in Shanghai. BMC Health Serv Res 21:1–9

Donev D, Kovacic L, Laaser U (2013) The role and organization of health care systems. Health: systems, lifestyles, policies, 2nd edn. Jacobs Verlag, Lage, pp 3–144

Murphy G F, Hanken M A, & Waters K A (1999) Electronic health records: changing the vision

Qayyum A, Qadir J, Bilal M, Al-Fuqaha A (2020) Secure and robust machine learning for healthcare: a survey. IEEE Rev Biomed Eng 14:156–180

El Seddawy AB, Moawad R, Hana MA (2018) Applying data mining techniques in CRM

Wang Y, Kung L, Wang WYC, Cegielski CG (2018) An integrated big data analytics-enabled transformation model: application to health care. Inform Manag 55(1):64–79

Mirbabaie M, Stieglitz S, Frick NR (2021) Artificial intelligence in disease diagnostics: a critical review and classification on the current state of research guiding future direction. Heal Technol 11(4):693–731

Tang R, De Donato L, Besinović N, Flammini F, Goverde RM, Lin Z, Wang Z (2022) A literature review of artificial intelligence applications in railway systems. Transp Res Part C: Emerg Technol 140:103679

Singh G, Al’Aref SJ, Van Assen M, Kim TS, van Rosendael A, Kolli KK, Dwivedi A, Maliakal G, Pandey M, Wang J, Do V (2018) Machine learning in cardiac CT: basic concepts and contemporary data. J Cardiovasc Comput Tomograph 12(3):192–201

Kim KJ, Tagkopoulos I (2019) Application of machine learning in rheumatic disease research. Korean J Intern Med 34(4):708

Liu B (2011) Web data mining: exploring hyperlinks, contents, and usage data. Spriger, Berlin

MATH   Google Scholar  

Haykin S, Lippmann R (1994) Neural networks, a comprehensive foundation. Int J Neural Syst 5(4):363–364

Gupta M, Pandya SD (2022) A comparative study on supervised machine learning algorithm. Int J Res Appl Sci Eng Technol (IJRASET) 10(1):1023–1028

Ray S (2019) A quick review of machine learning algorithms. In: 2019 international conference on machine learning, big data, cloud and parallel computing (COMITCon) (pp 35–39). IEEE

Srivastava A, Saini S, & Gupta D (2019) Comparison of various machine learning techniques and its uses in different fields. In: 2019 3rd international conference on electronics, communication and aerospace technology (ICECA) (pp 81–86). IEEE

Park HA (2013) An introduction to logistic regression: from basic concepts to interpretation with particular attention to nursing domain. J Korean Acad Nurs 43(2):154–164

Obulesu O, Mahendra M, & Thrilok Reddy M (2018) Machine learning techniques and tools: a survey. In: 2018 international conference on inventive research in computing applications (ICIRCA) (pp 605–611). IEEE

Dhall D, Kaur R, & Juneja M (2020) Machine learning: a review of the algorithms and its applications. Proceedings of ICRIC 2019: recent innovations in computing 47–63

Yang F J (2019) An extended idea about Decision Trees. In: 2019 international conference on computational science and computational intelligence (CSCI) (pp 349–354). IEEE

Eesa AS, Orman Z, Brifcani AMA (2015) A novel feature-selection approach based on the cuttlefish optimization algorithm for intrusion detection systems. Expert Syst Appl 42(5):2670–2679

Shamim A, Hussain H, & Shaikh M U (2010) A framework for generation of rules from Decision Tree and decision table. In: 2010 international conference on information and emerging technologies (pp 1–6). IEEE

Eesa AS, Abdulazeez AM, Orman Z (2017) A dids based on the combination of cuttlefish algorithm and Decision Tree. Sci J Univ Zakho 5(4):313–318

Bakyarani ES, Srimathi H, Bagavandas M (2019) A survey of machine learning algorithms in health care. Int J Sci Technol Res 8(11):223

Resende PAA, Drummond AC (2018) A survey of random forest based methods for intrusion detection systems. ACM Comput Surv (CSUR) 51(3):1–36

Breiman L (2001) Random forests. Mach learn 45:5–32

Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844

Hofmann M, & Klinkenberg R (2016) RapidMiner: data mining use cases and business analytics applications. CRC Press

Chow CKCN, Liu C (1968) Approximating discrete probability distributions with dependence trees. IEEE Trans Inf Theory 14(3):462–467

Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc 2(2):121–167

Han J, Pei J, Kamber M (1999) Data mining: concepts and techniques. 2011

Cortes C, Vapnik V (1995) Support-vector networks. Mach learn 20:273–297

Aldahiri A, Alrashed B, Hussain W (2021) Trends in using IoT with machine learning in health prediction system. Forecasting 3(1):181–206

Sarker IH (2021) Machine learning: Algorithms, real-world applications and research directions. SN Comput Sci 2(3):160

Ting K M, & Zheng Z (1999) Improving the performance of boosting for naive Bayesian classification. In: Methodologies for knowledge discovery and data mining: third Pacific-Asia conference, PAKDD-99 Beijing, China, Apr 26–28, 1999 proceedings 3 (pp 296–305). Springer Berlin Heidelberg

Oladipo ID, AbdulRaheem M, Awotunde JB, Bhoi AK, Adeniyi EA, Abiodun MK (2022) Machine learning and deep learning algorithms for smart cities: a start-of-the-art review. In: IoT and IoE driven smart cities, pp 143–162

Shailaja K, Seetharamulu B, & Jabbar M A Machine learning in healthcare: a review. In: 2018 second international conference on electronics, communication and aerospace technology (ICECA) 2018 Mar 29 (pp 910–914)

Mahesh B (2020) Machine learning algorithms-a review. Int J Sci Res (IJSR) 9:381–386

Greene D, Cunningham P, & Mayer R (2008) Unsupervised learning and clustering. Mach learn Techn Multimed: Case Stud Organ Retriev 51–90

Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc, USA

Kodinariya TM, Makwana PR (2013) Review on determining number of cluster in K-means clustering. Int J 1(6):90–95

Smith LI (2002) A tutorial on principal components analysis

Mishra SP, Sarkar U, Taraphder S, Datta S, Swain D, Saikhom R, Laishram M (2017) Multivariate statistical data analysis-principal component analysis (PCA). Int J Livestock Res 7(5):60–78

Kamani M, Farzin Haddadpour M, Forsati R, and Mahdavi M (2019) "Efficient Fair Principal Component Analysis." arXiv e-prints: arXiv-1911.

Dey A (2016) Machine learning algorithms: a review. Int J Comput Sci Inf Technol 7(3):1174–1179

Agrawal R, Imieliński T, & Swami A (1993) Mining association rules between sets of items in large databases. In: proceedings of the 1993 ACM SIGMOD international conference on Management of data (pp 207–216)

Agrawal R, & Srikant R (1994) Fast algorithms for mining association rules. In: Proceeding of 20th international conference very large data bases, VLDB (Vol 1215, pp 487-499)

Singh J, Ram H, Sodhi DJ (2013) Improving efficiency of apriori algorithm using transaction reduction. Int J Sci Res Publ 3(1):1–4

Al-Maolegi M, & Arkok B (2014) An improved Apriori algorithm for association rules. arXiv preprint arXiv:1403.3948

Abaya SA (2012) Association rule mining based on Apriori algorithm in minimizing candidate generation. Int J Sci Eng Res 3(7):1–4

Coronato A, Naeem M, De Pietro G, Paragliola G (2020) Reinforcement learning for intelligent healthcare applications: a survey. Artif Intell Med 109:101964

Watkins CJ, Dayan P (1992) Q-learning. Mach Learn 8:279–292

Jang B, Kim M, Harerimana G, Kim JW (2019) Q-learning algorithms: a comprehensive classification and applications. IEEE access 7:133653–133667

Achille A, Soatto S (2018) Information dropout: Learning optimal representations through noisy computation. IEEE Trans Pattern Anal Mach Intell 40(12):2897–2905

Williams G, Wagener N, Goldfain B, Drews P, Rehg J M, Boots B, & Theodorou E A (2017) Information theoretic MPC for model-based reinforcement learning. In: 2017 IEEE international conference on robotics and automation (ICRA) (pp 1714–1721). IEEE

Wilkes JT, Gallistel CR (2017) Information theory, memory, prediction, and timing in associative learning. Comput Models Brain Behav 29:481–492

Ning Y, Jia J, Wu Z, Li R, An Y, Wang Y, Meng H (2017) Multi-task deep learning for user intention understanding in speech interaction systems. In: Proceedings of the AAAI conference on artificial intelligence (Vol 31, No. 1)

Shi X, Gao Z, Lausen L, Wang H, Yeung DY, Wong WK, Woo WC (2017) Deep learning for precipitation nowcasting: a benchmark and a new model. In: Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (Eds) Advances in neural information processing systems, vol 30. Curran Associates, Inc.,. https://proceedings.neurips.cc/paper_files/paper/2017/file/a6db4ed04f1621a119799fd3d7545d3d-Paper.pdf

Juang CF, Lu CM (2009) Ant colony optimization incorporated with fuzzy Q-learning for reinforcement fuzzy control. IEEE Trans Syst, Man, Cybernet-Part A: Syst Humans 39(3):597–608

Świechowski M, Godlewski K, Sawicki B, Mańdziuk J (2022) Monte Carlo tree search: a review of recent modifications and applications. Artif Intell Rev 56:1–66

Lizotte DJ, Laber EB (2016) Multi-objective Markov decision processes for data-driven decision support. J Mach Learn Res 17(1):7378–7405

MathSciNet   MATH   Google Scholar  

Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Hassabis D (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489

Browne CB, Powley E, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Colton S (2012) A survey of monte carlo tree search methods. IEEE Trans Comput Intell AI Games 4(1):1–43

Ling ZH, Kang SY, Zen H, Senior A, Schuster M, Qian XJ, Deng L (2015) Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Process Magaz 32(3):35–52

Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117

Yu D, Deng L (2010) Deep learning and its applications to signal and information processing [exploratory dsp]. IEEE Signal Process Mag 28(1):145–154

Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554

Goyal P, Pandey S, Jain K, Goyal P, Pandey S, Jain K (2018) Introduction to natural language processing and deep learning. Deep Learn Nat Language Process: Creat Neural Netw Python 1–74. https://doi.org/10.1007/978-1-4842-3685-7

Mathew A, Amudha P, Sivakumari S (2021) Deep learning techniques: an overview. Adv Mach Learn Technol Appl: Proc AMLTA 2020:599–608

Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT press, USA

Gomes L (2014) Machine-learning maestro Michael Jordan on the delusions of big data and other huge engineering efforts. IEEE Spectrum 20. https://spectrum.ieee.org/machinelearning-maestro-michael-jordan-on-the-delusions-of-big-data-and-other-huge-engineering-efforts

Huang G, Liu Z, Van Der Maaten L, & Weinberger K Q (2017) Densely connected convolutional networks. In: proceedings of the IEEE conference on computer vision and pattern recognition (pp 4700–4708)

Yap MH, Pons G, Marti J, Ganau S, Sentis M, Zwiggelaar R, Marti R (2017) Automated breast ultrasound lesions detection using convolutional neural networks. IEEE J Biomed Health Inform 22(4):1218–1226

Hayashi Y (2019) The right direction needed to develop white-box deep learning in radiology, pathology, and ophthalmology: a short review. Front Robot AI 6:24

Alom MZ, Taha TM, Yakopcic C, Westberg S, Sidike P, Nasrin MS, Asari VK (2019) A state-of-the-art survey on deep learning theory and architectures. Electronics 8(3):292

Schmidhuber J, Hochreiter S (1997) Long short-term memory. Neural Comput 9(8):1735–1780

Smagulova K, James AP (2019) A survey on LSTM memristive neural network architectures and applications. Eur Phys J Spec Topics 228(10):2313–2324

Setyanto A, Laksito A, Alarfaj F, Alreshoodi M, Oyong I, Hayaty M, Kurniasari L (2022) Arabic language opinion mining based on long short-term memory (LSTM). Appl Sci 12(9):4140

Lindemann B, Müller T, Vietz H, Jazdi N, Weyrich M (2021) A survey on long short-term memory networks for time series prediction. Procedia CIRP 99:650–655

Cui Z, Ke R, Pu Z, & Wang Y (2018) Deep bidirectional and unidirectional LSTM recurrent neural network for network-wide traffic speed prediction. arXiv preprint arXiv:1801.02143

Villegas R, Yang J, Zou Y, Sohn S, Lin X, & Lee H (2017) Learning to generate long-term future via hierarchical prediction. In: international conference on machine learning (pp 3560–3569). PMLR

Gensler A, Henze J, Sick B, & Raabe N (2016) Deep learning for solar power forecasting—an approach using autoencoder and LSTM neural networks. In: 2016 IEEE international conference on systems, man, and cybernetics (SMC) (pp 002858–002865). IEEE

Lindemann B, Fesenmayr F, Jazdi N, Weyrich M (2019) Anomaly detection in discrete manufacturing using self-learning approaches. Procedia CIRP 79:313–318

Kalchbrenner N, Danihelka I, & Graves A (2015) Grid long short-term memory. arXiv preprint arXiv:1507.01526

Cheng B, Xu X, Zeng Y, Ren J, Jung S (2018) Pedestrian trajectory prediction via the social-grid LSTM model. J Eng 2018(16):1468–1474

Veličković P, Karazija L, Lane N D, Bhattacharya S, Liberis E, Liò P & Vegreville M (2018) Cross-modal recurrent models for weight objective prediction from multimodal time-series data. In: proceedings of the 12th EAI international conference on pervasive computing technologies for healthcare (pp 178–186)

Wang J, Hu X (2021) Convolutional neural networks with gated recurrent connections. IEEE Trans Pattern Anal Mach Intell 44(7):3421–3435

Liang M, & Hu X (2015) Recurrent convolutional neural network for object recognition. In: proceedings of the IEEE conference on computer vision and pattern recognition (pp 3367–3375)

Liang M, Hu X, Zhang B (2015) Convolutional neural networks with intra-layer recurrent connections for scene labeling. In: Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R (Eds) Advances in Neural Information Processing Systems, vol 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2015/file/9cf81d8026a9018052c429cc4e56739b-Paper.pdf

Fernandez B, Parlos A G, & Tsai W K (1990) Nonlinear dynamic system identification using artificial neural networks (ANNs). In: 1990 IJCNN international joint conference on neural networks (pp 133–141). IEEE

Puskorius GV, Feldkamp LA (1994) Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks. IEEE Trans Neural Netw 5(2):279–297

Rumelhart DE (1986) Learning representations by error propagation. In: DE Rumelhart and JL McClelland & PDP Research Group, eds, Parallel distributed processing: explorations in the microstructure of cognition. Bradford Books MITPress, Cambridge, Mass

Krishnamoorthi R, Joshi S, Almarzouki H Z, Shukla P K, Rizwan A, Kalpana C, & Tiwari B (2022) A novel diabetes healthcare disease prediction framework using machine learning techniques. J Healthcare Eng. https://doi.org/10.1155/2022/1684017

Edeh MO, Khalaf OI, Tavera CA, Tayeb S, Ghouali S, Abdulsahib GM, Louni A (2022) A classification algorithm-based hybrid diabetes prediction model. Front Publ Health 10:829510

Iwendi C, Huescas C G Y, Chakraborty C, & Mohan S (2022) COVID-19 health analysis and prediction using machine learning algorithms for Mexico and Brazil patients. J Experiment Theor Artif Intell 1–21. https://doi.org/10.1080/0952813X.2022.2058097

Lu H, Uddin S, Hajati F, Moni MA, Khushi M (2022) A patient network-based machine learning model for disease prediction: the case of type 2 diabetes mellitus. Appl Intell 52(3):2411–2422

Chugh M, Johari R, & Goel A (2022) MATHS: machine learning techniques in healthcare system. In: international conference on innovative computing and communications: proceedings of ICICC 2021, Volume 3 (pp 693–702). Springer Singapore

Deberneh HM, Kim I (2021) Prediction of type 2 diabetes based on machine learning algorithm. Int J Environ Res Public Health 18(6):3317

Gupta S, Verma H K, & Bhardwaj D (2021) Classification of diabetes using Naive Bayes and support vector machine as a technique. In: operations management and systems engineering: select proceedings of CPIE 2019 (pp 365–376). Springer Singapore

Islam M T, Rafa S R, & Kibria M G (2020) Early prediction of heart disease using PCA and hybrid genetic algorithm with k-means. In: 2020 23rd international conference on computer and information technology (ICCIT) (pp 1–6). IEEE

Qawqzeh Y K, Bajahzar A S, Jemmali M, Otoom M M, Thaljaoui A (2020) Classification of diabetes using photoplethysmogram (PPG) waveform analysis: logistic regression modeling. BioMed Res Int. https://doi.org/10.1155/2020/3764653

Grampurohit S, Sagarnal C (2020) Disease prediction using machine learning algorithms. In: 2020 international conference for emerging technology (INCET) (pp 1–7). IEEE

Moturi S, Srikanth Vemuru DS (2020) Classification model for prediction of heart disease using correlation coefficient technique. Int J 9(2). https://doi.org/10.30534/ijatcse/2020/185922020

Barik S, Mohanty S, Rout D, Mohanty S, Patra A K, & Mishra A K (2020) Heart disease prediction using machine learning techniques. In: advances in electrical control and signal systems: select proceedings of AECSS 2019 (pp 879–888). Springer, Singapore

Princy R J P, Parthasarathy S, Jose P S H, Lakshminarayanan A R, & Jeganathan S (2020) Prediction of cardiac disease using supervised machine learning algorithms. In: 2020 4th international conference on intelligent computing and control systems (ICICCS) (pp 570–575). IEEE

Saw M, Saxena T, Kaithwas S, Yadav R, & Lal N (2020) Estimation of prediction for getting heart disease using logistic regression model of machine learning. In: 2020 international conference on computer communication and informatics (ICCCI) (pp 1–6). IEEE

Soni VD (2020) Chronic disease detection model using machine learning techniques. Int J Sci Technol Res 9(9):262–266

Indrakumari R, Poongodi T, Jena SR (2020) Heart disease prediction using exploratory data analysis. Procedia Comput Sci 173:130–139

Wu C S M, Badshah M, & Bhagwat V (2019) Heart disease prediction using data mining techniques. In: proceedings of the 2019 2nd international conference on data science and information technology (pp 7–11)

Tarawneh M, & Embarak O (2019) Hybrid approach for heart disease prediction using data mining techniques. In: advances in internet, data and web technologies: the 7th international conference on emerging internet, data and web technologies (EIDWT-2019) (pp 447–454). Springer International Publishing

Rahman AS, Shamrat FJM, Tasnim Z, Roy J, Hossain SA (2019) A comparative study on liver disease prediction using supervised machine learning algorithms. Int J Sci Technol Res 8(11):419–422

Gonsalves A H, Thabtah F, Mohammad R M A, & Singh G (2019) Prediction of coronary heart disease using machine learning: an experimental analysis. In: proceedings of the 2019 3rd international conference on deep learning technologies (pp 51–56)

Khan A, Uddin S, Srinivasan U (2019) Chronic disease prediction using administrative data and graph theory: the case of type 2 diabetes. Expert Syst Appl 136:230–241

Alanazi R (2022) Identification and prediction of chronic diseases using machine learning approach. J Healthcare Eng. https://doi.org/10.1155/2022/2826127

Gouda W, Almurafeh M, Humayun M, Jhanjhi NZ (2022) Detection of COVID-19 based on chest X-rays using deep learning. Healthcare 10(2):343

Kumar A, Satyanarayana Reddy S S, Mahommad G B, Khan B, & Sharma R (2022) Smart healthcare: disease prediction using the cuckoo-enabled deep classifier in IoT framework. Sci Progr. https://doi.org/10.1155/2022/2090681

Monday H N, Li J P, Nneji G U, James E C, Chikwendu I A, Ejiyi C J, & Mgbejime G T (2021) The capability of multi resolution analysis: a case study of COVID-19 diagnosis. In: 2021 4th international conference on pattern recognition and artificial intelligence (PRAI) (pp 236–242). IEEE

Al Rahhal MM, Bazi Y, Jomaa RM, Zuair M, Al Ajlan N (2021) Deep learning approach for COVID-19 detection in computed tomography images. Cmc-Comput Mater Continua 67(2):2093–2110

Men L, Ilk N, Tang X, Liu Y (2021) Multi-disease prediction using LSTM recurrent neural networks. Expert Syst Appl 177:114905

Ahmad U, Song H, Bilal A, Mahmood S, Alazab M, Jolfaei A & Saeed U (2021) A novel deep learning model to secure internet of things in healthcare. Mach Intell Big Data Anal Cybersec Appl 341–353

Mansour RF, El Amraoui A, Nouaouri I, Díaz VG, Gupta D, Kumar S (2021) Artificial intelligence and internet of things enabled disease diagnosis model for smart healthcare systems. IEEE Access 9:45137–45146

Sevi M, & Aydin İ (2020) COVID-19 detection using deep learning methods. In: 2020 international conference on data analytics for business and industry: way towards a sustainable economy (ICDABI) (pp 1–6). IEEE

Martinsson J, Schliep A, Eliasson B, Mogren O (2020) Blood glucose prediction with variance estimation using recurrent neural networks. J Healthc Inform Res 4:1–18

Zhang J, Xie Y, Pang G, Liao Z, Verjans J, Li W, Xia Y (2020) Viral pneumonia screening on chest X-rays using confidence-aware anomaly detection. IEEE Trans Med Imaging 40(3):879–890

Hemdan E E D, Shouman M A, & Karar M E (2020) Covidx-net: a framework of deep learning classifiers to diagnose covid-19 in x-ray images. arXiv preprint arXiv:2003.11055

Zhu T, Li K, Chen J, Herrero P, Georgiou P (2020) Dilated recurrent neural networks for glucose forecasting in type 1 diabetes. J Healthc Inform Res 4:308–324

Cheon S, Kim J, Lim J (2019) The use of deep learning to predict stroke patient mortality. Int J Environ Res Public Health 16(11):1876

Li K, Liu C, Zhu T, Herrero P, Georgiou P (2019) GluNet: a deep learning framework for accurate glucose forecasting. IEEE J Biomed Health Inform 24(2):414–423

Wang W, Tong M, Yu M (2020) Blood glucose prediction with VMD and LSTM optimized by improved particle swarm optimization. IEEE Access 8:217908–217916

Rashid N, Hossain M A F, Ali M, Sukanya M I, Mahmud T, & Fattah S A (2020) Transfer learning based method for COVID-19 detection from chest X-ray images. In: 2020 IEEE region 10 conference (TENCON) (pp 585–590). IEEE

Arora P, Kumar H, Panigrahi BK (2020) Prediction and analysis of COVID-19 positive cases using deep learning models: a descriptive case study of India. Chaos, Solitons Fractals 139:110017

MathSciNet   Google Scholar  

Zaitcev A, Eissa MR, Hui Z, Good T, Elliott J, Benaissa M (2020) A deep neural network application for improved prediction of in type 1 diabetes. IEEE J Biomed Health Inform 24(10):2932–2941

Naz H, Ahuja S (2020) Deep learning approach for diabetes prediction using PIMA Indian dataset. J Diabetes Metab Disord 19:391–403

Download references

Acknowledgements

Author information, authors and affiliations.

Department of Information Systems and Technology, Faculty of Graduate Studies for Statistical Research, Cairo University, Giza, Egypt

Mohammed Badawy & Nagy Ramadan

Department of Computer Sciences, Faculty of Graduate Studies for Statistical Research, Cairo University, Giza, Egypt

Hesham Ahmed Hefny

You can also search for this author in PubMed   Google Scholar

Contributions

MB wrote the main text of the manuscript; NR and HAH revised the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Mohammed Badawy .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests. All authors approved the final manuscript.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Badawy, M., Ramadan, N. & Hefny, H.A. Healthcare predictive analytics using machine learning and deep learning techniques: a survey. Journal of Electrical Systems and Inf Technol 10 , 40 (2023). https://doi.org/10.1186/s43067-023-00108-y

Download citation

Received : 27 December 2022

Accepted : 31 July 2023

Published : 29 August 2023

DOI : https://doi.org/10.1186/s43067-023-00108-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Healthcare prediction
  • Artificial intelligence (AI)
  • Machine learning (ML)
  • Deep learning (DL)
  • Medical diagnosis

predictive analytics research topics

Numbers, Facts and Trends Shaping Your World

Read our research on:

Full Topic List

Regions & Countries

  • Publications
  • Our Methods
  • Short Reads
  • Tools & Resources

Read Our Research On:

At least four-in-ten U.S. adults have faced high levels of psychological distress during COVID-19 pandemic

predictive analytics research topics

At least four-in-ten U.S. adults (41%) have experienced high levels of psychological distress at least once since the early stages of the coronavirus outbreak , according to a new Pew Research Center analysis that examines survey responses from the same Americans over time.

Experiences of high psychological distress are especially widespread among young adults. A 58% majority of those ages 18 to 29 have experienced high levels of psychological distress at least once across four Center surveys conducted between March 2020 and September 2022.

This assessment of the public’s psychological reaction to the COVID-19 outbreak is based on surveys of members of Pew Research Center’s American Trends Panel (ATP) conducted online several times since March 2020. The mental health questions were included on four surveys. The first survey was conducted among 11,537 U.S. adults March 19-24, 2020; a second survey with the question series was conducted April 20-26, 2020, with a sample of 10,139 adults; a third survey was conducted February 16-21, 2021, among 10,121 adults; and the most recent survey was conducted September 13-18, 2022, among 10,588 adults. Additionally, researchers analyzed a subsample of 5,007 respondents who participated in each of the four surveys to examine psychological distress over time.

The ATP is an online survey panel that is recruited through national random sampling of residential addresses. This way nearly all U.S. adults have a chance of selection. The surveys are weighted to be representative of the U.S. adult population by gender, race, ethnicity, partisan affiliation, education and other categories. The group of respondents who participated in each of the four surveys was similarly weighted to be representative of the U.S. adult population. Here is more information about the ATP .

The psychological distress index used here measures the total amount of mental distress that individuals reported experiencing in the past seven days, as captured by questions measuring sleeplessness, anxiety, depression, loneliness, and physical reactions experienced when thinking about the outbreak. The low distress category includes about half of the sample; very few in that group said they were experiencing any of the types of distress most or all of the time. The middle category includes roughly a quarter of the sample, while the high distress category includes 21%. A large majority of those in the high distress group reported experiencing at least one type of distress most or all of the time in the past seven days.

This research benefited from the advice and counsel of the COVID-19 and mental health measurement group from Johns Hopkins Bloomberg School of Public Health (JHSPH): Catherine K. Ettman (JHSPH); M. Daniele Fallin (JHSPH, now at Emory University); Calliope Holingue (Kennedy Krieger Institute, JHSPH); Renee Johnson (JHSPH); Luke Kalb (Kennedy Krieger Institute, JHSPH); Frauke Kreuter (University of Maryland, Ludwig-Maximilians University of Munich); Elizabeth Stuart (JHSPH); Johannes Thrul (JHSPH); and Cindy Veldhuis (Columbia University, now at Northwestern University).

Here are the mental health questions used for this analysis, along with responses, and the detailed survey methodology statements for surveys conducted in March 2020 , late April 2020 , February 2021 and September 2022 .

A bar chart showing that young adults are especially likely to have experienced high psychological distress since March 2020

The analysis highlights the fluid nature of psychological distress among Americans, as measured by a five-item index that asks about experiences such as loneliness, anxiety and trouble sleeping.

In the September 2022 survey, 21% of U.S. adults fell into the high psychological distress category; in each of four surveys, no more than 24% of adults have fallen into this category. But because individuals experience varying levels of distress at different points in time, a significantly larger share of Americans (41%) have experienced high psychological distress at least once across the four surveys conducted over the past two and a half years.

In addition to age, experiences of high psychological distress are strongly tied to disability status and income. About two-thirds (66%) of adults who have a disability or health condition that keeps them from participating fully in work, school, housework or other activities reported a high level of distress at least once across the four surveys. And those with lower family incomes (53%) are more likely than those from middle- (38%) and high-income households (30%) to have experienced high psychological distress at least once since March 2020.

See also: In CDC survey, 37% of U.S. high school students report regular mental health struggles during COVID-19 pandemic

While many Americans faced challenges with mental health before the coronavirus pandemic, public health officials warned in early 2020 that the pandemic could exacerbate psychological distress. The negative effects of the outbreak have hit some people harder than others, with women, lower-income adults , and Black and Hispanic adults among the groups who have faced disparate health or financial impacts.

Americans’ personal levels of concern about getting or spreading the coronavirus have continued to decline over the course of 2022. The coronavirus is one of many potential sources of stress , including the economy and worries about the future of the nation.

Psychological distress levels have shifted for most Americans during the pandemic

A pie chart showing that levels of psychological distress have fluctuated for a 60% majority of U.S. adults since COVID-19 pandemic began

Amid the shifting landscape of COVID-19 in the United States , just 35% of Americans have registered the same level of psychological distress – whether high, medium or low – across all four surveys conducted by the Center since March 2020.

Instead, a majority of respondents (60%) moved in and out of levels of psychological distress. Psychological distress increased for some but decreased for others. One illustration of the fluid nature of these experiences is that while 41% of U.S. adults faced high psychological distress at least once across four surveys, just 6% experienced high distress in all four surveys. Nearly five times as many (28%) experienced low distress in all of the surveys.

The index of psychological distress is based on measures of five types of possible distress experienced in the past week, such as anxiety or sleeplessness, that are adapted from standard psychological measures. As used in the current survey, the questions are not a clinical measure nor a diagnostic tool; they describe people’s emotional experiences during the week prior to the interview.

A bar chart showing that having trouble sleeping (64%) and feeling anxious (61%) were the most commonly reported feelings of psychological distress in September 2022

Only one question refers specifically to the coronavirus outbreak. It asks how often in the past week Americans have “had physical reactions, such as sweating, trouble breathing, nausea, or a pounding heart” when thinking about their experience with the coronavirus outbreak. In the most recent September survey, 14% of Americans answered this question affirmatively. In March 2020, in the early stages of the outbreak, 18% said they had experienced this.  

Trouble sleeping is one of the most common forms of distress measured in the surveys. In the latest survey, about two-thirds of adults (64%) reported trouble sleeping at least some or a little of the time during the past week. A similar share (61%) said they had felt nervous, anxious or on edge.

Experiences with depression and loneliness also register with sizable shares of Americans. In the most recent survey, 46% of adults said they had felt depressed at least one or two days during the past week, and 42% said they had felt lonely.

All four surveys have included a question about positive feelings, though it is not part of the psychological distress index. Overall, 78% of U.S. adults said they had felt hopeful about the future at least one or two days in the past week, according to the latest survey from September. However, 22% of adults said they had felt hopeful about the future rarely or none of the time during the past week.

Note: Here are the mental health questions used for this analysis, along with responses, and the detailed survey methodology statements for surveys conducted in March 2020 , late April 2020 , February 2021 and September 2022 .

  • Coronavirus (COVID-19)
  • Happiness & Life Satisfaction
  • Health Care
  • Health Policy
  • Medicine & Health

Giancarlo Pasquini's photo

Giancarlo Pasquini is a research associate focusing on science and society research at Pew Research Center

Scott Keeter's photo

Scott Keeter is a senior survey advisor at Pew Research Center

How Americans View the Coronavirus, COVID-19 Vaccines Amid Declining Levels of Concern

Online religious services appeal to many americans, but going in person remains more popular, about a third of u.s. workers who can work from home now do so all the time, how the pandemic has affected attendance at u.s. religious services, mental health and the pandemic: what u.s. surveys have found, most popular.

1615 L St. NW, Suite 800 Washington, DC 20036 USA (+1) 202-419-4300 | Main (+1) 202-857-8562 | Fax (+1) 202-419-4372 |  Media Inquiries

Research Topics

  • Age & Generations
  • Economy & Work
  • Family & Relationships
  • Gender & LGBTQ
  • Immigration & Migration
  • International Affairs
  • Internet & Technology
  • Methodological Research
  • News Habits & Media
  • Non-U.S. Governments
  • Other Topics
  • Politics & Policy
  • Race & Ethnicity
  • Email Newsletters

ABOUT PEW RESEARCH CENTER  Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of  The Pew Charitable Trusts .

Copyright 2024 Pew Research Center

Terms & Conditions

Privacy Policy

Cookie Settings

Reprints, Permissions & Use Policy

IMAGES

  1. How Do Predictive Analytics Work? A Brief Guide

    predictive analytics research topics

  2. Predictive Analytics with INTELLIBOT

    predictive analytics research topics

  3. [Infographic] Predictive Analytics

    predictive analytics research topics

  4. Predictive Analytics: Transforming Data into Future Insights

    predictive analytics research topics

  5. An overview of the four main approaches to predictive analytics

    predictive analytics research topics

  6. Evolution of Predictive Analytics

    predictive analytics research topics

VIDEO

  1. Predictive Content Analysis

  2. Predictive Analytics in Healthcare

  3. Predictive analytics: A "must have" in your toolkit

  4. How Google Knows Your Next Move: Predictive Analytics!

  5. Predictive Analytics

  6. Predictive Analytics

COMMENTS

  1. Top 10 Essential Data Science Topics to Real-World Application From the

    Feature Selection: The above steps could serve as feature selection, an input to predictive analytics. 4.8. Prescriptive Analytics. The top level of analytics in Figure 1, prescriptive analytics for decision making, tends to be under-focused in statistics and data science programs.

  2. What is Predictive Analytics?

    What is predictive analytics? Predictive analytics is a branch of advanced analytics that makes predictions about future outcomes using historical data combined with statistical modeling, data mining techniques and machine learning. Companies employ predictive analytics to find patterns in this data to identify risks and opportunities ...

  3. Predictive Analytics: A Review of Trends and Techniques

    Predictive analytics, a branch in the domain of advanced. analytics, is used in predicting the fut ure events. It analyzes. the current and historical data in order to make predictions. about the ...

  4. What Is Predictive Analytics? 5 Examples

    5 Examples of Predictive Analytics in Action. 1. Finance: Forecasting Future Cash Flow. Every business needs to keep periodic financial records, and predictive analytics can play a big role in forecasting your organization's future health. Using historical data from previous financial statements, as well as data from the broader industry, you ...

  5. PDF Analytics of the Future Predictive Analytics

    Appendix: Predictive Analytics Methods. Predictive analytics fits into a spectrum of analytic methods that help convert data into: an understanding about the present (descriptive analytics), insights into the future (predictive analytics), and recommendations about actions (prescriptive analytics). Dr.

  6. A Beginner's Guide to Predictive Analytics

    The primary purpose of predictive analytics is to make predictions about outcomes, trends, or events based on patterns and insights from historical data. Predictive analytics is the second of four stages of analytical capability in an organization. The four stages of analytics, in order, are: Descriptive analytics - identifying what happened in ...

  7. Predictive analytics in the era of big data: opportunities and

    Three steps are typically involved in the big data analytics ( Table 1 ). The first step is the formulation of clinical questions ( 4 ), which can be categorized into three types: (I) epidemiological question on prevalence and incidence and risk factors; (II) effectiveness and/or safety of an intervention; and (III) predictive analytics.

  8. Recent advances in Predictive Learning Analytics: A decade ...

    The last few years have witnessed an upsurge in the number of studies using Machine and Deep learning models to predict vital academic outcomes based on different kinds and sources of student-related data, with the goal of improving the learning process from all perspectives. This has led to the emergence of predictive modelling as a core practice in Learning Analytics and Educational Data ...

  9. Predictive Analytics

    Predictive analytics is a branch of data science that applies various techniques including statistical inference, machine learning, data mining, and information visualization toward the ultimate goal of forecasting, modeling, and understanding the future behavior of a system based on historical and/or real-time data.

  10. Introduction to Predictive Analytics

    Full size image. Step 1 in the model is to determine the business problem. The business problem in this example is to develop a predictive model to detect and mitigate potentially fraudulent automobile insurance claims. Step 2 in the model is to narrow down the business problem and develop the hypotheses.

  11. Precision Health Analytics With Predictive Analytics and ...

    11 Predictive Analytics and Comparative Effectiveness (PACE) Center, Sackler School of Graduate Biomedical Sciences, Tufts University, Tufts Medical Center, Boston, Massachusetts. 12 Department of Medicine Health Services and Care Research Program, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin.

  12. (PDF) Predictive analysis using machine learning: Review of trends and

    Abstract —Artificial Intelligence (AI) has been growing con-. siderably over the last ten years. Machine Learning (ML) is. probably the most popular branch of AI to date. Most systems. that use ...

  13. Predictive Analytics

    Predictive analytics with Big Data in education will improve educational programs for students and fund-raising campaigns for donors (Siegel, 2013). Research in both educational data mining (EDM) and data analytics (LA) continues to increase (Siemens, 2013; Baker and Siemens, 2014). The key elements of recruitment, learning, and retention can ...

  14. Big Data and Predictive Analytics for Business Intelligence: A ...

    The major topics of COVID-19, big data, predictive analytics, and BI research falls on the challenges and BI solution for firms due to the epidemic. The other topic is the application of BI utilized in the research on the influential effects of COVID-19 on business industries. 2021: healthcare [72,73,74]

  15. Data Science & Analytics Research Topics (Includes Free Webinar)

    Data Science-Related Research Topics. Developing machine learning models for real-time fraud detection in online transactions. The use of big data analytics in predicting and managing urban traffic flow. Investigating the effectiveness of data mining techniques in identifying early signs of mental health issues from social media usage.

  16. predictive analytics Latest Research Papers

    Empirical quantitative research method is used to verify the model with the sample of UK insurance sector analysis. This research will conclude some practical insights for Insurance companies using AI, ML, Big data processing and Cloud computing for the better client satisfaction, predictive analysis, and trending. Download Full-text.

  17. Applying Predictive Analytics on Research Information to ...

    The historical information can also tell universities where they have been successful before. As the next step, we are going to develop Predictive Analytics models that can: (i) finding research topics of regional or national importance, (ii) recommending strong fit funding opportunities, and (iii) predicting the success of proposal applications.

  18. Predictive Analytics: Definition, Model Types, and Uses

    Predictive Analytics: The use of statistics and modeling to determine future performance based on current and historical data. Predictive analytics look at patterns in data to determine if those ...

  19. 18 Great Articles About Predictive Analytics

    This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, Hadoop, decision trees, ensembles, correlation, outliers, regression, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, time series, cross-validation, model fitting, dataviz, AI and many more. To keep receiving these articles ...

  20. Predictive big data analytics for supply chain demand forecasting

    Big data analytics (BDA) in supply chain management (SCM) is receiving a growing attention. This is due to the fact that BDA has a wide range of applications in SCM, including customer behavior analysis, trend analysis, and demand prediction. In this survey, we investigate the predictive BDA applications in supply chain demand forecasting to propose a classification of these applications ...

  21. Predictive Analytics Research

    Future Actuaries. Education & Exams. Professional Development. Research Institute. Professional Sections. Tools & Resources. About SOA. Examples of SOA experience studies and research reports that have made use of predictive analytic techniques.

  22. The use of predictive analytics in finance

    The economy is, however, fundamentally uncertain. As such, there is a lack of observed accuracy in the predictions in economics such as inflation and GDP growth. 34 Despite the power of computational models, there isn't a great deal of successful research utilizing predictive analytics in the field of economics. This is because the primary ...

  23. Healthcare predictive analytics using machine learning and deep

    Healthcare prediction has been a significant factor in saving lives in recent years. In the domain of health care, there is a rapid development of intelligent systems for analyzing complicated data relationships and transforming them into real information for use in the prediction process. Consequently, artificial intelligence is rapidly transforming the healthcare industry, and thus comes the ...

  24. During the pandemic, 41% of US adults faced high ...

    (SDI Productions via Getty Images) At least four-in-ten U.S. adults (41%) have experienced high levels of psychological distress at least once since the early stages of the coronavirus outbreak, according to a new Pew Research Center analysis that examines survey responses from the same Americans over time.. Experiences of high psychological distress are especially widespread among young adults.