• Latest News

Logo

  • Cryptocurrencies
  • White Papers

Top 10 Must-Read Data Science Research Papers in 2022

Data Science plays a vital role in many sectors such as small businesses, software companies, and the list goes on. Data Science understands customer preferences, demographics, automation, risk management, and many other valuable insights. Data Science can analyze and aggregate industry data. It has a frequency and real-time nature of data collection.

There are many data science enthusiasts out there who are totally into Data Science. The sad part is that they couldn't follow up with the latest research papers of Data Science. Here, Analytics Insight brings you the latest Data Science Research Papers. These research papers consist of different data science topics including the present fast passed technologies such as AI, ML, Coding, and many others. Data Science plays a very major role in applying AI, ML, and Coding. With the help of data science, we can improve our applications in various sectors. Here are the Data Science Research Papers in 2024

10DATA SCIENTISTS THAT TECH ENTHUSIASTS CAN FOLLOW ON LINKEDIN

ARE YOU A JOB SEEKER? KNOW THE IMPACT OF AI AND DATA SCIENCE

TOP 10 PYTHON + DATA SCIENCE COURSES YOU SHOULD TAKE UP IN 2022  

The Research Papers Includes

Documentation matters: human-centered ai system to assist data science code documentation in computational notebooks.

The research paper is written by April Yi Wang, Dakuo Wang, Jaimie Drozda, Michael Muller, Soya Park, Justin D. Weisz, Xuye Lui, Lingfei Wu, Casey Dugan.

This research paper is all about AMC transactions on Computer-Human Interaction. This is a combination of code and documentation. In this research paper, the researchers have Themisto an automated documentation generation system. This explores how human-centered AI systems can support data scientists in Machine Learning code documentation.

Assessing the effects of fuel energy consumption, foreign direct investment and GDP on CO2 emission: New data science evidence from Europe & Central Asia

The research paper is written by- Muhammad Mohsin, SobiaNaseem, Muddassar Sarfraz Tamoor, Azam

This research paper deals with how bad the effects of fuel consumption are and how data science is playing a vital role in extracting such huge information.

Impact on Stock Market across Covid-19 Outbreak

The research paper is written by-CharmiGotecha

This paper analyses the impacts of a pandemic from 2019-2022 and how it has affected the world with the help of data science tools. It also talks about how data science played a major role in recovering the world from covid losses.

Exploring the political pulse of a country using data science tools

The research paper is written by Miguel G. Folgado, Veronica Sanz

This paper deals with how data science tools/techniques are used to analyses complex human communication. This study paper is an example of how Twitter data and different types of data science tools for political analysis.

Situating Data Science

The research paper is written by-Michelle HodaWilkerson, Joseph L. Polman

This research paper gives detailed information about regulating procurement understanding the ends and means of public procurement regulation.

VeridicalFlow: a Python package for building trustworthy data science pipelines with PCS

The research paper is written by- James Duncan, RushKapoor, Abhineet Agarwal, Chandan Singh, Bin Yu

This research paper is more of a journal of open-source software than a study paper. It deals with the open-source software that is the programs available in the systems that are related to data science.

From AI ethics principles to data science practice: a reflection and a gap analysis based on recent frameworks and practical experience

The research paper is written by-IlinaGeorgieva, ClaudioLazo, Tjerk Timan, Anne Fleur van Veenstra

This study paper deals with the field of AI ethics, its frameworks, evaluation, and much more. This paper contributes ethical AI by mapping AI ethical principles onto the lifestyle of artificial intelligence -based digital services or products to investigate their applicability for the practice of data science.

Building an Effective Data Science Practice

The research paper is written by Vineet Raina, Srinath Krishnamurthy

This paper is a complete guide for an effective data science practice. It gives an idea about how the data science team can be helpful and how productive they can be.

Detection of Road Traffic Anomalies Based on Computational Data Science

The research paper is written by Jamal Raiyn

This research paper gives an idea about autonomous vehicles will have control over every function and how data science will be part of taking full control over all the functions. Also, to manage large amounts of data collected from traffic in various formats, a Computational Data Science approach is proposed by the researchers.

Data Science Data Governance [AI Ethics]

The research paper is written by Joshua A. Kroll

This paper analyses and gives brief yet complete information about the best practices opted by organizations to manage their data which encompass the full range of responsibilities borne by the use of data in automated decision making, including data security, privacy, avoidance of undue discrimination, accountability, and transparency.

Disclaimer: Analytics Insight does not provide financial advice or guidance. Also note that the cryptocurrencies mentioned/listed on the website could potentially be scams, i.e. designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made. You are responsible for conducting your own research (DYOR) before making any investments. Read more here.

Related Stories

logo

data science Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Assessing the effects of fuel energy consumption, foreign direct investment and GDP on CO2 emission: New data science evidence from Europe & Central Asia

Documentation matters: human-centered ai system to assist data science code documentation in computational notebooks.

Computational notebooks allow data scientists to express their ideas through a combination of code and documentation. However, data scientists often pay attention only to the code, and neglect creating or updating their documentation during quick iterations. Inspired by human documentation practices learned from 80 highly-voted Kaggle notebooks, we design and implement Themisto, an automated documentation generation system to explore how human-centered AI systems can support human data scientists in the machine learning code documentation scenario. Themisto facilitates the creation of documentation via three approaches: a deep-learning-based approach to generate documentation for source code, a query-based approach to retrieve online API documentation for source code, and a user prompt approach to nudge users to write documentation. We evaluated Themisto in a within-subjects experiment with 24 data science practitioners, and found that automated documentation generation techniques reduced the time for writing documentation, reminded participants to document code they would have ignored, and improved participants’ satisfaction with their computational notebook.

Data science in the business environment: Insight management for an Executive MBA

Adventures in financial data science, gecoagent: a conversational agent for empowering genomic data extraction and analysis.

With the availability of reliable and low-cost DNA sequencing, human genomics is relevant to a growing number of end-users, including biologists and clinicians. Typical interactions require applying comparative data analysis to huge repositories of genomic information for building new knowledge, taking advantage of the latest findings in applied genomics for healthcare. Powerful technology for data extraction and analysis is available, but broad use of the technology is hampered by the complexity of accessing such methods and tools. This work presents GeCoAgent, a big-data service for clinicians and biologists. GeCoAgent uses a dialogic interface, animated by a chatbot, for supporting the end-users’ interaction with computational tools accompanied by multi-modal support. While the dialogue progresses, the user is accompanied in extracting the relevant data from repositories and then performing data analysis, which often requires the use of statistical methods or machine learning. Results are returned using simple representations (spreadsheets and graphics), while at the end of a session the dialogue is summarized in textual format. The innovation presented in this article is concerned with not only the delivery of a new tool but also our novel approach to conversational technologies, potentially extensible to other healthcare domains or to general data science.

Differentially Private Medical Texts Generation Using Generative Neural Networks

Technological advancements in data science have offered us affordable storage and efficient algorithms to query a large volume of data. Our health records are a significant part of this data, which is pivotal for healthcare providers and can be utilized in our well-being. The clinical note in electronic health records is one such category that collects a patient’s complete medical information during different timesteps of patient care available in the form of free-texts. Thus, these unstructured textual notes contain events from a patient’s admission to discharge, which can prove to be significant for future medical decisions. However, since these texts also contain sensitive information about the patient and the attending medical professionals, such notes cannot be shared publicly. This privacy issue has thwarted timely discoveries on this plethora of untapped information. Therefore, in this work, we intend to generate synthetic medical texts from a private or sanitized (de-identified) clinical text corpus and analyze their utility rigorously in different metrics and levels. Experimental results promote the applicability of our generated data as it achieves more than 80\% accuracy in different pragmatic classification problems and matches (or outperforms) the original text data.

Impact on Stock Market across Covid-19 Outbreak

Abstract: This paper analysis the impact of pandemic over the global stock exchange. The stock listing values are determined by variety of factors including the seasonal changes, catastrophic calamities, pandemic, fiscal year change and many more. This paper significantly provides analysis on the variation of listing price over the world-wide outbreak of novel corona virus. The key reason to imply upon this outbreak was to provide notion on underlying regulation of stock exchanges. Daily closing prices of the stock indices from January 2017 to January 2022 has been utilized for the analysis. The predominant feature of the research is to analyse the fact that does global economy downfall impacts the financial stock exchange. Keywords: Stock Exchange, Matplotlib, Streamlit, Data Science, Web scrapping.

Information Resilience: the nexus of responsible and agile approaches to information use

AbstractThe appetite for effective use of information assets has been steadily rising in both public and private sector organisations. However, whether the information is used for social good or commercial gain, there is a growing recognition of the complex socio-technical challenges associated with balancing the diverse demands of regulatory compliance and data privacy, social expectations and ethical use, business process agility and value creation, and scarcity of data science talent. In this vision paper, we present a series of case studies that highlight these interconnected challenges, across a range of application areas. We use the insights from the case studies to introduce Information Resilience, as a scaffold within which the competing requirements of responsible and agile approaches to information use can be positioned. The aim of this paper is to develop and present a manifesto for Information Resilience that can serve as a reference for future research and development in relevant areas of responsible data management.

qEEG Analysis in the Diagnosis of Alzheimers Disease; a Comparison of Functional Connectivity and Spectral Analysis

Alzheimers disease (AD) is a brain disorder that is mainly characterized by a progressive degeneration of neurons in the brain, causing a decline in cognitive abilities and difficulties in engaging in day-to-day activities. This study compares an FFT-based spectral analysis against a functional connectivity analysis based on phase synchronization, for finding known differences between AD patients and Healthy Control (HC) subjects. Both of these quantitative analysis methods were applied on a dataset comprising bipolar EEG montages values from 20 diagnosed AD patients and 20 age-matched HC subjects. Additionally, an attempt was made to localize the identified AD-induced brain activity effects in AD patients. The obtained results showed the advantage of the functional connectivity analysis method compared to a simple spectral analysis. Specifically, while spectral analysis could not find any significant differences between the AD and HC groups, the functional connectivity analysis showed statistically higher synchronization levels in the AD group in the lower frequency bands (delta and theta), suggesting that the AD patients brains are in a phase-locked state. Further comparison of functional connectivity between the homotopic regions confirmed that the traits of AD were localized in the centro-parietal and centro-temporal areas in the theta frequency band (4-8 Hz). The contribution of this study is that it applies a neural metric for Alzheimers detection from a data science perspective rather than from a neuroscience one. The study shows that the combination of bipolar derivations with phase synchronization yields similar results to comparable studies employing alternative analysis methods.

Big Data Analytics for Long-Term Meteorological Observations at Hanford Site

A growing number of physical objects with embedded sensors with typically high volume and frequently updated data sets has accentuated the need to develop methodologies to extract useful information from big data for supporting decision making. This study applies a suite of data analytics and core principles of data science to characterize near real-time meteorological data with a focus on extreme weather events. To highlight the applicability of this work and make it more accessible from a risk management perspective, a foundation for a software platform with an intuitive Graphical User Interface (GUI) was developed to access and analyze data from a decommissioned nuclear production complex operated by the U.S. Department of Energy (DOE, Richland, USA). Exploratory data analysis (EDA), involving classical non-parametric statistics, and machine learning (ML) techniques, were used to develop statistical summaries and learn characteristic features of key weather patterns and signatures. The new approach and GUI provide key insights into using big data and ML to assist site operation related to safety management strategies for extreme weather events. Specifically, this work offers a practical guide to analyzing long-term meteorological data and highlights the integration of ML and classical statistics to applied risk and decision science.

Export Citation Format

Share document.

  • Survey paper
  • Open access
  • Published: 01 October 2015

Big data analytics: a survey

  • Chun-Wei Tsai 1 ,
  • Chin-Feng Lai 2 ,
  • Han-Chieh Chao 1 , 3 , 4 &
  • Athanasios V. Vasilakos 5  

Journal of Big Data volume  2 , Article number:  21 ( 2015 ) Cite this article

149k Accesses

479 Citations

130 Altmetric

Metrics details

The age of big data is now coming. But the traditional data analytics may not be able to handle such large quantities of data. The question that arises now is, how to develop a high performance platform to efficiently analyze big data and how to design an appropriate mining algorithm to find the useful things from big data. To deeply discuss this issue, this paper begins with a brief introduction to data analytics, followed by the discussions of big data analytics. Some important open issues and further research directions will also be presented for the next step of big data analytics.

Introduction

As the information technology spreads fast, most of the data were born digital as well as exchanged on internet today. According to the estimation of Lyman and Varian [ 1 ], the new data stored in digital media devices have already been more than 92 % in 2002, while the size of these new data was also more than five exabytes. In fact, the problems of analyzing the large scale data were not suddenly occurred but have been there for several years because the creation of data is usually much easier than finding useful things from the data. Even though computer systems today are much faster than those in the 1930s, the large scale data is a strain to analyze by the computers we have today.

In response to the problems of analyzing large-scale data , quite a few efficient methods [ 2 ], such as sampling, data condensation, density-based approaches, grid-based approaches, divide and conquer, incremental learning, and distributed computing, have been presented. Of course, these methods are constantly used to improve the performance of the operators of data analytics process. Footnote 1 The results of these methods illustrate that with the efficient methods at hand, we may be able to analyze the large-scale data in a reasonable time. The dimensional reduction method (e.g., principal components analysis; PCA [ 3 ]) is a typical example that is aimed at reducing the input data volume to accelerate the process of data analytics. Another reduction method that reduces the data computations of data clustering is sampling [ 4 ], which can also be used to speed up the computation time of data analytics.

Although the advances of computer systems and internet technologies have witnessed the development of computing hardware following the Moore’s law for several decades, the problems of handling the large-scale data still exist when we are entering the age of big data . That is why Fisher et al. [ 5 ] pointed out that big data means that the data is unable to be handled and processed by most current information systems or methods because data in the big data era will not only become too big to be loaded into a single machine, it also implies that most traditional data mining methods or data analytics developed for a centralized data analysis process may not be able to be applied directly to big data. In addition to the issues of data size, Laney [ 6 ] presented a well-known definition (also called 3Vs) to explain what is the “big” data: volume, velocity, and variety. The definition of 3Vs implies that the data size is large, the data will be created rapidly, and the data will be existed in multiple types and captured from different sources, respectively. Later studies [ 7 , 8 ] pointed out that the definition of 3Vs is insufficient to explain the big data we face now. Thus, veracity, validity, value, variability, venue, vocabulary, and vagueness were added to make some complement explanation of big data [ 8 ].

Expected trend of the marketing of big data between 2012 and 2018. Note that yellow , red , and blue of different colored box represent the order of appearance of reference in this paper for particular year

The report of IDC [ 9 ] indicates that the marketing of big data is about $16.1 billion in 2014. Another report of IDC [ 10 ] forecasts that it will grow up to $32.4 billion by 2017. The reports of [ 11 ] and [ 12 ] further pointed out that the marketing of big data will be $46.34 billion and $114 billion by 2018, respectively. As shown in Fig. 1 , even though the marketing values of big data in these researches and technology reports [ 9 – 15 ] are different, these forecasts usually indicate that the scope of big data will be grown rapidly in the forthcoming future.

In addition to marketing, from the results of disease control and prevention [ 16 ], business intelligence [ 17 ], and smart city [ 18 ], we can easily understand that big data is of vital importance everywhere. A numerous researches are therefore focusing on developing effective technologies to analyze the big data. To discuss in deep the big data analytics, this paper gives not only a systematic description of traditional large-scale data analytics but also a detailed discussion about the differences between data and big data analytics framework for the data scientists or researchers to focus on the big data analytics.

Moreover, although several data analytics and frameworks have been presented in recent years, with their pros and cons being discussed in different studies, a complete discussion from the perspective of data mining and knowledge discovery in databases still is needed. As a result, this paper is aimed at providing a brief review for the researchers on the data mining and distributed computing domains to have a basic idea to use or develop data analytics for big data.

Roadmap of this paper

Figure 2 shows the roadmap of this paper, and the remainder of the paper is organized as follows. “ Data analytics ” begins with a brief introduction to the data analytics, and then “ Big data analytics ” will turn to the discussion of big data analytics as well as state-of-the-art data analytics algorithms and frameworks. The open issues are discussed in “ The open issues ” while the conclusions and future trends are drawn in “ Conclusions ”.

Data analytics

To make the whole process of knowledge discovery in databases (KDD) more clear, Fayyad and his colleagues summarized the KDD process by a few operations in [ 19 ], which are selection, preprocessing, transformation, data mining, and interpretation/evaluation. As shown in Fig. 3 , with these operators at hand we will be able to build a complete data analytics system to gather data first and then find information from the data and display the knowledge to the user. According to our observation, the number of research articles and technical reports that focus on data mining is typically more than the number focusing on other operators, but it does not mean that the other operators of KDD are unimportant. The other operators also play the vital roles in KDD process because they will strongly impact the final result of KDD. To make the discussions on the main operators of KDD process more concise, the following sections will focus on those depicted in Fig. 3 , which were simplified to three parts (input, data analytics, and output) and seven operators (gathering, selection, preprocessing, transformation, data mining, evaluation, and interpretation).

The process of knowledge discovery in databases

As shown in Fig. 3 , the gathering, selection, preprocessing, and transformation operators are in the input part. The selection operator usually plays the role of knowing which kind of data was required for data analysis and select the relevant information from the gathered data or databases; thus, these gathered data from different data resources will need to be integrated to the target data. The preprocessing operator plays a different role in dealing with the input data which is aimed at detecting, cleaning, and filtering the unnecessary, inconsistent, and incomplete data to make them the useful data. After the selection and preprocessing operators, the characteristics of the secondary data still may be in a number of different data formats; therefore, the KDD process needs to transform them into a data-mining-capable format which is performed by the transformation operator. The methods for reducing the complexity and downsizing the data scale to make the data useful for data analysis part are usually employed in the transformation, such as dimensional reduction, sampling, coding, or transformation.

The data extraction, data cleaning, data integration, data transformation, and data reduction operators can be regarded as the preprocessing processes of data analysis [ 20 ] which attempts to extract useful data from the raw data (also called the primary data) and refine them so that they can be used by the following data analyses. If the data are a duplicate copy, incomplete, inconsistent, noisy, or outliers, then these operators have to clean them up. If the data are too complex or too large to be handled, these operators will also try to reduce them. If the raw data have errors or omissions, the roles of these operators are to identify them and make them consistent. It can be expected that these operators may affect the analytics result of KDD, be it positive or negative. In summary, the systematic solutions are usually to reduce the complexity of data to accelerate the computation time of KDD and to improve the accuracy of the analytics result.

Data analysis

Since the data analysis (as shown in Fig. 3 ) in KDD is responsible for finding the hidden patterns/rules/information from the data, most researchers in this field use the term data mining to describe how they refine the “ground” (i.e, raw data) into “gold nugget” (i.e., information or knowledge). The data mining methods [ 20 ] are not limited to data problem specific methods. In fact, other technologies (e.g., statistical or machine learning technologies) have also been used to analyze the data for many years. In the early stages of data analysis, the statistical methods were used for analyzing the data to help us understand the situation we are facing, such as public opinion poll or TV programme rating. Like the statistical analysis, the problem specific methods for data mining also attempted to understand the meaning from the collected data.

After the data mining problem was presented, some of the domain specific algorithms are also developed. An example is the apriori algorithm [ 21 ] which is one of the useful algorithms designed for the association rules problem. Although most definitions of data mining problems are simple, the computation costs are quite high. To speed up the response time of a data mining operator, machine learning [ 22 ], metaheuristic algorithms [ 23 ], and distributed computing [ 24 ] were used alone or combined with the traditional data mining algorithms to provide more efficient ways for solving the data mining problem. One of the well-known combinations can be found in [ 25 ], Krishna and Murty attempted to combine genetic algorithm and k -means to get better clustering result than k -means alone does.

Data mining algorithm

As Fig. 4 shows, most data mining algorithms contain the initialization, data input and output, data scan, rules construction, and rules update operators [ 26 ]. In Fig. 4 , D represents the raw data, d the data from the scan operator, r the rules, o the predefined measurement, and v the candidate rules. The scan, construct, and update operators will be performed repeatedly until the termination criterion is met. The timing to employ the scan operator depends on the design of the data mining algorithm; thus, it can be considered as an optional operator. Most of the data algorithms can be described by Fig. 4 in which it also shows that the representative algorithms— clustering , classification , association rules , and sequential patterns —will apply these operators to find the hidden information from the raw data. Thus, modifying these operators will be one of the possible ways for enhancing the performance of the data analysis.

Clustering is one of the well-known data mining problems because it can be used to understand the “new” input data. The basic idea of this problem [ 27 ] is to separate a set of unlabeled input data Footnote 2 to k different groups, e.g., such as k -means [ 28 ]. Classification [ 20 ] is the opposite of clustering because it relies on a set of labeled input data to construct a set of classifiers (i.e., groups) which will then be used to classify the unlabeled input data to the groups to which they belong. To solve the classification problem, the decision tree-based algorithm [ 29 ], naïve Bayesian classification [ 30 ], and support vector machine (SVM) [ 31 ] are widely used in recent years.

Unlike clustering and classification that attempt to classify the input data to k groups, association rules and sequential patterns are focused on finding out the “relationships” between the input data. The basic idea of association rules [ 21 ] is find all the co-occurrence relationships between the input data. For the association rules problem, the apriori algorithm [ 21 ] is one of the most popular methods. Nevertheless, because it is computationally very expensive, later studies [ 32 ] have attempted to use different approaches to reducing the cost of the apriori algorithm, such as applying the genetic algorithm to this problem [ 33 ]. In addition to considering the relationships between the input data, if we also consider the sequence or time series of the input data, then it will be referred to as the sequential pattern mining problem [ 34 ]. Several apriori-like algorithms were presented for solving it, such as generalized sequential pattern [ 34 ] and sequential pattern discovery using equivalence classes [ 35 ].

Output the result

Evaluation and interpretation are two vital operators of the output. Evaluation typically plays the role of measuring the results. It can also be one of the operators for the data mining algorithm, such as the sum of squared errors which was used by the selection operator of the genetic algorithm for the clustering problem [ 25 ].

To solve the data mining problems that attempt to classify the input data, two of the major goals are: (1) cohesion—the distance between each data and the centroid (mean) of its cluster should be as small as possible, and (2) coupling—the distance between data which belong to different clusters should be as large as possible. In most studies of data clustering or classification problems, the sum of squared errors (SSE), which was used to measure the cohesion of the data mining results, can be defined as

where k is the number of clusters which is typically given by the user; \(n_i\) the number of data in the i th cluster; \(x_{ij}\) the j th datum in the i th cluster; \(c_i\) is the mean of the i th cluster; and \(n= \sum ^k_{i=1} n_i\) is the number of data. The most commonly used distance measure for the data mining problem is the Euclidean distance, which is defined as

where \(p_i\) and \(p_j\) are the positions of two different data. For solving different data mining problems, the distance measurement \(D(p_i, p_j)\) can be the Manhattan distance, the Minkowski distance, or even the cosine similarity [ 36 ] between two different documents.

Accuracy (ACC) is another well-known measurement [ 37 ] which is defined as

To evaluate the classification results, precision ( p ), recall ( r ), and F -measure can be used to measure how many data that do not belong to group A are incorrectly classified into group A ; and how many data that belong to group A are not classified into group A . A simple confusion matrix of a classifier [ 37 ] as given in Table 1 can be used to cover all the situations of the classification results.

In Table 1 , TP and TN indicate the numbers of positive examples and negative examples that are correctly classified, respectively; FN and FP indicate the numbers of positive examples and negative examples that are incorrectly classified, respectively. With the confusion matrix at hand, it is much easier to describe the meaning of precision ( p ), which is defined as

and the meaning of recall ( r ), which is defined as

The F -measure can then be computed as

In addition to the above-mentioned measurements for evaluating the data mining results, the computation cost and response time are another two well-known measurements. When two different mining algorithms can find the same or similar results, of course, how fast they can get the final mining results will become the most important research topic.

After something (e.g., classification rules) is found by data mining methods, the two essential research topics are: (1) the work to navigate and explore the meaning of the results from the data analysis to further support the user to do the applicable decision can be regarded as the interpretation operator [ 38 ], which in most cases, gives useful interface to display the information [ 39 ] and (2) a meaningful summarization of the mining results [ 40 ] can be made to make it easier for the user to understand the information from the data analysis. The data summarization is generally expected to be one of the simple ways to provide a concise piece of information to the user because human has trouble of understanding vast amounts of complicated information. A simple data summarization can be found in the clustering search engine, when a query “oasis” is sent to Carrot2 ( http://search.carrot2.org/stable/search ), it will return some keywords to represent each group of the clustering results for web links to help us recognize which category needed by the user, as shown in the left side of Fig. 5 .

Screenshot of the results of clustering search engine

A useful graphical user interface is another way to provide the meaningful information to an user. As explained by Shneiderman in [ 39 ], we need “overview first, zoom and filter, then retrieve the details on demand”. The useful graphical user interface [ 38 , 41 ] also makes it easier for the user to comprehend the meaning of the results when the number of dimensions is higher than three. How to display the results of data mining will affect the user’s perspective to make the decision. For instance, data mining can help us find “type A influenza” at a particular region, but without the time series and flu virus infected information of patients, the government could not recognize what situation (pandemic or controlled) we are facing now so as to make appropriate responses to that. For this reason, a better solution to merge the information from different sources and mining algorithm results will be useful to let the user make the right decision.

Since the problems of handling and analyzing large-scale and complex input data always exist in data analytics, several efficient analysis methods were presented to accelerate the computation time or to reduce the memory cost for the KDD process, as shown in Table 2 . The study of [ 42 ] shows that the basic mathematical concepts (i.e., triangle inequality) can be used to reduce the computation cost of a clustering algorithm. Another study [ 43 ] shows that the new technologies (i.e., distributed computing by GPU) can also be used to reduce the computation time of data analysis method. In addition to the well-known improved methods for these analysis methods (e.g., triangle inequality or distributed computing), a large proportion of studies designed their efficient methods based on the characteristics of mining algorithms or problem itself, which can be found in [ 32 , 44 , 45 ], and so forth. This kind of improved methods typically was designed for solving the drawback of the mining algorithms or using different ways to solve the mining problem. These situations can be found in most association rules and sequential patterns problems because the original assumption of these problems is for the analysis of large-scale dataset. Since the earlier frequent pattern algorithm (e.g., apriori algorithm) needs to scan the whole dataset many times which is computationally very expensive. How to reduce the number of times the whole dataset is scanned so as to save the computation cost is one of the most important things in all the frequent pattern studies. The similar situation also exists in data clustering and classification studies because the design concept of earlier algorithms, such as mining the patterns on-the-fly [ 46 ], mining partial patterns at different stages [ 47 ], and reducing the number of times the whole dataset is scanned [ 32 ], are therefore presented to enhance the performance of these mining algorithms. Since some of the data mining problems are NP-hard [ 48 ] or the solution space is very large, several recent studies [ 23 , 49 ] have attempted to use metaheuristic algorithm as the mining algorithm to get the approximate solution within a reasonable time.

Abundant research results of data analysis [ 20 , 27 , 63 ] show possible solutions for dealing with the dilemmas of data mining algorithms. It means that the open issues of data analysis from the literature [ 2 , 64 ] usually can help us easily find the possible solutions. For instance, the clustering result is extremely sensitive to the initial means, which can be mitigated by using multiple sets of initial means [ 65 ]. According to our observation, most data analysis methods have limitations for big data, that can be described as follows:

Unscalability and centralization Most data analysis methods are not for large-scale and complex dataset. The traditional data analysis methods cannot be scaled up because their design does not take into account large or complex datasets. The design of traditional data analysis methods typically assumed they will be performed in a single machine, with all the data in memory for the data analysis process. For this reason, the performance of traditional data analytics will be limited in solving the volume problem of big data.

Non-dynamic Most traditional data analysis methods cannot be dynamically adjusted for different situations, meaning that they do not analyze the input data on-the-fly. For example, the classifiers are usually fixed which cannot be automatically changed. The incremental learning [ 66 ] is a promising research trend because it can dynamically adjust the the classifiers on the training process with limited resources. As a result, the performance of traditional data analytics may not be useful to the problem of velocity problem of big data.

Uniform data structure Most of the data mining problems assume that the format of the input data will be the same. Therefore, the traditional data mining algorithms may not be able to deal with the problem that the formats of different input data may be different and some of the data may be incomplete. How to make the input data from different sources the same format will be a possible solution to the variety problem of big data.

Because the traditional data analysis methods are not designed for large-scale and complex data, they are almost impossible to be capable of analyzing the big data. Redesigning and changing the way the data analysis methods are designed are two critical trends for big data analysis. Several important concepts in the design of the big data analysis method will be given in the following sections.

Big data analytics

Nowadays, the data that need to be analyzed are not just large, but they are composed of various data types, and even including streaming data [ 67 ]. Since big data has the unique features of “massive, high dimensional, heterogeneous, complex, unstructured, incomplete, noisy, and erroneous,” which may change the statistical and data analysis approaches [ 68 ]. Although it seems that big data makes it possible for us to collect more data to find more useful information, the truth is that more data do not necessarily mean more useful information. It may contain more ambiguous or abnormal data. For instance, a user may have multiple accounts, or an account may be used by multiple users, which may degrade the accuracy of the mining results [ 69 ]. Therefore, several new issues for data analytics come up, such as privacy, security, storage, fault tolerance, and quality of data [ 70 ].

The comparison between traditional data analysis and big data analysis on wireless sensor network

The big data may be created by handheld device, social network, internet of things, multimedia, and many other new applications that all have the characteristics of volume, velocity, and variety. As a result, the whole data analytics has to be re-examined from the following perspectives:

From the volume perspective, the deluge of input data is the very first thing that we need to face because it may paralyze the data analytics. Different from traditional data analytics, for the wireless sensor network data analysis, Baraniuk [ 71 ] pointed out that the bottleneck of big data analytics will be shifted from sensor to processing, communications, storage of sensing data, as shown in Fig. 6 . This is because sensors can gather much more data, but when uploading such large data to upper layer system, it may create bottlenecks everywhere.

In addition, from the velocity perspective, real-time or streaming data bring up the problem of large quantity of data coming into the data analytics within a short duration but the device and system may not be able to handle these input data. This situation is similar to that of the network flow analysis for which we typically cannot mirror and analyze everything we can gather.

From the variety perspective, because the incoming data may use different types or have incomplete data, how to handle them also bring up another issue for the input operators of data analytics.

In this section, we will turn the discussion to the big data analytics process.

Big data input

The problem of handling a vast quantity of data that the system is unable to process is not a brand-new research issue; in fact, it appeared in several early approaches [ 2 , 21 , 72 ], e.g., marketing analysis, network flow monitor, gene expression analysis, weather forecast, and even astronomy analysis. This problem still exists in big data analytics today; thus, preprocessing is an important task to make the computer, platform, and analysis algorithm be able to handle the input data. The traditional data preprocessing methods [ 73 ] (e.g., compression, sampling, feature selection, and so on) are expected to be able to operate effectively in the big data age. However, a portion of the studies still focus on how to reduce the complexity of the input data because even the most advanced computer technology cannot efficiently process the whole input data by using a single machine in most cases. By using domain knowledge to design the preprocessing operator is a possible solution for the big data. In [ 74 ], Ham and Lee used the domain knowledge, B -tree, divide-and-conquer to filter the unrelated log information for the mobile web log analysis. A later study [ 75 ] considered that the computation cost of preprocessing will be quite high for massive logs, sensor, or marketing data analysis. Thus, Dawelbeit and McCrindle employed the bin packing partitioning method to divide the input data between the computing processors to handle this high computations of preprocessing on cloud system. The cloud system is employed to preprocess the raw data and then output the refined data (e.g., data with uniform format) to make it easier for the data analysis method or system to preform the further analysis work.

Sampling and compression are two representative data reduction methods for big data analytics because reducing the size of data makes the data analytics computationally less expensive, thus faster, especially for the data coming to the system rapidly. In addition to making the sampling data represent the original data effectively [ 76 ], how many instances need to be selected for data mining method is another research issue [ 77 ] because it will affect the performance of the sampling method in most cases.

To avoid the application-level slow-down caused by the compression process, in [ 78 ], Jun et al. attempted to use the FPGA to accelerate the compression process. The I/O performance optimization is another issue for the compression method. For this reason, Zou et al. [ 79 ] employed the tentative selection and predictive dynamic selection and switched the appropriate compression method from two different strategies to improve the performance of the compression process. To make it possible for the compression method to efficiently compress the data, a promising solution is to apply the clustering method to the input data to divide them into several different groups and then compress these input data according to the clustering information. The compression method described in [ 80 ] is one of this kind of solutions, it first clusters the input data and then compresses these input data via the clustering results while the study [ 81 ] also used clustering method to improve the performance of the compression process.

In summary, in addition to handling the large and fast data input, the research issues of heterogeneous data sources, incomplete data, and noisy data may also affect the performance of the data analysis. The input operators will have a stronger impact on the data analytics at the big data age than it has in the past. As a result, the design of big data analytics needs to consider how to make these tasks (e.g., data clean, data sampling, data compression) work well.

Big data analysis frameworks and platforms

Various solutions have been presented for the big data analytics which can be divided [ 82 ] into (1) Processing/Compute: Hadoop [ 83 ], Nvidia CUDA [ 84 ], or Twitter Storm [ 85 ], (2) Storage: Titan or HDFS, and (3) Analytics: MLPACK [ 86 ] or Mahout [ 87 ]. Although there exist commercial products for data analysis [ 83 – 86 ], most of the studies on the traditional data analysis are focused on the design and development of efficient and/or effective “ways” to find the useful things from the data. But when we enter the age of big data, most of the current computer systems will not be able to handle the whole dataset all at once; thus, how to design a good data analytics framework or platform Footnote 3 and how to design analysis methods are both important things for the data analysis process. In this section, we will start with a brief introduction to data analysis frameworks and platforms, followed by a comparison of them.

The basic idea of big data analytics on cloud system

Researches in frameworks and platforms

To date, we can easily find tools and platforms presented by well-known organizations. The cloud computing technologies are widely used on these platforms and frameworks to satisfy the large demands of computing power and storage. As shown in Fig. 7 , most of the works on KDD for big data can be moved to cloud system to speed up the response time or to increase the memory space. With the advance of these works, handling and analyzing big data within a reasonable time has become not so far away. Since the foundation functions to handle and manage the big data were developed gradually; thus, the data scientists nowadays do not have to take care of everything, from the raw data gathering to data analysis, by themselves if they use the existing platforms or technologies to handle and manage the data. The data scientists nowadays can pay more attention to finding out the useful information from the data even thought this task is typically like looking for a needle in a haystack. That is why several recent studies tried to present efficient and effective framework to analyze the big data, especially on find out the useful things.

Performance-oriented From the perspective of platform performance, Huai [ 88 ] pointed out that most of the traditional parallel processing models improve the performance of the system by using a new larger computer system to replace the old computer system, which is usually referred to as “scale up”, as shown in Fig. 8 a. But for the big data analytics, most researches improve the performance of the system by adding more similar computer systems to make it possible for a system to handle all the tasks that cannot be loaded or computed in a single computer system (called “scale out”), as shown in Fig. 8 b where M1, M2, and M3 represent computer systems that have different computing power, respectively. For the scale up based solution, the computing power of the three systems is in the order of \(\text {M3}>\text {M2}>\text {M1}\) ; but for the scale out based system, all we have to do is to keep adding more similar computer systems to to a system to increase its ability. To build a scalable and fault-tolerant manager for big data analysis, Huai et al. [ 88 ] presented a matrix model which consists of three matrices for data set (D), concurrent data processing operations (O), and data transformations (T), called DOT. The big data is divided into n subsets each of which is processed by a computer node (worker) in such a way that all the subsets are processed concurrently, and then the results from these n computer nodes are collected and transformed to a computer node. By using this framework, the whole data analysis framework is composed of several DOT blocks. The system performance can be easily enhanced by adding more DOT blocks to the system.

The comparisons between scale up and scale out

Another efficient big data analytics was presented in [ 89 ], called generalized linear aggregates distributed engine (GLADE). The GLADE is a multi-level tree-based data analytics system which consists of two types of computer nodes that are a coordinator and workers. The simulation results [ 90 ] show that the GLADE can provide a better performance than Hadoop in terms of the execution time. Because Hadoop requires large memory and storage for data replication and it is a single master, Footnote 4 Essa et al. [ 91 ] presented a mobile agent based framework to solve these two problems, called the map reduce agent mobility (MRAM). The main reason is that each mobile agent can send its code and data to any other machine; therefore, the whole system will not be down if the master failed. Compared to Hadoop, the architecture of MRAM was changed from client/server to a distributed agent. The load time for MRAM is less than Hadoop even though both of them use the map-reduce solution and Java language. In [ 92 ], Herodotou et al. considered issues of the user needs and system workloads. They presented a self-tuning analytics system built on Hadoop for big data analysis. Since one of the major goals of their system is to adjust the system based on the user needs and system workloads to provide good performance automatically, the user usually does not need to understand and manipulate the Hadoop system. The study [ 93 ] was from the perspectives of data centric architecture and operational models to presented a big data architecture framework (BDAF) which includes: big data infrastructure, big data analytics, data structures and models, big data lifecycle management, and big data security. According to the observations of Demchenko et al. [ 93 ], cluster services, Hadoop related services, data analytics tools, databases, servers, and massively parallel processing databases are typically the required applications and services in big data analytics infrastructure.

Result-oriented Fisher et al. [ 5 ] presented a big data pipeline to show the workflow of big data analytics to extract the valuable knowledge from big data, which consists of the acquired data, choosing architecture, shaping data into architecture, coding/debugging, and reflecting works. From the perspectives of statistical computation and data mining, Ye et al. [ 94 ] presented an architecture of the services platform which integrates R to provide better data analysis services, called cloud-based big data mining and analyzing services platform (CBDMASP). The design of this platform is composed of four layers: the infrastructure services layer, the virtualization layer, the dataset processing layer, and the services layer. Several large-scale clustering problems (the datasets are of size from 0.1 G up to 25.6 G) were also used to evaluate the performance of the CBDMASP. The simulation results show that using map-reduce is much faster than using a single machine when the input data become too large. Although the size of the test dataset cannot be regarded as a big dataset, the performance of the big data analytics using map-reduce can be sped up via this kind of testings. In this study, map-reduce is a better solution when the dataset is of size more than 0.2 G, and a single machine is unable to handle a dataset that is of size more than 1.6 G.

Another study [ 95 ] presented a theorem to explain the big data characteristics, called HACE: the characteristics of big data usually are large-volume, Heterogeneous, Autonomous sources with distributed and decentralized control, and we usually try to find out some useful and interesting things from complex and evolving relationships of data. Based on these concerns and data mining issues, Wu and his colleagues [ 95 ] also presented a big data processing framework which includes data accessing and computing tier, data privacy and domain knowledge tier, and big data mining algorithm tier. This work explains that the data mining algorithm will become much more important and much more difficult; thus, challenges will also occur on the design and implementation of big data analytics platform. In addition to the platform performance and data mining issues, the privacy issue for big data analytics was a promising research in recent years. In [ 96 ], Laurila et al. explained that the privacy is an essential problem when we try to find something from the data that are gathered from mobile devices; thus, data security and data anonymization should also be considered in analyzing this kind of data. Demirkan and Delen [ 97 ] presented a service-oriented decision support system (SODSS) for big data analytics which includes information source, data management, information management, and operations management.

Comparison between the frameworks/platforms of big data

In [ 98 ], Talia pointed out that cloud-based data analytics services can be divided into data analytics software as a service, data analytics platform as a service, and data analytics infrastructure as a service. A later study [ 99 ] presented a general architecture of big data analytics which contains multi-source big data collecting, distributed big data storing, and intra/inter big data processing. Since many kinds of data analytics frameworks and platforms have been presented, some of the studies attempted to compare them to give a guidance to choose the applicable frameworks or platforms for relevant works. To give a brief introduction to big data analytics, especially the platforms and frameworks, in [ 100 ], Cuzzocrea et al. first discuss how recent studies responded the “computational emergency” issue of big data analytics. Some open issues, such as data source heterogeneity and uncorrelated data filtering, and possible research directions are also given in the same study. In [ 101 ], Zhang and Huang used the 5Ws model to explain what kind of framework and method we need for different big data approaches. Zhang and Huang further explained that the 5Ws model represents what kind of data, why we have these data, where the data come from, when the data occur, who receive the data, and how the data are transferred. A later study [ 102 ] used the features (i.e., owner, workload, source code, low latency, and complexity) to compare the frameworks of Hadoop [ 83 ], Storm [ 85 ] and Drill [ 103 ]. Thus, it can be easily seen that the framework of Apache Hadoop has high latency compared with the other two frameworks. To better understand the strong and weak points of solutions of big data, Chalmers et al. [ 82 ] then employed the volume, variety, variability, velocity, user skill/experience, and infrastructure to evaluate eight solutions of big data analytics.

In [ 104 ], in addition to defining that a big data system should include data generation, data acquisition, data storage, and data analytics modules, Hu et al. also mentioned that a big data system can be decomposed into infrastructure, computing, and application layers. Moreover, a promising research for NoSQL storage systems was also discussed in this study which can be divided into key-value , column , document , and row databases. Since big data analysis is generally regarded as a high computation cost work, the high performance computing cluster system (HPCC) is also a possible solution in early stage of big data analytics. Sagiroglu and Sinanc [ 105 ] therefore compare the characteristics between HPCC and Hadoop. They then emphasized that HPCC system uses the multikey and multivariate indexes on distributed file system while Hadoop uses the column-oriented database. In [ 17 ], Chen et al. give a brief introduction to the big data analytics of business intelligence (BI) from the perspective of evolution, applications, and emerging research topics. In their survey, Chen et al. explained that the revolution of business intelligence and analytics (BI&I) was from BI&I 1.0, BI&I 2.0, to BI&I 3.0 which are DBMS-based and structured content, web-based and unstructured content, and mobile and sensor based content, respectively.

Big data analysis algorithms

Mining algorithms for specific problem.

Because the big data issues have appeared for nearly ten years, in [ 106 ], Fan and Bifet pointed out that the terms “big data” [ 107 ] and “big data mining” [ 108 ] were first presented in 1998, respectively. The big data and big data mining almost appearing at the same time explained that finding something from big data will be one of the major tasks in this research domain. Data mining algorithms for data analysis also play the vital role in the big data analysis, in terms of the computation cost, memory requirement, and accuracy of the end results. In this section, we will give a brief discussion from the perspective of analysis and search algorithms to explain its importance for big data analytics.

Clustering algorithms In the big data age, traditional clustering algorithms will become even more limited than before because they typically require that all the data be in the same format and be loaded into the same machine so as to find some useful things from the whole data. Although the problem [ 64 ] of analyzing large-scale and high-dimensional dataset has attracted many researchers from various disciplines in the last century, and several solutions [ 2 , 109 ] have been presented presented in recent years, the characteristics of big data still brought up several new challenges for the data clustering issues. Among them, how to reduce the data complexity is one of the important issues for big data clustering. In [ 110 ], Shirkhorshidi et al. divided the big data clustering into two categories: single-machine clustering (i.e., sampling and dimension reduction solutions), and multiple-machine clustering (parallel and MapReduce solutions). This means that traditional reduction solutions can also be used in the big data age because the complexity and memory space needed for the process of data analysis will be decreased by using sampling and dimension reduction methods. More precisely, sampling can be regarded as reducing the “amount of data” entered into a data analyzing process while dimension reduction can be regarded as “downsizing the whole dataset” because irrelevant dimensions will be discarded before the data analyzing process is carried out.

CloudVista [ 111 ] is a representative solution for clustering big data which used cloud computing to perform the clustering process in parallel. BIRCH [ 44 ] and sampling method were used in CloudVista to show that it is able to handle large-scale data, e.g., 25 million census records. Using GPU to enhance the performance of a clustering algorithm is another promising solution for big data mining. The multiple species flocking (MSF) [ 112 ] was applied to the CUDA platform from NVIDIA to reduce the computation time of clustering algorithm in [ 113 ]. The simulation results show that the speedup factor can be increased from 30 up to 60 by using GPU for data clustering. Since most traditional clustering algorithms (e.g, k -means) require a computation that is centralized, how to make them capable of handling big data clustering problems is the major concern of Feldman et al. [ 114 ] who use a tree construction for generating the coresets in parallel which is called the “merge-and-reduce” approach. Moreover, Feldman et al. pointed out that by using this solution for clustering, the update time per datum and memory of the traditional clustering algorithms can be significantly reduced.

Classification algorithms Similar to the clustering algorithm for big data mining, several studies also attempted to modify the traditional classification algorithms to make them work on a parallel computing environment or to develop new classification algorithms which work naturally on a parallel computing environment. In [ 115 ], the design of classification algorithm took into account the input data that are gathered by distributed data sources and they will be processed by a heterogeneous set of learners. Footnote 5 In this study, Tekin et al. presented a novel classification algorithm called “classify or send for classification” (CoS). They assumed that each learner can be used to process the input data in two different ways in a distributed data classification system. One is to perform a classification function by itself while the other is to forward the input data to another learner to have them labeled. The information will be exchanged between different learners. In brief, this kind of solutions can be regarded as a cooperative learning to improve the accuracy in solving the big data classification problem. An interesting solution uses the quantum computing to reduce the memory space and computing cost of a classification algorithm. For example, in [ 116 ], Rebentrost et al. presented a quantum-based support vector machine for big data classification and argued that the classification algorithm they proposed can be implemented with a time complexity \(O(\log NM)\) where N is the number of dimensions and M is the number of training data. There are bright prospects for big data mining by using quantum-based search algorithm when the hardware of quantum computing has become mature.

Frequent pattern mining algorithms Most of the researches on frequent pattern mining (i.e., association rules and sequential pattern mining) were focused on handling large-scale dataset at the very beginning because some early approaches of them were attempted to analyze the data from the transaction data of large shopping mall. Because the number of transactions usually is more than “tens of thousands”, the issues about how to handle the large scale data were studied for several years, such as FP-tree [ 32 ] using the tree structure to include the frequent patterns to further reduce the computation time of association rule mining. In addition to the traditional frequent pattern mining algorithms, of course, parallel computing and cloud computing technologies have also attracted researchers in this research domain. Among them, the map-reduce solution was used for the studies [ 117 – 119 ] to enhance the performance of the frequent pattern mining algorithm. By using the map-reduce model for frequent pattern mining algorithm, it can be easily expected that its application to “cloud platform” [ 120 , 121 ] will definitely become a popular trend in the forthcoming future. The study of [ 119 ] no only used the map-reduce model, it also allowed users to express their specific interest constraints in the process of frequent pattern mining. The performance of these methods by using map-reduce model for big data analysis is, no doubt, better than the traditional frequent pattern mining algorithms running on a single machine.

Machine learning for big data mining

The potential of machine learning for data analytics can be easily found in the early literature [ 22 , 49 ]. Different from the data mining algorithm design for specific problems, machine learning algorithms can be used for different mining and analysis problems because they are typically employed as the “search” algorithm of the required solution. Since most machine learning algorithms can be used to find an approximate solution for the optimization problem, they can be employed for most data analysis problems if the data analysis problems can be formulated as an optimization problem. For example, genetic algorithm, one of the machine learning algorithms, can not only be used to solve the clustering problem [ 25 ], it can also be used to solve the frequent pattern mining problem [ 33 ]. The potential of machine learning is not merely for solving different mining problems in data analysis operator of KDD; it also has the potential of enhancing the performance of the other parts of KDD, such as feature reduction for the input operators [ 72 ].

A recent study [ 68 ] shows that some traditional mining algorithms, statistical methods, preprocessing solutions, and even the GUI’s have been applied to several representative tools and platforms for big data analytics. The results show clearly that machine learning algorithms will be one of the essential parts of big data analytics. One of the problems in using current machine learning methods for big data analytics is similar to those of most traditional data mining algorithms which are designed for sequential or centralized computing. However, one of the most possible solutions is to make them work for parallel computing. Fortunately, some of the machine learning algorithms (e.g., population-based algorithms) can essentially be used for parallel computing, which have been demonstrated for several years, such as parallel computing version of genetic algorithm [ 122 ]. Different from the traditional GA, as shown in Fig. 9 a, the population of island model genetic algorithm, one of the parallel GA’s, can be divided into several sub-populations, as shown in Fig. 9 b. This means that the sub-populations can be assigned to different threads or computer nodes for parallel computing, by a simple modification of the GA.

The comparison between basic idea of traditional GA (TGA) and parallel genetic algorithm (PGA)

For this reason, in [ 123 ], Kiran and Babu explained that the framework for distributed data mining algorithm still needs to aggregate the information from different computer nodes. As shown in Fig. 10 , the common design of distributed data mining algorithm is as follows: each mining algorithm will be performed on a computer node (worker) which has its locally coherent data, but not the whole data. To construct a globally meaningful knowledge after each mining algorithm finds its local model, the local model from each computer node has to be aggregated and integrated into a final model to represent the complete knowledge. Kiran and Babu [ 123 ] also pointed out that the communication will be the bottleneck when using this kind of distributed computing framework.

A simple example of distributed data mining framework [ 86 ]

Bu et al. [ 124 ] found some research issues when trying to apply machine learning algorithms to parallel computing platforms. For instance, the early version of map-reduce framework does not support “iteration” (i.e., recursion). But the good news is that some recent works [ 87 , 125 ] have paid close attention to this problem and tried to fix it. Similar to the solutions for enhancing the performance of the traditional data mining algorithms, one of the possible solutions to enhancing the performance of a machine learning algorithm is to use CUDA, i.e., a GPU, to reduce the computing time of data analysis. Hasan et al. [ 126 ] used CUDA to implement the self-organizing map (SOM) and multiple back-propagation (MBP) for the classification problem. The simulation results show that using GPU is faster than using CPU. More precisely, SOM running on a GPU is three times faster than SOM running on a CPU, and MPB running on a GPU is twenty-seven times faster than MPB running on a. Another study [ 127 ] attempted to apply the ant-based algorithm to grid computing platform. Since the proposed mining algorithm is extended by the ant clustering algorithm of Deneubourg et al. [ 128 ], Footnote 6 Ku-Mahamud modified the ant behavior of this ant clustering algorithm for big data clustering. That is, each ant will be randomly placed on the grid. This means that the ant clustering algorithm then can be used on a parallel computing environment.

The trends of machine learning studies for big data analytics can be divided into twofold: one attempts to make machine learning algorithms run on parallel platforms, such as Radoop [ 129 ], Mahout [ 87 ], and PIMRU [ 124 ]; the other is to redesign the machine learning algorithms to make them suitable for parallel computing or to parallel computing environment, such as neural network algorithms for GPU [ 126 ] and ant-based algorithm for grid [ 127 ]. In summary, both of them make it possible to apply the machine learning algorithms to big data analytics although still many research issues need to be solved, such as the communication cost for different computer nodes [ 86 ] and the large computation cost most machine learning algorithms require [ 126 ].

Output the result of big data analysis

The benchmarks of PigMix [ 130 ], GridMix [ 131 ], TeraSort and GraySort [ 132 ], TPC-C, TPC-H, TPC-DS [ 133 ], and yahoo cloud serving benchmark (YCSB) [ 134 ] have been presented for evaluating the performance of the cloud computing and big data analytics systems. Ghazal et al. [ 135 ] presented another benchmark (called BigBench) to be used as an end-to-end big data benchmark which covers the characteristics of 3V of big data and uses the loading time, time for queries, time for procedural processing queries, and time for the remaining queries as the metrics. By using these benchmarks, the computation time is one of the intuitive metrics for evaluating the performance of different big data analytics platforms or algorithms. That is why Cheptsov [ 136 ] compered the high performance computing (HPC) and cloud system by using the measurement of computation time to understand their scalability for text file analysis. In addition to the computation time, the throughput (e.g., the number of operations per second) and read/write latency of operations are the other measurements of big data analytics [ 137 ]. In the study of [ 138 ], Zhao et al. believe that the maximum size of data and the maximum number of jobs are the two important metrics to understand the performance of the big data analytics platform. Another study described in [ 139 ] presented a systematic evaluation method which contains the data throughput, concurrency during map and reduce phases, response times, and the execution time of map and reduce. Moreover, most benchmarks for evaluating the performance of big data analytics typically can only provide the response time or the computation cost; however, the fact is that several factors need to be taken into account at the same time when building a big data analytics system. The hardware, bandwidth for data transmission, fault tolerance, cost, power consumption of these systems are all issues [ 70 , 104 ] to be taken into account at the same time when building a big data analytics system. Several solutions available today are to install the big data analytics on a cloud computing system or a cluster system. Therefore, the measurements of fault tolerance, task execution, and cost of cloud computing systems can then be used to evaluate the performance of the corresponding factors of big data analytics.

How to present the analysis results to a user is another important work in the output part of big data analytics because if the user cannot easily understand the meaning of the results, the results will be entirely useless. Business intelligent and network monitoring are the two common approaches because their user interface plays the vital role of making them workable. Zhang et al. [ 140 ] pointed out that the tasks of the visual analytics for commercial systems can be divided into four categories which are exploration, dashboards, reporting, and alerting. The study [ 141 ] showed that the interface for electroencephalography (EEG) interpretation is another noticeable research issue in big data analytics. The user interface for cloud system [ 142 , 143 ] is the recent trend for big data analytics. This usually plays vital roles in big data analytics system, one of which is to simplify the explanation of the needed knowledge to the users while the other is to make it easier for the users to handle the data analytics system to work with their opinions. According to our observations, a flexible user interface is needed because although the big data analytics can help us to find some hidden information, the information found usually is not knowledge. This situation is just like the example we mentioned in “ Output the result ”. The mining or statistical techniques can be employed to know the flu situation of each region, but data scientists sometimes need additional ways to display the information to find out the knowledge they need or to prove their assumption. Thus, the user interface can be adjusted by the user to display the knowledge that is needed urgently for big data analytics.

Summary of process of big data analytics

This discussion of big data analytics in this section was divided into input, analysis, and output for mapping the data analysis process of KDD. For the input (see also in “ Big data input ”) and output (see also “ Output the result of big data analysis ”) of big data, several methods and solutions proposed before the big data age (see also “ Data input ”) can also be employed for big data analytics in most cases.

However, there still exist some new issues of the input and output that the data scientists need to confront. A representative example we mentioned in “ Big data input ” is that the bottleneck will not only on the sensor or input devices, it may also appear in other places of data analytics [ 71 ]. Although we can employ traditional compression and sampling technologies to deal with this problem, they can only mitigate the problems instead of solving the problems completely. Similar situations also exist in the output part. Although several measurements can be used to evaluate the performance of the frameworks, platforms, and even data mining algorithms, there still exist several new issues in the big data age, such as information fusion from different information sources or information accumulation from different times.

Several studies attempted to present an efficient or effective solution from the perspective of system (e.g., framework and platform) or algorithm level. A simple comparison of these big data analysis technologies from different perspectives is described in Table 3 , to give a brief introduction to the current studies and trends of data analysis technologies for the big data. The “Perspective” column of this table explains that the study is focused on the framework or algorithm level; the “Description” column gives the further goal of the study; and the “Name” column is an abbreviated names of the methods or platform/framework. From the analysis framework perspective, this table shows that big data framework , platform , and machine learning are the current research trends in big data analytics system. For the mining algorithm perspective, the clustering , classification , and frequent pattern mining issues play the vital role of these researches because several data analysis problems can be mapped to these essential issues.

A promising trend that can be easily found from these successful examples is to use machine learning as the search algorithm (i.e., mining algorithm) for the data mining problems of big data analytics system. The machine learning-based methods are able to make the mining algorithms and relevant platforms smarter or reduce the redundant computation costs. That parallel computing and cloud computing technologies have a strong impact on the big data analytics can also be recognized as follows: (1) most of the big data analytics frameworks and platforms are using Hadoop and Hadoop relevant technologies to design their solutions; and (2) most of the mining algorithms for big data analysis have been designed for parallel computing via software or hardware or designed for Map-Reduce-based platform.

From the results of recent studies of big data analytics, it is still at the early stage of Nolan’s stages of growth model [ 146 ] which is similar to the situations for the research topics of cloud computing, internet of things, and smart grid. This is because several studies just attempted to apply the traditional solutions to the new problems/platforms/environments. For example, several studies [ 114 , 145 ] used k -means as an example to analyze the big data, but not many studies applied the state-of-the-art data mining algorithms and machine learning algorithms to the analysis the big data. This explains that the performance of the big data analytics can be improved by data mining algorithms and metaheuristic algorithms presented in recent years [ 147 ]. The relevant technologies for compression, sampling, or even the platform presented in recent years may also be used to enhance the performance of the big data analytics system. As a result, although these research topics still have several open issues that need to be solved, these situations, on the contrary, also illustrate that everything is possible in these studies.

The open issues

Although the data analytics today may be inefficient for big data caused by the environment, devices, systems, and even problems that are quite different from traditional mining problems, because several characteristics of big data also exist in the traditional data analytics. Several open issues caused by the big data will be addressed as the platform/framework and data mining perspectives in this section to explain what dilemmas we may confront because of big data. Here are some of the open issues:

Platform and framework perspective

Input and output ratio of platform.

A large number of reports and researches mentioned that we will enter the big data age in the near future. Some of them insinuated to us that these fruitful results of big data will lead us to a whole new world where “everything” is possible; therefore, the big data analytics will be an omniscient and omnipotent system. From the pragmatic perspective, the big data analytics is indeed useful and has many possibilities which can help us more accurately understand the so-called “things.” However, the situation in most studies of big data analytics is that they argued that the results of big data are valuable, but the business models of most big data analytics are not clear. The fact is that assuming we have infinite computing resources for big data analytics is a thoroughly impracticable plan, the input and output ratio (e.g., return on investment) will need to be taken into account before an organization constructs the big data analytics center.

Communication between systems

Since most big data analytics systems will be designed for parallel computing, and they typically will work on other systems (e.g., cloud platform) or work with other systems (e.g., search engine or knowledge base), the communication between the big data analytics and other systems will strongly impact the performance of the whole process of KDD. The first research issue for the communication is that the communication cost will incur between systems of data analytics. How to reduce the communication cost will be the very first thing that the data scientists need to care. Another research issue for the communication is how the big data analytics communicates with other systems. The consistency of data between different systems, modules, and operators is also an important open issue on the communication between systems. Because the communication will appear more frequently between systems of big data analytics, how to reduce the cost of communication and how to make the communication between these systems as reliable as possible will be the two important open issues for big data analytics.

Bottlenecks on data analytics system

The bottlenecks will be appeared in different places of the data analytics for big data because the environments, systems, and input data have changed which are different from the traditional data analytics. The data deluge of big data will fill up the “input” system of data analytics, and it will also increase the computation load of the data “analysis” system. This situation is just like the torrent of water (i.e., data deluge) rushed down the mountain (i.e., data analytics), how to split it and how to avoid it flowing into a narrow place (e.g., the operator is not able to handle the input data) will be the most important things to avoid the bottlenecks in data analytics system. One of the current solutions to the avoidance of bottlenecks on a data analytics system is to add more computation resources while the other is to split the analysis works to different computation nodes. A complete consideration for the whole data analytics to avoid the bottlenecks of that kind of analytics system is still needed for big data.

Security issues

Since much more environment data and human behavior will be gathered to the big data analytics, how to protect them will also be an open issue because without a security way to handle the collected data, the big data analytics cannot be a reliable system. In spite of the security that we have to tighten for big data analytics before it can gather more data from everywhere, the fact is that until now, there are still not many studies focusing on the security issues of the big data analytics. According to our observation, the security issues of big data analytics can be divided into fourfold: input, data analysis, output, and communication with other systems. For the input, it can be regarded as the data gathering which is relevant to the sensor, the handheld devices, and even the devices of internet of things. One of the important security issues on the input part of big data analytics is to make sure that the sensors will not be compromised by the attacks. For the analysis and input, it can be regarded as the security problem of such a system. For communication with other system, the security problem is on the communications between big data analytics and other external systems. Because of these latent problems, security has become one of the open issues of big data analytics.

Data mining perspective

Data mining algorithm for map-reduce solution.

As we mentioned in the previous sections, most of the traditional data mining algorithms are not designed for parallel computing; therefore, they are not particularly useful for the big data mining. Several recent studies have attempted to modify the traditional data mining algorithms to make them applicable to Hadoop-based platforms. As long as porting the data mining algorithms to Hadoop is inevitable, making the data mining algorithms work on a map-reduce architecture is the first very thing to do to apply traditional data mining methods to big data analytics. Unfortunately, not many studies attempted to make the data mining and soft computing algorithms work on Hadoop because several different backgrounds are needed to develop and design such algorithms. For instance, the researcher and his or her research group need to have the background in data mining and Hadoop so as to develop and design such algorithms. Another open issue is that most data mining algorithms are designed for centralized computing; that is, they can only work on all the data at the same time. Thus, how to make them work on a parallel computing system is also a difficult work. The good news is that some studies [ 145 ] have successfully applied the traditional data mining algorithms to the map-reduce architecture. These results imply that it is possible to do so. According to our observation, although the traditional mining or soft computing algorithms can be used to help us analyze the data in big data analytics, unfortunately, until now, not many studies are focused on it. As a consequence, it is an important open issue in big data analytics.

Noise, outliers, incomplete and inconsistent data

Although big data analytics is a new age for data analysis, because several solutions adopt classical ways to analyze the data on big data analytics, the open issues of traditional data mining algorithms also exist in these new systems. The open issues of noise, outliers, incomplete, and inconsistent data in traditional data mining algorithms will also appear in big data mining algorithms. More incomplete and inconsistent data will easily appear because the data are captured by or generated from different sensors and systems. The impact of noise, outliers, incomplete and inconsistent data will be enlarged for big data analytics. Therefore, how to mitigate the impact will be the open issues for big data analytics.

Bottlenecks on data mining algorithm

Most of the data mining algorithms in big data analytics will be designed for parallel computing. However, once data mining algorithms are designed or modified for parallel computing, it is the information exchange between different data mining procedures that may incur bottlenecks. One of them is the synchronization issue because different mining procedures will finish their jobs at different times even though they use the same mining algorithm to work on the same amount of data. Thus, some of the mining procedures will have to wait until the others finished their jobs. This situation may occur because the loading of different computer nodes may be different during the data mining process, or it may occur because the convergence speeds are different for the same data mining algorithm. The bottlenecks of data mining algorithms will become an open issue for the big data analytics which explains that we need to take into account this issue when we develop and design a new data mining algorithm for big data analytics.

Privacy issues

The privacy concern typically will make most people uncomfortable, especially if systems cannot guarantee that their personal information will not be accessed by the other people and organizations. Different from the concern of the security, the privacy issue is about if it is possible for the system to restore or infer personal information from the results of big data analytics, even though the input data are anonymous. The privacy issue has become a very important issue because the data mining and other analysis technologies will be widely used in big data analytics, the private information may be exposed to the other people after the analysis process. For example, although all the gathered data for shop behavior are anonymous (e.g., buying a pistol), because the data can be easily collected by different devices and systems (e.g., location of the shop and age of the buyer), a data mining algorithm can easily infer who bought this pistol. More precisely, the data analytics is able to reduce the scope of the database because location of the shop and age of the buyer provide the information to help the system find out possible persons. For this reason, any sensitive information needs to be carefully protected and used. The anonymous, temporary identification, and encryption are the representative technologies for privacy of data analytics, but the critical factor is how to use, what to use, and why to use the collected data on big data analytics.

Conclusions

In this paper, we reviewed studies on the data analytics from the traditional data analysis to the recent big data analysis. From the system perspective, the KDD process is used as the framework for these studies and is summarized into three parts: input, analysis, and output. From the perspective of big data analytics framework and platform, the discussions are focused on the performance-oriented and results-oriented issues. From the perspective of data mining problem, this paper gives a brief introduction to the data and big data mining algorithms which consist of clustering, classification, and frequent patterns mining technologies. To better understand the changes brought about by the big data, this paper is focused on the data analysis of KDD from the platform/framework to data mining. The open issues on computation, quality of end result, security, and privacy are then discussed to explain which open issues we may face. Last but not least, to help the audience of the paper find solutions to welcome the new age of big data, the possible high impact research trends are given below:

For the computation time, there is no doubt at all that parallel computing is one of the important future trends to make the data analytics work for big data, and consequently the technologies of cloud computing, Hadoop, and map-reduce will play the important roles for the big data analytics. To handle the computation resources of the cloud-based platform and to finish the task of data analysis as fast as possible, the scheduling method is another future trend.

Using efficient methods to reduce the computation time of input, comparison, sampling, and a variety of reduction methods will play an important role in big data analytics. Because these methods typically do not consider parallel computing environment, how to make them work on parallel computing environment will be a future research trend. Similar to the input, the data mining algorithms also face the same situation that we mentioned in the previous section , how to make them work on parallel computing environment will be a very important research trend because there are abundant research results on traditional data mining algorithms.

How to model the mining problem to find something from big data and how to display the knowledge we got from big data analytics will also be another two vital future trends because the results of these two researches will decide if the data analytics can practically work for real world approaches, not just a theoretical stuff.

The methods of extracting information from external and relative knowledge resources to further reinforce the big data analytics, until now, are not very popular in big data analytics. But combining information from different resources to add the value of output knowledge is a common solution in the area of information retrieval, such as clustering search engine or document summarization. For this reason, information fusion will also be a future trend for improving the end results of big data analytics.

Because the metaheuristic algorithms are capable of finding an approximate solution within a reasonable time, they have been widely used in solving the data mining problem in recent years. Until now, many state-of-the-art metaheuristic algorithms still have not been applied to big data analytics. In addition, compared to some early data mining algorithms, the performance of metaheuristic is no doubt superior in terms of the computation time and the quality of end result. From these observations, the application of metaheuristic algorithms to big data analytics will also be an important research topic.

Because social network is part of the daily life of most people and because its data is also a kind of big data, how to analyze the data of a social network has become a promising research issue. Obviously, it can be used to predict the behavior of a user. After that, we can make applicable strategies for the user. For instance, a business intelligence system can use the analysis results to encourage particular customers to buy the goods they are interested.

The security and privacy issues that accompany the work of data analysis are intuitive research topics which contain how to safely store the data, how to make sure the data communication is protected, and how to prevent someone from finding out the information about us. Many problems of data security and privacy are essentially the same as those of the traditional data analysis even if we are entering the big data age. Thus, how to protect the data will also appear in the research of big data analytics.

In this paper, by the data analytics, we mean the whole KDD process, while by the data analysis, we mean the part of data analytics that is aimed at finding the hidden information in the data, such as data mining.

In this paper, by an unlabeled input data, we mean that it is unknown to which group the input data belongs. If all the input data are unlabeled, it means that the distribution of the input data is unknown.

In this paper, the analysis framework refers to the whole system, from raw data gathering, data reformat, data analysis, all the way to knowledge representation.

The whole system may be down when the master machine crashed for a system that has only one master.

The learner typically represented the classification function which will create the classifier to help us classify the unknown input data.

The basic idea of [ 128 ] is that each ant will pick up and drop data items in terms of the similarity of its local neighbors.

Abbreviations

principal components analysis

volume, velocity, and variety

International Data Corporation

knowledge discovery in databases

support vector machine

sum of squared errors

generalized linear aggregates distributed engine

big data architecture framework

cloud-based big data mining & analyzing services platform

service-oriented decision support system

high performance computing cluster system

business intelligence and analytics

database management system

multiple species flocking

genetic algorithm

self-organizing map

multiple back-propagation

yahoo cloud serving benchmark

high performance computing

electroencephalography

Lyman P, Varian H. How much information 2003? Tech. Rep, 2004. [Online]. Available: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/printable_report.pdf .

Xu R, Wunsch D. Clustering. Hoboken: Wiley-IEEE Press; 2009.

Google Scholar  

Ding C, He X. K-means clustering via principal component analysis. In: Proceedings of the Twenty-first International Conference on Machine Learning, 2004, pp 1–9.

Kollios G, Gunopulos D, Koudas N, Berchtold S. Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Trans Knowl Data Eng. 2003;15(5):1170–87.

Article   Google Scholar  

Fisher D, DeLine R, Czerwinski M, Drucker S. Interactions with big data analytics. Interactions. 2012;19(3):50–9.

Laney D. 3D data management: controlling data volume, velocity, and variety, META Group, Tech. Rep. 2001. [Online]. Available: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf .

van Rijmenam M. Why the 3v’s are not sufficient to describe big data, BigData Startups, Tech. Rep. 2013. [Online]. Available: http://www.bigdata-startups.com/3vs-sufficient-describe-big-data/ .

Borne K. Top 10 big data challenges a serious look at 10 big data v’s, Tech. Rep. 2014. [Online]. Available: https://www.mapr.com/blog/top-10-big-data-challenges-look-10-big-data-v .

Press G. $16.1 billion big data market: 2014 predictions from IDC and IIA, Forbes, Tech. Rep. 2013. [Online]. Available: http://www.forbes.com/sites/gilpress/2013/12/12/16-1-billion-big-data-market-2014-predictions-from-idc-and-iia/ .

Big data and analytics—an IDC four pillar research area, IDC, Tech. Rep. 2013. [Online]. Available: http://www.idc.com/prodserv/FourPillars/bigData/index.jsp .

Taft DK. Big data market to reach $46.34 billion by 2018, EWEEK, Tech. Rep. 2013. [Online]. Available: http://www.eweek.com/database/big-data-market-to-reach-46.34-billion-by-2018.html .

Research A. Big data spending to reach $114 billion in 2018; look for machine learning to drive analytics, ABI Research, Tech. Rep. 2013. [Online]. Available: https://www.abiresearch.com/press/big-data-spending-to-reach-114-billion-in-2018-loo .

Furrier J. Big data market $50 billion by 2017—HP vertica comes out #1—according to wikibon research, SiliconANGLE, Tech. Rep. 2012. [Online]. Available: http://siliconangle.com/blog/2012/02/15/big-data-market-15-billion-by-2017-hp-vertica-comes-out-1-according-to-wikibon-research/ .

Kelly J, Vellante D, Floyer D. Big data market size and vendor revenues, Wikibon, Tech. Rep. 2014. [Online]. Available: http://wikibon.org/wiki/v/Big_Data_Market_Size_and_Vendor_Revenues .

Kelly J, Floyer D, Vellante D, Miniman S. Big data vendor revenue and market forecast 2012-2017, Wikibon, Tech. Rep. 2014. [Online]. Available: http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2012-2017 .

Mayer-Schonberger V, Cukier K. Big data: a revolution that will transform how we live, work, and think. Boston: Houghton Mifflin Harcourt; 2013.

Chen H, Chiang RHL, Storey VC. Business intelligence and analytics: from big data to big impact. MIS Quart. 2012;36(4):1165–88.

Kitchin R. The real-time city? big data and smart urbanism. Geo J. 2014;79(1):1–14.

Fayyad UM, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in databases. AI Mag. 1996;17(3):37–54.

Han J. Data mining: concepts and techniques. San Francisco: Morgan Kaufmann Publishers Inc.; 2005.

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. Proc ACM SIGMOD Int Conf Manag Data. 1993;22(2):207–16.

Witten IH, Frank E. Data mining: practical machine learning tools and techniques. San Francisco: Morgan Kaufmann Publishers Inc.; 2005.

Abbass H, Newton C, Sarker R. Data mining: a heuristic approach. Hershey: IGI Global; 2002.

Book   Google Scholar  

Cannataro M, Congiusta A, Pugliese A, Talia D, Trunfio P. Distributed data mining on grids: services, tools, and applications. IEEE Trans Syst Man Cyber Part B Cyber. 2004;34(6):2451–65.

Krishna K, Murty MN. Genetic \(k\) -means algorithm. IEEE Trans Syst Man Cyber Part B Cyber. 1999;29(3):433–9.

Tsai C-W, Lai C-F, Chiang M-C, Yang L. Data mining for internet of things: a survey. IEEE Commun Surveys Tutor. 2014;16(1):77–97.

Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comp Surveys. 1999;31(3):264–323.

McQueen JB. Some methods of classification and analysis of multivariate observations. In: Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, 1967. pp 281–297.

Safavian S, Landgrebe D. A survey of decision tree classifier methodology. IEEE Trans Syst Man Cyber. 1991;21(3):660–74.

Article   MathSciNet   Google Scholar  

McCallum A, Nigam K. A comparison of event models for naive bayes text classification. In: Proceedings of the National Conference on Artificial Intelligence, 1998. pp. 41–48.

Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. In: Proceedings of the annual workshop on Computational learning theory, 1992. pp. 144–152.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In : Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000. pp. 1–12.

Kaya M, Alhajj R. Genetic algorithm based framework for mining fuzzy association rules. Fuzzy Sets Syst. 2005;152(3):587–601.

Article   MATH   MathSciNet   Google Scholar  

Srikant R, Agrawal R. Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the International Conference on Extending Database Technology: Advances in Database Technology, 1996. pp 3–17.

Zaki MJ. Spade: an efficient algorithm for mining frequent sequences. Mach Learn. 2001;42(1–2):31–60.

Article   MATH   Google Scholar  

Baeza-Yates RA, Ribeiro-Neto B. Modern Information Retrieval. Boston: Addison-Wesley Longman Publishing Co., Inc; 1999.

Liu B. Web data mining: exploring hyperlinks, contents, and usage data. Berlin, Heidelberg: Springer-Verlag; 2007.

d’Aquin M, Jay N. Interpreting data mining results with linked data for learning analytics: motivation, case study and directions. In: Proceedings of the International Conference on Learning Analytics and Knowledge, pp 155–164.

Shneiderman B. The eyes have it: a task by data type taxonomy for information visualizations. In: Proceedings of the IEEE Symposium on Visual Languages, 1996, pp 336–343.

Mani I, Bloedorn E. Multi-document summarization by graph search and matching. In: Proceedings of the National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence, 1997, pp 622–628.

Kopanakis I, Pelekis N, Karanikas H, Mavroudkis T. Visual techniques for the interpretation of data mining outcomes. In: Proceedings of the Panhellenic Conference on Advances in Informatics, 2005. pp 25–35.

Elkan C. Using the triangle inequality to accelerate k-means. In: Proceedings of the International Conference on Machine Learning, 2003, pp 147–153.

Catanzaro B, Sundaram N, Keutzer K. Fast support vector machine training and classification on graphics processors. In: Proceedings of the International Conference on Machine Learning, 2008. pp 104–111.

Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 1996. pp 103–114.

Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996. pp 226–231.

Ester M, Kriegel HP, Sander J, Wimmer M, Xu X. Incremental clustering for mining in a data warehousing environment. In: Proceedings of the International Conference on Very Large Data Bases, 1998. pp 323–333.

Ordonez C, Omiecinski E. Efficient disk-based k-means clustering for relational databases. IEEE Trans Knowl Data Eng. 2004;16(8):909–21.

Kogan J. Introduction to clustering large and high-dimensional data. Cambridge: Cambridge Univ Press; 2007.

MATH   Google Scholar  

Mitra S, Pal S, Mitra P. Data mining in soft computing framework: a survey. IEEE Trans Neural Netw. 2002;13(1):3–14.

Mehta M, Agrawal R, Rissanen J. SLIQ: a fast scalable classifier for data mining. In: Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology. 1996. pp 18–32.

Micó L, Oncina J, Carrasco RC. A fast branch and bound nearest neighbour classifier in metric spaces. Pattern Recogn Lett. 1996;17(7):731–9.

Djouadi A, Bouktache E. A fast algorithm for the nearest-neighbor classifier. IEEE Trans Pattern Anal Mach Intel. 1997;19(3):277–82.

Ververidis D, Kotropoulos C. Fast and accurate sequential floating forward feature selection with the bayes classifier applied to speech emotion recognition. Signal Process. 2008;88(12):2956–70.

Pei J, Han J, Mao R. CLOSET: an efficient algorithm for mining frequent closed itemsets. In: Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2000. pp 21–30.

Zaki MJ, Hsiao C-J. Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng. 2005;17(4):462–78.

Burdick D, Calimlim M, Gehrke J. MAFIA: a maximal frequent itemset algorithm for transactional databases. In: Proceedings of the International Conference on Data Engineering, 2001. pp 443–452.

Chen B, Haas P, Scheuermann P. A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002. pp 462–468.

Zaki MJ. SPADE: an efficient algorithm for mining frequent sequences. Mach Learn. 2001;42(1–2):31–60.

Yan X, Han J, Afshar R. CloSpan: mining closed sequential patterns in large datasets. In: Proceedings of the SIAM International Conference on Data Mining, 2003. pp 166–177.

Pei J, Han J, Asl MB, Pinto H, Chen Q, Dayal U, Hsu MC. PrefixSpan mining sequential patterns efficiently by prefix projected pattern growth. In: Proceedings of the International Conference on Data Engineering, 2001. pp 215–226.

Ayres J, Flannick J, Gehrke J, Yiu T. Sequential PAttern Mining using a bitmap representation. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002. pp 429–435.

Masseglia F, Poncelet P, Teisseire M. Incremental mining of sequential patterns in large databases. Data Knowl Eng. 2003;46(1):97–121.

Xu R, Wunsch-II DC. Survey of clustering algorithms. IEEE Trans Neural Netw. 2005;16(3):645–78.

Chiang M-C, Tsai C-W, Yang C-S. A time-efficient pattern reduction algorithm for k-means clustering. Inform Sci. 2011;181(4):716–31.

Bradley PS, Fayyad UM. Refining initial points for k-means clustering. In: Proceedings of the International Conference on Machine Learning, 1998. pp 91–99.

Laskov P, Gehl C, Krüger S, Müller K-R. Incremental support vector learning: analysis, implementation and applications. J Mach Learn Res. 2006;7:1909–36.

MATH   MathSciNet   Google Scholar  

Russom P. Big data analytics. TDWI: Tech. Rep ; 2011.

Ma C, Zhang HH, Wang X. Machine learning for big data analytics in plants. Trends Plant Sci. 2014;19(12):798–808.

Boyd D, Crawford K. Critical questions for big data. Inform Commun Soc. 2012;15(5):662–79.

Katal A, Wazid M, Goudar R. Big data: issues, challenges, tools and good practices. In: Proceedings of the International Conference on Contemporary Computing, 2013. pp 404–409.

Baraniuk RG. More is less: signal processing and the data deluge. Science. 2011;331(6018):717–9.

Lee J, Hong S, Lee JH. An efficient prediction for heavy rain from big weather data using genetic algorithm. In: Proceedings of the International Conference on Ubiquitous Information Management and Communication, 2014. pp 25:1–25:7.

Famili A, Shen W-M, Weber R, Simoudis E. Data preprocessing and intelligent data analysis. Intel Data Anal. 1997;1(1–4):3–23.

Zhang H. A novel data preprocessing solution for large scale digital forensics investigation on big data, Master’s thesis, Norway, 2013.

Ham YJ, Lee H-W. International journal of advances in soft computing and its applications. Calc Paralleles Reseaux et Syst Repar. 2014;6(1):1–18.

Cormode G, Duffield N. Sampling for big data: a tutorial. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014. pp 1975–1975.

Satyanarayana A. Intelligent sampling for big data using bootstrap sampling and chebyshev inequality. In: Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering, 2014. pp 1–6.

Jun SW, Fleming K, Adler M, Emer JS. Zip-io: architecture for application-specific compression of big data. In: Proceedings of the International Conference on Field-Programmable Technology, 2012, pp 343–351.

Zou H, Yu Y, Tang W, Chen HM. Improving I/O performance with adaptive data compression for big data applications. In: Proceedings of the International Parallel and Distributed Processing Symposium Workshops, 2014. pp 1228–1237.

Yang C, Zhang X, Zhong C, Liu C, Pei J, Ramamohanarao K, Chen J. A spatiotemporal compression based approach for efficient big data processing on cloud. J Comp Syst Sci. 2014;80(8):1563–83.

Xue Z, Shen G, Li J, Xu Q, Zhang Y, Shao J. Compression-aware I/O performance analysis for big data clustering. In: Proceedings of the International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, 2012. pp 45–52.

Pospiech M, Felden C. Big data—a state-of-the-art. In: Proceedings of the Americas Conference on Information Systems, 2012, pp 1–23. [Online]. Available: http://aisel.aisnet.org/amcis2012/proceedings/DecisionSupport/22 .

Apache Hadoop, February 2, 2015. [Online]. Available: http://hadoop.apache.org .

Cuda, February 2, 2015. [Online]. Available: URL: http://www.nvidia.com/object/cuda_home_new.html .

Apache Storm, February 2, 2015. [Online]. Available: URL: http://storm.apache.org/ .

Curtin RR, Cline JR, Slagle NP, March WB, Ram P, Mehta NA, Gray AG. MLPACK: a scalable C++ machine learning library. J Mach Learn Res. 2013;14:801–5.

Apache Mahout, February 2, 2015. [Online]. Available: http://mahout.apache.org/ .

Huai Y, Lee R, Zhang S, Xia CH, Zhang X. DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems. In: Proceedings of the ACM Symposium on Cloud Computing, 2011. pp 4:1–4:14.

Rusu F, Dobra A. GLADE: a scalable framework for efficient analytics. In: Proceedings of LADIS Workshop held in conjunction with VLDB, 2012. pp 1–6.

Cheng Y, Qin C, Rusu F. GLADE: big data analytics made easy. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2012. pp 697–700.

Essa YM, Attiya G, El-Sayed A. Mobile agent based new framework for improving big data analysis. In: Proceedings of the International Conference on Cloud Computing and Big Data. 2013, pp 381–386.

Wonner J, Grosjean J, Capobianco A, Bechmann D Starfish: a selection technique for dense virtual environments. In: Proceedings of the ACM Symposium on Virtual Reality Software and Technology, 2012. pp 101–104.

Demchenko Y, de Laat C, Membrey P. Defining architecture components of the big data ecosystem. In: Proceedings of the International Conference on Collaboration Technologies and Systems, 2014. pp 104–112.

Ye F, Wang ZJ, Zhou FC, Wang YP, Zhou YC. Cloud-based big data mining and analyzing services platform integrating r. In: Proceedings of the International Conference on Advanced Cloud and Big Data, 2013. pp 147–151.

Wu X, Zhu X, Wu G-Q, Ding W. Data mining with big data. IEEE Trans Knowl Data Eng. 2014;26(1):97–107.

Laurila JK, Gatica-Perez D, Aad I, Blom J, Bornet O, Do T, Dousse O, Eberle J, Miettinen M. The mobile data challenge: big data for mobile computing research. In: Proceedings of the Mobile Data Challenge by Nokia Workshop, 2012. pp 1–8.

Demirkan H, Delen D. Leveraging the capabilities of service-oriented decision support systems: putting analytics and big data in cloud. Decision Support Syst. 2013;55(1):412–21.

Talia D. Clouds for scalable big data analytics. Computer. 2013;46(5):98–101.

Lu R, Zhu H, Liu X, Liu JK, Shao J. Toward efficient and privacy-preserving computing in big data era. IEEE Netw. 2014;28(4):46–50.

Cuzzocrea A, Song IY, Davis KC. Analytics over large-scale multidimensional data: The big data revolution!. In: Proceedings of the ACM International Workshop on Data Warehousing and OLAP, 2011. pp 101–104.

Zhang J, Huang ML. 5Ws model for big data analysis and visualization. In: Proceedings of the International Conference on Computational Science and Engineering, 2013. pp 1021–1028.

Chandarana P, Vijayalakshmi M. Big data analytics frameworks. In: Proceedings of the International Conference on Circuits, Systems, Communication and Information Technology Applications, 2014. pp 430–434.

Apache Drill February 2, 2015. [Online]. Available: URL: http://drill.apache.org/ .

Hu H, Wen Y, Chua T-S, Li X. Toward scalable systems for big data analytics: a technology tutorial. IEEE Access. 2014;2:652–87.

Sagiroglu S, Sinanc D, Big data: a review. In: Proceedings of the International Conference on Collaboration Technologies and Systems, 2013. pp 42–47.

Fan W, Bifet A. Mining big data: current status, and forecast to the future. ACM SIGKDD Explor Newslett. 2013;14(2):1–5.

Diebold FX. On the origin(s) and development of the term “big data”, Penn Institute for Economic Research, Department of Economics, University of Pennsylvania, Tech. Rep. 2012. [Online]. Available: http://economics.sas.upenn.edu/sites/economics.sas.upenn.edu/files/12-037.pdf .

Weiss SM, Indurkhya N. Predictive data mining: a practical guide. San Francisco: Morgan Kaufmann Publishers Inc.; 1998.

Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya A, Foufou S, Bouras A. A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Topics Comp. 2014;2(3):267–79.

Shirkhorshidi AS, Aghabozorgi SR, Teh YW, Herawan T. Big data clustering: a review. In: Proceedings of the International Conference on Computational Science and Its Applications, 2014. pp 707–720.

Xu H, Li Z, Guo S, Chen K. Cloudvista: interactive and economical visual cluster analysis for big data in the cloud. Proc VLDB Endowment. 2012;5(12):1886–9.

Cui X, Gao J, Potok TE. A flocking based algorithm for document clustering analysis. J Syst Archit. 2006;52(89):505–15.

Cui X, Charles JS, Potok T. GPU enhanced parallel computing for large scale data clustering. Future Gener Comp Syst. 2013;29(7):1736–41.

Feldman D, Schmidt M, Sohler C. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2013. pp 1434–1453.

Tekin C, van der Schaar M. Distributed online big data classification using context information. In: Proceedings of the Allerton Conference on Communication, Control, and Computing, 2013. pp 1435–1442.

Rebentrost P, Mohseni M, Lloyd S. Quantum support vector machine for big feature and big data classification. CoRR , vol. abs/1307.0471, 2014. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1307.html#RebentrostML13 .

Lin MY, Lee PY, Hsueh SC. Apriori-based frequent itemset mining algorithms on mapreduce. In: Proceedings of the International Conference on Ubiquitous Information Management and Communication, 2012. pp 76:1–76:8.

Riondato M, DeBrabant JA, Fonseca R, Upfal E. PARMA: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: Proceedings of the ACM International Conference on Information and Knowledge Management, 2012. pp 85–94.

Leung CS, MacKinnon R, Jiang F. Reducing the search space for big data mining for interesting patterns from uncertain data. In: Proceedings of the International Congress on Big Data, 2014. pp 315–322.

Yang L, Shi Z, Xu L, Liang F, Kirsh I. DH-TRIE frequent pattern mining on hadoop using JPA. In: Proceedings of the International Conference on Granular Computing, 2011. pp 875–878.

Huang JW, Lin SC, Chen MS. DPSP: Distributed progressive sequential pattern mining on the cloud. In: Proceedings of the Advances in Knowledge Discovery and Data Mining, vol. 6119, 2010, pp 27–34.

Paz CE. A survey of parallel genetic algorithms. Calc Paralleles Reseaux et Syst Repar. 1998;10(2):141–71.

kranthi Kiran B, Babu AV. A comparative study of issues in big data clustering algorithm with constraint based genetic algorithm for associative clustering. Int J Innov Res Comp Commun Eng 2014; 2(8): 5423–5432.

Bu Y, Borkar VR, Carey MJ, Rosen J, Polyzotis N, Condie T, Weimer M, Ramakrishnan R. Scaling datalog for machine learning on big data, CoRR , vol. abs/1203.0160, 2012. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1203.html#abs-1203-0160 .

Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G. Pregel: A system for large-scale graph processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2010. pp 135–146.

Hasan S, Shamsuddin S,  Lopes N. Soft computing methods for big data problems. In: Proceedings of the Symposium on GPU Computing and Applications, 2013. pp 235–247.

Ku-Mahamud KR. Big data clustering using grid computing and ant-based algorithm. In: Proceedings of the International Conference on Computing and Informatics, 2013. pp 6–14.

Deneubourg JL, Goss S, Franks N, Sendova-Franks A, Detrain C, Chrétien L. The dynamics of collective sorting robot-like ants and ant-like robots. In: Proceedings of the International Conference on Simulation of Adaptive Behavior on From Animals to Animats, 1990. pp 356–363.

Radoop [Online]. https://rapidminer.com/products/radoop/ . Accessed 2 Feb 2015.

PigMix [Online]. https://cwiki.apache.org/confluence/display/PIG/PigMix . Accessed 2 Feb 2015.

GridMix [Online]. http://hadoop.apache.org/docs/r1.2.1/gridmix.html . Accessed 2 Feb 2015.

TeraSoft [Online]. http://sortbenchmark.org/ . Accessed 2 Feb 2015.

TPC, transaction processing performance council [Online]. http://www.tpc.org/ . Accessed 2 Feb 2015.

Cooper BF, Silberstein A, Tam E, Ramakrishnan R, Sears R. Benchmarking cloud serving systems with ycsb. In: Proceedings of the ACM Symposium on Cloud Computing, 2010. pp 143–154.

Ghazal A, Rabl T, Hu M, Raab F, Poess M, Crolotte A, Jacobsen HA. BigBench: Towards an industry standard benchmark for big data analytics. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2013. pp 1197–1208.

Cheptsov A. Hpc in big data age: An evaluation report for java-based data-intensive applications implemented with hadoop and openmpi. In: Proceedings of the European MPI Users’ Group Meeting, 2014. pp 175:175–175:180.

Yuan LY, Wu L, You JH, Chi Y. Rubato db: A highly scalable staged grid database system for oltp and big data applications. In: Proceedings of the ACM International Conference on Conference on Information and Knowledge Management, 2014. pp 1–10.

Zhao JM, Wang WS, Liu X, Chen YF. Big data benchmark - big DS. In: Proceedings of the Advancing Big Data Benchmarks, 2014, pp. 49–57.

 Saletore V, Krishnan K, Viswanathan V, Tolentino M. HcBench: Methodology, development, and full-system characterization of a customer usage representative big data/hadoop benchmark. In: Advancing Big Data Benchmarks, 2014. pp 73–93.

Zhang L, Stoffel A, Behrisch M,  Mittelstadt S, Schreck T, Pompl R, Weber S, Last H, Keim D. Visual analytics for the big data era—a comparative review of state-of-the-art commercial systems. In: Proceedings of the IEEE Conference on Visual Analytics Science and Technology, 2012. pp 173–182.

Harati A, Lopez S, Obeid I, Picone J, Jacobson M, Tobochnik S. The TUH EEG CORPUS: A big data resource for automated eeg interpretation. In: Proceeding of the IEEE Signal Processing in Medicine and Biology Symposium, 2014. pp 1–5.

Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endowment. 2009;2(2):1626–9.

Beckmann M, Ebecken NFF, de Lima BSLP, Costa MA. A user interface for big data with rapidminer. RapidMiner World, Boston, MA, Tech. Rep., 2014. [Online]. Available: http://www.slideshare.net/RapidMiner/a-user-interface-for-big-data-with-rapidminer-marcelo-beckmann .

Januzaj E, Kriegel HP, Pfeifle M. DBDC: Density based distributed clustering. In: Proceedings of the Advances in Database Technology, 2004; vol. 2992, 2004, pp 88–105.

Zhao W, Ma H, He Q. Parallel k-means clustering based on mapreduce. Proceedings Cloud Comp. 2009;5931:674–9.

Nolan RL. Managing the crises in data processing. Harvard Bus Rev. 1979;57(1):115–26.

Tsai CW, Huang WC, Chiang MC. Recent development of metaheuristics for clustering. In: Proceedings of the Mobile, Ubiquitous, and Intelligent Computing, 2014; vol. 274, pp. 629–636.

Download references

Authors’ contributions

CWT contributed to the paper review and drafted the first version of the manuscript. CFL contributed to the paper collection and manuscript organization. HCC and AVV double checked the manuscript and provided several advanced ideas for this manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions on the paper. This work was supported in part by the Ministry of Science and Technology of Taiwan, R.O.C., under Contracts MOST103-2221-E-197-034, MOST104-2221-E-197-005, and MOST104-2221-E-197-014.

Compliance with ethical guidelines

Competing interests The authors declare that they have no competing interests.

Author information

Authors and affiliations.

Department of Computer Science and Information Engineering, National Ilan University, Yilan, Taiwan

Chun-Wei Tsai & Han-Chieh Chao

Institute of Computer Science and Information Engineering, National Chung Cheng University, Chia-Yi, Taiwan

Chin-Feng Lai

Information Engineering College, Yangzhou University, Yangzhou, Jiangsu, China

Han-Chieh Chao

School of Information Science and Engineering, Fujian University of Technology, Fuzhou, Fujian, China

Department of Computer Science, Electrical and Space Engineering, Luleå University of Technology, SE-931 87, Skellefteå, Sweden

Athanasios V. Vasilakos

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Athanasios V. Vasilakos .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Tsai, CW., Lai, CF., Chao, HC. et al. Big data analytics: a survey. Journal of Big Data 2 , 21 (2015). https://doi.org/10.1186/s40537-015-0030-3

Download citation

Received : 14 May 2015

Accepted : 02 September 2015

Published : 01 October 2015

DOI : https://doi.org/10.1186/s40537-015-0030-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • data analytics
  • data mining

research papers in data analytics

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Data Analytics in Healthcare: A Tertiary Study

Toni taipalus.

Faculty of Information Technology, University of Jyväskylä, P.O. Box 35, FI-40014 Jyvaskyla, Finland

Ville Isomöttönen

Hanna erkkilä, sami Äyrämö, associated data.

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

The field of healthcare has seen a rapid increase in the applications of data analytics during the last decades. By utilizing different data analytic solutions, healthcare areas such as medical image analysis, disease recognition, outbreak monitoring, and clinical decision support have been automated to various degrees. Consequently, the intersection of healthcare and data analytics has received scientific attention to the point of numerous secondary studies. We analyze studies on healthcare data analytics, and provide a wide overview of the subject. This is a tertiary study, i.e., a systematic review of systematic reviews. We identified 45 systematic secondary studies on data analytics applications in different healthcare sectors, including diagnosis and disease profiling, diabetes, Alzheimer’s disease, and sepsis. Machine learning and data mining were the most widely used data analytics techniques in healthcare applications, with a rising trend in popularity. Healthcare data analytics studies often utilize four popular databases in their primary study search, typically select 25–100 primary studies, and the use of research guidelines such as PRISMA is growing. The results may help both data analytics and healthcare researchers towards relevant and timely literature reviews and systematic mappings, and consequently, towards respective empirical studies. In addition, the meta-analysis presents a high-level perspective on prominent data analytics applications in healthcare, indicating the most popular topics in the intersection of data analytics and healthcare, and provides a big picture on a topic that has seen dozens of secondary studies in the last 2 decades.

Introduction

The purpose of data analytics in healthcare is to find new insights in data, at least partially automate tasks such as diagnosing, and to facilitate clinical decision-making [ 1 , 2 ]. Higher hardware cost-efficiency and the popularization and advancement of data analysis techniques have led to data analytics gaining increasing scholarly and practical footing in the healthcare sector in recent decades [ 3 ]. Some data analytics solutions have also been demonstrated to surpass human effort [ 4 ]. As healthcare data is often characterized as diverse and plentiful, especially big data analysis techniques, prospects, and challenges have been discussed in scientific literature [ 5 ]. Other related concepts such as data mining, machine learning, and artificial intelligence have also been used either as buzzwords to promote data analytics applications or as genuine novel innovations or combinations of previously tested solutions.

The terms big data , big data analytics , and data analytics are often used interchangeably, which makes the search for related scientific works difficult. Especially, big data is often used as a synonym for analytics [ 6 ], a view contested in multiple sources [ 7 – 9 ]. In addition, the term data analytics is wide and usually at least partly subsumes concepts such as statistical analyses, machine learning, data mining, and artificial intelligence, many of which overlap with each other as well in terms of, e.g., using similar algorithms for different purposes. Finally, it is not uncommon that scientific works that are not focused on technical details discuss concepts such as machine learning at different levels of specificity. For example, some studies consider merely high-level paradigms such as supervised on unsupervised learning, while some discuss different tasks such as classification or clustering, and others focus on specific modeling techniques such as decision trees, kernel methods, or different types of artificial neural networks. These concerns of nomenclature and terminology apply to healthcare as well, and we adapt the broad view of both healthcare and data analytics in this study. In other words, with data analytics we refer to general data analytics encompassing terms such as data mining, machine learning, and big data analytics, and with healthcare we refer to different fields of medicine such as oncology and cardiology, some closely related concepts such as diagnosis and disease profiling, and diseases in the broad sense of the word, including but not limited to symptoms, injuries, and infections.

Naturally, because of growing interest in the intersection of data analytics and healthcare, the scientific field has seen numerous secondary studies on the applications of different data analysis techniques to different healthcare subfields such as disease profiling, epidemiology, oncology, and mental health. As the purpose of systematic reviews and mapping studies is to summarize and synthesize literature for easier conceptualization and a higher level view [ 10 , 11 ], when the number of secondary studies renders the subjective point of understanding a phenomenon on a high level arduous, a tertiary study is arguably warranted. In fact, we deemed the number of secondary studies high enough to conduct a tertiary study. In this study, we review systematic secondary studies on healthcare data analytics during 2000–2021, with the research goals to map publication fora, publication years, numbers of primary studies utilized, scientific databases utilized, healthcare subfields, data analytics subfields, and the intersection of healthcare and data analytics. The results indicate that the number of secondary studies is rising steadily, that data analytics is widely applied in a myriad of healthcare subfields, and that machine learning techniques are the most widely utilized data analytics subfield in healthcare. The relatively high number of secondary studies appears to be the consequence of over 6800 primary studies utilized by the secondary studies included in our review. Our results present a high-level overview of healthcare data analytics: specific and general data analytics and healthcare subfields and the intersection thereof, publication trends, as well as synthesis on the challenges and opportunities of healthcare data analytics presented by the secondary studies.

The rest of the study is structured as follows. In the next section, we describe the systematic method behind secondary study search and selection. In Section “ Results ” we present the results of this tertiary study, and in Section “ Discussion ” discuss the practical implications of the results as well as threats to validity. Section “ Conclusion ” concludes the study.

Search Strategy

We searched for eligible secondary studies using five databases: ACM Digital Library (ACM DL), IEEExplore, ScienceDirect, Scopus, and PubMed. In addition, we utilized Google Scholar, but the search returned too many results to be considered in a feasible timeframe. The search strings and publications returned from the respective databases are detailed in Table  1 . Because the relevant terms healthcare , big data and data analytics have been used in an ambiguous manner in the literature, we performed two rounds of backward snowballing, i.e., followed the reference lists of included articles to capture works not found by the database searches. The search and selection processes are detailed in Fig.  1 .

Search strings—Scopus database search returned 16,135 results which were sorted by relevance, and the first 2,000 papers were selected for further inspection

DatabaseSearch string# of results
ACM DL[Publication Title: data analy] AND [Publication Title: healthcare] AND [[Publication Title: review] OR [Publication Title: map] OR [Publication Title: systematic]] AND [Publication Date: (01/01/2000 TO 04/30/2021)]34
IEEExploredata analy* healthcare review325
ScienceDirectdata analytics healthcare review; 118
Scopus( TITLE-ABS-KEY ( data AND analy* AND healthcare AND review ) AND PUBYEAR> 1999 ) AND ( review )2000*
PubMed(healthcare[Title/Abstract]) AND (data analy*[Title/Abstract]); 107

An external file that holds a picture, illustration, etc.
Object name is 42979_2022_1507_Fig1_HTML.jpg

Study selection process showing the process step by step as well as the numbers of secondary studies in each step—A1, A2 and A3 refer to the authors responsible for each step, E refers to an exclusion criterion described in Table  2 , and n indicates the number of included papers after a step was completed

Study Selection

After the secondary studies were searched for closer eligibility inspection, the first author applied the exclusion criteria listed in Table  2 . In case the first author was unsure about a study, the second author was consulted. In case a consensus was not reached, the third author was consulted with the final decision on whether to include or exclude the study. Regarding exclusion criterion E5, we only considered secondary studies, i.e., mapping studies and different types of literature reviews. Furthermore, due to different levels of systematic approaches, we deemed a study systematic if (i) the utilized databases were explicitly stated (i.e., stated with more detail than “we used databases such as...” or “we mainly used Scopus”), (ii) search terms were explicitly stated, and (iii) inclusion or exclusion criteria or both were explicitly stated. Regarding exclusion criteria E6, E7 and E8, several studies considered healthcare in related fields such as healthcare from administrative perspectives [ 12 ], healthcare data privacy [ 13 , 14 ], data quality [ 15 ], and comparing human performance with data analytics solutions [ 4 ]. Such studies were excluded. Similarly, studies returned by the database searches on data analytics related fields such as big data and its challenges [ 16 ], Internet-of-Things [ 17 ], and studies with a focus on software or hardware architectures behind analytics platforms [ 18 , 19 ] rather than on the process of analysis were also excluded.

Exclusion (E) criteria

IDCriterion for exclusionExample studies excluded
E1Published online outside the time frame 2000 to April 2021
E2Published in a non-peer-reviewed forum
E3Is not written in English
E4Full text we could not find or download[ – ]
E5Is not a systematic secondary study[ – ]
E6focus on data analytics but not on healthcare[ , , ]
E7Focus on healthcare but not on data analytics[ – ]
E8Focus on healthcare-related field but not on healthcare[ – ]
E9Automatic or semi-automatic mapping[ ]

It is worth noting that we followed the respective secondary study authors’ classification of techniques, e.g., whether a technique is considered machine learning or deep learning. In the case a study considered more than one data analytics or healthcare subfield, we categorized the study according to what was to our understanding the primary focus. This is the reason we have refrained from defining terms such as deep learning in this study—the definitions are numerous and by defining the terms, we might give the reader the impression that we have judged whether a secondary study is concerned with, e.g., machine learning or deep learning.

Publication Fora and Years

We included 45 secondary studies (abbreviated SE in the figures, cf. 7 for full bibliographic details). A total of 34 (76%) of the selected secondary studies were published in academic journals, nine (20%) in conference proceedings, and two (4%) were book chapters. Most of the studies were published in distinct fora (cf. Table  3 ), and fora with more than one selected secondary study consisted of Journal of Medical Systems , International Journal of Medical Informatics , Journal of Biomedical Informatics , and IEEE Access . As expected, the publication fora were aimed at either computer science, healthcare, or both. Finally, as can be observed in Fig.  2 , the trend of systematic secondary studies in the intersection of data analytics and healthcare is growing.

Publication fora

ForumNumber of studies
Journal of Medical Systems4
International Journal of Medical Informatics3
Journal of Biomedical Informatics3
IEEE Access2
Americas Conference on Information Systems (AMCIS)1
Annals of Operations Research1
Applied Clinical Informatics1
Archives of Computational Methods in Engineering1
Artificial Intelligence in Medicine1
Australasian Conference on Information Systems1
Biomedical Informatics Insights1
BMC Family Practice1
BMC Medical Informatics and Decision Making1
Clinical Microbiology and Infection1
Communications in Computer and Information Science1
Computational and Structural Biotechnology Journal1
Enterprise Information Systems1
Healthcare1
IEEE Computers, Software, and Applications Conference (COMPSAC)1
IEEE International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management (HNICEM)1
IEEE International Conference on Information Communication and Management (ICICM)1
IEEE Reviews in Biomedical Engineering1
IEEE Symposium on Industrial Electronics & Applications (ISIEA)1
International Conference on Emerging Technologies in Computer Engineering1
International Joint Conference on Biomedical Engineering Systems and Technologies1
International Journal of Healthcare Management1
Intensive Care Medicine1
JAMIA Open1
JMIR Medical Informatics1
Journal of Diabetes Science and Technology1
Journal of the Operational Research Society1
Management Decision1
NPJ Digital Medicine1
Procedia Computer Science1
Scientific Programming1
Studies in Health Technology and Informatics1
Yearbook of Medical Informatics1
Total45

An external file that holds a picture, illustration, etc.
Object name is 42979_2022_1507_Fig2_HTML.jpg

Number of included secondary studies by publication year (bars, left y-axis), and the number of included primary studies by publication year (dots, right y-axis)—the year 2021 was only considered from January to April; the figure shows that the number of secondary studies is rising

Secondary Study Qualities

The selected secondary studies utilized a total of 37 different databases. The most frequently used databases were PubMed, Scopus, IEEExplore, and Web of Science, respectively. Other relatively frequently used databases were ACM Digital Library, Google Scholar, and Springer Link. Most of the secondary studies (33, or 73%) utilized four or fewer databases ( M = 3.6, Mdn = 3). However, many bibliographic databases subsume others, and the number of utilized databases should not be taken as a metric for a systematic review quality. For example, a PubMed search implicitly searches MEDLINE records, and Google Scholar indexes works from most other scientific databases. The extended coverage of a wider range of academic works naturally results in numerous studies to further inspect, posing a challenge in the amount of work required. The most popular databases used in the secondary studies are visualized in Fig.  3 .

An external file that holds a picture, illustration, etc.
Object name is 42979_2022_1507_Fig3_HTML.jpg

Four most popular databases used by the secondary studies were PubMed, IEEEXplore, Scopus and Web of Science—4 studies did not use any of these four databases, and other databases are not considered, e.g., the secondary study SE14, in addition to IEEExplore, might have also utilized other databases not visualized here

The secondary studies reported an average of 155 selected primary studies ( Mdn = 63, SD = 379.2), with a minimum of 6 (SE44) and a maximum of 2,421 primary studies (SE31). Five secondary studies selected more than 200 primary studies (cf. Fig  5 ). In total, the secondary studies utilized 6,838 primary studies. The number of secondary and primary studies categorized by the data analytics approach is summarized in Fig.  4 .

An external file that holds a picture, illustration, etc.
Object name is 42979_2022_1507_Fig4_HTML.jpg

Number of secondary studies included in this tertiary study, and the number of primary studies utilized by the secondary studies, categorized by data analytics approach; DA general data analytics, TA text analytics, INF informatics, NA network analytics, DL deep learning, PM process mining, BDA big data analytics, DM data mining, ML machine learning; the figure shows that the general term data analytics was the most popular in the secondary studies

An external file that holds a picture, illustration, etc.
Object name is 42979_2022_1507_Fig5_HTML.jpg

Number of primary studies (x-axis) selected for final inclusion in the secondary studies (y-axis), e.g., the chart shows that six secondary studies included 0–24 primary studies—one study (SE6) did not disclose the number of primary studies, and one study (SE15) reported two numbers: 24 primary studies for a quantitative analysis, and 28 primary studies for a qualitative analysis, and we reported that study using the latter number

Some secondary studies reported similar details on their respective primary studies, such as visualizations of publication years (22 studies), research approach summaries such as the number of qualitative and quantitative studies (8 studies), research field summaries (4 studies), and details on the geographic distribution of the primary study authors (5 studies). The use of PRISMA (preferred reporting items for systematic reviews and meta-analyses) [ 41 ] guidelines was reported in 15 studies.

Subject Areas Identified

Some selected studies considered the relationship between healthcare in general and a specific data analysis technique, while other studies considered the relationship between data analytics in general and a specific healthcare subfield. Most of the studies, however, considered the relationship between a specific data analysis technique and a specific healthcare subfield. These considerations are summarized in Fig.  6 . Readers interested mainly in general healthcare in the context of a specific analysis topic should refer to the secondary studies on the left-hand side, readers interested in general data analytics in the context of a specific healthcare topic should refer to secondary studies on the right-hand side, readers interested in a specific analysis topic applied to a specific healthcare topic should consider the studies in the middle, and readers interested in the applicability of analytics techniques in general to healthcare in general should consider the studies in the top row. Additional information on the secondary studies is presented in 6 .

An external file that holds a picture, illustration, etc.
Object name is 42979_2022_1507_Fig6_HTML.jpg

Selected secondary studies and whether they consider only specific data analytics techniques (left side), only specific healthcare subfields (right side), both (center), or neither (top); the figure may be utilized in finding relevant secondary studies on desired subfields

Implications

Considering the number of primary studies utilized, only 12 studies (27%) used more than a hundred primary studies. Figure  5 seems to indicate that the threshold for conducting a literature review or a mapping study in healthcare data analytics is typically between 25 and 100 studies. Furthermore, and on the basis of the evidence currently available, it seems reasonable to argue that at least 25 primary studies (84% of the secondary studies) warrant a systematic review, and the results of systematic reviews can be seen as valuable synthesizing contributions to the field. This observation arguably also supports the relevance of this study, although this study covers a relatively large intersection of the two research areas.

The earliest included secondary study was published in 2009, which might be explained by the relative novelty of data analysis in healthcare, at least with computerized automation rather than merely applying statistical analyses. In addition, although systematic reviews are relatively common in medicine, they have only recently gained popularity and visibility in information technology [ 10 ]. As may be observed in Fig.  2 , the trend of secondary studies is growing, which consequently indicates that the number of primary studies in the intersection of data analytics and healthcare is gaining research interest. The rising popularity of machine learning algorithms may be explained by the rising popularity of unstructured data, the growing utilization of graphics processing units, and the development of different machine learning tools and software libraries. Indeed, many of the techniques behind modern machine learning implementations have been around since the 1980s, but only the combination of large amounts of data, and developments in methods and computer hardware in recent years have made such implementations more cost-effective. The development of trends illustrated in, e.g., Fig.  2 propounds the view that machine learning algorithms will gain more and more practical applications in healthcare and related fields, such as molecular biology [ 42 ]. Finally, some studies have argued [ 43 ] as well as demonstrated [ 44 , 45 ] that the evolution of machine learning is changing the way research hypotheses are formulated. Instead of theory-driven hypothesis formulation, machine learning can be used to facilitate the formulation of data-driven hypotheses, also in the field of medicine.

Secondary study publication fora were numerous and focused either on information technology, healthcare, or both, without obvious anomalies. The secondary studies utilized dozens of different databases in their primary study searches. It seems that the coverage of these databases is not always understood, or it is disregarded, regardless that utilizing non-overlapping databases results in less work in duplicate publication removal. For example, Scopus indexes some of ACM DL, some of Web of Science, and all of IEEExplore, effectively rendering IEEExplore search redundant if Scopus is utilized—a fact we as well understood only after conducting our searches. In addition, Google Scholar appears to index the bibliographic details of effectively all published research, yet the number of search results returned may be overwhelming for a systematic review. In practice, the selection of databases is balanced by the amount of work needed to examine the results on one end of the scales, and coverage on the other. Backward or forward snowballing may be utilized to limit the amount of work and to extend coverage.

Secondary study topics summarized in Fig.  6 give some implications for subject areas of healthcare data analytics that are mature enough to warrant a secondary study. As the figure shows, these areas are aplenty, and the most frequent data analytics techniques applied seem to be machine learning (13 secondary studies) and data mining (7 secondary studies). It is worth noting that the nomenclature we applied in this study reflects that of the secondary study authors. As explained earlier in this study, attempts at defining, e.g., machine learning and data mining in this study would inevitably contradict the definitions given in some of the included secondary studies. For further reading, Cabatuan and Maguerra [ 46 ] provide a high-level overview of machine learning and deep learning, and Shukla, Patel and Sen [ 47 ] on data mining. For more technical approaches, both Ahmad, Qamar and Rizvi [ 30 ] and Harper [ 48 ] review data mining techniques and algorithms in healthcare.

Opportunities and Challenges in Healthcare Data Analytics

Many of the selected secondary studies provided syntheses on the current challenges and opportunities in healthcare data analytics. As the secondary studies inspected over 6800 studies of healthcare data analytics, we have summarized recurring insights here.

It was a generally accepted view in the secondary studies that healthcare data analytics is an opportunity that has already been partly realized, yet needs to be more studied and applied in more diverse contexts and in-depth scenarios [ 49 – 51 ]. For example, it has been noted that while big data applications are relatively mature in bio-informatics, this is not necessarily the case in other biomedical fields [ 52 ]. In general, healthcare data analytics is rather uniformly perceived as an opportunity for more cost-efficient healthcare [ 52 , 53 ] through many applications such as automating a specialist’s routine tasks so that they may focus on tasks more crucial in a patient’s treatment. The cost-efficiency is likely to be more concretized by novel deep learning techniques such as large language models [ 54 ], which are also offered through implementations that perform tasks faster while consuming less resources [ 55 ]. In addition to faster diagnoses, data analytics solutions may also offer more objective diagnoses in, e.g., pathology, if the models are trained with data from multiple pathologists.

Challenges regarding healthcare data analytics are more diverse. Perhaps the most discussed challenge was the nature of the data and how it can be treated. Many secondary studies highlighted problems with missing data [ 56 , 57 ], low-quality data [ 54 ], and datasets stored in various formats which are not interoperable with each other [ 52 , 55 , 56 ]. Furthermore, some studies raised the concern of missing techniques to visualize the outputs given by different data analyses [ 56 , 58 ]. Rather intuitively, many new implementations and the increases in the amount of data require new computational infrastructure for feasible use [ 54 , 58 – 60 ]. Some studies raised ethical concerns regarding data collection, merging, and sharing, as data privacy is a multifaceted concept [ 52 , 54 , 58 , 59 ], especially when the datasets cover multiple countries with different legislations. Many studies also called for multidisciplinary collaboration between medical and computing experts, stating that it is crucial that the analytics implementations are based on the same vocabulary and rules as medical experts use [ 49 , 57 , 61 – 64 ], and that the technical experts understand, e.g., how feasible it is to collect training data for a model to find patterns in medical images. Closely related, many of the more complex analytics solutions operate on a black box principle, meaning that it is not obvious how the implementation reaches the conclusion it reaches [ 56 , 59 , 65 – 67 ]. Open solutions, on the other hand, are typically understandable only for technical experts and may be outperformed by the more complex black box solutions. Finally, it has been observed that the already existing analytics solutions implemented in different environments, e.g., different hospitals [ 56 , 59 , 64 ], are not portable into other environments. In addition, it may be that the existing solutions are not fully integrated into actual day-to-day work [ 57 ]. Fleuren et al. [ 68 ] summarize the issue aptly, urging “ to bridge the gap between bytes and bedside. ”

Threats to Validity

As is typical for studies involving human judgment, it is possible for another group of researchers to select at least a slightly different group of studies. Furthermore, the categorization of studies into specific healthcare and data analytics topics is a likely candidate for the subject of change. We tried to mitigate the effect of human judgment by following the systematic mapping study guidelines, such as utilizing and reporting explicit exclusion criteria and search terms [ 11 ], following the PRISMA flow of information guidelines [ 41 ], and discussing discrepancies and disagreements among the authors until consensus was reached. Regarding the challenges related to the wide and rather ambiguous subject areas of data analytics and healthcare, we utilized two rounds of backward snowballing to mitigate the threat of missing relevant studies.

In this study, we systematically mapped systematic secondary studies on healthcare data analytics. The results implicate that the number of secondary—and naturally primary—studies are rising, and the scientific publication fora around the topics are numerous. We also discovered that the number of primary studies included in the secondary studies varies greatly, as do the scientific databases used in primary study search. The results also show that while machine learning and data mining seem to be the most popular data analytics subfields in healthcare, specific healthcare topics are more diverse. This meta-analysis provides researchers with a high-level overview of the intersection of data analytics and healthcare, and an accessible starting point towards specific studies. What was not considered in this study is whether or not and how much the selected secondary studies overlap in their primary study selection, which could indicate the level of either deliberate or unaware overlap of similar work.

Appendix A. Secondary Study Qualities

See Table ​ Table4 4 .

Detailed information on the secondary studies— PS = number of primary studies initially considered and finally included, PRISMA = whether the guidelines were used, Fields = whether the study reports the fields of primary studies, e.g., information systems, computer science, Years = whether the study reports and visualizes the distribution of publication years, Approach = whether the study reports primary study approaches, e.g., case study, qualitative study, philosophical, Geographic = whether the study reports the geographic distribution of primary study authors

Study# of PS initial# of PS finalPRISMAFieldsYearsApproachGeographic
SE113058
SE273,54281
SE321172
SE411032
SE516,63132
SE6 69
SE7
SE812,64834
SE9293288
SE10249635
SE1169653
SE1259337
SE13972446
SE14About 20035
SE15528124 (28)
SE1610412
SE176,817804
SE1876120
SE19 84
SE206084117
SE2165291
SE221287103
SE2339341
SE24270118
SE2575837
SE26 321
SE2775568
SE28106522
SE293117
SE3012,39058
SE31202,1922,421
SE3220,726127
SE3321,230190
SE3412675
SE35835534
SE36209374
SE378529576
SE385069288
SE39702169
SE4014831
SE413695110
SE4292735
SE432,160101
SE4423926
SE4514138

Appendix B. Secondary Studies

  • ] Albahri AS, Hamid RA, Alwan Jk, Al-qays ZT, Zaidan AA, Zaidan BB, Albahri AOS, AlAmoodi AH, Khlaf JM, Almahdi EM, Thabet E, Hadi SM, Mohammed KI, Alsalem MA, Al-Obaidi JR, Madhloom HT. Role of biological data mining and machine learning techniques in detecting and diagnosing the novel coronavirus (COVID-19): a systematic review. J Med Syst. 2020;44(7).
  • ] Alkhatib M, Talaei-Khoei A, Ghapanchi A. Analysis of research in healthcare data analytics. In: Australasian Conference on Information Systems, 2016.
  • ] Alonso SG, de la Torre-Díez I, Hamrioui S, López-Coronado M, Barreno DC, Nozaleda LM, Franco M. Data mining algorithms and techniques in mental health: a systematic review. J Med Syst. 2018;42(9):1–15.
  • ] Alonso SG, de la Torre Diez I, Rodrigues JJPC, Hamrioui S, Lopez-Coronado M. A systematic review of techniques and sources of big data in the healthcare sector. J Med Syst. 2017;41(11):1–9.
  • ] Behera RK, Bala PK, Dhir A. The emerging role of cognitive computing in healthcare: a systematic literature review. Int J Med Inf. 2019;129:154–166.
  • ] Buettner R, Bilo M, Bay N, Zubac T. A systematic literature review of medical image analysis using deep learning. In: 2020 IEEE Symposium on Industrial Electronics & Applications (ISIEA). IEEE, 2020.
  • ] Buettner R, Klenk F, Ebert M. A systematic literature review of machine learning-based disease profiling and personalized treatment. In: 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 2020.
  • ] Cabatuan M, Manguerra M. Machine learning for disease surveillance or outbreak monitoring: a review. In: 2020 IEEE 12th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management (HNICEM). IEEE, 2020.
  • ] Carroll LN, Au AP, Detwiler LT, Fu Tc, Painter IS, Abernethy NF. Visualization and analytics tools for infectious disease epidemiology: a systematic review. J Biomed Inf. 2014;51:287–298.
  • ] Choudhury A, Asan O. Role of artificial intelligence in patient safety outcomes: systematic literature review. JMIR Med Inf. 2020;8(7):e18599.
  • ] Choudhury A, Renjilian E, Asan O. Use of machine learning in geriatric clinical care for chronic diseases: a systematic literature review. JAMIA Open. 2020;3(3):459–471.
  • ] Dallora AL, Eivazzadeh S, Mendes E, Berglund J, Anderberg P. Prognosis of dementia employing machine learning and microsimulation techniques: a systematic literature review. Proc Comput Sci. 2016;100:480–8.
  • ] de la Torre Díez I, Cosgaya HM, Garcia-Zapirain B, López-Coronado M. Big data in health: a literature review from the year 2005. J Med Syst. 2016;40(9).
  • ] Elbattah M, Arnaud E, Gignon M, Dequen G. The role of text analytics in healthcare: a review of recent developments and applications. In: Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies. SCITEPRESS—Science and Technology Publications, 2021.
  • ] Fleuren LM, Klausch TLT, Zwager CL, Schoonmade LJ, Guo T, Roggeveen LF, Swart EL, Girbes ARJ, Thoral P, Ercole A, et al. Machine learning for the prediction of sepsis: a systematic review and meta-analysis of diagnostic test accuracy. Intensive Care Med. 2020;46(3):383–400.
  • ] Gaitanou P, Garoufallou E, Balatsoukas P. The effectiveness of big data in health care: A systematic review. In: Communications in Computer and Information Science, pp. 141–153. Springer International Publishing; 2014.
  • ] Galetsi P, Katsaliaki K. A review of the literature on big data analytics in healthcare. J Oper Res Soc. 2020;71(10):1511–1529.
  • ] Gesicho MB, Babic A. Analysis of usage of indicators by leveraging health data warehouses: A literature review. In: Studies in Health Technology and Informatics, pages 184–187. IOS Press; 2019.
  • ] Iavindrasana J, Cohen G, Depeursinge A, Müller H, Meyer R, Geissbuhler A. Clinical data mining: a review. Yearb Med Inf. 2009;18(01):121–133.
  • ] Islam Md, Hasan Md, Wang X, Germack H, Noor-E-Alam Md. A systematic review on healthcare analytics: Application and theoretical perspective of data mining. Healthcare. 2018;6(2):54.
  • ] Kamble SS, Gunasekaran A, Goswami M, Manda J. A systematic perspective on the applications of big data analytics in healthcare management. Int J Healthc Manag. 2018;12(3):226–240.
  • ] Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I, Chouvarda I. Machine learning and data mining methods in diabetes research. Comput Struct Biotechnol J. 2017;15:104–116.
  • ] Khanra S, Dhir A, Najmul Islam AKM, Mäntymäki M. Big data analytics in healthcare: a systematic literature review. Enterp Inf Syst. 2020;14(7):878–912.
  • ] Sudheer Kumar E, Shoba Bindu C. Medical image analysis using deep learning: a systematic literature review. In: International Conference on Emerging Technologies in Computer Engineering, pages 81–97. Springer; 2019.
  • ] Kurniati AP, Johnson O, Hogg D, Hall G. Process mining in oncology: a literature review. In: 2016 6th International Conference on Information Communication and Management (ICICM). IEEE, 2016.
  • ] Li J, Ding W, Cheng H, Chen P, Di D, Huang W. A comprehensive literature review on big data in healthcare. In: Twenty-second Americas Conference on Information Systems (AMCIS), 2016.
  • ] Luo J, Wu M, Gopukumar D, Zhao Y. Big data application in biomedical research and health care: A literature review. Biomed Inf Insights. 2016;8:BII.S31559.
  • ] Malik MM, Abdallah S, Ala’raj M. Data mining and predictive analytics applications for the delivery of healthcare services: a systematic literature review. Ann Oper Res. 2016;270(1-2):287–312.
  • ] Marinov M, Mohammad Mosa AS, Yoo I, Boren SA. Data-mining technologies for diabetes: A systematic review. J Diabetes Sci Technol. 2011;5(6):1549–1556.
  • ] Mehta N, Pandit A. Concurrence of big data analytics and healthcare: a systematic review. Int J Med Inf. 2018;114:57–65.
  • ] Mehta N, Pandit A, Shukla S. Transforming healthcare with big data analytics and artificial intelligence: a systematic mapping study. J Biomed Inf. 2019;100:103311.
  • ] Nazir S, Khan S, Khan HU, Ali S, Garcia-Magarino I, Atan RB, Nawaz M. A comprehensive analysis of healthcare big data management, analytics and scientific programming. IEEE Access. 2020;8:95714–95733.
  • ] Nazir S, Nawaz M, Adnan A, Shahzad S, Asadi S. Big data features, applications, and analytics in cardiology—a systematic literature review. IEEE Access. 2019;7:143742–143771.
  • ] Peiffer-Smadja N, Rawson TM, Ahmad R, Buchard A, Georgiou P, Lescure F-X, Birgand G, Holmes AH. Machine learning for clinical decision support in infectious diseases: a narrative review of current applications. Clin Microbiol Infect. 2020;26(5):584–595.
  • ] Raja R, Mukherjee I, Sarkar BK. A systematic review of healthcare big data. Sci Programm. 2020;2020.
  • ] Rojas E, Munoz-Gama J, Sepúlveda M, Capurro D. Process mining in healthcare: a literature review. J Biomed Inform. 2016;61:224–236.
  • ] Salazar-Reyna R, Gonzalez-Aleu F, Granda-Gutierrez EMA, Diaz-Ramirez J, Garza-Reyes JA, Kumar A. A systematic literature review of data science, data analytics and machine learning applied to healthcare engineering systems. Manag Decis. 2020.
  • ] Secinaro S, Calandra D, Secinaro A, Muthurangu V, Biancone P. The role of artificial intelligence in healthcare: a structured literature review. BMC Med Inf Decis Making. 2021;21(1).
  • ] Stafford IS, Kellermann M, Mossotto E, Beattie RM, MacArthur BD, Ennis S. A systematic review of the applications of artificial intelligence and machine learning in autoimmune diseases. NPJ Digit Med. 2020;3(1):1–11.
  • ] Teng AK, Wilcox AB. A review of predictive analytics solutions for sepsis patients. Appl Clin Inf. 2020;11(03):387–398.
  • ] Toor R, Chana I. Network analysis as a computational technique and its benefaction for predictive analysis of healthcare data: a systematic review. Arch Comput Methods Eng. 2020;28(3):1689–1711.
  • ] Tsang G, Xie X, Zhou S-M. Harnessing the power of machine learning in dementia informatics research: Issues, opportunities, and challenges. Rev Biomed Eng. 2020;13:113–129.
  • ] Waring J, Lindvall C, Umeton R. Automated machine learning: review of the state-of-the-art and opportunities for healthcare. Artif Intell Med. 2020;104:101822.
  • ] Waschkau A, Wilfling D, Steinhäuser J. Are big data analytics helpful in caring for multimorbid patients in general practice?—a scoping review. Family Pract. 2016;20(1).
  • ] Zhang R, Simon G, Yu F. Advancing Alzheimer’s research: a review of big data promises. Int J Med Inf. 2017;106:48–56.

Open Access funding provided by University of Jyväskylä (JYU). This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data availability

Declarations.

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Different Types of Data Analysis; Data Analysis Methods and Techniques in Research Projects

International Journal of Academic Research in Management, 9(1):1-9, 2022 http://elvedit.com/journals/IJARM/wp-content/uploads/Different-Types-of-Data-Analysis-Data-Analysis-Methods-and-Tec

9 Pages Posted: 18 Aug 2022

Hamed Taherdoost

Hamta Group

Date Written: August 1, 2022

This article is concentrated to define data analysis and the concept of data preparation. Then, the data analysis methods will be discussed. For doing so, the first six main categories are described briefly. Then, the statistical tools of the most commonly used methods including descriptive, explanatory, and inferential analyses are investigated in detail. Finally, we focus more on qualitative data analysis to get familiar with the data preparation and strategies in this concept.

Keywords: Data Analysis, Data Preparation, Data Analysis Methods, Data Analysis Types, Descriptive Analysis, Explanatory Analysis, Inferential Analysis, Predictive Analysis, Explanatory Analysis, Causal Analysis and Mechanistic Analysis, Statistical Analysis.

Suggested Citation: Suggested Citation

Hamed Taherdoost (Contact Author)

Hamta group ( email ).

Vancouver Canada

Do you have a job opening that you would like to promote on SSRN?

Paper statistics, related ejournals, data science, data analytics & informatics ejournal.

Subscribe to this fee journal for more curated articles on this topic

  • Health data governance

4 high-value use cases for synthetic data in healthcare

Synthetic data generation and use can bolster clinical research, application development and data privacy protection efforts in the healthcare sector..

Shania Kennedy

  • Shania Kennedy, Assistant Editor

The hype around emerging technologies -- like generative AI -- in healthcare has brought significant attention to the potential value of analytics for stakeholders pursuing improved care quality, revenue cycle management and risk stratification.

But strategies to advance big data analytics hinge on the availability, quality and accessibility of data, which can create barriers for healthcare organizations.

Synthetic data -- artificially generated information not taken from real-world sources -- has been proposed as a potential solution to many of healthcare's data woes, but the approach comes with a host of pros and cons .

To successfully navigate these hurdles, healthcare stakeholders must identify relevant applications for synthetic data generation and use within the enterprise. Here, in alphabetical order, TechTarget Editorial's Healthtech Analytics will explore four use cases for synthetic healthcare data.

Application development

Proponents of synthetic data emphasize its potential to replicate the correlations and statistical characteristics of real-world data without the associated risks and costs. In doing so, these data sets lend themselves to the development of data-driven healthcare applications.

Much of the real-world data that would be used to build these tools is stored in a tabular format, meaning that the ability to generate tabular synthetic data could help streamline application development.

In a March 2023 study published in MultiMedia Modeling , researchers examined the potential of deep learning-based approaches to generate complex tabular data sets. They found that generative adversarial networks (GANs) tasked with creating synthetic tabular healthcare data were viable across a host of applications, even with the added complexity of differing numbers of variables and feature distributions.

A research team writing for the 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) demonstrated how GAN architecture can be used to develop wearable sensor data for remote patient monitoring systems.

In 2021, winners of ONC's Synthetic Health Data Challenge highlighted novel uses of the open source Synthea model, a synthetic health data generator. Among the winning proposals were tools to improve medication diversification, spatiotemporal analytics of the opioid epidemic and comorbidity modeling.

Synthetic data is also valuable in developing and testing healthcare-driven AI and machine learning (ML) technologies.

A 2019 study published in Sensors detailed the importance of behavior-based sensor data for testing ML applications in healthcare. However, existing approaches for generating synthetic data can be limited in terms of their realism and complexity.

To overcome this, the research team developed an ML-driven synthetic data generation approach for creating sensor data. The analysis revealed that this method generated high-quality data, even when constrained by a small amount of ground truth data. Further, the approach outperformed existing methods, including random data generation .

A research team writing in NPJ Digital Medicine in 2020 explored how a framework combining outlier analysis, graphical modeling, resampling and latent variable identification could be used to produce realistic synthetic healthcare data for assessing ML applications.

This approach is designed to help tackle issues like complex interactions between variables and data missingness, which can arise during the synthetic data generation process. Using primary care data, the researchers were able to use their method to generate synthetic data sets "that are not significantly different from original ground truth data in terms of feature distributions, feature dependencies, and sensitivity analysis statistics when inferring machine learning classifiers."

Further, the study found that this method had a low risk of generating synthetic data that was very similar or identical to real patients.

Synthetic data's utility for healthcare-related application development is closely tied to its value for clinical research.

Clinical research

Clinical research -- particularly clinical trials -- is key to advancing innovations that improve patient outcomes and quality of life. But conducting this research is challenging due to issues like a lack of data standards and EHR missingness .

Researchers can overcome some of these obstacles by turning to synthetic data.

EHRs are valuable data sources for investigating diagnoses and treatments, but concerns about data quality and patient privacy create hurdles to their use. A research team looking to tackle these issues investigated the plausibility of synthetic EHR data generation in a 2021 study in the Computational Intelligence journal .

The study emphasizes that synthetic EHRs are needed to complement existing real-world data, as these could promote access to data, cost-efficiency, test efficiency, privacy protection, data completeness and benchmarking.

However, synthetic data generation methods for this purpose must preserve key ground truth characteristics of real-world EHR data -- including biological relationships between variables and privacy protections.

The research team proposed a framework to generate and evaluate synthetic healthcare data with these ground truth considerations in mind and found that the approach could successfully be applied to two distinct research use cases that rely on EHR-sourced cross-sectional data sets.

Similar methods are also useful for generating synthetic scans and test results, such as electrocardiograms (ECGs). Experts writing in the February 2021 issue of Electronics found that using GANs to create synthetic ECGs for research can potentially address data anonymization and data leakage.

In June, a team from Johns Hopkins University successfully developed a method to generate synthetic liver tumor computed tomography (CT) scans, which could help tackle the ongoing scarcity of high-quality tumor images.

The lack of real-world, annotated tumor CTs makes it difficult to curate the large-scale medical imaging data sets necessary to advance research into cancer detection algorithms.

Synthetic data is also helpful in bolstering infectious disease research. In 2020, Washington University researchers turned to synthetic data to accelerate COVID-19 research , allowing stakeholders to produce relevant data and share it among collaborators more efficiently.

The value of synthetic data in healthcare research is further underscored by efforts from government agencies and academic institutions to promote its use.

The United States Veterans Health Administration's Arches platform is designed to facilitate research collaboration by providing access to both real-world and synthetic veteran data, while the Agency for Healthcare Research and Quality offers its Synthetic Healthcare Database for Research to researchers who need access to high-quality medical claims data.

Alongside clinical research applications, synthetic health data also shows promise in emerging use cases like digital twin technology .

Digital twins

Digital twins serve as virtual representations of real-world processes or entities. The approach has garnered attention in healthcare for its ability to help represent individual patients and populations across various data-driven use cases.

Synthetic data has shown promise in bolstering the data that underpins a digital twin, and some health systems are already pursuing projects that combine synthetic data generation with digital twin modeling.

One such project, spearheaded by Cleveland Clinic and MetroHealth, aims to use digital twins to gain insights into neighborhood-level health disparities and their impact on patient outcomes .

Addressing the social determinants of health (SDOH) -- non-medical factors, such as housing and food security, that impact health -- is a major priority across the healthcare industry. To date, healthcare organizations have found success in building care teams to tackle SDOH and developing SDOH screening processes , but other approaches are needed to meaningfully advance health equity.

In an interview, leadership from Cleveland Clinic discussed how the Digital Twin Neighborhoods project hopes to utilize de-identified EHR data to generate synthetic populations that are closely matched to those of the real-world neighborhoods that Cleveland Clinic and MetroHealth serve.

By incorporating SDOH alongside geographic, biological and social information, the researchers hope to understand existing disparities and their drivers better. Using digital twins, the research team can explore the health profile of a community by simulating how various interventions might impact health status and outcomes over time.

The synthetic and real-world data used to run these simulations will help demonstrate how chronic disease risk, environmental exposures and other factors contribute to increased mortality and lower life expectancy through the lens of place-based health.

Using the digital twin models, the researchers will pursue initial projects assessing regional mental health and modifiable cardiovascular risk factor reduction.

This approach allows Cleveland Clinic and MetroHealth to safely use existing EHR data to inform health equity initiatives without unnecessarily risking patient privacy, one of the most promising applications for synthetic data.

Patient privacy preservation

Protecting patient privacy is paramount when health systems consider using data to improve care and reduce costs. Healthcare data de-identification helps ensure that the sharing and use of patient information is HIPAA-compliant, but the process cannot totally remove the risk of patient re-identification.

Removing or obscuring protected health information (PHI) , as required by HIPAA, is only one aspect of de-identification. Another involves obscuring potential relationships between de-identified variables that could lead to re-identification.

Synthetic data can help create another layer of privacy preservation by replicating the statistical characteristics and correlations in the real-world data, enabling stakeholders to create a data set that doesn't contain PHI.

In doing so, both the privacy and value of the original data are protected, and that information can be used to inform many analytics projects. While no approach to patient privacy protection is completely foolproof, combining data de-identification, synthetic data use and the application of privacy-enhancing technologies strengthens patient privacy preservation efforts.

In 2021, a team from the Institute for Informatics at Washington University School of Medicine in St. Louis demonstrated synthetic data's potential to protect privacy while conducting clinical studies.

The researchers showed that, using a software known as MDClone, users can build effective synthetic data sets for medical research that are statistically similar to real data while simultaneously preserving privacy more effectively than traditional de-identification.

The study authors noted that these capabilities have the potential to significantly speed up critical research.

These four use cases represent an array of opportunities for using synthetic data to transform healthcare and clinical research. While not without pitfalls, synthetic data is likely to see continued interest across the industry as stakeholders continue to explore advanced technologies like digital twins and AI.

Shania Kennedy has been covering news related to health IT and analytics since 2022.

Data analytics in healthcare: Defining the most common terms

Breaking down the types of healthcare big data analytics

High-value use cases for predictive analytics in healthcare

Dig Deeper on Health data governance

research papers in data analytics

How GenAI-created synthetic data improves augmentation

MaryPratt

Enterprises must stop GenAI experiments and start long-term strategies

research papers in data analytics

What is synthetic data? Examples, use cases and benefits

KinzaYasar

Synthetic tumor data enhances training for cancer detection AI

ShaniaKennedy

The proposed rule, which is available for public comment until October 8, 2024, would require HHS contractors to use certified ...

The Traverse Exchange interoperability network supports nationwide health information exchange (HIE) for MEDITECH customers, ...

Generative AI in clinical documentation can ease EHR burdens and enhance patient communication, but issues with accuracy ...

Grab your spot at the free arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Artificial Intelligence

Title: the ai scientist: towards fully automated open-ended scientific discovery.

Abstract: One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aids to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems. Our code is open-sourced at this https URL
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as: [cs.AI]
  (or [cs.AI] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Advertisement

Advertisement

Data science and big data analytics: a systematic review of methodologies used in the supply chain and logistics research

  • Original Research
  • Open access
  • Published: 11 July 2023

Cite this article

You have full access to this open access article

research papers in data analytics

  • Hamed Jahani   ORCID: orcid.org/0000-0002-7091-6060 1 ,
  • Richa Jain   ORCID: orcid.org/0000-0002-8307-2442 2 &
  • Dmitry Ivanov   ORCID: orcid.org/0000-0003-4932-9627 3  

14k Accesses

25 Citations

Explore all metrics

Data science and big data analytics (DS &BDA) methodologies and tools are used extensively in supply chains and logistics (SC &L). However, the existing insights are scattered over different literature sources and there is a lack of a structured and unbiased review methodology to systematise DS &BDA application areas in the SC &L comprehensively covering efficiency, resilience and sustainability paradigms. In this study, we first propose an unique systematic review methodology for the field of DS &BDA in SC &L. Second, we use the methodology proposed for a systematic literature review on DS &BDA techniques in the SC &L fields aiming at classifying the existing DS &BDA models/techniques employed, structuring their practical application areas, identifying the research gaps and potential future research directions. We analyse 364 publications which use a variety of DS &BDA-driven modelling methods for SC &L processes across different decision-making levels. Our analysis is triangulated across efficiency, resilience, and sustainability perspectives. The developed review methodology and proposed novel classifications and categorisations can be used by researchers and practitioners alike for a structured analysis and applications of DS &BDA in SC &L.

Similar content being viewed by others

research papers in data analytics

Big data optimisation and management in supply chain management: a systematic literature review

research papers in data analytics

A Systematic Literature Review on Sustainability and Disruptions in Supply Chains

research papers in data analytics

Big Data Analytics for Supply Chain Management: A Literature Review and Research Agenda

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction and background

In supply chains (SCs), large data sets are available through multiple sources such as enterprise resource planning (ERP) systems, logistics service providers, sales, supplier collaboration platforms, digital manufacturing, Blockchain, sensors, and customer buying patterns (Li et al., 2020b ; Rai et al., 2021 ; Li et al., 2022a ). Such data can be structured, semi-structured, and unstructured. Big data analytics (BDA) can be used to create knowledge from data to improve SC performance and decision-making capabilities. While BDA offers substantial opportunities for value creation, it also presents significant challenges for organisations (Chen et al., 2014 ; Choi et al., 2018 ).

Compared to BDA that deal with collecting, storing, and analysing data, data science (DS) focuses on more complex data analytics. In particular, predictive analytics such as machine learning and deep learning algorithms are considered. From the methodological perspective, DS &BDA contribute to decision-making at strategic, tactical, and operational levels of SC management. Organisations can use DS &BDA capabilities to achieve competitive advantage in the markets (Kamley et al., 2016 ). DS &BDA techniques also help organisations improve their SC design and management by reducing costs, increasing sustainability, mitigating risk and improving resilience (Baryannis et al., 2019b ), understanding customer demands, and predicting market trends (Potočnik et al., 2019 ).

Along with methodological advancements, a progress in the DS &BDA tools can be observed. SC analytics software help researchers and practitioners alike to develop better forecasting, optimization, and simulation models (Analytics, 2020 ). These tools can also extract data and produce advanced visualizations. Along with the large corporations such as SAP®, IBM, and Oracle, there are also specific SC software such as anyLogistix™ and LLamasoft™, that allow to integrate simulation and network design with SC operations data to build digital SC twins (Ivanov, 2021b ; Burgos & Ivanov, 2021 ). The advanced methodical and software developments result in growing opportunities for SC researchers and practitioners. However, the existing insights are scattered over different literature sources and there is a lack of a structured review on the DS &BDA application areas in SC and logistics (SC &L) areas, comprehensively covering efficiency, resilience and sustainability paradigms which encouraged us to conduct this systematic and comprehensive literature review. In the next section, we elaborate in detail on our motivation for this study.

1.1 Motivation of the study

Google trends for “Data Science” and “Big Data” have exhibited continuously increasing interest over the last 19 years in the DS &BDA in SC &L field, while the trend for SCs has steadily exhibited high interest (see Fig.  1 ). However, interest in BDA started increasing earlier than that for DS. We can also observe the recent convergence in the trends for “Big Data” and “Data Science”.

figure 1

Trends of interest in the topics of this research (2004–2022)

From an academic point-of-view, various literature review studies have recently indicated the benefits of using DS &BDA in SC &L management (Pournader et al., 2021 ; Riahi et al., 2021 ; Novais et al., 2019 ; Neilson et al., 2019 ; Ameri Sianaki et al., 2019 ; Baryannis et al., 2019b ; Choi et al., 2018 ; Govindan et al., 2018 ; Mishra et al., 2018 ; Arunachalam et al., 2018 ; Tiwari et al., 2018 ). Table  3 demonstrates the latest literature review publications in line with DS &BDA and affirms that although several review papers can be found around this topic, the reviews only explore SC &L from the specific viewpoint of BDA. Kotu and Deshpande ( 2018 ) concede that although the concept of big data is worthy of being explored separately, a holistic view on all aspects of data science with consideration of big data is of utmost importance and still needs to be researched in several areas such as SCs. Our investigation also shows that studies including BDA mostly discuss architecture and tools for BDA, but lack a contextualisation in the general data science methodologies. Waller and Fawcett ( 2013 ) affirm that along with the importance of data analysis in SCs, other issues related to data science are important in the SC, such as “data generation”, “data acquisition”, “data storage methods”, “fundamental technologies”, and “data-driven applications”, which are not necessarily connected to BDA.

The growing number of studies in DS &BDA and SC &L substantiates the need to adopt systematic approaches to aggregating and assessing research results to provide an objective and unbiased summary of research evidence. A systematic literature review is a procedural aggregation of precise outcomes of research. We explored several survey studies around our topic, shown in Table  3 , to understand how researchers employ a systematic approach for their review process. Our general observation from analysis of the literature is that the existing surveys mostly focus on the BDA while missing a detailed analysis of DS and intersections of BDA and DS - a distinct and substantial contribution made by our study (Grover & Kar, 2017 ; Brinch, 2018 ; Nguyen et al., 2018 ; Kamble & Gunasekaran, 2020 ; Neilson et al., 2019 ; Talwar et al., 2021 ; Maheshwari et al., 2021 ). For instance, Maheshwari et al. ( 2021 ) conduct a systematic review for finding the role of BDA in SCM, but only select the keywords “Big data analytics” with “Supply chain management” or “Logistics management” or “Inventory management” which definitely miss many relevant studies using DS applications with big data.

1.2 Basic terminologies

Since several terms are used in the area of DS &BDA, we introduce here some of the main terminologies in the domain of our research.

Data science is a knowledge-based field of study that provides not only predictive and statistical tools for decision-makers, but also an effective solution that can help manage organisations from a data-driven perspective. DS requires integration of different skills such as statistics, machine learning, predictive analyses, data-driven techniques, and computer sciences (Kotu & Deshpande, 2018 ; Waller & Fawcett, 2013 ).

Big data includes the mass of structured or unstructured data and has been commonly characterised in the literature by 6Vs, i.e., “volume” (high-volume data), “variety” (a great variety of formats and sources), “velocity” (rapid growth in generation), “veracity” (quality, trust, and importance of data), “variability” (statistical variation in the data parameters) and “value” (huge economic benefits from low-data density) (Mishra et al., 2018 ; Chen et al., 2014 ).

Predictive analytics project the future of a SC by investigating its data and employing mathematical and forecasting models (Kotu & Deshpande, 2018 ).

Prescriptive analytics employs optimisation, simulation, and decision-making mechanisms to enhance business performance (Kotu & Deshpande, 2018 ; Chen et al., 2022 ).

Diagnostic analytics is a financial-analytical approach that aims to discover events causes and behaviours (Xu & Li, 2016 ; Windt & Hütt, 2011 ).

Descriptive analytics aims to analyse problems and provide historical analytics regarding the organisation’s processes by applying some techniques such as data mining, data aggregation, online analytical processing (OLAP), or business intelligence (BI) (Kotu & Deshpande, 2018 ).

The remainder of this paper is organised as follows: Sect.  2 describes our systematic research methodology to introduce the research questions, objectives, and conceptual framework, and to identify potential related studies. Section  4 presents and describes our content analysis results of the selected studies. Section  5 identifies gaps in the literature of DS &BDA within the context of SC &L. Finally, Sect.  6 concludes our study by summarising the significant features of our detailed framework and by providing several future research avenues.

2 Research methodology

2.1 research questions and objectives.

To develop a conceptual framework for our research, the following research questions (RQs) have been framed:

What strategies are required (in line with the systematic review protocol) to identify studies related to our research topic? (RQ1)

What can be inferred about the research process and guidelines from the previous survey studies related to DS &BDA in SC &L? (RQ2)

What research topics and methodologies have been investigated in DS &BDA in the context of SC &L? (RQ3)

What are the existing gaps in the literature for using DS &BDA techniques in SC &L? (RQ4)

Consequently, the research objectives are defined as follows:

Developing a comprehensive and unbiased systematic process to identify a methodological taxonomy of DS &BDA in SC &L.

Proposing a conceptual framework to categorize application areas of DS &BDA in SC &L.

Identifying the gaps and future research areas in development and application of DS &BDA techniques in SC &L.

2.2 Research process

Figure  2 depicts the nine main steps of our research process derived from Kitchenham ( 2004 ). The process includes three major phases: “research planning” , “conducting review” , and “reporting results” . We initially prepared the research plan by clarifying the research questions , defining the research objectives , and developing a review protocol for our study. A review protocol is an essential element in undertaking a systematic review and determines how primary studies are chosen and analysed. It also involves choosing beneficial resources/databases, study selection procedures or criteria (inclusion and exclusion criteria), and the proposed data synthesis method.

According to our defined review protocol, the second phase of the proposed research process (conducting the review) involves:

Conducting the analysis of recent review studies.

Material collection and identification of the available studies concerning the domain.

Developing a conceptual framework for reviewing and coding the collected studies.

Finally, we analyse the results of the content analysis and coding/classifying the selected studies. This phase also involves exploring the potential gaps and concluding with significant insights.

figure 2

Outline of the research process

2.3 Review protocol

Our review protocol is a systematic process of searching, demarcating, appraising, and selecting of articles. A similar protocol has been adopted by a number of highly cited review papers in the literature (Nguyen et al., 2018 ; Wang et al., 2018b ; Brinch, 2018 ). Material was collected from standard academic databases such as Web of Knowledge, Science Direct, Scopus, and Google Scholar, and only included “articles”, “research papers” or “reviews”. Results were limited to articles written in English language only between the years 2005 and 2021. The rationale behind this year range is the following: it will allow us to overview the latest studies to identify the research gaps in the area of DS &BDA in SC &L. Additionally, it will enable us to develop a coding strategy to formulate a conceptual framework for classifying the literature.

Initially a broad set of keywords were chosen to select potentially relevant studies. These keywords were “supply chain” OR “logistics”, along with at least one of the following keywords: “data science”, “data driven”, “data mining”, “text mining”, “data analytics”, “big data”, “predictive analytics”, and “machine learning”. However, additional search terms were identified later on from the relevant identified articles, to formulate more sophisticated search strings. We limited our search to articles that include the search keywords in their “title”, “abstract”, or “keywords”. The entire contents of the articles were not studied at this stage. If any database returned a huge number of articles during the search, we then followed a strategy to exclude or make selections from that database.

Table  1 shows the number of extracted papers from each database. This is further subdivided as per the keywords in Table  2 . Since we followed a comprehensive approach and selected a broad range of keywords, it resulted in a large number of studies, in comparison with the related review papers. Investigating the search results from Google Scholar demonstrated that most of the articles were irrelevant. Therefore, we identified Google Scholar database’s result as unreliable and did not consider the associated articles for the selection process. Moreover, after a thorough content analysis of the review articles and an examination of their search keywords with our proposed keywords set (listed in Table  2 ), we recognised that the “SC analytics” and “big data analytics” set of words had been commonly used in most of these articles; thus, to provide a more comprehensive search process, we also added these two keywords to our previous set of search keywords. According to the above-mentioned selection process, the number of preliminary papers extracted from the three search databases was reduced to 6064. The last search process was applied in January 2023.

In the next stages, duplicates from the databases were removed and, in order to ensure quality, only papers that were published in A* and A-ranked journals were selected, Footnote 1 or journal papers published in Q1-ranked journals in the SJR Report. Footnote 2 . This process was repeated twice: once for identifying a database of only literature review studies, and once for all other studies (see Figs.  3 and 4 ). Regarding the selection of review studies, we also looked at papers with the most citations in the Google Scholar database. We selected these studies by sorting the search list collected by each keyword set (see the last column of Table  2 ). This stage could not be applied to the process of selecting all studies, as the most cited non-review studies, listed in the Google Scholars search, were books, chapters, and other non-relevant articles. At the last stage, a content analysis was done to exclude those studies that were not closely associated with our field of interest.

2.4 Analysis of recent relevant review studies

As noted in the research process, we initially aimed to deliberate recent relevant review studies. The purpose of this approach is twofold. First, we are able to overview the latest review approaches and identify the interest and research gaps in the area of techniques involving DS &BDA in SC &L. Second, it helps us summarise coding strategies to develop a conceptual framework for classifying the literature.

Following our review protocol, the word “review” was added to the previous keywords, and the search process was repeated. This was done to extract only literature review studies from the shortlisted databases. No thorough analysis regarding the content of these studies was done at this time. This reduced the number of potential studies to 459. Amongst these, we found 317 duplicates. Furthermore, the focus on A* and A-ranked journals reduced the number of papers to 18. Then, we investigated the relevance of the remaining papers to our field of interest. After precisely reading the abstract and introduction sections, the papers not strongly associated with the subject or to the field of this research were also removed. Finally, 16 potential review studies were selected in this stage for a full text analysis. Figure  3 illustrates our meticulous selection procedure for selecting these articles. To ensure the comprehensiveness of the analysis, we also selected the most cited review studies listed in the Google Scholar database. The keywords set, as well as the search and filtering process, were applied as noted in the review protocol. We found three more relevant review papers in this stage (see the second stage of the process in Fig.  3 ). It is worth mentioning that some survey studies that only focus on BDA and SCs without a relevance to DS and logistics (e.g. Xu et al. ( 2021a )), have been removed from our list. Finally, 23 papers were selected from our two stages after full text filtering and content analysis.

figure 3

Research selection process regarding the literature review papers

3 Content analysis and framework development

3.1 lessons from the review studies.

To answer the second research question (RQ2) of our study, outlined in Sect.  2.1 , we analysed the content of the 23 selected review articles. Table  3 summarises these latest review articles. We categorise the lessons gained from this content analysis step in the following subsections.

3.1.1 Review methodologies

The investigation of venues of the selected survey studies introduced the top journals, listed in Table  3 . These journals are mostly among A*/A/Q1 -ranked journals. This confirmed that our approach regarding the inclusion of highly ranked journals was a correct strategy to limit the selected documents. Moreover, by looking over the search engines used by the survey studies (see the column Search Engines in Table  3 ), we confirmed our main databases for the material selection process in which we selected all relevant studies.

In Table  3 , the column Type of Review refers to the research methodology employed for reviewing the selected studies. From a total of 23 review papers, eight of them did not utilise any type of “systematic” (SR) or “bibliometric” (BIB) methods, and by investigating their research methodology in more detail, we categorised these articles as “others” (ORS), which means that they did not utilise an organised research methodology for their review process. Three articles reviewed the literature bibliographically (Pournader et al., 2021 ; Mishra et al., 2018 ; Iftikhar et al., 2022b ), and one article (Arunachalam et al., 2018 ) chose both methods (SR and BIB). The systematic approach also claimed to be implemented on some BIB based methods (see Pournader et al. ( 2021 ))

3.1.2 Gaps identification in research topics

Although the authors asserted a holistic view in their research process, they mostly investigated the SC from an operations viewpoint, i.e., production, logistics, inventory management, transportation, and demand planning (see the coding and classification in these studies: Nguyen et al. ( 2018 ); Tiwari et al. ( 2018 ); Choi et al. ( 2018 ); Maheshwari et al. ( 2021 )). Moreover, it seems that some aspects of SC operations were overlooked by the researchers (see the column Perspective and Special Features of Neilson et al. ( 2019 ) and Novais et al. ( 2019 )). For instance, some of the production and transportation aspects, such as the “network design”, “facilities capacity”, and “vehicle routing” were not reviewed by Nguyen et al. ( 2018 ) or Maheshwari et al. ( 2021 ), although they follow a comprehensive approach.

Any decision around a SC can be classified at three planning levels, i.e., “strategic”, “tactical”, and “operational” (Stadtler & Kilger, 2002 ; Ivanov et al., 2021b ). DS &BDA can provide useful solutions at each of these planning levels (Nguyen et al., 2018 ). Wang et al. ( 2016a ) focused on the value of SC analytics and the applications of BDA on strategic and operational levels. They acknowledge the importance of BDA for the SC strategies, which in turn, affects the “SC network”, “product design and development”, and “strategic sourcing”. They also note that at the operational level, BDA plays a critical role in the effective performance of analysing and measuring “demand”, “production”, “inventory”, “transportation” and “logistics”. The authors do not utilise the basic definitions and categorisations of the decisions levels that existed in the literature (Stadtler & Kilger, 2002 ).

3.1.3 Gaps identification in DS &BDA techniques

Grover and Kar ( 2017 ) classify BDA into “predictive”, “prescriptive”, “diagnostic”, and “descriptive” categories. Kotu and Deshpande ( 2018 ) note that these classifications can be considered for any research using DS &BDA tools. However, some review studies around BDA only refer to three of these classifications (predictive, prescriptive, and descriptive categories) (Wang et al., 2016a ; Arunachalam et al., 2018 ; Nguyen et al., 2018 ). Nguyen et al. ( 2018 ) classify the applications of BDA in SC &L based on the main three categories and conclude that prescriptive analytics is more controversial than the other two, since the results of this type of analytics are strongly influenced by the descriptive and predictive types.

Considering a broader exploration of logistics for any company, DS &BDA has significant importance in transportation systems for enhancing safety and sustainability. Neilson et al. ( 2019 ) review the applications of BDA from only the logistics perspective. They concede that the applications of BDA in the transportation system can be categorised as sharing traffic information (avoiding traffic congestion), urban planning (developing transportation infrastructures), and analysing accidents (improving traffic safety). Since the authors focus only on transportation systems, they explore the data collection process from urban facilities only such as smartphones, traffic lights, roadside sensors, global positioning systems (GPSs), and vehicles. The authors focus on special data types and formats that are mostly used in urban applications. They also classify the application of BDA in transportation into several categories, including predictive, real-time, historical, visual, video, and image analytics.

The special characteristics of big data as noted in our research terminologies (see Sect.  1.2 ), have been researched in certain survey studies. For instance, Addo-Tenkorang and Helo ( 2016 ) propose a framework based on the Internet of things (IoT), referred to as “IoT-value adding”, and extend five traits for big data: variety, velocity, volume, veracity, and value-adding. IoT is defined as the connectivity and sharing of data between physical things or technical equipments via the Internet (Addo-Tenkorang & Helo, 2016 ). Studies around BDA also list recent technologies and tools employed for dealing with large data sets. These technologies include but are not limited to cloud computing, IoT, and master database management systems (MDMS), which are associated with the veracity characteristic of big data; additionally, the tools include Apache Hadoop, Apache Spark, and Map-Reduce. In the case of the SC, Chen et al. ( 2014 ) concede that big data can be acquired from elements of the SC network, such as suppliers, manufacturers, warehouses, retailers, and customers, which are related to the variety characteristic of big data. Some of the researchers such as Brinch ( 2018 ) and Addo-Tenkorang and Helo ( 2016 ) consider the value of big data in SC &L. Brinch ( 2018 ) introduces a conceptual model for discovering, creating, and capturing value in SC management. Arunachalam et al. ( 2018 ) note that assessing the current state of an organisation on BDA will help its managers enhance the company’s capabilities. The authors suggest five BDA capabilities dimensions: “data generation”, “data integration and management”, “data advanced analytics”, “data visualisation”, and “data-driven culture”. The first two capabilities represent the level of data resources, whereas the second two demonstrate the level of analytical resources. The last is the foundation capability, compared to the other capabilities, which needs to be institutionalised in any organisation. Kamble and Gunasekaran ( 2020 ) also affirm that the performance measures used in a data-driven SC must be different from a traditional SC. For this purpose, the authors identify two categories of measures for data-driven SC performance monitoring: BDA capability and evaluating processes. BDA tools and platforms are also categorised into five groups according to the type of provided service: “Hadoop”, “Grid Gain”, “Map Reduce”, “High-performance computing cluster (HPCC) systems”, and “Storm” (Grover & Kar, 2017 ; Addo-Tenkorang & Helo, 2016 ). Each of these tools has different applications in SC &L.

Regarding the statistical techniques indexed in the review studies and their categories, we found the following classifications:

Techniques to measure data correlation (such as statistical regression (Zhang et al., 2019 ) and multivariate statistical analysis (Wesonga & Nabugoomu, 2016 )).

Simulation techniques (Wojtusiak et al., 2012a ; Antomarioni et al., 2021 )).

Optimisation techniques (including heuristic algorithms such as the genetic algorithm (Chi et al., 2007 ) and particle filters (Wang et al., 2018c )).

Machine learning methods (e.g. neural networks (Tsai & Huang, 2017 ), support vector machines (Weiss et al., 2016 )).

Data mining methods (e.g., classification (Merchán & Winkenbach, 2019 ), clustering (Windt & Hütt, 2011 ), regression (Benzidia et al., 2021 )).

These studies note that every technique has its strengths and weaknesses. For instance, statistical methods are fast but not adaptable enough to all problems. These methods cannot be applied to an unstructured and heterogeneous data set, while machine learning techniques are flexible, adaptable, yet time-consuming (Wang et al., 2016a ; Choi et al., 2018 ; Pournader et al., 2021 ). Some studies such as Ameri Sianaki et al. ( 2019 ) investigate the applications of DS &BDA in a specific industry such as healthcare or smart cities. The authors find applications of DS &BDA in healthcare SCs and classify them as “patients monitoring”, “diagnosing disorders”, and “remote surgery”. Each of the mentioned techniques can also be applied in different types/levels of analytics. For example, optimisation is a prescriptive analytic and cannot be predictive. However, simulation techniques can be used in predictive, diagnostic, and prescriptive analytics (Baryannis et al., 2019a ). Therefore, one perspective that can help us define the conceptual analysis of articles is the categorisation of DS &BDA techniques based on different analytical types/levels. This categorisation proposes guidelines for practitioners as well.

These review studies also investigate their selected articles in certain specific domains in SC &L such as SC risk management, in which decision-making is required to be fast, and the data is acquired from multi-dimensional sources. Baryannis et al. ( 2019a ) explore risk and uncertainty in the SC by reviewing the applications of AI in BDA. The authors categorise the methods proposed for SC risk management in two main classes: mathematical programming and network-based models, and find that mathematical programming approaches have received more attention in the literature. Data-driven optimisation (DDO) approaches are other recent and effective approaches used in this area of research (Jiao et al., 2018 ; Gao et al., 2019 ; Zhao & You, 2019 ; Ning & You, 2018 ). The related methods are recognised as a combination of machine learning and mathematical programming methods for making decisions under uncertainty. DDO approaches can be further subdivided into four categories: “stochastic programming”, “robust optimisation”, “chance-constrained programming”, and “scenario-based optimisation” (Ning & You, 2019 ; Nguyen et al., 2021 ). In the DDO approaches, uncertainty is not predetermined, and decisions are made based on real data. Therefore, these are the main differences between data-driven approaches and traditional mathematical approaches. The results of the DDO methods are also less conservative, and consequently, closer to reality. The selection of techniques and tools is very critical because they strongly influence the outputs of analytics.

One of the most widely used tools for managing and integrating data in SC &L is cloud computing (Mourtzis et al., 2021 ; Sokolov et al., 2020 ). With this tool, the data is stored in cyberspace and serviced according to user needs. This technology can play an important role in SC &L. Novais et al. ( 2019 ) explore the role of cloud computing on the chain’s integrity. Some studies (Jiang, 2019 ; Zhu, 2018 ; Zhong et al., 2016 ) show that the impact of cloud computing on the integration of the SC (financially or commercially) is positive. This technology helps improve the integration of information, financial and physical flows in the SC via information sharing between the SC members, optimising payment and cash processes among partners, and controlling inventory levels and costs. In the case of information sharing, we also found the fuzzy model developed by Ming et al. ( 2021 ) as a valuable method considering BDA concerns.

3.2 Material collection

By looking at the keywords, we found that two combinations of them, i.e., “data mining” AND “logistics” and “machine learning” AND “logistics”, had yielded the most search results. Precisely reviewing some of the articles, instead of merely the “logistics” word, the “logistics regression” phrase was detected, which is a common methodology in data mining, and not in the transportation field. Therefore, the keyword “logistics regression” was excluded from the list via “AND NOT”. The selection process resulted in 6681 potential papers. In the next step, we excluded irrelevant papers by overviewing the abstracts and keywords. Articles related to “conceptual analysis”, “resource dependency theory”, “the importance of BDA”, “management capabilities”, and “the role or application of the BDA in the SC” were identified as unrelated articles. We also removed the review papers, as we explored them in the previous step, separately. These filtering criteria reduced the number of papers to 2583 (see Table  4 ). Figure  4 illustrates the selection procedure in detail. After removing duplicates and filtering for highly ranked journals, 1167 studies remained. In the next step, we scanned the papers’ abstracts (and in some cases, the full text) to examine the relevance of the paper to our domain. Since we aimed to limit our selection to research employing any DS &BDA models/techniques, we removed several papers that were using only conceptual models. A total of 364 articles were finally selected to go through the full text analysis and coding step.

figure 4

Research selection process regarding all relevant papers excluding review papers

3.3 Conceptual framework of DS &BDA in SC &L

Considering all the insights gained from the previous review studies, we propose a conceptual framework of our study encompassing two perspectives: (1) SC &L research problems/topics and (2) DS &BDA main approaches. This structure can help practitioners apply DS &BDA approaches for creating a competitive advantage. According to our research process outlined in Fig.  2 , we revised the list of each category with the help of a recursive process and gathered feedback from the full content analysis of the selected studies.

Figure  5 illustrates our proposed categorisation from the SC &L research problems/topics viewpoint in a framework. We classify SC operational processes (i.e., procurement, production, distribution, logistics, and sales) into three hierarchical levels of decision-making used in SC &L companies (Stadtler & Kilger, 2002 ). In the first operational process, we highlight decisions about procurement planning, which includes concerns about raw materials and suppliers (Cui et al., 2022 ). Production planning organises the products’ design and development (Ialongo et al., 2022 ). These issues coordinate suppliers’ and customers’ requirements. Distribution planning influences production and transportation decisions. Logistics or transportation planning deals with methodologies related to delivering products to end-customers or retailers. Sales planning is related to trades in business markets. We also consider SC design as a strategic decision and classify the studies in resilient, sustainable, and closed loop and reverse logistics categories.

figure 5

Conceptual framework of reviewing the selected studies from the SC &L research problems/topics viewpoint

Figure  6 demonstrates our conceptual framework proposed for the classifications of the DS &BDA main approaches. DS &BDA algorithms/techniques for SC &L are categorised based on this proposed framework.

figure 6

Conceptual framework of reviewing the selected studies from the viewpoint of the employed DS &BDA main approaches

All DS &BDA approaches, shown in Fig.  6 , are applied to each topic listed in Fig.  5 . In the next section, we explore our 364 selected articles in more detail with respect to each of these categories.

4 Context analysis of results

Responding to the third research question (RQ3), we initially visualise the research sources in the scope of a yearly distribution, publication venues, and analytics types. In the coding process, we review the context of the selected papers precisely and classify them based on the proposed framework. In this step, we explore the main contributions of the selected papers. It is worth mentioning that with the recursive process, we complete the proposed framework so as to cover all topics (the final version of the framework is delineated in Figs.  5 and 6 ). Consequently, we evaluate and synthesise the selected studies at the end of this section.

4.1 Data visualisation and descriptive analysis

4.1.1 distribution of papers per year and publications.

To identify the journals with the highest number contributions, and to provide an overview of the research trends, we classify all selected papers based on the publication per venue and year (see Fig.  7 ). Figure  7 a depicts the distribution of published papers between 2005 to August 2022. It can be observed that before 2005, the domain of DS &BDA was not investigated, and there is an insignificant contribution until 2012. In fact, before 2012, the concept of DS &BDA was considered as data mining or BI (Arunachalam et al., 2018 ). The Google trend of interest regarding DS &BDA topics, depicted in Fig.  1 , also confirms this trend and the consideration of DS &BDA after the year 2012. The publication trend also shows that the applications of DS &BDA in SC &L have attracted the attention of many researchers in the past four years. As the chart shows, the number of papers published in the last five years is approximately doubled to the summation of those in the previous years. Apparently, the number of studies has been declined since 2020 which is expected due to the specifics of the COVID-19 period.

figure 7

Distribution of the selected papers per year and for the top ten journal venues

The number of publications in the top ten journals is illustrated in Fig.  7 b. Overall, we found 157 various journal venues for all of our 364 selected studies in the domain of DS &BDA, with most of them in the “information system management” and “trasportation” 2020 SJR subject classification. Footnote 3 It is noticeable that a significant proportion of the studies (over 45%) have been published by high-impact journals, such as CIE , ESA , IJPR , JCP , and IJPE . Also, it is worth mentioning that the ESA journal recently has got the most publications in the field of DS &BDA applications in SC &L. The ESA journal is an open access journal whose focus is on intelligent systems applied in industry. The CIE and IJPR are both in the second rank, which mostly concentrates on SC &L, compared to “information systems”. Other journals in Fig.  7 b are among the most popular journals published in the field of SC &L.

4.1.2 Types of analytics approaches

The analytics type for each selected study needs to be further investigated. According to the classification introduced by Grover and Kar ( 2017 ) and Arunachalam et al. ( 2018 ), four types of analytics can be defined: descriptive, diagnostic, predictive, and prescriptive. Due to an extremely limited number of studies classified on diagnostic analytics (7 out of 364 publications in our data set), this area was excluded from our classification, similar to the survey study of Nguyen et al. ( 2018 ). A classification in each field of analytics is conducted based on the applied models and common techniques of analysis, as outlined in Table  5 (see also Wang et al. ( 2016a ); Grover and Kar ( 2017 ); Nguyen et al. ( 2018 ) for a description of these analytics types). The simulation approach is listed in both the predictive and prescriptive analytics (Viet et al., 2020 ; Wojtusiak et al., 2012b ; Wang et al., 2018c ).

Figure  8 shows the annual distribution of analytics types over time. Predictive analytics methodologies have become more popular in 2019–2022. 45% of the articles have followed a predictive approach in their proposed solution, which is the highest proportion compared to the other types of analytics. This is justified by the development of analytical tools and the ability to access dynamic data in addition to historical data (Arunachalam et al., 2018 ).

figure 8

Annual distribution of the selected studies with respect to analytics type

figure 9

Distribution of the articles by DS &BDA approaches

Additionally, we analyse the distribution of approaches used in the articles (see Table  5 ). Figure  9 a–c show the distribution of the main approaches employed in the selected studies regarding each type of analytics. Among all predictive approaches, we found that neural network is the most favourable technique, employed in 19% of the selected papers in the various main approaches of DS &BDA, such as forecasting, classification, and clustering. Moreover, among the main approaches and algorithms, the graph visualisation techniques are the most employed methods in the field of this survey (29% of the selected papers used this technique).

Ensemble learning is the process by which several algorithms/techniques (including forecasting or classification techniques) are strategically combined to solve a particular DS &BDA problem. Regarding the selection of appropriate techniques, ensemble learning can be employed to help reduce the probability of an unlucky selection of a poor technique and can improve performance of the whole model (Zhu et al., 2019b ; Hosseini & Al Khaled, 2019 ). Deep learning is an evolution of machine learning that uses a programmable neural network technique and can be employed for forecasting, classification, or any predictive model (Bao et al., 2019 ; Pournader et al., 2021 ; Rolf et al., 2022 ).

4.1.3 Methodological perspectives

Descriptive analysis is adopted in approximately 33% the examined literature. These articles have commonly used clustering, association, visualisation, and descriptive approaches in DS &BDA (see Fig.  9 a). The trend of using these approaches in the articles has almost been ascending, especially the visualisation ones that have received much attention in the last four years. Data visualisation is a beneficial tool for SC &L in different areas. The graphs and OLAP techniques have been the methods mainly used in data visualisation. This is because visualisation approaches are able to depict a portion of the research problem and are applicable to all areas of SC &L. In the clustering approach, there are a variety of techniques and algorithms. K-means clustering is the most discussed technique, which is used in analysing energy logistics vehicles (Mao et al., 2020 ), traffic accidents (Kuvvetli & Firuzan, 2019 ), traffic flows (Bhattacharya et al., 2014 ), pricing (Hogenboom et al., 2015 ), and routing (Ehmke et al., 2012 ). The third most commonly used approach in descriptive analytics is the association approach, which means the measure of association between two variables. The Apriori algorithm is the most popular association algorithm, which has been used in various issues, including transportation risk (Yang, 2020 ), demand forecasting (Kilimci et al., 2019 ), quality management (Wang & Yue, 2017 ), vehicle routing (Ting et al., 2014 ), research and development (R &D) (Liao et al., 2008b ), and customer feedback (Singh et al., 2018a ).

In the predictive analytics type, the classification approach is very popular (see Fig.  9 b). The most common algorithms used in the classification approach are SVM (20%), decision trees (19%), logistic regression (19%), and neural networks (11%). This approach is usually applied in decisions corresponding to demand forecasting (Nikolopoulos et al., 2021 ; Yu et al., 2019b ; Gružauskas et al., 2019 ; Zhu et al., 2019a ), quality management (Bahaghighat et al., 2019 ), customer churn (Coussement et al., 2017 ), delivery planning (Proto et al., 2020 ; Wang et al., 2020 ; Praet & Martens, 2020 ), and routing (Spoel et al., 2017 ). Next, regression techniques have received high attention. Both linear regression models (37%) and SVR (24%) are the most commonly used techniques in regression models. These regression models are mainly applicable to logistics decisions such as traffic accidents (Farid et al., 2019 ; Wang et al., 2016b ), vehicle delays (Eltoukhy et al., 2019 ), delivery planning (Ghasri et al., 2016 ; Merchán & Winkenbach, 2019 ), and sales decisions such as demand forecasting (Nikolopoulos et al., 2021 , 2016 ) and sales forecasting (Lau et al., 2018 ).

The neural network is an important and common technique for forecasting and can be applied in a wide variety of problems such as supplier selection (Pramanik et al., 2020 ) and demand or sales forecasting (Verstraete et al., 2019 ). Time series modelling is the fourth predictive approach. ARIMA (34%), exponential smoothing (17%), and moving averages (18%) are the most popular techniques for DS &BDA time series modelling. These techniques are usually applied for demand forecasting (Kilimci et al., 2019 ; Huber et al., 2017 ). In this survey, we find that ARENA and AnyLogic simulation software are used more than others for shop floor control simulations (Yang et al., 2013 ), machine scheduling (Heger et al., 2016 ), and routing (Ehmke et al., 2016 ). Text mining is a useful approach for understanding the feelings and opinions of customers or people. In the examined papers, this method has been used in only 8 articles in the fields of customer feedback (Hao et al., 2021 ), sales forecasting (Cui et al., 2018 ), SC mapping (Wichmann et al., 2020 ).

Prescriptive analytics has the lowest number of contributions, compared to the other types of analytics. The optimisation models, simulations, and multi-criteria decision-making are the main approaches of the prescriptive analytics type. Among them, optimisation has the most contributions (78% out of the prescriptive analytics studies). The optimisation techniques are most often used to optimise the facility location (Doolun et al., 2018 ), location of distribution centres (Wang et al., 2018a ), type of technology (shen How & Lam, 2018 ), capacity planning (Ning & You, 2018 ), number of facilities (Tayal & Singh, 2018 ), inventory management (Çimen & Kirkbride, 2017 ), and vehicle routing (Mokhtarinejad et al., 2015 ). In addition to optimisation, MCDM approaches are also used in decision-making. This approach is classified into two main technique categories (MADM and MODM) and applied to customer credit risk (Lyu & Zhao, 2019 ), supplier selection (Maghsoodi et al., 2018 ), inventory management (Kartal et al., 2016 ), and SC resilience (Belhadi et al., 2022 ).

4.1.4 Technique verification strategies

In order to solve an SC &L problem, a suitable algorithm/technique must be selected and then evaluated through a proper set of data. Figure  10 shows the percentages of the applied verification strategies. In the examined literature, researchers mainly employ case study strategy with real data to verify their selected approaches and models (Antomarioni et al., 2021 ; Nuss et al., 2019 ). However, a few others have used a generating data strategy (i.e., synthetic data) that is mainly seen in simulation techniques (Kang et al., 2019 ). Hence, almost all of algorithms/techniques require real data to be verified (Choi et al., 2018 ).

figure 10

Distribution of research verification strategies

4.1.5 Comparison with previous survey studies

We compare our results with the recent survey studies listed in Table  3 . The comparison of top journals demonstrates that our unbiased approach in finding studies includes more relevant journals focusing on information systems (e.g., ESA and IEEEA journals). For instance, in the survey by Nguyen et al. ( 2021 ), all top journals are listed among SC &L-focused journals (IJPE, TRC, IJPR, and ICE). This survey has employed only “data-driven” or “data-based” keywords that do not cover all aspects of data science or data analytics applications (e.g. machine learning, deep learning, big data, etc). The authors only use previous survey studies to find all keywords related to SC &L.

Comparison of “Search Engines” in Table  3 demonstrates that most of the previous surveys use a sole database (mostly Scopus) for their search process and do not double check or confirm the process by other databases. Our systematic process concluded many duplicates (see Tables  1 and 2 ) but by handling these duplicates we reached a more clean and accurate data set.

4.2 Classification of studies based on the conceptual framework

As depicted in Fig.  5 , SC &L is comprised of five internal processes: procurement, production, distribution, logistics, and sales. In each process, a hierarchical triple planning structure is required: (1) long-term planning or strategic decision-making over a multi-year scheduling horizon, (2) mid-term planning or tactical decision-making over a seasonal or maximum one-year scheduling horizon, and (3) short-term planning or operational decision-making , which has a planning period between a few days up to one season (Sugrue & Adriaens, 2021 ).

An overview of the processes shows that the logistics process has received more interest, especially during the last two years (128 papers, 31% of the corpus). Sales is another frequently studied field in applying DS &BDA (83 papers, 20% of the examined literature). Figure  11 illustrates the distribution of studies by each decision level. In the procurement process, supplier selection (27 papers) and order allocation (17 papers) are the most discussed. The results of our investigations indicate that long-term decisions such as the plant location (Doolun et al., 2018 ), type of technologies (Vondra et al., 2019 ), and R &D (Liao et al., 2009 ) have made considerable contributions to improve production decisions. For example, an inappropriate network design incurs high costs (Song & Kusiak, 2009 ). The two key aspects at the mid-term production decision level in DS &BDA are master production scheduling (determining production quantities at each period) and quality management (Masna et al., 2019 ). Shop floor control has been of interest to researchers in short-term production planning (Yang et al., 2013 ). The results further show that in distribution tactical decisions, most of the papers discuss inventory management (Sachs, 2015 ), while in this process at the long- and short-term decision levels, the issues of distribution centre location (Wang et al., 2018c ), warehouse replenishment (Priore et al., 2019 ), and order picking decisions (Mirzaei et al., 2021 ) have been attractive to researchers.

figure 11

Distribution of the selected studies with respect to SC &L processes and planning levels

Logistics decisions have been mainly studied at the short-term level, i.e., vehicle routing (Tsolakis et al., 2021 ) and delivery planning (Vieira et al., 2019a ). Unlike other processes, most articles discuss the operational decisions compared to the other levels (59% of logistics planning studies). Subsequently, mid-term transportation planning decisions, including material flow rate issues (Wu et al., 2019 ), have been investigated more frequently. In the sales process, decisions and issues are mainly planned at the mid-term level. Hence, decisions regarding customer demand forecasting (Yu et al., 2019b ), pricing (Liu, 2019 ), and sales forecasting (Villegas & Pedregal, 2019 ) are the three most commonly studied issues in this process.

Overall, at the long-term planning level, most articles contribute to production (35 papers) and procurement decisions (30 papers). At the mid-term decision level, due to attractive issues such as demand forecasting, sales process has been the most investigated area (70 papers). After that, logistics process is in second place (40 papers) in the form of contributions involving transportation planning issues. At the short-term level, logistics process is at the forefront (76 papers). Vehicle routing (Yao et al., 2019 ), delivery planning (Vieira et al., 2019a ), financing risk (Ying et al., 2021 ), and transportation risk (Zhao et al., 2017 ) have been addressed more frequently. Subsequently, distribution process, with a large difference in contributions (20 papers), ranks second. Material ordering (Vieira et al., 2019b ) and customer feedback (Singh et al., 2018a ) have the lowest contribution in terms of applying DS &BDA among other processes at the short-term decision level.

4.2.1 Long-term decisions in SC &L

Long-term procurement decisions deal with supplier selection (Hosseini & Al Khaled, 2019 ) and supplier performance (Chen et al., 2012 ). During the production and distribution processes, these decisions are made considering the network design of factories and the distribution centres such as the location, number, types of facilities, and centre capacity (Mishra & Singh, 2020 ; Flores & Villalobos, 2020 ; Mohseni & Pishvaee, 2020 ), whereas, strategic decisions in the logistics process comprise planning with respect to the transportation system infrastructure, carrier selection, and capacity design (Lamba & Singh, 2019 ; Lamba et al., 2019 ). The decisions include customer service level determination, strategic sales planning, and customer targeting through sales category.

We also consider the SC design decisions in this category, including resilient SC (Brintrup et al., 2020 ; Belhadi et al., 2022 ; Mungo et al., 2023 ; Mishra & Singh, 2022 ; Hägele et al., 2023 ), sustainable SC (Bag et al., 2022b ), closed-loop (Govindan & Gholizadeh, 2021 ) and reverse logistics (Shokouhyar et al., 2022 ). A more complete categorisation of the related articles is summarised in Table  6 .

4.2.2 Efficiency, sustainability, and resilience paradigms

The COVID-19 pandemic has clearly shown the importance of resilient SC designs (Rozhkov et al., 2022 ). SC resilience refers to having the capability to absorb or even avoid disruptions (Ivanov, 2021a ; Kosasih & Brintrup, 2021 ; Yang & Peng, 2023 ). Belhadi et al. ( 2022 ) concede that AI techniques provide capable solutions for designing and upgrading more resilient SCs. Zhao and You ( 2019 ) develop a resilient SC design by employing a data-driven robust optimisation approach and demonstrate how the DS &BDA concepts should be considered in SC models.

SC sustainability refers to consideration of environmental, societal, and human-centric aspects in SC decisions (Li et al., 2021b ; Homayouni et al., 2021 ; Sun et al., 2020 ; Li et al., 2020a ). Mishra and Singh ( 2020 ) develop a sustainable reverse logistics model by considering realistic parameters. They affirm that all three aspects of sustainability can be covered by BDA. Tsolakis et al. ( 2021 ) conduct a comprehensive literature review for AI-driven sustainability and conclude that the most essential techniques in modelling SCs are AI and optimisation techniques.

A closed-loop SC employs reverse logistics to supply re-manufactured products back into the forward logistics process. Jiao et al. ( 2018 ) develop a data-driven optimisation model to integrate sustainability features in a closed-loop SC. Shokouhyar et al. ( 2022 ) employ social media data for modelling a customer-centric reverse logistics with an emphasis on the BDA approaches for designing reverse logistics SCs.

4.2.3 Mid-term decisions in SC &L

The selected paper categorisation at the tactical decision level is outlined in Table  7 . Decisions regarding the allocation of orders to suppliers such as the order quantity planning and lot sizing (Lamba & Singh, 2019 ), supply risk management (Baryannis et al., 2019a ), raw materials quality management (Bouzembrak & Marvin, 2019 ), material requirement planning (Zhao & You, 2019 ), material cost management (Ou et al., 2016 ), and demand forecasting (Stip & Van Houtum, 2020 ) are all considered as mid-term procurement decisions.

The main tasks in mid-term production planning are master production scheduling (Flores & Villalobos, 2020 ), capacity planning (Sugrue & Adriaens, 2021 ), quality management (Ou et al., 2016 ), and demand forecasting (Dombi et al., 2018 ), while inventory management decisions (Ning & You, 2018 ), capacity planning (Oh & Jeong, 2019 ), in-stock product quality management issues (Ou et al., 2016 ), and warehouse demand forecasting (Zhou et al., 2022b ) are among the tactical distribution decisions.

Some of the main mid-term logistics decisions are transportation planning (Wu et al., 2020 ; Gao et al., 2019 ), service quality management (Gürbüz et al., 2019 ; Molka-Danielsen et al., 2018 ), transportation modes (Jula & Leachman, 2011 ), and demand forecasting (Potočnik et al., 2019 ; Boutselis & McNaught, 2019 ). Demand forecasting (Lee et al., 2011 ; Shukla et al., 2012 ), demand shaping (e.g., marketing) (Aguilar-Palacios et al., 2019 ; Liao et al., 2009 ), sales forecasting (Wong & Guo, 2010 ), pricing (Hogenboom et al., 2015 ), consumer behaviour (e.g., purchasing pattern) (Bodendorf et al., 2022b ; Garcia et al., 2019 ), and customer churn (Coussement et al., 2017 ) are planned in the tactical sale decisions.

4.2.4 Short-term decisions in SC &L

Short-term procurement planning includes ordering materials (Vieira et al., 2019b ). Production operational decisions include machine scheduling (Yue et al., 2021 ), shop floor control (such as preventive maintenance scheduling (Celik & Son, 2012 ) and material flows (Zhong et al., 2015 )), and decisions regarding the size of the production batch (Sadic et al., 2018 ). In the area of distribution , planning associated with packaging (Kim, 2018 ), warehouse replenishment (Taube & Minner, 2018 ), order picking (Mirzaei et al., 2021 ), and inventory turnover (Zhang et al., 2019 ) could be made in short-term decisions. A variety of operational decisions can be made at the logistics stage, including delivery planning (Praet & Martens, 2020 ), vehicle delay management (Kim et al., 2017 ), routing planning (Liu et al., 2019 ), and transportation risk management (Wu et al., 2017 ).

At this level of decision-making, due to the wide variety of decisions, we consider more categories than other levels. For example, we consider vehicle delivery planning (Vieira et al., 2019b ) and vehicle routing (Yao et al., 2019 ) as two separate categories. Also, in order to reduce the number of categories, we aggregate crash risk (Bao et al., 2019 ), traffic safety (Arbabzadeh & Jafari, 2017 ), and fraud detection decisions (Triepels et al., 2018 ) in the transport risk management category. Table  8 shows the results of reviewing the short-term decisions.

5 Identification of research gaps

To answer the fourth research question (RQ4), we evaluate the selected studies in details to find any existing gaps in the literature for using DS &BDA approaches in SC &L. We categorise our findings in the following sub-sections.

5.1 Data-driven optimisation

DDO has received a considerable attention. In our study, we aimed to identify related techniques by adding the word “data-driven” to our keyword set (see the preliminary search results for DDO in Table  2 ). DDO is a mathematical programming method that combines uncertainty approaches for optimisation with machine learning algorithms. The objective functions are often cost-related (Alhameli et al., 2021 ; Baryannis et al., 2019a ). Ning and You ( 2019 ) divided DDO into four modeling methods: stochastic programming, chance-constrained programming, robust optimisation, and scenario-based optimisation. In the SC &L area, some of the problem parameters may be considered as uncertain such as customer demand (Medina-González et al., 2018 ; Taube & Minner, 2018 ), production capacity (Jiao et al., 2018 ), and delivery time (Lee & Chien, 2014 ). In comparison with the traditional optimisation models under uncertainty, which consider perfect information for the parameters, DDO approaches employ information of random variables direct inputs to the proposed programming problems.

In our examined material, 21 papers studied optimisation under uncertainty. The stochastic programming methods (e.g., MILP and MINLP) were the most applied methods (e.g., Flores and Villalobos ( 2020 ); Taube and Minner ( 2018 )). Chance-constrained programming is an optimisation method in which the constraints in the probability distribution must be satisfied. This method has practical applicability in SC &L (Jiao et al., 2018 ). In robust optimisation, the uncertainty sets (the set of uncertain parameters) must be specified the in case of data sets. Therefore, in order address uncertainty in the SC &L area, this method seems to be more efficient. As in the SC &L, we are mostly facing uncertain data (Gao et al., 2019 ). In scenario-based optimisation, uncertainty scenarios are used to find an optimal solution. In our selected studies, there was no study using this method. It seems that this method has research potential, as long as the scenarios are created as a set of data, and the scenario-based DDO methods are applied especially in risk management (Baryannis et al., 2019a ).

Considering that BDA applications to SC &L are still in the process of development, employing BDA techniques (e.g., cloud computing or parallel computing) or tools (e.g., Hadoop, Spark, or Map-Reduce solutions) can be sonsidered as important future directions for using DDO methods in decision-making (Ning & You, 2019 ). Big data-driven optimisation (BDDO) methods, which are a combination of methods dealing with big data and techniques employing DDO, could be of interest in terms of solving several problems in SC &L.

5.2 SC &L processes and decision levels

The framework used in our study revealed the contributions of DS &BDA in SC &L processes. The material evaluation from the SC &L process point-of-view shows that the two processes of distribution and procurement are discussed less often in all three hierarchical levels of decision-making in the SC &L. While the SC is a set of hierarchical processes, and the decisions at each level are influenced by the ones from other levels and processes (Stadtler & Kilger, 2002 ), more attention can be given to distribution and procurement decisions, especially at the strategic level.

In the process of procurement, most of the studies focused on mid-term decisions (such as order allocation decisions (Kaur & Singh, 2018 ), supplier risk management (Brintrup et al., 2018 ), MRP (Zhao & You, 2019 ), and so forth), while short-term decisions in this process (e.g., ordering materials (Vieira et al., 2019b )) have received the least amount of attention.

Short-term decisions in the production process, such as lot sizing (Gao et al., 2019 ) and machine scheduling (Simkoff & Baldea, 2019 ) decisions, have received less attention compared to strategic decisions. In the process of distribution, warehouse capacity planning (Oh & Jeong, 2019 ) and inventory turnover (Zhang et al., 2019 ) decisions have been partially ignored representing a visible research gap. For example, capacity design (Gao et al., 2019 ) requires more attention in the domain of logistics processes. Shipment size planning (Piendl et al., 2019 ) has been identified as one of the most important decisions. In sales processes, customer feedback (Hao et al., 2021 ) is crucial in determining organisation strategies; however, this field has not received enough attention so far.

5.3 DS &BDA approaches, techniques, and tools

Our results demonstrate that a wide range of models and techniques can be used in the SC &L area. Nevertheless, some techniques are employed less. For instance, OLAP is a powerful technique behind many BI software solutions, but it is rarely employed in the models. OLAP is applied for the processing of multidimensional data or data collected from different databases, which are routine issues in the SC &L area. As another example, in data clustering, some other clustering techniques such as fuzzy k-modes, k-medoid and fuzzy c-means are used less in the reviewed articles. For instance, Kuvvetli and Firuzan ( 2019 ) apply k-means clustering to classify the number of traffic accidents in urban public transportation. However, the model is not examined by other clustering techniques such as the k-medoid or fuzzy c-means to ensure that the selected clustering technique is more accurate or efficient than the others.

Our study on the types of analytics indicates that the predictive analytics approach has attracted more attention. Nevertheless, using this approach has its own challenges. Executing predictive analytics techniques is time-consuming and requires iterative stages of testing, adopting, and resulting (Arunachalam et al., 2018 ). The majority of the studies have not discussed these challenges. Machine learning is one of the efficient methods of AI for analysing and learning data. Some articles have used machine learning methods, but only in the context of “AI”, which can be used with a wider range of techniques (Li et al., 2021a ).

Among the machine learning techniques used in the examined literature, deep learning (Punia et al., 2020 ; Kilimci et al., 2019 ) and ensemble learning (Zhu et al., 2019b ) techniques have received very limited attention, while these techniques increase the ideal prediction accuracy (Baryannis et al., 2019a ). Moreover, “transfer learning” and “reinforcement learning” have not been employed in the examined literature. These methods enhance neural networks and deep learning techniques.

5.4 Big data analytics (BDA)

Although some scholars have argued in favor of BDA approaches, they have not fully addressed BDA challenges such as generation, integration, and BDA techniques (Arunachalam et al., 2018 ; Novais et al., 2019 ). Among the 227 examined articles, 107 articles used the buzzword “Big Data” in their publications, but a few of them (we found 13 articles) focused on big data characteristics, techniques, and architectures. Therefore, we suppose the the rest probably used large data sets, but not necessarily big data. Considering the special characteristics of big data, it is required that the studies on BDA unequivocally and practically refer to big data techniques (Chen et al., 2014 ; Grover & Kar, 2017 ; Arunachalam et al., 2018 ; Brinch, 2018 ).

Since big data in SC &L can be generated from various SC processes and from different data collection resources (such as GPS, sensors, and multimedia tools), extracting knowledge from various types of the data is another concern in BDA. The diversity of the data is anticipated to increase in the future Baryannis et al. ( 2019b ); thus, integration in data analysis is an important debate in BDA. It is expected that researchers will considerably focus on data integration in the future.

BDA implementation, like other analytical tools and types of process monitoring, is time-consuming and requires management commitment. Executive BDA challenges, such as strategic management, business process management, knowledge management and performance measurement, need to be reviewed and analysed (Brinch, 2018 ; Choi et al., 2018 ; Kamble & Gunasekaran, 2020 ). Moreover, instead of focusing on some limited performance metrics, the key performance indicators of an SC &L company, such as financial or profitability indicators, must be monitored for proper BDA implementation. In the future, with the development of BDA techniques, such as the proposed BDDO techniques, some prescriptive analytics approaches will become more preferred (Arunachalam et al., 2018 ).

5.5 Data collection and generation

Unstructured data, such as the data extracted from social media and websites, are great sources of data acquisition that seem to be ignored in the SC &L literature. This type of data should be considered more in the future. Besides,in order to extract more value from DS &BDA approaches, real-time data is much more reliable than historical data because it can better describe SC behaviour (Nguyen et al., 2018 ). Therefore, SC &L companies should rapidly employ analytics with real-time processing tools. IOT, RFID, and sensor devices are technologies that facilitate real-time recognition (Zhu, 2018 ; Zhong et al., 2016 ), and it is suggested that these tools be used in any of the real-time processes in SC &L. A special role in this area will be played by digital twins and associated technologies for real-time data collection such as 5 G (Ivanov et al., 2021a ; Choi et al., 2022 ; Ivanov & Dolgui, 2021a ; Dolgui & Ivanov, 2022 ; Ivanov et al., 2022 ).

5.6 SC design

Analysis of DS &BDA models indicated a few papers considering not only efficient but also sustainable and resilient network designs. Table  6 illustrates that there is a large gap in literature for considering DS &BDA concepts in resilient SCs. Belhadi et al. ( 2022 ) confirm that the COVID-19 pandemic made the SCs focus on resilient principles. The authors affirm that DS &BDA techniques highly support SC resilient strategies.

Our observation for employing the DS &BDA techniques in sustainable SCs reveals another future direction for SC &L research. We realised that only 5% of the studies consider sustainability concepts in their models. Although considering the environmental and human impacts on SC design is a contemporary subject for SCs, Tsolakis et al. ( 2021 ) acknowledge that Industry 4.0 and the Internet of Things necessitate the applications of DS &BDA techniques but with deliberating social and environmental aspects in line with SCs’ progress. The authors confirm that the recent extant literature has not adequately covered the sustainability implications of DS &BDA innovations.

The closed-loop SC and reverse logistics are also among the rare design configurations for DS &BDA models. Govindan and Gholizadeh ( 2021 ) concede that the analysis of the processes in a closed-loop SC requires big data and once the sustainability and resilient features are combined to the model, a BDA model is capable of addressing the proposed problems in such types of SCs. This means that the volume, velocity, and variety of the input data should be considered in the models.

5.7 COVID-19 and pandemic

Since 2020, COVID-19 pandemic has posed significant challenges for SCs. Different SC echelons have collaborated under deep uncertainty. Academic research introduced some new models and frameworks (Ivanov & Dolgui, 2021b ; Ivanov, 2021c ; Ardolino et al., 2022 ). We identified several studies within this research stream in our selected data set. For instance, Barnes et al. ( 2021 ) study consumer behaviour in pandemic and named it as “panic buying”. Using big data of social media, the authors apply text mining with compensatory control theory to demonstrate early warning of potential demand problems. Nikolopoulos et al. ( 2021 ) study forecasting and planning during a pandemic using nearest neighbours clustering method. They use Google trends data to predict COVID-19 growth rates and model excess demand of products.

One of the central questions regarding the pandemic is how to design a pandemic-resilient SC (Ivanov & Dolgui, 2020 ; Nikolopoulos et al., 2021 ; Ivanov & Dolgui, 2021a ; Ivanov, 2021a ; Choi et al., 2022 ) and how to adapt to “new normal” (Bag et al., 2022a ; Ivanov, 2021b ). By emphasising the role of BDA in SC &L, Belhadi et al. ( 2022 ) examine the effect of COVID-19 outbreak on manufacturing and service SC resilience. Kar et al. ( 2022 ) investigate fake news on consumer buying behavior during pandemic and focus on the effect of resultant fear on hoarding of necessary products. SC performance in COVID-19 era is also investigated by researchers through BDA (Li et al., 2022b ; Rozhkov et al., 2022 ). Although several studies contributed in the area of using DS &BDA approaches, the literature needs a dedicated survey study similar to (Ardolino et al., 2021 ; Queiroz et al., 2022 ). Novel contributions in this area can be done with BDA and DS applications in the context of SC viability and Industry 5.0 (Ivanov, 2023 ; Ivanov & Keskin, 2023 ).

6 Conclusion and research directions

In this study, we proposed a literature review methodology and a holistic conceptual framework for classifying the applications of DS &BDA in SC &L. An investigation of the relevant review studies illustrated several gaps in former studies, which motivated us to focus on a conceptual framework for our reviewing process. Our broad keyword search initially found a large variety of papers published from 2005 to 2022. Employing a detailed review protocol and process, we selected 364 publications from highly ranked journals. We also focused on studies using DS &BDA modelling methods for solving SC &L problems. We revealed the contributions of DS &BDA in SC &L processes and highlighted the potential for future studies in each SC &L process. We also indicated the effective and bold role of DS &BDA applications/techniques in triple hierarchical decision levels. Three main types of analytics were used to categorise DS &BDA techniques and tools. The overall results indicated that the predictive approach was the most popular one. However, with the development of BDA techniques and the DDO approaches in the future, the prescriptive approach is likely to become more attractive. We also emphasised the deployment of effective deep learning, ensemble learning, and machine learning techniques in SCs. In the area of SC design, we proposed a structured and unbiased review on the DS &BDA application areas in the SC &L comprehensively covering efficiency, resilience and sustainability paradigms.

Limitations exist as with any study. Although we conducted a systematic literature review, the selected papers were restricted due to our proposed inclusion and exclusion criteria. We tried to include all relevant papers and selected highly ranked journals to increase the quality of the research. Nevertheless, a larger data set using computer science/engineering conferences and journals may result in a better exploration of the literature. This will reduce the echo chamber effect of citations in which a specific subset of journals keep citing each other and find each other worthy. The proposed conceptual framework may need to be extended, especially in the case of prescriptive analytics approaches. Also, we may be prejudiced in our interpretation of the literature. The material collection process showed that studies on the topic of DS &BDA in SC &L are substantially growing. Therefore, annual survey studies on this topic (with a broad range of keywords) are suggested for future research. Furthermore, any of the main approaches in DS &BDA applications (such as clustering, classification, simulation, text mining, or time series analysis) can be investigated separately in SC &L.

2019 Australian Business Deans Council (ABDC) journal rank https://abdc.edu.au/research/abdc-journal-quality-list/ .

2020 Scimago Journal & Country Rank (SJR) https://www.scimagojr.com/journalrank.php .

https://www.scimagojr.com/journalrank.php .

Abbasi, B., Babaei, T., Hosseinifard, Z., Smith-Miles, K., & Dehghani, M. (2020). Predicting solutions of large-scale optimization problems via machine learning: A case study in blood supply chain management. Computers and Operations Research, 119 , 104941.

Article   Google Scholar  

Addo-Tenkorang, R., & Helo, P. T. (2016). Big data applications in operations/supply-chain management: A literature review. Computers and Industrial Engineering, 101 , 528–543.

Aguilar-Palacios, C., Muñoz-Romero, S., & Rojo-Álvarez, J. L. (2019). Forecasting promotional sales within the neighbourhood. IEEE Access, 7 , 74759–74775.

Akinade, O. O., & Oyedele, L. O. (2019). Integrating construction supply chains within a circular economy: An ANFIS-based waste analytics system (A-WAS). Journal of Cleaner Production, 229 , 863–873.

Alahmadi, D., & Jamjoom, A. (2022). Decision support system for handling control decisions and decision-maker related to supply chain. Journal of Big Data, 9 (1).

Alhameli, F., Ahmadian, A., & Elkamel, A. (2021). Multiscale decision-making for enterprise-wide operations incorporating clustering of high-dimensional attributes and big data analytics: Applications to energy hub. Energies, 14 (20).

Aloini, D., Benevento, E., Stefanini, A., & Zerbino, P. (2019). Process fragmentation and port performance: Merging SNA and text mining. International Journal of Information Management, 51 , 101925.

Altintas, N., & Trick, M. (2014). A data mining approach to forecast behavior. Annals of Operations Research, 216 (1), 3–22.

Ameri Sianaki, O., Yousefi, A., Tabesh, A. R., & Mahdavi, M. (2019). Machine learning applications: The past and current research trend in diverse industries. Inventions, 4 (1), 8.

Amoozad Mahdiraji, H., Yaftiyan, F., Abbasi-Kamardi, A., & Garza-Reyes, J. (2022). Investigating potential interventions on disruptive impacts of Industry 4.0 technologies in circular supply chains: Evidence from SMEs of an emerging economy. Computers and Industrial Engineering, 174 .

Analytics, T. S. C. (2020). Top supply chain analytics: 50 useful software solutions and data analysis tools to gain valuable supply chain insights. Visited on 2020-01-31. www.camcode.com/asset-tags/top-supply-chain-analytics/

Anparasan, A. A., & Lejeune, M. A. (2018). Data laboratory for supply chain response models during epidemic outbreaks. Annals of Operations Research, 270 (1–2), 53–64.

Antomarioni, S., Lucantoni, L., Ciarapica, F. E., & Bevilacqua, M. (2021). Data-driven decision support system for managing item allocation in an ASRS: A framework development and a case study. Expert Systems with Applications, 185 , 115622.

Arbabzadeh, N., & Jafari, M. (2017). A data-driven approach for driving safety risk prediction using driver behavior and roadway information data. IEEE Transactions on Intelligent Transportation Systems, 19 (2), 446–460.

Ardolino, M., Bacchetti, A., Dolgui, A., Franchini, G., Ivanov, D., & Nair, A. (2022). The Impacts of digital technologies on coping with the COVID-19 pandemic in the manufacturing industry: A systematic literature review. International Journal of Production Research , 1–24.

Ardolino, M., Bacchetti, A., & Ivanov, D. (2021). Analysis of the COVID-19 pandemic’s impacts on manufacturing: A systematic literature review and future research agenda. Operations Management Research .

Arunachalam, D., Kumar, N., & Kawalek, J. P. (2018). Understanding big data analytics capabilities in supply chain management: Unravelling the issues, challenges and implications for practice. Transportation Research Part E: Logistics and Transportation Review, 114 , 416–436.

Bag, S., Choi, T.-M., Rahman, M., Srivastava, G., & Singh, R. (2022a). Examining collaborative buyer-supplier relationships and social sustainability in the “new normal” era: The moderating effects of justice and big data analytical intelligence. Annals of Operations Research , 1–46.

Bag, S., Gupta, S., & Wood, L. (2022). Big data analytics in sustainable humanitarian supply chain: Barriers and their interactions. Annals of Operations Research, 319 (1), 721–760.

Bag, S., Luthra, S., Mangla, S., & Kazancoglu, Y. (2021). Leveraging big data analytics capabilities in making reverse logistics decisions and improving remanufacturing performance. International Journal of Logistics Management, 32 (3), 742–765.

Google Scholar  

Bahaghighat, M., Akbari, L., & Xin, Q. (2019). A machine learning-based approach for counting blister cards within drug packages. IEEE Access, 7 , 83785–83796.

Baker, T., Jayaraman, V., & Ashley, N. (2013). A data-driven inventory control policy for cash logistics operations: An exploratory case study application at a financial institution. Decision Sciences, 44 (1), 205–226.

Ballings, M., & Van den Poel, D. (2012). Customer event history for churn prediction: How long is long enough? Expert Systems with Applications, 39 (18), 13517–13522.

Bányai, T., Illés, B., & Bányai, Á. (2018). Smart scheduling: An integrated first mile and last mile supply approach. Complexity, 2018 .

Bao, J., Liu, P., & Ukkusuri, S. V. (2019). A spatiotemporal deep learning approach for citywide short-term crash risk prediction with multi-source data. Accident Analysis and Prevention, 122 , 239–254.

Barnes, S. J., Diaz, M., & Arnaboldi, M. (2021). Understanding panic buying during COVID-19: A text analytics approach. Expert Systems with Applications, 169 , 114360.

Barraza, N., Moro, S., Ferreyra, M., & de la Peña, A. (2019). Mutual information and sensitivity analysis for feature selection in customer targeting: A comparative study. Journal of Information Science, 45 (1), 53–67.

Baryannis, G., Dani, S., & Antoniou, G. (2019). Predicting supply chain risks using machine learning: The trade-off between performance and interpretability. Future Generation Computer Systems, 101 , 993–1004.

Baryannis, G., Validi, S., Dani, S., & Antoniou, G. (2019). Supply chain risk management and artificial intelligence: State of the art and future research directions. International Journal of Production Research, 57 (7), 2179–2202.

Belhadi, A., Kamble, S., Fosso Wamba, S., & Queiroz, M. (2022). Building supply-chain resilience: An artificial intelligence-based technique and decision-making framework. International Journal of Production Research, 60 (14), 4487–4507.

Benzidia, S., Makaoui, N., & Bentahar, O. (2021). The impact of big data analytics and artificial intelligence on green supply chain process integration and hospital environmental performance. Technological Forecasting and Social Change, 165 , 120557.

Bhattacharya, A., Kumar, S. A., Tiwari, M., & Talluri, S. (2014). An intermodal freight transport system for optimal supply chain logistics. Transportation Research Part C: Emerging Technologies, 38 , 73–84.

Blackburn, R., Lurz, K., Priese, B., Göb, R., & Darkow, I.-L. (2015). A predictive analytics approach for demand forecasting in the process industry. International Transactions in Operational Research, 22 (3), 407–428.

Bodendorf, F., Dimitrov, G., & Franke, J. (2022a). Analyzing and evaluating supplier carbon footprints in supply networks. Journal of Cleaner Production, 372 .

Bodendorf, F., Merkl, P., & Franke, J. (2022). Artificial neural networks for intelligent cost estimation-a contribution to strategic cost management in the manufacturing supply chain. International Journal of Production Research, 60 (21), 6637–6658.

Boutselis, P., & McNaught, K. (2019). Using Bayesian networks to forecast spares demand from equipment failures in a changing service logistics context. International Journal of Production Economics, 209 , 325–333.

Bouzembrak, Y., & Marvin, H. J. (2019). Impact of drivers of change, including climatic factors, on the occurrence of chemical food safety hazards in fruits and vegetables: A Bayesian Network approach. Food Control, 97 , 67–76.

Brinch, M. (2018). Understanding the value of big data in supply chain management and its business processes. International Journal of Operations and Production Management .

Brintrup, A., Pak, J., Ratiney, D., Pearce, T., Wichmann, P., Woodall, P., & McFarlane, D. (2020). Supply chain data analytics for predicting supplier disruptions: A case study in complex asset manufacturing. International Journal of Production Research, 58 (11), 3330–3341.

Brintrup, A., Wichmann, P., Woodall, P., McFarlane, D., Nicks, E., & Krechel, W. (2018). Predicting hidden links in Supply Networks. Complexity, 2018 .

Bucur, P. A., Hungerländer, P., & Frick, K. (2019). Quality classification methods for ball nut assemblies in a multi-view setting. Mechanical Systems and Signal Processing, 132 , 72–83.

Burgos, D., & Ivanov, D. (2021). Food retail supply chain resilience and the COVID-19 pandemic: A digital twin-based impact analysis and improvement directions. Transportation Research Part E: Logistics and Transportation Review, 152 , 102412.

Carbonneau, R., Laframboise, K., & Vahidov, R. (2008). Application of machine learning techniques for supply chain demand forecasting. European Journal of Operational Research, 184 (3), 1140–1154.

Cavalcante, I. M., Frazzon, E. M., Forcellini, F. A., & Ivanov, D. (2019). A supervised machine learning approach to data-driven simulation of resilient supplier selection in digital manufacturing. International Journal of Information Management, 49 , 86–97.

Cavallo, D. P., Cefola, M., Pace, B., Logrieco, A. F., & Attolico, G. (2019). Non-destructive and contactless quality evaluation of table grapes by a computer vision system. Computers and Electronics in Agriculture, 156 , 558–564.

Celik, N., Lee, S., Vasudevan, K., & Son, Y.-J. (2010). DDDAS-based multi-fidelity simulation framework for supply chain systems. IIE Transactions, 42 (5), 325–341.

Celik, N., & Son, Y.-J. (2012). Sequential Monte Carlo-based fidelity selection in dynamic-data-driven adaptive multi-scale simulations. International Journal of Production Research, 50 (3), 843–865.

Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications, 19 (2), 171–209.

Chen, M.-C., Huang, C.-L., Chen, K.-Y., & Wu, H.-P. (2005). Aggregation of orders in distribution centers using data mining. Expert Systems with Applications, 28 (3), 453–460.

Chen, M.-C., & Wu, H.-P. (2005). An association-based clustering approach to order batching considering customer demand patterns. Omega, 33 (4), 333–343.

Chen, R., Wang, Z., Yang, L., Ng, C., & Cheng, T. (2022). A study on operational risk and credit portfolio risk estimation using data analytics. Decision Sciences, 53 (1), 84–123.

Chen, W., Song, J., Shi, L., Pi, L., & Sun, P. (2013). Data mining-based dispatching system for solving the local pickup and delivery problem. Annals of Operations Research, 203 (1), 351–370.

Chen, X., Liu, L., & Guo, X. (2021). Analysing repeat blood donation behavior via big data. Industrial Management and Data Systems, 121 (2), 192–208.

Chen, Y.-S., Cheng, C.-H., & Lai, C.-J. (2012). Extracting performance rules of suppliers in the manufacturing industry: An empirical study. Journal of Intelligent Manufacturing, 23 (5), 2037–2045.

Chen, Y.-T., Sun, E., Chang, M.-F., & Lin, Y.-B. (2021b). Pragmatic real-time logistics management with traffic IoT infrastructure: Big data predictive analytics of freight travel time for Logistics 4.0. International Journal of Production Economics, 238 .

Chi, H.-M., Ersoy, O. K., Moskowitz, H., & Ward, J. (2007). Modeling and optimizing a vendor managed replenishment system using machine learning and genetic algorithms. European Journal of Operational Research, 180 (1), 174–193.

Choi, T.-M., Dolgui, A., & Ivanov, D., & Pesch, E. (2022). OR and analytics for digital, resilient, and sustainable manufacturing 4.0. Annals of Operations Research, 310 (1), 1–6.

Choi, T.-M., Wallace, S. W., & Wang, Y. (2018). Big data analytics in operations management. Production and Operations Management, 27 (10), 1868–1883.

Choy, K., Tan, K., & Chan, F. (2007). Design of an intelligent supplier knowledge management system: An integrative approach. Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Manufacture, 221 (2), 195–211.

Chuang, Y.-F., Chia, S.-H., & Yih Wong, J. (2013). Customer value assessment of pharmaceutical marketing in Taiwan. Industrial Management and Data Systems, 113 (9), 1315–1333.

Çimen, M., & Kirkbride, C. (2017). Approximate dynamic programming algorithms for multidimensional flexible production-inventory problems. International Journal of Production Research, 55 (7), 2034–2050.

Coussement, K., Lessmann, S., & Verstraeten, G. (2017). A comparative analysis of data preparation algorithms for customer churn prediction: A case study in the telecommunication industry. Decision Support Systems, 95 , 27–36.

Cui, R., Gallino, S., Moreno, A., & Zhang, D. J. (2018). The operational value of social media information. Production and Operations Management, 27 (10), 1749–1769.

Cui, R., Li, M., & Zhang, S. (2022). AI and procurement. Manufacturing and Service Operations Management, 24 (2), 691–706.

Dai, J., Xie, Y., Xu, J., & Lv, C. (2020). Environmentally friendly equilibrium strategy for coal distribution center site selection. Journal of Cleaner Production, 246 , 119017.

Dai, Y., Dou, L., Song, H., Zhou, L., & Li, H. (2022). Two-way information sharing of uncertain demand forecasts in a dual-channel supply chain. Computers and Industrial Engineering, 169 .

De Caigny, A., Coussement, K., & De Bock, K. W. (2018). A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees. European Journal of Operational Research, 269 (2), 760–772.

De Clercq, D., Jalota, D., Shang, R., Ni, K., Zhang, Z., Khan, A., Wen, Z., Caicedo, L., & Yuan, K. (2019). Machine learning powered software for accurate prediction of biogas production: A case study on industrial-scale Chinese production data. Journal of Cleaner Production, 218 , 390–399.

De Giovanni, P., Belvedere, V., & Grando, A. (2022). The selection of industry 4.0 technologies through Bayesian networks: An operational perspective. IEEE Transactions on Engineering Management , 1–16.

Dev, N. K., Shankar, R., Gunasekaran, A., & Thakur, L. S. (2016). A hybrid adaptive decision system for supply chain reconfiguration. International Journal of Production Research, 54 (23), 7100–7114.

Di Ciccio, C., Van der Aa, H., Cabanillas, C., Mendling, J., & Prescher, J. (2016). Detecting flight trajectory anomalies and predicting diversions in freight transportation. Decision Support Systems, 88 , 1–17.

Dolgui, A., & Ivanov, D. (2022). 5G in digital supply chain and operations management: Fostering flexibility, end-to-end connectivity and real-time visibility through internet-of-everything. International Journal of Production Research, 60 (2), 442–451.

Dombi, J., Jónás, T., & Tóth, Z. E. (2018). Modeling and long-term forecasting demand in spare parts logistics businesses. International Journal of Production Economics, 201 , 1–17.

Doolun, I. S., Ponnambalam, S., Subramanian, N., & Kanagaraj, G. (2018). Data driven hybrid evolutionary analytical approach for multi objective location allocation decisions: Automotive green supply chain empirical evidence. Computers and Operations Research, 98 , 265–283.

Ehmke, J. F., Campbell, A. M., & Thomas, B. W. (2016). Data-driven approaches for emissions-minimized paths in urban areas. Computers and Operations Research, 67 , 34–47.

Ehmke, J. F., Meisel, S., & Mattfeld, D. C. (2012). Floating car based travel times for city logistics. Transportation Research Part C: Emerging Technologies, 21 (1), 338–352.

Eltoukhy, A. E., Wang, Z., Chan, F. T., & Fu, X. (2019). Data analytics in managing aircraft routing and maintenance staffing with price competition by a Stackelberg–Nash game model. Transportation Research Part E: Logistics and Transportation Review, 122 , 143–168.

Farid, A., Abdel-Aty, M., & Lee, J. (2019). Comparative analysis of multiple techniques for developing and transferring safety performance functions. Accident Analysis and Prevention, 122 , 85–98.

Figueiras, P., Gonçalves, D., Costa, R., Guerreiro, G., Georgakis, P., & Jardim-Gonçalves, R. (2019). Novel Big Data-supported dynamic toll charging system: Impact assessment on Portugal’s shadow-toll highways. Computers and Industrial Engineering, 135 , 476–491.

Flores, H., & Villalobos, J. R. (2020). A stochastic planning framework for the discovery of complementary, agricultural systems. European Journal of Operational Research, 280 (2), 707–729.

Fu, W., & Chien, C.-F. (2019). UNISON data-driven intermittent demand forecast framework to empower supply chain resilience and an empirical study in electronics distribution. Computers and Industrial Engineering, 135 , 940–949.

Fukuda, S., Yasunaga, E., Nagle, M., Yuge, K., Sardsud, V., Spreer, W., & Müller, J. (2014). Modelling the relationship between peel colour and the quality of fresh mango fruit using Random Forests. Journal of Food Engineering, 131 , 7–17.

Gan, M., Yang, S., Li, D., Wang, M., Chen, S., Xie, R., & Liu, J. (2018). A novel intensive distribution logistics network design and profit allocation problem considering sharing economy. Complexity, 2018 .

Gao, J., Ning, C., & You, F. (2019). Data-driven distributionally robust optimization of shale gas supply chains under uncertainty. AIChE Journal, 65 (3), 947–963.

Garcia, S., Cordeiro, A., de Alencar Nääs, I., & Neto, P. L. (2019). The sustainability awareness of Brazilian consumers of cotton clothing. Journal of Cleaner Production, 215 , 1490–1502.

Ghasri, M., Maghrebi, M., Rashidi, T. H., & Waller, S. T. (2016). Hazard-based model for concrete pouring duration using construction site and supply chain parameters. Automation in Construction, 71 , 283–293.

Göçmen, E., & Erol, R. (2019). Transportation problems for intermodal networks: Mathematical models, exact and heuristic algorithms, and machine learning. Expert Systems with Applications, 135 , 374–387.

Gopal, P., Rana, N., Krishna, T., & Ramkumar, M. (2022). Impact of big data analytics on supply chain performance: An analysis of influencing factors. Annals of Operations Research , 1–29.

Gordini, N., & Veglio, V. (2017). Customers churn prediction and marketing retention strategies. An application of support vector machines based on the AUC parameter-selection technique in B2B e-commerce industry. Industrial Marketing Management, 62 , 100–107.

Govindan, K., Cheng, T., Mishra, N., & Shukla, N. (2018). Big data analytics and application for logistics and supply chain management.

Govindan, K., & Gholizadeh, H. (2021). Robust network design for sustainable-resilient reverse logistics network using big data: A case study of end-of-life vehicles. Transportation Research Part E: Logistics and Transportation Review, 149 , 102279.

Grover, P., & Kar, A. K. (2017). Big data analytics: A review on theoretical contributions and tools used in literature. Global Journal of Flexible Systems Management, 18 (3), 203–229.

Gružauskas, V., Gimžauskienė, E., & Navickas, V. (2019). Forecasting accuracy influence on logistics clusters activities: The case of the food industry. Journal of Cleaner Production, 240 , 118225.

Grzybowska, H., Kerferd, B., Gretton, C., & Waller, S. T. (2020). A simulation-optimisation genetic algorithm approach to product allocation in vending machine systems. Expert Systems with Applications, 145 , 113110.

Gumus, A. T., Guneri, A. F., & Keles, S. (2009). Supply chain network design using an integrated neuro-fuzzy and MILP approach: A comparative design study. Expert Systems with Applications, 36 (10), 12570–12577.

Gunduz, M., Demir, S., & Paksoy, T. (2021). Matching functions of supply chain management with smart and sustainable Tools: A novel hybrid BWM-QFD based method. Computers and Industrial Engineering, 162 .

GuoHua, Z., Wei, W., et al. (2021). Study of the game model of E-commerce information sharing in an agricultural product supply chain based on fuzzy big data and LSGDM. Technological Forecasting and Social Change, 172 , 121017.

Gürbüz, F., Eski, İ, Denizhan, B., & Dağlı, C. (2019). Prediction of damage parameters of a 3PL company via data mining and neural networks. Journal of Intelligent Manufacturing, 30 (3), 1437–1449.

Ha, S. H., & Krishnan, R. (2008). A hybrid approach to supplier selection for the maintenance of a competitive supply chain. Expert Systems with Applications, 34 (2), 1303–1311.

Hägele, S., Grosse, E. H., & Ivanov, D. (2023). Supply chain resilience: A tertiary study. International Journal of Integrated Supply Management, 16 (1), 52–81.

Han, S., Cao, B., Fu, Y., & Luo, Z. (2018). A liner shipping competitive model with consideration of service quality management. Annals of Operations Research, 270 (1–2), 155–177.

Han, S., Fu, Y., Cao, B., & Luo, Z. (2018). Pricing and bargaining strategy of e-retail under hybrid operational patterns. Annals of Operations Research, 270 (1–2), 179–200.

Hao, H., Guo, J., Xin, Z., & Qiao, J. (2021). Research on e-commerce distribution optimization of rice agricultural products based on consumer satisfaction. IEEE Access, 9 , 135304–135315.

Heger, J., Branke, J., Hildebrandt, T., & Scholz-Reiter, B. (2016). Dynamic adjustment of dispatching rule parameters in flow shops with sequence-dependent set-up times. International Journal of Production Research, 54 (22), 6812–6824.

Ho, C.-T.B., Koh, S. L., Mahamaneerat, W. K., Shyu, C.-R., Ho, S.-C., & Chang, C. A. (2007). Domain-concept association rules mining for large-scale and complex cellular manufacturing tasks. Journal of Manufacturing Technology Management, 18 (7), 787–806.

Ho, G. T., Lau, H. C., Kwok, S., Lee, C. K., & Ho, W. (2009). Development of a co-operative distributed process mining system for quality assurance. International Journal of Production Research, 47 (4), 883–918.

Hogenboom, A., Ketter, W., Van Dalen, J., Kaymak, U., Collins, J., & Gupta, A. (2015). Adaptive tactical pricing in multi-agent supply chain markets using economic regimes. Decision Sciences, 46 (4), 791–818.

Hojati, A. T., Ferreira, L., Washington, S., & Charles, P. (2013). Hazard based models for freeway traffic incident duration. Accident Analysis and Prevention, 52 , 171–181.

Homayouni, Z., Pishvaee, M. S., Jahani, H., & Ivanov, D. (2021). A robust-heuristic optimization approach to a green supply chain design with consideration of assorted vehicle types and carbon policies under uncertainty. Annals of Operations Research , 1–41.

Hong, G.-H., & Ha, S. H. (2008). Evaluating supply partner’s capability for seasonal products using machine learning techniques. Computers and Industrial Engineering, 54 (4), 721–736.

Hosseini, S., & Al Khaled, A. (2019). A hybrid ensemble and AHP approach for resilient supplier selection. Journal of Intelligent Manufacturing, 30 (1), 207–228.

Hou, F., Li, B., Chong, A.Y.-L., Yannopoulou, N., & Liu, M. J. (2017). Understanding and predicting what influence online product sales? A neural network approach. Production Planning and Control, 28 (11–12), 964–975.

Hsiao, Y.-C., Wu, M.-H., & Li, S. C. (2019). Elevated performance of the smart city: A case study of the IoT by innovation mode. IEEE Transactions on Engineering Management, 68 (5), 1461–1475.

Huber, J., Gossmann, A., & Stuckenschmidt, H. (2017). Cluster-based hierarchical demand forecasting for perishable goods. Expert Systems with Applications, 76 , 140–151.

Ialongo, L. N., de Valk, C., Marchese, E., Jansen, F., Zmarrou, H., Squartini, T., & Garlaschelli, D. (2022). Reconstructing firm-level interactions in the Dutch input-output network from production constraints. Scientific Reports, 12 (1), 1–12.

Iftikhar, A., Ali, I., Arslan, A., & Tarba, S. (2022a). Digital innovation, data analytics, and supply chain resiliency: A bibliometric-based systematic literature review. Annals of Operations Research , 1–24.

Iftikhar, A., Purvis, L., Giannoccaro, I., & Wang, Y. (2022b). The impact of supply chain complexities on supply chain resilience: The mediating effect of big data analytics. Production Planning and Control , 1–21.

Iranitalab, A., & Khattak, A. (2017). Comparison of four statistical and machine learning methods for crash severity prediction. Accident Analysis and Prevention, 108 , 27–36.

Islam, S., & Amin, S. H. (2020). Prediction of probable backorder scenarios in the supply chain using Distributed Random Forest and Gradient Boosting Machine learning techniques. Journal of Big Data, 7 (1), 1–22.

Ivanov, D. (2021a). Digital supply chain management and technology to enhance resilience by building and using end-to-end visibility during the COVID-19 pandemic. IEEE Transactions on Engineering Management , 1–11.

Ivanov, D. (2021b). Exiting the COVID-19 pandemic: After-shock risks and avoidance of disruption tails in supply chains. Annals of Operations Research , 1–18.

Ivanov, D. (2021). Supply Chain Viability and the COVID-19 pandemic: A conceptual and formal generalisation of four major adaptation strategies. International Journal of Production Research, 59 (12), 3535–3552.

Ivanov, D. (2023). The industry 5.0 framework: Viability-based integration of the resilience, sustainability, and human-centricity perspectives. International Journal of Production Research , 61 (5), 1683–1695.

Ivanov, D., & Dolgui, A. (2020). Viability of intertwined supply networks: extending the supply chain resilience angles towards survivability. A position paper motivated by COVID-19 outbreak. International Journal of Production Research, 58 (10), 2904–2915.

Ivanov, D., & Dolgui, A. (2021). A digital supply chain twin for managing the disruption risks and resilience in the era of Industry 4.0. Production Planning and Control, 32 (9), 775–788.

Ivanov, D., & Dolgui, A. (2021). OR-methods for coping with the ripple effect in supply chains during COVID-19 pandemic: Managerial insights and research implications. International Journal of Production Economics, 232 , 107921.

Ivanov, D., Dolgui, A., & Sokolov, B. (2022). Cloud supply chain: Integrating industry 4.0 and digital platforms in the “supply chain-as-a-service’’. Transportation Research Part E: Logistics and Transportation Review, 160 , 102676.

Ivanov, D., & Keskin, B. B. (2023). Post-pandemic adaptation and development of supply chain viability theory. Omega, 116 , 102806.

Ivanov, D., Tang, C. S., Dolgui, A., Battini, D., & Das, A. (2021). Researchers’ perspectives on Industry 4.0: Multi-disciplinary analysis and opportunities for operations management. International Journal of Production Research, 59 (7), 2055–2078.

Ivanov, D., Tsipoulanidis, A., Schönberger, J., et al. (2021). Global supply chain and operations management . Springer.

Book   Google Scholar  

Jain, R., Singh, A., Yadav, H., & Mishra, P. (2014). Using data mining synergies for evaluating criteria at pre-qualification stage of supplier selection. Journal of Intelligent Manufacturing, 25 (1), 165–175.

Jain, V., Wadhwa, S., & Deshmukh, S. (2007). Supplier selection using fuzzy association rules mining approach. International Journal of Production Research, 45 (6), 1323–1353.

Jeong, H., Jang, Y., Bowman, P. J., & Masoud, N. (2018). Classification of motor vehicle crash injury severity: A hybrid approach for imbalanced data. Accident Analysis and Prevention, 120 , 250–261.

Ji, G., Yu, M., Tan, K., Kumar, A., & Gupta, S. (2022). Decision optimization in cooperation innovation: the impact of big data analytics capability and cooperative modes. Annals of Operations Research , 1–24.

Jiang, C., & Sheng, Z. (2009). Case-based reinforcement learning for dynamic inventory control in a multi-agent supply-chain system. Expert Systems with Applications, 36 (3), 6520–6526.

Jiang, W. (2019). An intelligent supply chain information collaboration model based on Internet of things and big data. IEEE Access, 7 , 58324–58335.

Jiao, Z., Ran, L., Zhang, Y., Li, Z., & Zhang, W. (2018). Data-driven approaches to integrated closed-loop sustainable supply chain design under multi-uncertainties. Journal of Cleaner Production, 185 , 105–127.

Jula, P., & Leachman, R. C. (2011). Long-and short-run supply-chain optimization models for the allocation and congestion management of containerized imports from Asia to the United States. Transportation Research Part E: Logistics and Transportation Review, 47 (5), 593–608.

Jung, S., Hong, S., & Lee, K. (2018). A data-driven air traffic sequencing model based on pairwise preference learning. IEEE Transactions on Intelligent Transportation Systems, 20 (3), 803–816.

Kamble, S., Belhadi, A., Gunasekaran, A., Ganapathy, L., & Verma, S. (2021a). A large multi-group decision-making technique for prioritizing the big data-driven circular economy practices in the automobile component manufacturing industry. Technological Forecasting and Social Change, 165 .

Kamble, S. S., & Gunasekaran, A. (2020). Big data-driven supply chain performance measurement system: A review and framework for implementation. International Journal of Production Research, 58 (1), 65–86.

Kamble, S. S., Gunasekaran, A., Kumar, V., Belhadi, A., & Foropon, C. (2021). A machine learning based approach for predicting blockchain adoption in supply chain. Technological Forecasting and Social Change, 163 , 120465.

Kamley, S., Jaloree, S., & Thakur, R. (2016). Performance forecasting of share market using machine learning techniques: A review. International Journal of Electrical and Computer Engineering (2088-8708), 6 (6).

Kang, Y., Lee, S., & Do Chung, B. (2019). Learning-based logistics planning and scheduling for crowdsourced parcel delivery. Computers and Industrial Engineering, 132 , 271–279.

Kappelman, A. C., & Sinha, A. K. (2021). Optimal control in dynamic food supply chains using big data. Computers and Operations Research, 126 , 105117.

Kar, A., Tripathi, S., Malik, N., Gupta, S., & Sivarajah, U. (2022). How does misinformation and capricious opinions impact the supply chain: A study on the impacts during the pandemic. Annals of Operations Research , 1–22.

Kartal, H., Oztekin, A., Gunasekaran, A., & Cebi, F. (2016). An integrated decision analytic framework of machine learning with multi-criteria decision making for multi-attribute inventory classification. Computers and Industrial Engineering, 101 , 599–613.

Kaur, H., & Singh, S. P. (2018). Heuristic modeling for sustainable procurement and logistics in a supply chain using big data. Computers and Operations Research, 98 , 301–321.

Kazancoglu, Y., Ozkan-Ozen, Y., Sagnak, M., Kazancoglu, I., & Dora, M. (2021a). Framework for a sustainable supply chain to overcome risks in transition to a circular economy through Industry 4.0. Production Planning and Control , 1–16.

Kazancoglu, Y., Sagnak, M., Mangla, S., Sezer, M., & Pala, M. (2021b). A fuzzy based hybrid decision framework to circularity in dairy supply chains through big data solutions. Technological Forecasting and Social Change, 170 .

Keller, T., Thiesse, F., & Fleisch, E. (2014). Classification models for RFID-based real-time detection of process events in the supply chain: An empirical study. ACM Transactions on Management Information Systems (TMIS), 5 (4), 1–30.

Ketter, W., Collins, J., Gini, M., Gupta, A., & Schrater, P. (2009). Detecting and forecasting economic regimes in multi-agent automated exchanges. Decision Support Systems, 47 (4), 307–318.

Kiekintveld, C., Miller, J., Jordan, P. R., Callender, L. F., & Wellman, M. P. (2009). Forecasting market prices in a supply chain game. Electronic Commerce Research and Applications, 8 (2), 63–77.

Kilimci, Z. H., Akyuz, A. O., Uysal, M., Akyokus, S., Uysal, M. O., Atak Bulbul, B., & Ekmis, M. A. (2019). An improved demand forecasting model using deep learning approach and proposed decision integration strategy for supply chain. Complexity, 2019 .

Kim, S., Kim, H., & Park, Y. (2017). Early detection of vessel delays using combined historical and real-time information. Journal of the Operational Research Society, 68 (2), 182–191.

Kim, S., Sohn, W., Lim, D., & Lee, J. (2021). A multi-stage data mining approach for liquid bulk cargo volume analysis based on bill of lading data. Expert Systems with Applications , 115304.

Kim, T. Y. (2018). Improving warehouse responsiveness by job priority management: A European distribution centre field study. Computers and Industrial Engineering, 139 , 105564.

Kitchenham, B. (2004). Procedures for performing systematic reviews. Keele, UK, Keele University, 33 (2004), 1–26.

Kosasih, E. E., & Brintrup, A. (2021). A machine learning approach for predicting hidden links in supply chain with graph neural networks. International Journal of Production Research , 1–14.

Kotu, V., & Deshpande, B. (2018). Data science: Concepts and practice . New York: Morgan Kaufmann.

Kumar, S., Nottestad, D. A., & Murphy, E. E. (2009). Effects of product postponement on the distribution network: A case study. Journal of the Operational Research Society, 60 (4), 471–480.

Kuo, R. J., Wang, Y. C., & Tien, F. C. (2010). Integration of artificial neural network and MADA methods for green supplier selection. Journal of Cleaner Production, 18 (12), 1161–1170.

Kusi-Sarpong, S., Orji, I., Gupta, H., & Kunc, M. (2021). Risks associated with the implementation of big data analytics in sustainable supply chains. Omega (United Kingdom), 105 .

Kuvvetli, Ü., & Firuzan, A. R. (2019). Applying Six Sigma in urban public transportation to reduce traffic accidents involving municipality buses. Total Quality Management and Business Excellence, 30 (1–2), 82–107.

Lamba, K., & Singh, S. P. (2019). Dynamic supplier selection and lot-sizing problem considering carbon emissions in a big data environment. Technological Forecasting and Social Change, 144 , 573–584.

Lamba, K., Singh, S. P., & Mishra, N. (2019). Integrated decisions for supplier selection and lot-sizing considering different carbon emission regulations in Big Data environment. Computers and Industrial Engineering, 128 , 1052–1062.

Lau, R. Y. K., Zhang, W., & Xu, W. (2018). Parallel aspect-oriented sentiment analysis for sales forecasting with big data. Production and Operations Management, 27 (10), 1775–1794.

Lázaro, J. L., Jiménez, Á. B., & Takeda, A. (2018). Improving cash logistics in bank branches by coupling machine learning and robust optimization. Expert Systems with Applications, 92 , 236–255.

Le Thi, H. A. (2020). DC programming and DCA for supply chain and production management: State-of-the-art models and methods. International Journal of Production Research, 58 (20), 6078–6114.

Lee, C. (2017). A GA-based optimisation model for big data analytics supporting anticipatory shipping in Retail 4.0. International Journal of Production Research, 55 (2), 593–605.

Lee, C. K., Ho, W., Ho, G. T., & Lau, H. C. (2011). Design and development of logistics workflow systems for demand management with RFID. Expert Systems with Applications, 38 (5), 5428–5437.

Lee, C.-Y., & Chien, C.-F. (2014). Stochastic programming for vendor portfolio selection and order allocation under delivery uncertainty. OR Spectrum, 36 (3), 761–797.

Lee, H., Aydin, N., Choi, Y., Lekhavat, S., & Irani, Z. (2018). A decision support system for vessel speed decision in maritime logistics using weather archive big data. Computers and Operations Research, 98 , 330–342.

Lee, Y.-C., Hsiao, Y.-C., Peng, C.-F., Tsai, S.-B., Wu, C.-H., & Chen, Q. (2015). Using Mahalanobis-Taguchi system, logistic regression, and neural network method to evaluate purchasing audit quality. Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Manufacture, 229 (1), 3–12.

Leung, K. H., Mo, D. Y., Ho, G. T., Wu, C.-H., & Huang, G. Q. (2020). Modelling near-real-time order arrival demand in e-commerce context: A machine learning predictive methodology. Industrial Management and Data Systems, 120 (6), 1149–1174.

Li, G., Li, L., Choi, T.-M., & Sethi, S. P. (2020). Green supply chain management in Chinese firms: Innovative measures and the moderating role of quick response technology. Journal of Operations Management, 66 (7–8), 958–988.

Li, G., Li, N., & Sethi, S. P. (2021). Does CSR reduce idiosyncratic risk? Roles of operational efficiency and AI innovation. Production and Operations Management, 30 (7), 2027–2045.

Li, G., Lim, M. K., & Wang, Z. (2020). Stakeholders, green manufacturing, and practice performance: Empirical evidence from Chinese fashion businesses. Annals of Operations Research, 290 (1), 961–982.

Li, G., Wu, H., Sethi, S. P., & Zhang, X. (2021). Contracting green product supply chains considering marketing efforts in the circular economy era. International Journal of Production Economics, 234 , 108041.

Li, G., Xue, J., Li, N., & Ivanov, D. (2022). Blockchain-supported business model design, supply chain resilience, and firm performance. Transportation Research Part E: Logistics and Transportation Review, 163 , 102773.

Li, G.-D., Yamaguchi, D., & Nagai, M. (2008). A grey-based rough decision-making approach to supplier selection. The International Journal of Advanced Manufacturing Technology, 36 (9–10), 1032.

Li, J., Zeng, X., Liu, C., & Zhou, X. (2018). A parallel Lagrange algorithm for order acceptance and scheduling in cluster supply chains. Knowledge-Based Systems, 143 , 271–283.

Li, L., Chi, T., Hao, T., & Yu, T. (2018). Customer demand analysis of the electronic commerce supply chain using Big Data. Annals of Operations Research, 268 (1–2), 113–128.

Li, L., Dai, Y., & Sun, Y. (2021). Impact of data-driven online financial consumption on supply chain services. Industrial Management and Data Systems, 121 (4), 856–878.

Li, L., Gong, Y., Wang, Z., & Liu, S. (2022b). Big data and big disaster: A mechanism of supply chain risk management in global logistics industry. International Journal of Operations and Production Management .

Li, R., Pereira, F. C., & Ben-Akiva, M. E. (2015). Competing risk mixture model and text analysis for sequential incident duration prediction. Transportation Research Part C: Emerging Technologies, 54 , 74–85.

Li, S., & Kuo, X. (2008). The inventory management system for automobile spare parts in a central warehouse. Expert Systems with Applications, 34 (2), 1144–1153.

Liao, S.-H., Chen, C.-M., & Wu, C.-H. (2008). Mining customer knowledge for product line and brand extension in retailing. Expert Systems with Applications, 34 (3), 1763–1776.

Liao, S.-H., Chen, Y.-N., & Tseng, Y.-Y. (2009). Mining demand chain knowledge of life insurance market for new product development. Expert Systems with Applications, 36 (5), 9422–9437.

Liao, S.-H., Hsieh, C.-L., & Huang, S.-P. (2008). Mining product maps for new product development. Expert Systems with Applications, 34 (1), 50–62.

Lim, M., Li, Y., & Song, X. (2021). Exploring customer satisfaction in cold chain logistics using a text mining approach. Industrial Management and Data Systems, 121 (12), 2426–2449.

Lin, R.-H., Chuang, C.-L., Liou, J. J., & Wu, G.-D. (2009). An integrated method for finding key suppliers in SCM. Expert Systems with Applications, 36 (3), 6461–6465.

Lin, W., Wu, Z., Lin, L., Wen, A., & Li, J. (2017). An ensemble random forest algorithm for insurance big data analysis. IEEE Access, 5 , 16568–16575.

Liu, C., Feng, Y., Lin, D., Wu, L., & Guo, M. (2020). IoT based laundry services: an application of big data analytics, intelligent logistics management, and machine learning techniques. International Journal of Production Research, 58 (17), 5113–5131.

Liu, C., Li, H., Tang, Y., Lin, D., & Liu, J. (2019). Next generation integrated smart manufacturing based on big data analytics, reinforced learning, and optimal routes planning methods. International Journal of Computer Integrated Manufacturing, 32 (9), 820–831.

Liu, P. (2019). Pricing policies and coordination of low-carbon supply chain considering targeted advertisement and carbon emission reduction costs in the big data environment. Journal of Cleaner Production, 210 , 343–357.

Liu, P., & Yi, S.-P. (2017). Pricing policies of green supply chain considering targeted advertising and product green degree in the big data environment. Journal of Cleaner Production, 164 , 1614–1622.

Liu, W., Long, S., Xie, D., Liang, Y., & Wang, J. (2021). How to govern the big data discriminatory pricing behavior in the platform service supply chain? An examination with a three-party evolutionary game model. International Journal of Production Economics, 231 , 107910.

Lyu, X., & Zhao, J. (2019). Compressed sensing and its applications in risk assessment for internet supply chain finance under big data. IEEE Access, 7 , 53182–53187.

Ma, D., Hu, J., & Yao, F. (2021). Big data empowering low-carbon smart tourism study on low-carbon tourism O2O supply chain considering consumer behaviors and corporate altruistic preferences. Computers and Industrial Engineering, 153 .

Maghsoodi, A. I., Kavian, A., Khalilzadeh, M., & Brauers, W. K. (2018). CLUS-MCDA: A novel framework based on cluster analysis and multiple criteria decision theory in a supplier selection problem. Computers and Industrial Engineering, 118 , 409–422.

Maheshwari, S., Gautam, P., & Jaggi, C. K. (2021). Role of Big Data Analytics in supply chain management: Current trends and future perspectives. International Journal of Production Research, 59 (6), 1875–1900.

Maldonado, S., González-Ramírez, R. G., Quijada, F., & Ramírez-Nafarrate, A. (2019). Analytics meets port logistics: A decision support system for container stacking operations. Decision Support Systems, 121 , 84–93.

Mancini, M., Mircoli, A., Potena, D., Diamantini, C., Duca, D., & Toscano, G. (2020). Prediction of pellet quality through machine learning techniques and near-infrared spectroscopy. Computers and Industrial Engineering, 147 , 106566.

Mao, J., Hong, D., Ren, R., Li, X., Wang, J., & Nasr, E. S. A. (2020). Driving conditions of new energy logistics vehicles using big data technology. IEEE Access, 8 , 123891–123903.

Masna, N. V. R., Chen, C., Mandal, S., & Bhunia, S. (2019). Robust authentication of consumables with extrinsic tags and chemical fingerprinting. IEEE Access, 7 , 14396–14409.

Matusiak, M., de Koster, R., & Saarinen, J. (2017). Utilizing individual picker skills to improve order batching in a warehouse. European Journal of Operational Research, 263 (3), 888–899.

Medina-González, S., Shokry, A., Silvente, J., Lupera, G., & Espuña, A. (2018). Optimal management of bio-based energy supply chains under parametric uncertainty through a data-driven decision-support framework. Computers and Industrial Engineering, 139 , 105561.

Merchán, D., & Winkenbach, M. (2019). An empirical validation and data-driven extension of continuum approximation approaches for urban route distances. Networks, 73 (4), 418–433.

Metzger, A., Leitner, P., Ivanović, D., Schmieders, E., Franklin, R., Carro, M., Dustdar, S., & Pohl, K. (2014). Comparing and combining predictive business process monitoring techniques. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 45 (2), 276–290.

Miguéis, V. L., Van den Poel, D., Camanho, A. S., & e Cunha, J. F. (2012). Modeling partial customer churn: On the value of first product-category purchase sequences. Expert Systems with Applications, 39 (12), 11250–11256.

Ming, L., GuoHua, Z., & Wei, W. (2021). Study of the Game Model of E-Commerce Information Sharing in an Agricultural Product Supply Chain based on fuzzy big data and LSGDM. Technological Forecasting and Social Change, 172 .

Mirzaei, M., Zaerpour, N., & de Koster, R. (2021). The impact of integrated cluster-based storage allocation on parts-to-picker warehouse performance. Transportation Research Part E: Logistics and Transportation Review, 146 , 102207.

Mishra, D., Gunasekaran, A., Papadopoulos, T., & Childe, S. J. (2018). Big Data and supply chain management: A review and bibliometric analysis. Annals of Operations Research, 270 (1–2), 313–336.

Mishra, S., & Singh, S. (2022). A stochastic disaster-resilient and sustainable reverse logistics model in big data environment. Annals of Operations Research, 319 (1), 853–884.

Mishra, S., & Singh, S. P. (2020). A stochastic disaster-resilient and sustainable reverse logistics model in big data environment. Annals of Operations Research , 1–32.

Mishra, S., & Singh, S. P. (2021). A clean global production network model considering hybrid facilities. Journal of Cleaner Production, 281 , 124463.

Mocanu, E., Nguyen, P. H., Gibescu, M., & Kling, W. L. (2016). Deep learning for estimating building energy consumption. Sustainable Energy, Grids and Networks, 6 , 91–99.

Mohseni, S., & Pishvaee, M. S. (2020). Data-driven robust optimization for wastewater sludge-to-biodiesel supply chain design. Computers and Industrial Engineering, 139 , 105944.

Mokhtarinejad, M., Ahmadi, A., Karimi, B., & Rahmati, S. H. A. (2015). A novel learning based approach for a new integrated location-routing and scheduling problem within cross-docking considering direct shipment. Applied Soft Computing, 34 , 274–285.

Molka-Danielsen, J., Engelseth, P., & Wang, H. (2018). Large scale integration of wireless sensor network technologies for air quality monitoring at a logistics shipping base. Journal of Industrial Information Integration, 10 , 20–28.

Mourtzis, D., Dolgui, A., Ivanov, D., Peron, M., & Sgarbossa, F. (2021). Design and operation of production networks for mass personalization in the era of cloud technology . Elsevier.

Mungo, L., Lafond, F., Astudillo-Estévez, P., & Farmer, J. D. (2023). Reconstructing production networks using machine learning. Journal of Economic Dynamics and Control , 104607.

Murray, P. W., Agard, B., & Barajas, M. A. (2018). Forecast of individual customer’s demand from a large and noisy dataset. Computers and Industrial Engineering, 118 , 33–43.

Muteki, K., & MacGregor, J. F. (2008). Optimal purchasing of raw materials: A data-driven approach. AIChE Journal, 54 (6), 1554–1559.

Neilson, A., Daniel, B., Tjandra, S., et al. (2019). Systematic review of the literature on big data in the transportation domain: Concepts and applications. Big Data Research, 17 , 35–44.

Newman, W. R., & Krehbiel, T. C. (2007). Linear performance pricing: A collaborative tool for focused supply cost reduction. Journal of Purchasing and Supply Management, 13 (2), 152–165.

Nguyen, A., Lamouri, S., Pellerin, R., Tamayo, S., & Lekens, B. (2022). Data analytics in pharmaceutical supply chains: State of the art, opportunities, and challenges. International Journal of Production Research, 60 (22), 6888–6907.

Nguyen, A., Pellerin, R., Lamouri, S., & Lekens, B. (2022b). Managing demand volatility of pharmaceutical products in times of disruption through news sentiment analysis. International Journal of Production Research , 1–12.

Nguyen, D. T., Adulyasak, Y., Cordeau, J.-F., & Ponce, S. I. (2021). Data-driven operations and supply chain management: Established research clusters from 2000 to early 2020. International Journal of Production Research , 1–25.

Nguyen, T., Li, Z., Spiegler, V., Ieromonachou, P., & Lin, Y. (2018). Big data analytics in supply chain management: A state-of-the-art literature review. Computers and Operations Research, 98 , 254–264.

Ni, M., Xu, X., & Deng, S. (2007). Extended QFD and data-mining-based methods for supplier selection in mass customization. International Journal of Computer Integrated Manufacturing, 20 (2–3), 280–291.

Nikolopoulos, K., Punia, S., Schäfers, A., Tsinopoulos, C., & Vasilakis, C. (2021). Forecasting and planning during a pandemic: COVID-19 growth rates, supply chain disruptions, and governmental decisions. European Journal of Operational Research, 290 (1), 99–115.

Nikolopoulos, K. I., Babai, M. Z., & Bozos, K. (2016). Forecasting supply chain sporadic demand with nearest neighbor approaches. International Journal of Production Economics, 177 , 139–148.

Ning, C., & You, F. (2018). Data-driven stochastic robust optimization: General computational framework and algorithm leveraging machine learning for optimization under uncertainty in the big data era. Computers and Chemical Engineering, 111 , 115–133.

Ning, C., & You, F. (2019). Optimization under uncertainty in the era of big data and deep learning: When machine learning meets mathematical programming. Computers and Chemical Engineering, 125 , 434–448.

Niu, B., Dai, Z., & Chen, L. (2022). Information leakage in a cross-border logistics supply chain considering demand uncertainty and signal inference. Annals of Operations Research, 309 (2), 785–816.

Noroozi, A., Mokhtari, H., & Abadi, I. N. K. (2013). Research on computational intelligence algorithms with adaptive learning approach for scheduling problems with batch processing machines. Neurocomputing, 101 , 190–203.

Novais, L., Maqueira, J. M., & Ortiz-Bas, Á. (2019). A systematic literature review of cloud computing use in supply chain integration. Computers and Industrial Engineering, 129 , 296–314.

Nuss, P., Ohno, H., Chen, W.-Q., & Graedel, T. (2019). Comparative analysis of metals use in the United States economy. Resources, Conservation and Recycling, 145 , 448–456.

Oh, J., & Jeong, B. (2019). Tactical supply planning in smart manufacturing supply chain. Robotics and Computer-Integrated Manufacturing, 55 , 217–233.

Opasanon, S., & Kitthamkesorn, S. (2016). Border crossing design in light of the ASEAN Economic Community: Simulation based approach. Transport Policy, 48 , 1–12.

Ou, T.-Y., Cheng, C.-Y., Chen, P.-J., & Perng, C. (2016). Dynamic cost forecasting model based on extreme learning machine: A case study in steel plant. Computers and Industrial Engineering, 101 , 544–553.

Ozgormus, E., & Smith, A. E. (2020). A data-driven approach to grocery store block layout. Computers and Industrial Engineering, 139 , 105562.

Pal Singh, S., Adhikari, A., Majumdar, A., & Bisi, A. (2022). Does service quality influence operational and financial performance of third party logistics service providers? A mixed multi criteria decision making -text mining-based investigation. Transportation Research Part E: Logistics and Transportation Review, 157 .

Pan, S., Giannikas, V., Han, Y., Grover-Silva, E., & Qiao, B. (2017). Using customer-related data to enhance e-grocery home delivery. Industrial Management and Data Systems, 117 (9), 1917–1933.

Papanagnou, C. I., & Matthews-Amune, O. (2018). Coping with demand volatility in retail pharmacies with the aid of big data exploration. Computers and Operations Research, 98 , 343–354.

Parmar, D., Wu, T., Callarman, T., Fowler, J., & Wolfe, P. (2010). A clustering algorithm for supplier base management. International Journal of Production Research, 48 (13), 3803–3821.

Piendl, R., Matteis, T., & Liedtke, G. (2019). A machine learning approach for the operationalization of latent classes in a discrete shipment size choice model. Transportation Research Part E: Logistics and Transportation Review, 121 , 149–161.

Potočnik, P., Šilc, J., Papa, G., et al. (2019). A comparison of models for forecasting the residential natural gas demand of an urban area. Energy, 167 , 511–522.

Pournader, M., Ghaderi, H., Hassanzadegan, A., & Fahimnia, B. (2021). Artificial intelligence applications in supply chain management. International Journal of Production Economics , 108250.

Praet, S., & Martens, D. (2020). Efficient parcel delivery by predicting customers’ locations. Decision Sciences, 51 (5), 1202–1231.

Prakash, A., & Deshmukh, S. (2011). A multi-criteria customer allocation problem in supply chain environment: An artificial immune system with fuzzy logic controller based approach. Expert Systems with Applications, 38 (4), 3199–3208.

Pramanik, D., Mondal, S. C., & Haldar, A. (2020). Resilient supplier selection to mitigate uncertainty: Soft-computing approach. Journal of Modelling in Management .

Priore, P., Ponte, B., Rosillo, R., & de la Fuente, D. (2019). Applying machine learning to the dynamic selection of replenishment policies in fast-changing supply chain environments. International Journal of Production Research, 57 (11), 3663–3677.

Proto, S., Di Corso, E., apiletti, D., Cagliero, L., Cerquitelli, T., Malnati, G., & Mazzucchi, D. (2020). REDTag: A predictive maintenance framework for parcel delivery services. IEEE Access, 8 , 14953–14964.

Punia, S., Singh, S. P., & Madaan, J. K. (2020). A cross-temporal hierarchical framework and deep learning for supply chain forecasting. Computers and Industrial Engineering, 149 , 106796.

Putra, P., Mahendra, R., & Budi, I. (2022). Traffic and road conditions monitoring system using extracted information from Twitter. Journal of Big Data, 9 (1).

Quariguasi Frota Neto, J., & Dutordoir, M. (2020). Mapping the market for remanufacturing: An application of “Big Data” analytics. International Journal of Production Economics, 230 .

Queiroz, M. M., Ivanov, D., Dolgui, A., & Wamba, S. F. (2022). Impacts of epidemic outbreaks on supply chains: Mapping a research agenda amid the COVID-19 pandemic through a structured literature review. Annals of Operations Research, 319 (1), 1159–1196.

Rahmanzadeh, S., Pishvaee, M., & Govindan, K. (2022). Emergence of open supply chain management: the role of open innovation in the future smart industry using digital twin network. Annals of Operations Research , 1–29.

Rai, R., Tiwari, M. K., Ivanov, D., & Dolgui, A. (2021). Machine learning in manufacturing and Industry 4.0 applications.

Riahi, Y., Saikouk, T., Gunasekaran, A., & Badraoui, I. (2021). Artificial intelligence applications in supply chain: A descriptive bibliometric analysis and future research directions. Expert Systems with Applications, 173 , 114702.

Rolf, B., Jackson, I., Müller, M., Lang, S., Reggelin, T., & Ivanov, D. (2022). A review on reinforcement learning algorithms and applications in supply chain management. International Journal of Production Research , 1–29.

Roy, V., Mitra, S., Chattopadhyay, M., & Sahay, B. (2018). Facilitating the extraction of extended insights on logistics performance from the logistics performance index dataset: A two-stage methodological framework and its application. Research in Transportation Business and Management, 28 , 23–32.

Rozhkov, M., Ivanov, D., Blackhurst, J., & Nair, A. (2022). Adapting supply chain operations in anticipation of and during the COVID-19 pandemic. Omega, 110 , 102635.

Sachs, A.-L. (2015). The data-driven newsvendor with censored demand observations. In Retail analytics (pp. 35–56). Springer.

Sadic, S., de Sousa, J. P., & Crispim, J. A. (2018). A two-phase MILP approach to integrate order, customer and manufacturer characteristics into Dynamic Manufacturing Network formation and operational planning. Expert Systems with Applications, 96 , 462–478.

See-To, E. W., & Ngai, E. W. (2018). Customer reviews for demand distribution and sales nowcasting: A big data approach. Annals of Operations Research, 270 (1–2), 415–431.

Segev, D., Levi, R., Dunn, P. F., & Sandberg, W. S. (2012). Modeling the impact of changing patient transportation systems on peri-operative process performance in a large hospital: Insights from a computer simulation study. Health Care Management Science, 15 (2), 155–169.

Seitz, A., Grunow, M., & Akkerman, R. (2020). Data driven supply allocation to individual customers considering forecast bias. International Journal of Production Economics, 227 , 107683.

Sener, A., Barut, M., Dag, A., & Yildirim, M. B. (2019). Impact of commitment, information sharing, and information usage on supplier performance: A Bayesian belief network approach. Annals of Operations Research , 1–34.

Shajalal, M., Hajek, P., & Abedin, M. Z. (2021). Product backorder prediction using deep neural network on imbalanced data. International Journal of Production Research , 1–18.

Shang, Y., Dunson, D., & Song, J.-S. (2017). Exploiting big data in logistics risk assessment via Bayesian nonparametrics. Operations Research, 65 (6), 1574–1588.

Sharma, R., Kamble, S. S., Gunasekaran, A., Kumar, V., & Kumar, A. (2020). A systematic literature review on machine learning applications for sustainable agriculture supply chain performance. Computers and Operations Research, 119 , 104926.

Shen, B., Choi, T.-M., & Chan, H.-L. (2019). Selling green first or not? A Bayesian analysis with service levels and environmental impact considerations in the Big Data Era. Technological Forecasting and Social Change, 144 , 412–420.

shen How, B., & Lam, H. L. (2018). Sustainability evaluation for biomass supply chain synthesis: novel principal component analysis (PCA) aided optimisation approach. Journal of Cleaner Production, 189 , 941–961.

Shokouhyar, S., Dehkhodaei, A., & Amiri, B. (2022). A mixed-method approach for modelling customer-centric mobile phone reverse logistics: Application of social media data. Journal of Modelling in Management, 17 (2), 655–696.

Shukla, V., Naim, M. M., & Thornhill, N. F. (2012). Rogue seasonality detection in supply chains. International Journal of Production Economics, 138 (2), 254–272.

Simkoff, J. M., & Baldea, M. (2019). Parameterizations of data-driven nonlinear dynamic process models for fast scheduling calculations. Computers and Chemical Engineering, 129 , 106498.

Singh, A., Shukla, N., & Mishra, N. (2018). Social media data analytics to improve supply chain management in food industries. Transportation Research Part E: Logistics and Transportation Review, 114 , 398–415.

Singh, A. K., Subramanian, N., Pawar, K. S., & Bai, R. (2018). Cold chain configuration design: Location-allocation decision-making using coordination, value deterioration, and big data approximation. Annals of Operations Research, 270 (1–2), 433–457.

Sodero, A. C., & Rabinovich, E. (2017). Demand and revenue management of deteriorating inventory on the Internet: An empirical study of flash sales markets. Journal of Business Logistics, 38 (3), 170–183.

Sokolov, B., Ivanov, D., & Dolgui, A. (2020). Scheduling in industry 4.0 and cloud manufacturing (Vol. 289). Springer.

Song, Z., & Kusiak, A. (2009). Optimising product configurations with a data-mining approach. International Journal of Production Research, 47 (7), 1733–1751.

Spoel, V., Chintan, A., & Hillegersberg, V. (2017). Predictive analytics for truck arrival time estimation: A field study at a European Distribution Center. International Journal of Production Research, 55 (17), 5062–5078.

Srinivasan, R., Giannikas, V., Kumar, M., Guyot, R., & McFarlane, D. (2019). Modelling food sourcing decisions under climate change: A data-driven approach. Computers and Industrial Engineering, 128 , 911–919.

Stadtler, H., & Kilger, C. (2002). Supply chain management and advanced planning (Vol. 4). New York: Springer.

Stip, J., & Van Houtum, G.-J. (2019). On a method to improve your service BOMs within spare parts management. International Journal of Production Economics , 107466.

Stip, J., & Van Houtum, G.-J. (2020). On a method to improve your service BOMs within spare parts management. International Journal of Production Economics, 221 , 107466.

Sugrue, D., & Adriaens, P. (2021). A data fusion approach to predict shipping efficiency for bulk carriers. Transportation Research Part E: Logistics and Transportation Review, 149 , 102326.

Sun, J., Li, G., & Lim, M. K. (2020). China’s power supply chain sustainability: An analysis of performance and technology gap. Annals of Operations Research , 1–29.

Susanty, A., Puspitasari, N., Prastawa, H., & Renaldi, S. (2021). Exploring the best policy scenario plan for the dairy supply chain: A DEMATEL approach. Journal of Modelling in Management, 16 (1), 240–266.

Talwar, S., Kaur, P., Fosso Wamba, S., & Dhir, A. (2021). Big Data in operations and supply chain management: a systematic literature review and future research agenda. International Journal of Production Research, 1–26.

Tan, K. H., Zhan, Y., Ji, G., Ye, F., & Chang, C. (2015). Harvesting big data to enhance supply chain innovation capabilities: An analytic infrastructure based on deduction graph. International Journal of Production Economics, 165 , 223–233.

Tao, Q., Gu, C., Wang, Z., Rocchio, J., Hu, W., & Yu, X. (2018). Big data driven agricultural products supply chain management: A trustworthy scheduling optimization approach. IEEE Access, 6 , 49990–50002.

Taube, F., & Minner, S. (2018). Data-driven assignment of delivery patterns with handling effort considerations in retail. Computers and Operations Research, 100 , 379–393.

Tavana, M., Fallahpour, A., Di Caprio, D., & Santos-Arteaga, F. J. (2016). A hybrid intelligent fuzzy predictive model with simulation for supplier evaluation and selection. Expert Systems with Applications, 61 , 129–144.

Tayal, A., & Singh, S. P. (2018). Integrating big data analytic and hybrid firefly-chaotic simulated annealing approach for facility layout problem. Annals of Operations Research, 270 (1–2), 489–514.

Thomassey, S. (2010). Sales forecasts in clothing industry: The key success factor of the supply chain management. International Journal of Production Economics, 128 (2), 470–483.

Ting, S., Tse, Y., Ho, G., Chung, S., & Pang, G. (2014). Mining logistics data to assure the quality in a sustainable food supply chain: A case in the red wine industry. International Journal of Production Economics, 152 , 200–209.

Tirkel, I. (2013). Forecasting flow time in semiconductor manufacturing using knowledge discovery in databases. International Journal of Production Research, 51 (18), 5536–5548.

Tiwari, S., Wee, H. M., & Daryanto, Y. (2018). Big data analytics in supply chain management between 2010 and 2016: Insights to industries. Computers and Industrial Engineering, 115 , 319–330.

Tomičić-Pupek, K., Srpak, I., Havaš, L., & Srpak, D. (2020). Algorithm for customizing the material selection process for application in power engineering. Energies, 13 (23), 6458.

Triepels, R., Daniels, H., & Feelders, A. (2018). Data-driven fraud detection in international shipping. Expert Systems with Applications, 99 , 193–202.

Tsai, F.-M., & Huang, L. J. (2017). Using artificial neural networks to predict container flows between the major ports of Asia. International Journal of Production Research, 55 (17), 5001–5010.

Tsao, Y.-C. (2017). Managing default risk under trade credit: Who should implement Big-Data analytics in supply chains? Transportation Research Part E: Logistics and Transportation Review, 106 , 276–293.

Tsolakis, N., Zissis, D., Papaefthimiou, S., & Korfiatis, N. (2021). Towards AI driven environmental sustainability: An application of automated logistics in container port terminals. International Journal of Production Research , 1–21.

Tsolakis, N., Zissis, D., Papaefthimiou, S., & Korfiatis, N. (2022). Towards AI driven environmental sustainability: An application of automated logistics in container port terminals. International Journal of Production Research, 60 (14), 4508–4528.

Tsou, C.-M. (2013). On the strategy of supply chain collaboration based on dynamic inventory target level management: A theory of constraint perspective. Applied Mathematical Modelling, 37 (7), 5204–5214.

Tucnik, P., Nachazel, T., Cech, P., & Bures, V. (2018). Comparative analysis of selected path-planning approaches in large-scale multi-agent-based environments. Expert Systems with Applications, 113 , 415–427.

Vahdani, B., Iranmanesh, S., Mousavi, S. M., & Abdollahzade, M. (2012). A locally linear neuro-fuzzy model for supplier selection in cosmetics industry. Applied Mathematical Modelling, 36 (10), 4714–4727.

Verstraete, G., Aghezzaf, E.-H., & Desmet, B. (2019). A data-driven framework for predicting weather impact on high-volume low-margin retail products. Journal of Retailing and Consumer Services, 48 , 169–177.

Vieira, A. A., Dias, L. M., Santos, M. Y., Pereira, G. A., & Oliveira, J. A. (2019). Simulation of an automotive supply chain using big data. Computers and Industrial Engineering, 137 , 106033.

Vieira, A. A., Dias, L. M., Santos, M. Y., Pereira, G. A., & Oliveira, J. A. (2019). Supply chain hybrid simulation: From Big Data to distributions and approaches comparison. Simulation Modelling Practice and Theory, 97 , 101956.

Viet, N. Q., Behdani, B., & Bloemhof, J. (2020). Data-driven process redesign: anticipatory shipping in agro-food supply chains. International Journal of Production Research, 58 (5), 1302–1318.

Villegas, M. A., & Pedregal, D. J. (2019). Automatic selection of unobserved components models for supply chain forecasting. International Journal of Forecasting, 35 (1), 157–169.

Vondra, M., Touš, M., & Teng, S. Y. (2019). Digestate evaporation treatment in biogas plants: A techno-economic assessment by Monte Carlo, neural networks and decision trees. Journal of Cleaner Production, 238 , 117870.

Waller, M. A., & Fawcett, S. E. (2013). Data science, predictive analytics, and big data: A revolution that will transform supply chain design and management. Journal of Business Logistics, 34 (2), 77–84.

Wang, F., Zhu, Y., Wang, F., Liu, J., Ma, X., & Fan, X. (2020). Car4Pac: Last mile parcel delivery through intelligent car trip sharing. IEEE Transactions on Intelligent Transportation Systems, 21 (10), 4410–4424.

Wang, G., Gunasekaran, A., & Ngai, E. W. (2018). Distribution network design with big data: Model and analysis. Annals of Operations Research, 270 (1–2), 539–551.

Wang, G., Gunasekaran, A., Ngai, E. W., & Papadopoulos, T. (2016). Big data analytics in logistics and supply chain management: Certain investigations for research and applications. International Journal of Production Economics, 176 , 98–110.

Wang, J., & Yue, H. (2017). Food safety pre-warning system based on data mining for a sustainable food supply chain. Food Control, 73 , 223–229.

Wang, K., Simandl, J. K., Porter, M. D., Graettinger, A. J., & Smith, R. K. (2016). How the choice of safety performance function affects the identification of important crash prediction variables. Accident Analysis and Prevention, 88 , 1–8.

Wang, L., Guo, S., Li, X., Du, B., & Xu, W. (2018). Distributed manufacturing resource selection strategy in cloud manufacturing. The International Journal of Advanced Manufacturing Technology, 94 (9–12), 3375–3388.

Wang, Y., Assogba, K., Liu, Y., Ma, X., Xu, M., & Wang, Y. (2018). Two-echelon location-routing optimization with time windows based on customer clustering. Expert Systems with Applications, 104 , 244–260.

Weiss, S. M., Dhurandhar, A., Baseman, R. J., White, B. F., Logan, R., Winslow, J. K., & Poindexter, D. (2016). Continuous prediction of manufacturing performance throughout the production lifecycle. Journal of Intelligent Manufacturing, 27 (4), 751–763.

Weng, T., Liu, W., & Xiao, J. (2019). Supply chain sales forecasting based on lightGBM and LSTM combination model. Industrial Management and Data Systems, 120 (2), 265–279.

Wesonga, R., & Nabugoomu, F. (2016). Framework for determining airport daily departure and arrival delay thresholds: Statistical modelling approach. SpringerPlus, 5 (1), 1026.

Wey, W.-M., & Huang, J.-Y. (2018). Urban sustainable transportation planning strategies for livable City’s quality of life. Habitat International, 82 , 9–27.

Wichmann, P., Brintrup, A., Baker, S., Woodall, P., & McFarlane, D. (2020). Extracting supply chain maps from news articles using deep neural networks. International Journal of Production Research, 58 (17), 5320–5336.

Windt, K., & Hütt, M.-T. (2011). Exploring due date reliability in production systems using data mining methods adapted from gene expression analysis. CIRP Annals, 60 (1), 473–476.

Wojtusiak, J., Warden, T., & Herzog, O. (2012). Machine learning in agent-based stochastic simulation: Inferential theory and evaluation in transportation logistics. Computers and Mathematics with Applications, 64 (12), 3658–3665.

Wojtusiak, J., Warden, T., & Herzog, O. (2012). The learnable evolution model in agent-based delivery optimization. Memetic Computing, 4 (3), 165–181.

Wong, W., & Guo, Z. (2010). A hybrid intelligent model for medium-term sales forecasting in fashion retail supply chains using extreme learning machine and harmony search algorithm. International Journal of Production Economics, 128 (2), 614–624.

Wu, P.-J., Chen, M.-C., & Tsau, C.-K. (2017). The data-driven analytics for investigating cargo loss in logistics systems. International Journal of Physical Distribution and Logistics Management, 47 (1), 68–83.

Wu, T., Xiao, F., Zhang, C., Zhang, D., & Liang, Z. (2019). Regression and extrapolation guided optimization for production-distribution with ship-buy-exchange options. Transportation Research Part E: Logistics and Transportation Review, 129 , 15–37.

Wu, X., Cao, Y., Xiao, Y., & Guo, J. (2020). Finding of urban rainstorm and waterlogging disasters based on microblogging data and the location-routing problem model of urban emergency logistics. Annals of Operations Research, 290 (1), 865–896.

Wu, Z., Li, Y., Wang, X., Su, J., Yang, L., Nie, Y., & Wang, Y. (2022). Mining factors affecting taxi detour behavior from GPS traces at directional road segment level. IEEE Transactions on Intelligent Transportation Systems, 23 (7), 8013–8023.

Wy, J., Jeong, S., Kim, B.-I., Park, J., Shin, J., Yoon, H., & Lee, S. (2011). A data-driven generic simulation model for logistics-embedded assembly manufacturing lines. Computers and Industrial Engineering, 60 (1), 138–147.

Xiang, Z., & Xu, M. (2019). Dynamic cooperation strategies of the closed-loop supply chain involving the Internet service platform. Journal of Cleaner Production, 220 , 1180–1193.

Xiang, Z., & Xu, M. (2020). Dynamic game strategies of a two-stage remanufacturing closed-loop supply chain considering Big Data marketing, technological innovation and overconfidence. Computers and Industrial Engineering, 145 .

Xu, F., Li, Y., & Feng, L. (2019). The influence of big data system for used product management on manufacturing-remanufacturing operations. Journal of Cleaner Production, 209 , 782–794.

Xu, G., Qiu, X., Fang, M., Kou, X., & Yu, Y. (2019). Data-driven operational risk analysis in E-Commerce Logistics. Advanced Engineering Informatics, 40 , 29–35.

Xu, J., Pero, M. E. P., Ciccullo, F., & Sianesi, A. (2021). On relating big data analytics to supply chain planning: Towards a research agenda. International Journal of Physical Distribution and Logistics Management, 51 (6), 656–682.

Xu, X., Guo, W. G., & Rodgers, M. D. (2020). A real-time decision support framework to mitigate degradation in perishable supply chains. Computers and Industrial Engineering, 150 , 106905.

Xu, X., & Li, Y. (2016). The antecedents of customer satisfaction and dissatisfaction toward various types of hotels: A text mining approach. International Journal of Hospitality Management, 55 , 57–69.

Xu, X., Shen, Y., Chen, W. A., Gong, Y., & Wang, H. (2021). Data-driven decision and analytics of collection and delivery point location problems for online retailers. Omega, 100 , 102280.

Yan, P., Pei, J., Zhou, Y., & Pardalos, P. (2021). When platform exploits data analysis advantage: change of OEM-led supply chain structure. Annals of Operations Research , 1–27.

Yang, B. (2020). Construction of logistics financial security risk ontology model based on risk association and machine learning. Safety Science, 123 .

Yang, H., Bukkapatnam, S. T., & Barajas, L. G. (2013). Continuous flow modelling of multistage assembly line system dynamics. International Journal of Computer Integrated Manufacturing, 26 (5), 401–411.

Yang, L., Jiang, A., & Zhang, J. (2021). Optimal timing of big data application in a two-period decision model with new product sales. Computers and Industrial Engineering, 160 , 107550.

Yang, Y., & Peng, C. (2023). A prediction-based supply chain recovery strategy under disruption risks. International Journal of Production Research , 1–15.

Yao, Y., Zhu, X., Dong, H., Wu, S., Wu, H., Tong, L. C., & Zhou, X. (2019). ADMM-based problem decomposition scheme for vehicle routing problem with time windows. Transportation Research Part B: Methodological, 129 , 156–174.

Yin, S., Jiang, Y., Tian, Y., & Kaynak, O. (2016). A data-driven fuzzy information granulation approach for freight volume forecasting. IEEE Transactions on Industrial Electronics, 64 (2), 1447–1456.

Yin, W., He, S., Zhang, Y., & Hou, J. (2018). A product-focused, cloud-based approach to door-to-door railway freight design. IEEE Access, 6 , 20822–20836.

Ying, H., Chen, L., & Zhao, X. (2021). Application of text mining in identifying the factors of supply chain financing risk management. Industrial Management and Data Systems, 121 (2), 498–518.

Yu, B., Guo, Z., Asian, S., Wang, H., & Chen, G. (2019). Flight delay prediction for commercial air transport: A deep learning approach. Transportation Research Part E: Logistics and Transportation Review, 125 , 203–221.

Yu, C.-C., & Wang, C.-S. (2008). A hybrid mining approach for optimizing returns policies in e-retailing. Expert Systems with Applications, 35 (4), 1575–1582.

Yu, L., Zhao, Y., Tang, L., & Yang, Z. (2019). Online big data-driven oil consumption forecasting with Google trends. International Journal of Forecasting, 35 (1), 213–223.

Yu, Y., He, Y., & Zhao, X. (2021). Impact of demand information sharing on organic farming adoption: An evolutionary game approach. Technological Forecasting and Social Change, 172 .

Yue, G., Tailai, G., & Dan, W. (2021). Multi-layered coding-based study on optimization algorithms for automobile production logistics scheduling. Technological Forecasting and Social Change, 170 , 120889.

Zakeri, A., Saberi, M., Hussain, O. K., & Chang, E. (2018). An early detection system for proactive management of raw milk quality: An Australian case study. IEEE Access, 6 , 64333–64349.

Zamani, E. D., Smyth, C., Gupta, S., & Dennehy, D. (2022). Artificial intelligence and big data analytics for supply chain resilience: A systematic literature review. Annals of Operations Research , 1–28.

Zhang, G., Shang, J., & Li, W. (2012). An information granulation entropy-based model for third-party logistics providers evaluation. International Journal of Production Research, 50 (1), 177–190.

Zhang, K., Qu, T., Zhang, Y., Zhong, R., & Huang, G. (2022). Big data-enabled intelligent synchronisation for the complex production logistics system under the opti-state control strategy. International Journal of Production Research, 60 (13), 4159–4175.

Zhang, R., Li, J., Wu, S., & Meng, D. (2016). Learning to select supplier portfolios for service supply chain. PLoS ONE, 11 (5), e0155672.

Zhang, T., Zhang, C. Y., & Pei, Q. (2019). Misconception of providing supply chain finance: Its stabilising role. International Journal of Production Economics, 213 , 175–184.

Zhao, J., Wang, J., & Deng, W. (2015). Exploring bikesharing travel time and trip chain by gender and day of the week. Transportation Research Part C: Emerging Technologies, 58 , 251–264.

Zhao, K., & Yu, X. (2011). A case based reasoning approach on supplier selection in petroleum enterprises. Expert Systems with Applications, 38 (6), 6839–6847.

Zhao, N., & Wang, Q. (2021). Analysis of two financing modes in green supply chains when considering the role of data collection. Industrial Management and Data Systems, 121 (4), 921–939.

Zhao, R., Liu, Y., Zhang, N., & Huang, T. (2017). An optimization model for green supply chain management by using a big data analytic approach. Journal of Cleaner Production, 142 , 1085–1097.

Zhao, S., & You, F. (2019). Resilient supply chain design and operations with decision-dependent uncertainty using a data-driven robust optimization approach. AIChE Journal, 65 (3), 1006–1021.

Zhao, X., Yeung, K., Huang, Q., & Song, X. (2015). Improving the predictability of business failure of supply chain finance clients by using external big dataset. Industrial Management and Data Systems, 115 (9), 1683–1703.

Zheng, M., Wu, K., Sun, C., & Pan, E. (2019). Optimal decisions for a two-echelon supply chain with capacity and demand information. Advanced Engineering Informatics, 39 , 248–258.

Zhong, R. Y., Huang, G. Q., Lan, S., Dai, Q., Chen, X., & Zhang, T. (2015). A big data approach for logistics trajectory discovery from RFID-enabled production data. International Journal of Production Economics, 165 , 260–272.

Zhong, R. Y., Lan, S., Xu, C., Dai, Q., & Huang, G. Q. (2016). Visualization of RFID-enabled shopfloor logistics Big Data in Cloud Manufacturing. The International Journal of Advanced Manufacturing Technology, 84 (1–4), 5–16.

Zhou, J., Li, X., Zhao, X., & Wang, L. (2021). Driving performance grading and analytics: Learning internal indicators and external factors from multi-source data. Industrial Management and Data Systems, 121 (12), 2530–2570.

Zhou, Y., & Guo, Z. (2021a). Research on intelligent solution of service industry supply chain network optimization based on genetic algorithm. Journal of Healthcare Engineering, 2021 .

Zhou, Y., & Guo, Z. (2021b). Research on intelligent solution of service industry supply chain network optimization based on genetic algorithm. Journal of Healthcare Engineering, 2021 .

Zhou, Y., Yu, L., Chi, G., Ding, S., & Liu, X. (2022a). An enterprise default discriminant model based on optimal misjudgment loss ratio. Expert Systems with Applications, 205 .

Zhou, Z., Wang, M., Huang, J., Lin, S., & Lv, Z. (2022). Blockchain in big data security for intelligent transportation with 6G. IEEE Transactions on Intelligent Transportation Systems, 23 (7), 9736–9746.

Zhu, D. (2018). IOT and big data based cooperative logistical delivery scheduling method and cloud robot system. Future Generation Computer Systems, 86 , 709–715.

Zhu, J. (2022). DEA under big data: Data enabled analytics and network data envelopment analysis. Annals of Operations Research, 309 (2), 761–783.

Zhu, Y., Zhao, Y., Zhang, J., Geng, N., & Huang, D. (2019a). Spring onion seed demand forecasting using a hybrid Holt-Winters and support vector machine model. PLoS ONE, 14 (7).

Zhu, Y., Zhou, L., Xie, C., Wang, G.-J., & Nguyen, T. V. (2019). Forecasting SMEs’ credit risk in supply chain finance with an enhanced hybrid ensemble machine learning approach. International Journal of Production Economics, 211 , 22–33.

Download references

Acknowledgements

The authors would like to express their sincere gratitude to the editors and anonymous reviewers for their important comments and suggestions that helped to improve this paper.

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and affiliations.

School of Accounting, Information Systems and Supply Chain, RMIT University, Melbourne, Australia

Hamed Jahani

Department of Management, Tilburg School of Economics and Management, Tilburg University, Tilburg, The Netherlands

Berlin School of Economics and Law, Global Supply Chain & Operations Management, Berlin, Germany

Dmitry Ivanov

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Dmitry Ivanov .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Acronyms of journal names

See Table  9 .

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Jahani, H., Jain, R. & Ivanov, D. Data science and big data analytics: a systematic review of methodologies used in the supply chain and logistics research. Ann Oper Res (2023). https://doi.org/10.1007/s10479-023-05390-7

Download citation

Accepted : 08 May 2023

Published : 11 July 2023

DOI : https://doi.org/10.1007/s10479-023-05390-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data analytics
  • Predictive analytics
  • Data science
  • Data mining
  • Machine learning
  • Supply chain
  • Data-driven optimisation
  • Find a journal
  • Publish with us
  • Track your research
  • Accessibility Options:
  • Skip to Content
  • Skip to Search
  • Skip to footer
  • Office of Disability Services
  • Request Assistance
  • 305-284-2374
  • High Contrast
  • School of Architecture
  • College of Arts and Sciences
  • Miami Herbert Business School
  • School of Communication
  • School of Education and Human Development
  • College of Engineering
  • School of Law
  • Rosenstiel School of Marine, Atmospheric, and Earth Science
  • Miller School of Medicine
  • Frost School of Music
  • School of Nursing and Health Studies
  • The Graduate School
  • Division of Continuing and International Education
  • People Search
  • Class Search
  • IT Help and Support
  • Privacy Statement
  • Student Life
  • University of Miami
  • Division of University Communications
  • Office of Media Relations
  • Miller School of Medicine Communications
  • Hurricane Sports
  • UM Media Experts
  • Emergency Preparedness
  • Public Administration
  • Sport Administration
  • Student Experience
  • Online Degrees
  • Admissions Info
  • Tuition and Aid
  • Current Students
  • Latest Headlines
  • Subscribe to News@TheU Newsletter
  • UM NEWS HOME

What are the Benefits of Data Analytics in the Business World?

By UOnline News 08-08-2024

In today's fast-paced, data-driven world, businesses are increasingly relying on data analytics to gain a competitive edge. In such highly competitive business environments, data analytics has emerged as a powerful tool for all kinds of businesses. More than ever before, the benefits of data analytics to businesses have become vast and far-reaching.

This blog post explores the various benefits of data analytics in business, illustrating how it can transform operations and drive growth.

How Does Data Analytics Help Business?

Data analytics, the process of examining large datasets to uncover hidden patterns, correlations, and insights, has become a cornerstone of modern business strategy. From enhancing decision-making to improving operational efficiency, the benefits of data analytics in business are multifaceted and profound.

By leveraging the advantages of data analytics, businesses can make more informed decisions, improve operational efficiency, enhance customer experiences, and drive innovation. Data analytics has the potential to transform every aspect of a business.

Leveraging data analytics is no longer just an advantage but a necessity for sustainable growth and success. By investing in data analytics capabilities, businesses can unlock new opportunities, mitigate risks, and achieve their strategic objectives more effectively.

Read on to delve into the myriad benefits of data analytics, including how it can transform your business and drive sustainable growth.

Understanding Data Analytics

Data analytics refers to the process of examining data sets to draw conclusions about the information they contain. It involves applying algorithms, statistical techniques, and machine learning models to uncover patterns, correlations, and insights that can inform decision-making. Data analytics can be categorized into four main types: descriptive, diagnostic, predictive, and prescriptive analytics.

Descriptive Analytics: This involves summarizing historical data to understand what has happened in the past. It includes data aggregation and data mining to provide insights into past performance.

Diagnostic Analytics: This focuses on understanding why something happened. It involves data discovery, drill-down, and correlations to identify the root causes of past outcomes.

Predictive Analytics: This type of analytics uses historical data to forecast future events. Techniques such as regression analysis, time series analysis, and machine learning are employed to predict future trends and behaviors.

Prescriptive Analytics: This goes a step further by recommending actions to achieve desired outcomes. It involves optimization and simulation algorithms to suggest the best course of action based on the predictive analysis.

Data Analytics Benefits for Business

Enhanced Decision-Making

One of the most significant benefits of data analytics is its ability to enhance decision-making. By analyzing historical data, businesses can identify trends and patterns that inform strategic decisions. This data-driven approach minimizes the reliance on gut feelings and intuition, leading to more accurate and reliable outcomes.

For example, a retail company can analyze sales data to determine which products are performing well and adjust their inventory accordingly. This not only reduces the risk of stockouts but also ensures that capital is not tied up in slow-moving products.

Improved Operational Efficiency

In terms of operational efficiency, the importance of data analytics in business cannot be overstated. Data analytics can streamline operations by identifying bottlenecks and inefficiencies within business processes. By examining process data, businesses can pinpoint areas where resources are being wasted and implement changes to enhance efficiency.

For instance, a manufacturing company can use data analytics to monitor machine performance and predict maintenance needs, reducing downtime and increasing productivity. Similarly, logistics companies can optimize delivery routes based on traffic patterns and delivery schedules, resulting in faster deliveries and lower fuel costs.

Personalized Customer Experiences

In an era where customer experience is paramount, data analytics provides businesses with the tools to deliver personalized experiences. By analyzing customer data, businesses can gain insights into individual preferences, behaviors, and purchase histories. This enables the creation of targeted marketing campaigns and personalized product recommendations, enhancing customer satisfaction and loyalty.

For example, an e-commerce platform can analyze browsing and purchase data to recommend products that align with a customer's interests, increasing the likelihood of repeat purchases.

Risk Management and Mitigation

Every business faces risks, whether they are financial, operational, or market-related. The benefits of big data analytics include helping businesses identify and mitigate these risks by providing a comprehensive understanding of potential threats. For instance, financial institutions can use data analytics to detect fraudulent activities by analyzing transaction patterns and flagging anomalies. Similarly, businesses can analyze market data to anticipate changes in demand and adjust their strategies accordingly, reducing the impact of market volatility.

Optimized Marketing Campaigns

Marketing is a critical area where data analytics’ importance is significant. By analyzing data from various sources such as social media, website traffic, and customer interactions, businesses can gain insights into the effectiveness of their marketing campaigns. This allows for the optimization of marketing strategies, ensuring that resources are allocated to the most effective channels.

For example, a company can analyze the performance of different ad campaigns to determine which ones are generating the highest return on investment (ROI) and adjust their budget allocation accordingly.

Cost Reduction

Data analytics can lead to significant cost reductions by identifying areas where expenses can be minimized without compromising quality. For instance, by analyzing procurement data, businesses can identify suppliers that offer the best value for money and negotiate better contracts.

Additionally, data analytics can help optimize inventory management, reducing carrying costs and minimizing waste. For example, a company can use data analytics to forecast demand accurately, ensuring that they order the right amount of stock and avoid overproduction.

Competitive Advantage

In a competitive business environment, gaining an edge over rivals is crucial. Data analytics provides businesses with insights that can drive innovation and differentiate them from the competition. By analyzing market trends and customer feedback, businesses can identify unmet needs and develop new products or services to address them. Additionally, data analytics can help businesses benchmark their performance against competitors, identifying areas where they can improve and capitalize on opportunities.

Enhanced Employee Performance and Satisfaction

Data analytics is not only beneficial for external operations but also for internal processes. By analyzing employee performance data, businesses can identify top performers and areas where additional training may be needed. This enables the implementation of targeted development programs that enhance employee skills and productivity.

Additionally, data analytics can help improve employee satisfaction by analyzing feedback and identifying factors that contribute to a positive work environment. For example, by analyzing survey data, a company can identify common employee concerns and implement changes to address them, leading to higher retention rates.

Innovation and Product Development

Data analytics can drive innovation by providing insights into customer needs and market trends. By analyzing data from various sources, businesses can identify gaps in the market and develop new products or services to meet those needs.

This data-driven approach to innovation ensures that new offerings are aligned with customer preferences and have a higher likelihood of success. For instance, a tech company can analyze user feedback and usage data to develop new features for their software products, enhancing user satisfaction and driving adoption.

Supply Chain Optimization

Data analytics can significantly enhance supply chain management by providing visibility into every stage of the supply chain. Businesses can analyze data related to suppliers, inventory levels, transportation, and demand forecasts to optimize their supply chain operations.

For example, a logistics company can use data analytics to optimize delivery routes, reduce fuel consumption, and improve delivery times. This not only lowers operational costs but also enhances customer satisfaction by ensuring timely deliveries.

Human Resources Management

Data analytics can transform human resources management by providing insights into employee performance, engagement, and retention. By analyzing HR data, businesses can identify trends and patterns that inform hiring decisions, employee development programs, and retention strategies. For example, an organization can use data analytics to identify the factors that contribute to high employee turnover. By addressing these factors, the organization can improve employee satisfaction and retention rates.

Regulatory Compliance

In many industries, regulatory compliance is a critical concern. Data analytics can help businesses ensure compliance with various regulations by providing a comprehensive view of their operations and identifying areas where they may be falling short.

For example, in the healthcare industry, data analytics can be used to monitor patient records and ensure that they are being handled in accordance with privacy regulations. Similarly, in the financial sector, data analytics can help monitor transactions for compliance with anti-money laundering (AML) regulations.

Real-World Examples of How Data Analytics Help Business

Walmart, the world's largest retailer, uses data analytics extensively to optimize its operations and enhance customer experience. The company collects vast amounts of data from its stores and online channels, which it analyzes to make data-driven decisions. For example, Walmart uses predictive analytics to forecast demand and manage inventory levels, ensuring that products are available when customers need them. This has helped Walmart reduce stockouts and improve customer satisfaction.

Kaiser Permanente

Kaiser Permanente, a leading healthcare provider, leverages data analytics to improve patient care and operational efficiency. The organization uses data analytics to monitor patient outcomes, identify trends, and develop personalized treatment plans. By analyzing patient data, Kaiser Permanente can identify high-risk patients and intervene early to prevent complications. This has led to improved patient outcomes and reduced healthcare costs.

Capital One

Capital One, a major financial institution, uses data analytics to enhance its risk management and customer experience. The company analyzes transaction data to detect fraudulent activities and prevent financial losses. Additionally, Capital One uses data analytics to create personalized offers and recommendations for its customers, improving customer satisfaction and loyalty.

Final Thoughts on Data Analytics in Business

Data analytics offers many benefits for businesses across various industries. From enhanced decision-making and operational efficiency to improved customer experiences and risk management, data analytics has the potential to transform business operations and drive growth. By leveraging the power of data, businesses can uncover new opportunities, innovate, and gain a competitive edge in the market.

As the volume of data continues to grow, the importance of data analytics in business will only increase, making it an essential tool for success in the modern business landscape. Businesses that embrace this powerful tool will be well-positioned to navigate the complexities of the modern business landscape and thrive in an ever-evolving market. Whether you are a small startup or a large enterprise, the insights gained from data analytics can provide the foundation for informed decision-making, strategic planning, and long-term success.

Learn more about the University of Miami UOnline Master of Science in Data Analytics and Program Evaluation .

University of Miami Split U logo

  • 800-411-2290 800-411-2290
  • UM News and Events
  • Alumni & Friends
  • 'Cane Watch

Tools and Resources

  • Academic Calendar
  • Department Search
  • Parking & Transportation
  • social-facebook
  • social-twitter
  • social-youtube
  • social-instagram

Copyright: 2024 University of Miami. All Rights Reserved. Emergency Information Privacy Statement & Legal Notices Title IX & Gender Equity Website Feedback

Individuals with disabilities who experience any technology-based barriers accessing the University’s websites or services can visit the Office of Workplace Equity and Inclusion .

U.S. flag

  • Working Papers
  • Center for Financial Research
  • Researchers
  • Research Fellows
  • Senior Advisor
  • Special Advisor
  • 2023 Fellows
  • 2022 Fellows
  • Visiting Scholar
  • 23rd Bank Research Conference
  • Conference Videos
  • Speaker Information
  • Poster Videos
  • Poster Session Videos
  • Previous Conference Programs
  • Other Conferences
  • 2021-2022 FDIC Academic Challenge
  • 2020-2021 FDIC Academic Challenge
  • Prior Years
  • Research Assistants
  • Internships
  • Visiting Scholars

Working Papers presented as PDF files on this page reference Portable Document Format (PDF) files.  Adobe Acrobat, a reader available for free on the Internet, is required to display or print PDF files. ( PDF Help )

Most Recently Published Working Papers

 

FDIC Center for Financial Research Working Paper No. 2024-03
Haelim Anderson, Jaewon Choi and Jennifer Rhee

FDIC Center for Financial Research Working Paper No. 2024-02
Ajay Palvia, George Shoukry and Anna-Leigh Stone

FDIC Center for Financial Research Working Paper No. 2024-01
Stefan Jacewitz, Jonathan Pogach, Haluk Unal and Chengjun Wu

FDIC Center for Financial Research Working Paper No. 2023-03
Leonard Kiefer, Hua Kiefer and Tom Mayock

FDIC Center for Financial Research Working Paper No. 2023-02
Alireza Ebrahim, Ajay Palvia, Emilia Vähämaa and Sami Vähämaa

The Center for Financial Research (CFR) Working Paper Series allows CFR staff and their coauthors to circulate preliminary research findings to stimulate discussion and critical comment. Views and opinions expressed in CFR Working Papers reflect those of the authors and do not necessarily reflect those of the FDIC or the United States. Comments and suggestions are welcome and should be directed to the authors. References should cite this research as a “FDIC CFR Working Paper” and should note that findings and conclusions in working papers may be preliminary and subject to revision.

Last Updated: August 4, 2024

  • - Google Chrome

Intended for healthcare professionals

  • My email alerts
  • BMA member login
  • Username * Password * Forgot your log in details? Need to activate BMA Member Log In Log in via OpenAthens Log in via your institution

Home

Search form

  • Advanced search
  • Search responses
  • Search blogs
  • News & Views
  • MDMA assisted therapy:...

MDMA assisted therapy: Three papers are retracted as FDA rejects PTSD application

  • Related content
  • Peer review
  • Elisabeth Mahase

Three research papers on MDMA assisted psychotherapy have been retracted by the journal Psychopharmacology because of “protocol violations amounting to unethical conduct” by researchers at a study site.

In the retraction notices the journal said that the authors were aware of the violations when they submitted the articles but failed to disclose them or remove the affected data from their analysis. 1 2 3 All three papers related to phase 2 randomised controlled trials for MDMA assisted therapy, with similar authors listed for all three papers.

Although the notices did not detail what the …

Log in using your username and password

BMA Member Log In

If you have a subscription to The BMJ, log in:

  • Need to activate
  • Log in via institution
  • Log in via OpenAthens

Log in through your institution

Subscribe from £184 *.

Subscribe and get access to all BMJ articles, and much more.

* For online subscription

Access this article for 1 day for: £50 / $60/ €56 ( excludes VAT )

You can download a PDF version for your personal record.

Buy this article

research papers in data analytics

More From Forbes

Capital one: the ongoing story of how one firm has been pioneering data, analytics, & ai innovation for over three decades.

  • Share to Facebook
  • Share to Twitter
  • Share to Linkedin

Capital One

Pioneers are trailblazers. They operate as agents of transformation and change. They think differently. They innovate and introduce a new order. In the late 1980s, a pair of trailblazing strategy consultants came together with a new idea. Their idea? To use data to expand access to credit cards and do so more efficiently. This idea became the foundation for Capital One. In an industry historically dominated by century old banks, Capital One has been a trailblazer.

From start up to top ten bank, Capital One has revolutionized the credit card industry with data and technology. Capital One is approaching its 30th anniversary under the leadership of Founder, CEO, and Chairman, Rich Fairbank. Today, Capital One serves more than 100 million customers across a diverse set of businesses and has established one of the most widely recognized brands in banking.

I’ve had the good fortune to write about Capital One on multiple occasions over the years. In a 2019 Forbes article, From Analytics First to AI First at Capital One , co-authored with my industry colleague Tom Davenport, we wrote, “Capital One has long been known as a north star for financial services firms that aspire to be data-driven. Established in 1994 after spinning off of Signet Bank, the core idea behind the company’s formation was the “information-based strategy”—that important operational and financial decisions should be made on the basis of data and analytics”.

Data Leadership and the Role of Chief Data Officer (CDO) at Capital One

This past month, I had the privilege of interviewing Capital One’s current Chief Data Officer (CDO) and Executive Vice President, Amy Lenander, as part of a CDO panel that I have been organizing and moderating for the annual CDOIQ Symposium , which was launched 18 years ago at the Massachusetts Institute of Technology (MIT). Lenander was appointed CDO of Capital One in early 2023. She had previously led business functions for the company during a 20-year tenure, including a term as the CEO of Capital One U.K.

I asked Lenander about the evolution of the CDO role at Capital One. Lenander explained, “My role as Chief Data Officer is primarily about setting our company up to make the most of our data to power the future of Capital One.” She continued, “Part of the reason that the CDO role was created more than two decades ago was our recognition that establishing a rigorous, well-managed data ecosystem across the enterprise requires dedicated leadership from the top”.

Best High-Yield Savings Accounts Of 2024

Best 5% interest savings accounts of 2024.

Lenander understands the heritage of Capital One, noting, “In a company like Capital One, we need and expect data to enable associates across the company to discover insights, make decisions, and drive innovation and value for the business and our customers. This was true back in the early 2000s and remained true as we began our technology transformation and moved to the public cloud, and as the volume of data available continued to grow”.

Lenander appreciates that being data-driven is a way of life at Capital One and has been since its founding. Lenander comments, “What’s changed over time is the methods we use to find insights and how we use them to improve our business”. She continued, “Our data-driven culture means that there’s an incredibly strong pull from business leaders to continually learn about and use better and better techniques to drive more powerful insights. Along the way, every major division of our company has had leaders accountable for the data in their line of business, and data has remained central to our culture and to how we operate”.

Change has been a constant in the world of data over the past few decades. Lenander observes, “In a world where analytic techniques and even data management approaches are rapidly evolving, I look to build a team of talented people that are curious, great problem solvers, and continuous learners. In this environment, it’s also important to have a strong culture of experimentation and sharing lessons learned along the way”.

I asked Lenander about the importance of having a guiding data strategy at Capital One. Lenander explained, “We have an enterprise data strategy because one of the ways we can achieve our goals is by using a common set of data platforms across the company, and to achieve that, we need data owners across the company to work back from a common strategy”. She elaborated, “One of the benefits of that strategy is to make data available to use across the company, for example, so that we can use information about customers’ experiences with their bank account to inform how we might better service their credit card account”.

Lenander recognizes that what often differentiates one company from another is the ability to execute. Lenander comments, “This is one of those things where the goals and the strategy are relatively easy to say, but the execution is very difficult”. She continues, “Our data strategy is focused on ensuring we can make the most of our data as we continue to evolve our business”.

Ultimately, the value of data comes from the ability to serve the business. As a longtime business leader at Capital One, Lenander appreciates that, “Our data strategy is all about enabling the business. My prior experience leading businesses across Capital One has given me empathy and a deep understanding of where the value is in the business and where data can make the most difference”. She notes, “The goal of our data strategy is to make data well-managed and easy to find, understand, use and govern”.

The State of Data & AI at Capital One Today

In our 2019 Forbes article on AI at Capital One, we noted that Capital One was investing in how to apply AI in its business well before most of its peers. Lenander comments, “AI raises the stakes because AI is incredibly data-hungry, uses many more types of data, including unstructured data, and can be less explainable in the relationships that models are finding within data”. She understands that great AI requires great data, noting, “That means that there’s an even bigger need with AI to ensure data is of high quality”.

Lenander observes that, “Organizations that have plentiful, well-managed data have a huge advantage in their ability to leverage AI”. She continues, “Our investments in technology and data infrastructure and talent — along with our deep experience in machine learning — have set us up to be at the forefront of enterprises leveraging AI today”.

To continue to capitalize on momentum in data and AI, Capital One has established dedicated leadership roles to drive and evolve its enterprise strategies in these domains, which include the early 2023 appointment of Prem Natarajan as Executive Vice President, Chief Scientist, and Head of Enterprise AI for Capital One. Lenander and Natarajan serve as peers who partner closely on Capital One’s data journey. Natarajan leads Capital One’s AI initiatives, and both are part of a central Enterprise organization that has leaders accountable for core domains like AI and Data, as well as for core functions such as Engineering, Science, Design, and Product. The senior leaders of the Enterprise organization collectively drive Data, AI, and the innovation agenda for Capital One in close partnership with senior leaders across the company.

I asked Natarajan about his vision for AI at Capital One. Natarajan commented, “ From its inception, Capital One has had a reverence for insights derived from data. Culturally, we subscribe to data-driven methodologies for decision making”. He explains, “That has set us up very, very well for the current age of machine learning and AI”, adding, “When we think about future waves of AI and their impact, we take a cautiously optimistic view that is informed by our vision of an AI-powered enterprise in which all our associates and customers benefit from the real-time intelligent experiences and services that such an enterprise will deliver”.

Lenander adds her perspective, noting, “Data, machine learning and AI are central components of how we operate and serve our customers. Today we are embedding AI throughout our business with proprietary solutions built on our modern tech stack, and we have hundreds of use cases in production that use machine learning or AI to deliver value for our customers, associates, and the company”.

Developing Safeguards and Charting an AI Future

Natarajan is highly cognizant of the need for safeguards to ensure responsible AI use, recognizing that, as with any new and rapidly evolving technology, the potential benefits have to be balanced with a thoughtful, responsible approach that appropriately manages risk right from the start. This is especially the case due to the more powerful capabilities of modern Generative AI technologies. He comments, “We are anchoring on a test-and-learn approach to both identify the highest leverage areas for AI as well as the safest ways to deploy them while delivering benefits to our customers”.

Continuing on the topic of responsible AI, Natarajan states that across AI initiatives, Capital One is guided by a mission to build and deploy AI in a responsible, well-managed way that puts people first. Natarajan notes, “When developed responsibly, AI will continue to democratize access to a whole suite of insights, resources, and everyday conveniences across the entire social spectrum – in areas from finance to healthcare to education and more”. He continues, “Perhaps the most important safeguard is a cultural one, because it is such a strong determinant of practices and outcomes”.

Natarajan underscores the importance of thoughtful collaboration, commenting, “To maximize the benefits of AI, it is important to adopt an inclusive approach from the outset”. He adds, “A spirit of responsibility and thoughtfulness needs to pervade the entire development process, from research and experimentation to design, building, testing, and refining, through the whole development and production lifecycles”.

Capital One recognizes the need for extensive testing and implementation of human-centered guardrails before introducing AI systems into any customer or business setting. Natarajan comments, “For Capital One, this includes a Model Risk Management framework to evaluate, assess, validate, and govern models to effectively manage and mitigate risks associated with AI”. He notes that banks like Capital One have robust risk management infrastructure, oversight mechanisms, and governance capabilities that are required to manage risk and to deploy and scale AI appropriately.

Earlier this year, Capital One established the Center for AI and Responsible Financial Innovation with Columbia University and the Center for Responsible AI and Decision Making in Finance with the University of Southern California to advance state-of-the-art research in responsible AI and its application to finance. Capital One are also partners in multisector consortiums like the Partnership on AI, where Natarajan is on the Board, and with institutions like the National Sciences Foundation.

Natarajan comments, “We have a strong belief in the value of multi-sector partnerships between industry, academia, and government to ensure diverse perspectives and equities when developing, testing, and deploying AI”. He adds, “We are helping to advance research and strengthen our national capabilities in AI”.

Examples of Capital One initiatives that have been designed to measure the value of new data, analytics, and AI techniques, include:

1. A proprietary generative AI agent servicing tool which is helping agents access information to resolve customer questions more quickly and efficiently. For example, if a customer calls in about a lost or misplaced credit card, agents can get them a virtual card number immediately and have a new card delivered, so their ability to spend is uninterrupted and resolved more efficiently than ever. Lenander notes, “This tool has been used thousands of times by hundreds of agents, with over 95% of search results found highly relevant by our agents”.

2. An AI model that is used to customize the user experience across digital and mobile channels to help put the most relevant information in customers’ hands. Lenander comments, “The model is driving double digit improvement in relevance of personalization than the prior machine learning model and allows for rapid experimentation and iteration as we continue to enhance the customer experience”.

3. A proprietary fraud platform that leverages AI and machine learning to proactively surface and mitigate fraud in the time it takes a customer to swipe their card. Lenander notes, “We are continuing to experiment with new AI capabilities in fraud to stay at the leading edge of this space”.

Reflecting on the Capital One data and AI journey and how the company has evolved over the past three decades, Lenander concludes, “We have a long history of using data to drive our business strategies, and in a cloud and AI world, there is both massive opportunity and the potential for massive complexity”. She adds, “Cultivating a modern data ecosystem and being AI-ready is an ongoing journey, and we continually evolve our approach to data so that we can be best-prepared”.

Natarajan, reflecting on his mandate to advance the development and application of AI within Capital One, concludes, “Ultimately, AI is at its best when it empowers individuals and societies to achieve things that weren’t possible before – and we all need to come together to actively work towards that future”.

I look forward to seeing what comes next from Capital One as they further pioneer data, analytics, and AI leadership on the frontier of business innovation. What’s in your wallet?

Randy Bean

  • Editorial Standards
  • Reprints & Permissions

IMAGES

  1. FREE 46+ Research Paper Examples & Templates in PDF, MS Word

    research papers in data analytics

  2. (PDF) Big Data Visualization and Analytics: Future Research Challenges

    research papers in data analytics

  3. Analysis In A Research Paper

    research papers in data analytics

  4. (PDF) Big Data Analytics: A Literature Review Paper

    research papers in data analytics

  5. How to write data analysis in a research paper?

    research papers in data analytics

  6. Data analysis in research

    research papers in data analytics

COMMENTS

  1. Data Science and Analytics: An Overview from Data-Driven Smart

    Introduction. We are living in the age of "data science and advanced analytics", where almost everything in our daily lives is digitally recorded as data [].Thus the current electronic world is a wealth of various kinds of data, such as business data, financial data, healthcare data, multimedia data, internet of things (IoT) data, cybersecurity data, social media data, etc [].

  2. (PDF) Data Analytics and Techniques: A Review

    The lifecycle for data analysis will help to manage and organize the tasks connected to big data research and analysis. Data Analytics evolution with big data analytics, SQL analytics, and ...

  3. Home

    Overview. The International Journal of Data Science and Analytics is a pioneering journal in data science and analytics, publishing original and applied research outcomes. Focuses on fundamental and applied research outcomes in data and analytics theories, technologies and applications. Promotes new scientific and technological approaches for ...

  4. (PDF) Data Analytics: A Literature Review Paper

    This paper aims to analyze some. of the different analytics metho ds and tools which can be applied to big data, as. well as the opportunities provided by the application of big data a nalytics in ...

  5. Ten Research Challenge Areas in Data Science

    Abstract. To drive progress in the field of data science, we propose 10 challenge areas for the research community to pursue. Since data science is broad, with methods drawing from computer science, statistics, and other disciplines, and with applications appearing in all sectors, these challenge areas speak to the breadth of issues spanning ...

  6. Data science: a game changer for science and innovation

    This paper shows data science's potential for disruptive innovation in science, industry, policy, and people's lives. We present how data science impacts science and society at large in the coming years, including ethical problems in managing human behavior data and considering the quantitative expectations of data science economic impact. We introduce concepts such as open science and e ...

  7. Home page

    The Journal of Big Data publishes open-access original research on data science and data analytics. Deep learning algorithms and all applications of big data are welcomed. Survey papers and case studies are also considered. The journal examines the challenges facing big data today and going forward including, but not limited to: data capture ...

  8. Big Data and Big Data Analytics: Concepts, Types and Technologies

    Big Data Anlytics refers to the process of collecting, organizing, analyzing large data sets to discover dif ferent. patterns and other useful information. Big data analytics is a. set of ...

  9. Big data analytics in healthcare: a systematic literature review

    2.1. Characteristics of big data. The concept of BDA overarches several data-intensive approaches to the analysis and synthesis of large-scale data (Galetsi, Katsaliaki, and Kumar Citation 2020; Mergel, Rethemeyer, and Isett Citation 2016).Such large-scale data derived from information exchange among different systems is often termed 'big data' (Bahri et al. Citation 2018; Khanra, Dhir ...

  10. Top 10 Must-Read Data Science Research Papers in 2022

    Here, Analytics Insight brings you the latest Data Science Research Papers. These research papers consist of different data science topics including the present fast passed technologies such as AI, ML, Coding, and many others. Data Science plays a very major role in applying AI, ML, and Coding. With the help of data science, we can improve our ...

  11. Big Data Research

    About the journal. The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic. The journal will accept papers on foundational aspects in …. View full aims & scope.

  12. data science Latest Research Papers

    Data Science . Information Use . Regulatory Compliance . Future Research . Public And Private . Social Good . Public And Private Sector . Effective Use. AbstractThe appetite for effective use of information assets has been steadily rising in both public and private sector organisations.

  13. Big data analytics and firm performance: Findings from a mixed-method

    Several research papers demonstrate that big data analytics, when applied to problems of specific domains such as healthcare, service provision, supply chain management, ... An emerging theme in big data analytics and business value research is that companies differ in the way they operate, and thus require attention in different sets of ...

  14. A practical guide to data analysis in general literature reviews

    This article is a practical guide to conducting data analysis in general literature reviews. The general literature review is a synthesis and analysis of published research on a relevant clinical issue, and is a common format for academic theses at the bachelor's and master's levels in nursing, physiotherapy, occupational therapy, public health and other related fields.

  15. Big data analytics: a survey

    Abundant research results of data analysis [20, 27, 63] show possible solutions for dealing with the dilemmas of data mining algorithms. It means that the open issues of data analysis from the literature ... In this paper, by the data analytics, we mean the whole KDD process, while by the data analysis, we mean the part of data analytics that ...

  16. Learning to Do Qualitative Data Analysis: A Starting Point

    Yonjoo Cho is an associate professor of Instructional Systems Technology focusing on human resource development (HRD) at Indiana University. Her research interests include action learning in organizations, international HRD, and women in leadership. She serves as an associate editor of Human Resource Development Review and served as a board member of the Academy of Human Resource Development ...

  17. Data Analytics in Healthcare: A Tertiary Study

    Introduction. The purpose of data analytics in healthcare is to find new insights in data, at least partially automate tasks such as diagnosing, and to facilitate clinical decision-making [1, 2].Higher hardware cost-efficiency and the popularization and advancement of data analysis techniques have led to data analytics gaining increasing scholarly and practical footing in the healthcare sector ...

  18. Research on Data Science, Data Analytics and Big Data

    Data Analytics has shown such a tremendous growth across the globe that soon the Big Data market revenue is expected grow by 50 percent.Impact on various sectors like Traveling and transportation, Financial analysis, Retail, Research, Energy management, Healthcare. Nadikattu, Rahul Reddy and Nadikattu, Rahul Reddy, Research on Data Science ...

  19. Data Science and Analytics: An Overview from Data-Driven Smart

    The digital world has a wealth of data, such as internet of things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science ...

  20. Different Types of Data Analysis; Data Analysis Methods and ...

    This article is concentrated to define data analysis and the concept of data preparation. Then, the data analysis methods will be discussed. For doing so, the f ... Hamed, Different Types of Data Analysis; Data Analysis Methods and Techniques in Research Projects (August 1, 2022). ... Research Paper Series; Conference Papers; Partners in ...

  21. InfoGuides: Quantative Analysis & Statistics: Write a Paper

    Reporting Quantitative Research in Psychology: How to meet APA Style Journal Article Reporting Standards by Harris Cooper This updated edition offers practical guidance for understanding and implementing APA Style Journal Article Reporting Standards (JARS) and Meta‑Analysis Reporting Standards (MARS) for quantitative research. These standards provide the essential information researchers ...

  22. 4 high-value use cases for synthetic data in healthcare

    But strategies to advance big data analytics hinge on the availability, quality and accessibility of data, which can create barriers for healthcare organizations.. Synthetic data-- artificially generated information not taken from real-world sources -- has been proposed as a potential solution to many of healthcare's data woes, but the approach comes with a host of pros and cons.

  23. (PDF) Different Types of Data Analysis; Data Analysis Methods and

    Data analysis is simply the process of converting the gathered data to meanin gf ul information. Different techniques such as modeling to reach trends, relatio nships, and therefore conclusions to ...

  24. [2408.06292] The AI Scientist: Towards Fully Automated Open-Ended

    One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aids to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first ...

  25. Data science and big data analytics: a systematic review of ...

    Data science and big data analytics (DS &BDA) methodologies and tools are used extensively in supply chains and logistics (SC &L). However, the existing insights are scattered over different literature sources and there is a lack of a structured and unbiased review methodology to systematise DS &BDA application areas in the SC &L comprehensively covering efficiency, resilience and ...

  26. What are the Benefits of Data Analytics in the Business World?

    Data analytics has the potential to transform every aspect of a business. Leveraging data analytics is no longer just an advantage but a necessity for sustainable growth and success. By investing in data analytics capabilities, businesses can unlock new opportunities, mitigate risks, and achieve their strategic objectives more effectively.

  27. Working Papers

    Analysis. The FDIC is proud to be a pre-eminent source of U.S. banking industry research, including quarterly banking profiles, working papers, and state banking performance data. Browse our extensive research tools and reports. More FDIC Analysis

  28. MDMA assisted therapy: Three papers are retracted as FDA rejects PTSD

    Three research papers on MDMA assisted psychotherapy have been retracted by the journal Psychopharmacology because of "protocol violations amounting to unethical conduct" by researchers at a study site. In the retraction notices the journal said that the authors were aware of the violations when they submitted the articles but failed to disclose them or remove the affected data from their ...

  29. Decentralized food safety and authentication on cellulose paper‐based

    Food safety and authenticity analysis play a pivotal role in guaranteeing food quality, safeguarding public health, and upholding consumer trust. In recent years, significant social progress has presented fresh challenges in the realm of food analysis, underscoring the imperative requirement to devise innovative and expedient approaches for ...

  30. Capital One: The Ongoing Story Of How One Firm Has Been Pioneering Data

    Reflecting on the Capital One data and AI journey and how the company has evolved over the past three decades, Lenander concludes, "We have a long history of using data to drive our business ...