Feature selection techniques for machine learning: a survey of more than two decades of research

feature selection in machine learning research paper

New Citation Alert added!

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, index terms.

Computing methodologies

  • Machine learning

Learning paradigms

Machine learning algorithms

Machine learning approaches

Recommendations

Mrmr+ and cfs+ feature selection algorithms for high-dimensional data.

Feature selection is a central issue in machine learning and applied mathematics. Filter feature selection algorithms aim to solve the optimization problem of selecting a set of features that maximize the correlation feature-class and minimize the ...

Stable feature selection via dense feature groups

Many feature selection algorithms have been proposed in the past focusing on improving classification accuracy. In this work, we point out the importance of stable feature selection for knowledge discovery from high-dimensional data, and identify two ...

Feature selection considering two types of feature relevancy and feature interdependency

A novel feature selection method is proposed based on information theory.Our method divides feature relevancy into two categories.We performed experiments over 12 public data sets.Our method outperforms five competing methods in terms of accuracy.Our ...

Information

Published in.

Springer-Verlag

Berlin, Heidelberg

Publication History

Author tags.

  • Feature selection
  • High-dimensional data
  • Filter techniques
  • Wrapper techniques
  • Embedded techniques
  • Review-article

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

  • Statistical Learning
  • Biosignal Processing
  • Dimensionality Reduction
  • Feature Selection

A Survey on Feature Selection

  • December 2016
  • Procedia Computer Science 91(04):919-926
  • 91(04):919-926
  • CC BY-NC-ND 4.0

Jianyu Miao at Henan University of Technology

  • Henan University of Technology
  • This person is not on ResearchGate, or hasn't claimed this research yet.

Abstract and Figures

. Clustering results of different methods on 12 data sets. The best result for each data set is highlighted in bold face.

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations
  • Hafez Kader
  • Uli Niemann

Rilana Cima

  • Lamia Hassan

Amr A. Abohany

  • Reda M. Hussien
  • Jiangling Hong
  • Ruiquan Liao
  • N K ; Castellon-Apaza
  • D D ; Carpio-Vargas

Fredy Heric Villasante-Saravia

  • T P ; Quispe-Layme
  • FUNCT INTEGR GENOMIC

Kasmika Borah

  • MULTIBODY SYST DYN

Bas Kessels

  • R. H. B. Fey
  • Nathan van de Wouw

Daniel Voskergian

  • John A. Lory

Rajasree R Rs

  • S. Brintha Rajakumari
  • Vivek Yelleti
  • P. S. V. S. Sai Prasad

Suhang Wang

  • Jiliang Tang
  • STAT COMPUT
  • U. Von Luxburg

Hanchuan Peng

  • Phillip Alvelda

Miguel San Martin

  • Dacheng Tao
  • Dongchen Li
  • Jennifer Dy

Carla E Brodley

  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

This week: the arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Machine Learning

Title: feature selection tutorial with python examples.

Abstract: In Machine Learning, feature selection entails selecting a subset of the available features in a dataset to use for model development. There are many motivations for feature selection, it may result in better models, it may provide insight into the data and it may deliver economies in data gathering or data processing. For these reasons feature selection has received a lot of attention in data analytics research. In this paper we provide an overview of the main methods and present practical examples with Python implementations. While the main focus is on supervised feature selection techniques, we also cover some feature transformation methods.
Comments: 20 pages, 19 figures
Subjects: Machine Learning (cs.LG)
Cite as: [cs.LG]
  (or [cs.LG] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

REVIEW article

A review of feature selection methods for machine learning-based disease risk prediction.

Nicholas Pudjihartono

  • 1 Liggins Institute, University of Auckland, Auckland, New Zealand
  • 2 Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand
  • 3 Department of Engineering Science, The University of Auckland, Auckland, New Zealand
  • 4 MRC Lifecourse Epidemiology Unit, University of Southampton, Southampton, United Kingdom
  • 5 Singapore Institute for Clinical Sciences, Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
  • 6 Australian Parkinson’s Mission, Garvan Institute of Medical Research, Sydney, NSW, Australia

Machine learning has shown utility in detecting patterns within large, unstructured, and complex datasets. One of the promising applications of machine learning is in precision medicine, where disease risk is predicted using patient genetic data. However, creating an accurate prediction model based on genotype data remains challenging due to the so-called “curse of dimensionality” (i.e., extensively larger number of features compared to the number of samples). Therefore, the generalizability of machine learning models benefits from feature selection, which aims to extract only the most “informative” features and remove noisy “non-informative,” irrelevant and redundant features. In this article, we provide a general overview of the different feature selection methods, their advantages, disadvantages, and use cases, focusing on the detection of relevant features (i.e., SNPs) for disease risk prediction.

1 Introduction

1.1 precision medicine and complex disease risk prediction.

The advancement of genetic sequencing technology over the last decade has re-ignited interest in precision medicine and the goal of providing healthcare based on a patient’s individual genetic features ( Spiegel and Hawkins, 2012 ). Prediction of complex disease risk (e.g., type 2 diabetes, obesity, cardiovascular diseases, etc…) is emerging as an early success story. Successful prediction of individual disease risk has the potential to aid in disease prevention, screening, and early treatment for high-risk individuals ( Wray et al., 2007 ; Ashley et al., 2010 ; Manolio, 2013 ).

Genome-wide association studies (GWAS) have identified single nucleotide polymorphisms (SNPs) within the human genome that are associated with complex diseases at the population level ( Altshuler et al., 2008 ; Donnelly, 2008 ; Hindorff et al., 2009 ). However, most of the SNPs that have been associated with phenotypes have small effect sizes ( Visscher et al., 2017 ), and collectively they only explain a fraction of the estimated heritability for each phenotype ( Makowsky et al., 2011 ). This is known as the missing heritability problem. One possible explanation for the missing heritability is that GWAS typically utilize univariate filter techniques (such as the χ 2 test) to evaluate a SNP’s association with a phenotype SNP separately ( Han et al., 2012 ). While univariate filter techniques are popular because of their simplicity and scalability, they do not account for the complex interactions between SNPs (i.e., epistasis effects). Ignoring interactions amongst genetic features might explain a significant portion of the missing heritability of complex diseases ( Maher, 2008 ; König et al., 2016 ). Furthermore, being population-based, GWAS do not provide a model for predicting individual genetic risk. Thus, translation of GWAS association to individualized risk prediction requires quantification of the predictive utility of the SNPs that are identified. Typically, genetic risk prediction models are built by: 1) Polygenic risk scoring; or 2) Machine learning (ML) ( Abraham and Inouye, 2015 ).

1.2 Machine Learning for Individualized Complex Disease Risk Prediction

ML-based approaches are a potentially effective way of predicting individualized disease risk ( Figure 1 ). Unlike other popular predictive models (e.g., Polygenic Risk Scores, which use a fixed additive model), ML has the potential to account for complex interactions between features (i.e. SNP-SNP interaction) ( Ho et al., 2019 ). ML algorithms utilize a set of advanced function-approximation algorithms (e.g., support-vector machine, random forests, K-nearest neighbor, artificial neural network, etc…) to create a model that maps the association between a set of risk SNPs and a particular phenotype ( Kruppa et al., 2012 ; Mohri et al., 2018 ; Uddin et al., 2019 ). Thus, a patient’s genotype data can be used as an input to the predictive ML algorithm to predict their risk for developing a disease ( Figure 1B ).

www.frontiersin.org

FIGURE 1 . (A) Generalized workflow for creating a predictive ML model from a genotype dataset. (B) The final model can then be used for disease risk prediction.

The prediction of disease risk using SNP genotype data can be considered as a binary classification problem within supervised learning. There is a generalized workflow for creating a predictive ML model from a case-control genotype dataset ( Figure 1A ). The first step is data pre-processing, which includes quality control and feature selection ( Figure 1A , step 1). Quality control includes, but is not limited to, removing low-quality SNPs (e.g., those with low call rates or that deviate from the Hardy-Weinberg Equilibrium), and samples (e.g. individuals with missing genotypes). SNPs with low minimum allele frequency (e.g., less than 0.01) can also be removed. Feature selection reduces the training dataset’s dimensionality by choosing only features that are relevant to the phenotype. Feature selection is crucial in order to produce a model that generalizes well to unseen cohorts (see Section 1.3). The goal of data pre-processing is to produce a high-quality dataset with which to train the prediction model.

The second step in a generalized predictive ML modelling workflow is the selection of the specific learning algorithm and setting the learning parameters (i.e. the “hyperparameters”) ( Figure 1A , step 2). Hyperparameters are algorithm-specific parameters whose values are set before training. Examples include the number of trees in a random forest, the type of kernel in an SVM, or the number of hidden layers in an artificial neural network. Different learning algorithms use different hyperparameters, and their values affect the complexity and learning behaviour of the model.

Once the hyperparameters have been set, the pre-processed dataset is used to train the chosen algorithm ( Figure 1A , step 3). This training step allows the algorithm to “learn” the association between the features (i.e., SNPs) and the class labels (i.e., phenotype status). Once learnt, the trained model’s predictive performance (e.g. accuracy, precision, AUC) is validated ( Figure 1A , step 4). This is typically performed by K-fold cross-validation to estimate the model’s performance on unseen data. Cross-validation on unseen data ensures that the trained model does not overfit the training data. During cross-validation, the training dataset is equally split into K parts, and each part will be used as a validation/testing set. For example, in 5-fold (K = 5) cross-validation, the dataset is divided into 5 equal parts. The model is then trained on four of these parts and the performance is tested on the one remaining part. This process is repeated five times until all sections have been used as the testing set. The average performance of the model across all testing sets is then calculated.

The estimated model performance from cross-validation can be used as a guide for iterative refinement. During iterative refinement different aspects of the model building process (step 1–4) are repeated and refined. For example, different: hyperparameters (hyperparameter tuning); learning algorithms, feature selection methods, or quality control thresholds can all be tried. The combination that produces the best average performance (in cross-validation) is chosen to build the final classification model. The process of selecting the best model development pipeline is known as model selection. The final classification model can then be tested against an independent (external) dataset to confirm the model’s predictive performance, and finally be used for disease risk prediction ( Figure 1B ).

1.3 Feature Selection to Reduce SNP Data Dimensionality

Overcoming the curse of dimensionality is one of the biggest challenges in building an accurate predictive ML model from high dimensional data (e.g. genotype or GWAS data). For example, a typical case-control genotype dataset used in a GWAS can contain up to a million SNPs and only a few thousands of samples ( Szymczak et al., 2009 ). Using such data directly to train the ML classification algorithms is likely to generate an overfitted model, which performs well on the training data but poorly on unseen data. Overfitting happens when the model picks up the noise and random fluctuations in the training data as a learned concept. Furthermore, the excessive number of features increases the learning and computational time significantly because the irrelevant and redundant features clutter the learning algorithm ( Yu and Liu, 2004 ).

Feature selection is a common way to minimize the problem of excessive and irrelevant features ( Figure 2 ). Generally, feature selection methods reduce the dimensionality of the training data by excluding SNPs that: 1) have low or negligible predictive power for the phenotype class; and 2) are redundant to each other ( Okser et al., 2014 ). Effective feature selection can increase learning efficiency, predictive accuracy, and reduce the complexity of the learned results ( Koller and Sahami, 1996 ; Kohavi and John, 1997 ; Hall, 2000 ). Furthermore, the SNPs that are incorporated into the predictive model (following feature selection) are typically assumed to be associated with loci that are mechanistically or functionally related to the underlying disease etiology ( Pal and Foody, 2010 ; López et al., 2018 ). Therefore, extracting a subset of the most relevant features (through feature selection) could help researchers to understand the biological process(es) that underlie the disease ( Cueto-López et al., 2019 ). In this context, feature selection can be said to be analogous to the identification of SNPs that are associated with phenotypes in GWAS.

www.frontiersin.org

FIGURE 2 . Illustration of feature selection process. (A) The original dataset may contain an excessive number of features and a lot of irrelevant SNPs. (B) Feature selection reduces the dimensionality of the dataset by excluding irrelevant features and including only those features that are relevant for prediction. The reduced dataset contains relevant SNPs (rSNPs) which can be used to train the learning algorithm. N o : original number of features, N r : number of remaining relevant SNPs.

1.4 The Problem of Feature Redundancy and Feature Interaction in SNP Genotype Dataset

GWAS typically identify multiple SNPs close to each other within a genetic window to be associated with a disease ( Broekema et al., 2020 ). This occurs because of linkage disequilibrium (LD), which is the correlation between nearby variants such that they are inherited together within a population more often than by random chance ( Figure 3 ). In ML and prediction contexts, these highly correlated SNPs can be considered redundant because they carry similar information and can substitute for each other. The inclusion of redundant features has been shown to degrade ML performance and increase computation time ( Kubus, 2019 ; Danasingh et al., 2020 ). Therefore, ideally, feature selection techniques should select one SNP (e.g., the SNP with the highest association score) to represent the entire LD cluster as a feature for prediction. However, since the SNP with the highest association signal is not necessarily the causal variant of that locus ( Onengut-Gumuscu et al., 2015 ), geneticists often link an association signal to the locus they belong to rather than the SNP itself ( Brzyski et al., 2017 ). If a researcher aims to identify the true causal variant within an association locus then fine-mapping techniques must be employed (see ( Spain and Barrett, 2015 ; Broekema et al., 2020 ))

www.frontiersin.org

FIGURE 3 . Lead SNPs in GWAS studies need not be the causal variant due to linkage disequilibrium. Illustration of GWAS result where SNPs (circles) are colored according to linkage disequilibrium (LD) strength with the true causal variant within the locus (indicated with a black star). Due to LD, several SNPs near the true causal variant may show a statistically significant association with the phenotype. In ML, these highly correlated SNPs can be considered redundant to each other, therefore only one representative SNP for this LD cluster is required as a selected feature. In this example, the causal variant is not the variant with the strongest GWAS association signal.

Relevant features may appear irrelevant (or weakly relevant) on their own but are highly correlated to the class in the presence of other features. This situation arises because these features are only relevant to the phenotype when they interact with other features (i.e., they are epistatic). Figure 4 shows a simplified example of a feature interaction that arises because of epistasis. In this example, there is an equal number of SNP 1 = AA, Aa, or aa in cases and controls, which means that SNP 1 does not affect the distribution of the phenotype class. The same is true for SNP 2. However, the allele combinations between SNP1 and SNP2 does affect phenotype distribution. For example, there are more combinations of SNP1 = AA and SNP2 = AA in cases than controls, consistent with this allele combination conferring increased risk ( Figure 4B ).

www.frontiersin.org

FIGURE 4 . The functional impacts of SNPs can interact and may be epistatic. (A) Individually, neither SNP1 nor SNP2 affect phenotype distribution. (B) Taken together, allele combinations between SNP1 and SNP2 can affect phenotype distribution (marked with yellow star).

It is generally advisable to consider both feature redundancy and feature interaction during feature selection. This is especially true when dealing with genotype data, where linkage disequilibrium (LD) and the non-random association of alleles create redundant SNPs within loci. Moreover, complex epistatic interactions between SNPs can account for some of the missing heritability of complex diseases and should be considered when undertaking feature selection. Indeed, studies have demonstrated the benefits to predictive power of ML approaches that consider feature interactions when compared to those that only consider simple additive risk contributions ( Couronné et al., 2018 ; Ooka et al., 2021 ). However, searching for relevant feature interactions undoubtedly comes with additional computational costs. As such, deciding whether different aspects of it must be done (i.e., searching for relevant interactions) is a problem-specific question that depends upon the nature of the input data and the a priori assumptions of the underlying mechanisms of the disease. For example, if the genetic data originates from whole-genome sequencing (WGS), or a genotyping array, and the target phenotype is a complex disease (i.e. best explained by non-linear interactions between loci) then using a feature selection approach that considers interactions will be beneficial. By contrast, if the input genetic data does not uniformly cover the genome (i.e., the density of the SNPs is much higher in known disease associated loci; e.g. Immunochip genotyping array) then interactions may not aid the selection as the lack of data leads to potentially important interactions with SNPs outside known disease associated loci being missed. Furthermore, not all diseases are recognized as involving complex epistatic effects. In such cases, searching for feature interactions might lead to additional computation complexity without obvious predictive benefits. For example, Romagnoni et al. ( Romagnoni et al., 2019 ) reported that searching for possible epistatic interactions did not yield a significant increase in predictive accuracy for Crohn’s disease. Notably, the authors concluded that epistatic effects might make limited contributions to the genetic architecture of Crohn’s disease, and the use of the Immunochip genotyping array might have caused interaction effects with SNPs outside of the known autoimmune risk loci to have been missed.

The goal of feature selection is to select a minimum subset of features (which includes individually relevant and interacting features) that can be used to explain the different classes with as little information loss as possible ( Yu and Liu, 2004 ). It is possible that there are multiple possible minimum feature subsets due to redundancies. Thus, this is “a minimum subset” and not “the minimum set.”

In the remainder of this manuscript we discuss the advantages and disadvantages of representative filter, wrapper, and embedded methods of feature selection (Section 2). We then assess expansions of these feature selection methods (e.g. hybrid, ensemble, and integrative methods; Sections 3.1–3.2) and exhaustive search methods for higher-order (≥3) SNP-SNP interaction/epistasis effects (Section 4).

2 Feature Selection Techniques

The feature selection methods that are routinely used in classification can be split into three methodological categories ( Guyon et al., 2008 ; Bolón-Canedo et al., 2013 ): 1) filters; 2) wrappers; and 3) embedded methods ( Table 1 ). These methods differ in terms of 1) the feature selection aspect being separate or integrated as a part of the learning algorithm; 2) evaluation metrics; 3) computational complexities; 4) the potential to detect redundancies and interactions between features. The particular strengths and weaknesses of each methodological category mean they are more suitable for particular use cases ( Saeys et al., 2007 ; Okser et al., 2013 ; De et al., 2014 ; Remeseiro and Bolon-Canedo, 2019 ) ( Table 1 ).

www.frontiersin.org

TABLE 1 . Strengths, weaknesses, and examples of the three main feature selection categories.

2.1 Filter Methods for Feature Selection

Filter methods use feature ranking as the evaluation metric for feature selection. Generally, features are ranked based on their scores in various statistical tests for their correlation with the class. Features that score below a certain threshold are removed, while features that score above it are selected. Once a subset of features is selected, it can then be presented as an input to the chosen classifier algorithm. Unlike the other feature selection methods (wrapper and embedded), filter methods are independent/separate from the classifier algorithm ( Figure 5A ). This separation means that filter methods are free from classifier’s bias which reduces overfitting. However, this independence also means that interaction with the classifier is not considered during feature selection ( John et al., 1994 ). Thus, the selected feature set is more general and not fine-tuned to any specific classifier ( Zhang et al., 2013 ). This lack of tuning means that filter methods tend to produce models that have reduced predictive performance compared to those produced by wrapper or embedded methods. The main advantage of filter methods over other feature selection methods is that they are generally less computationally demanding, and thus can easily be scaled to very high dimensional data (e.g. SNP genotype datasets).

www.frontiersin.org

FIGURE 5 . Generalized illustrations of methods. (A) Schematic of filter method, where feature selection is independent of the classifier. (B) The wrapper method. Feature selection relies on the performance of the classifier algorithm on the various generated feature subsets. (C) The embedded method. In embedded methods, feature selection is integrated as a part of the classifier algorithm. (D) Hybrid methods. In hybrid methods, features are reduced through the application of a filter method before the reduced feature set is passed through a wrapper or embedded method to obtain the final feature subset. (E) Integrative methods. In integrative methods, external information is used as a filter to reduce feature search space before the reduced feature set is passed through a wrapper or embedded method to obtain the final feature subset.

Existing filter methods can be broadly categorized as either univariate or multivariate. Univariate methods test each feature individually, while multivariate methods consider a subset of features simultaneously. Due to their speed and simplicity, univariate methods (e.g., χ2 test, Fisher’s exact test, information gain, Euclidean distance, Pearson correlation, Mann-Whitney U test, t -test, etc...) have attracted the most attention in fields that work with high dimensional datasets ( Saeys et al., 2007 ; Bolón-Canedo et al., 2014 ). However, since each feature is considered separately, univariate methods only focus on feature relevance and cannot detect feature redundancy, or interactions. This decreases model predictor performance because: 1) the inclusion of redundant features makes the feature subset larger than necessary; and 2) ignoring feature interactions can lead to the loss of important information.

More advanced multivariate filter techniques, including mutual information feature selection (MIFS) ( Battiti, 1994 ), minimal-redundancy-maximal-relevance (mRMR) ( Peng et al., 2005 ), conditional mutual information maximization (CMIM) ( Schlittgen, 2011 ), and fast correlation-based filter (FCBF), ( Yu and Liu, 2004 ), have been developed to detect relevant features and eliminate redundancies between features without information loss. Other algorithms like BOOST ( Wan et al., 2010 ), FastEpistasis ( Schüpbach et al., 2010 ), and TEAM ( Zhang et al., 2010 ) have been designed to exhaustively search for all possible feature interactions. However, they are restricted to two-way (pairwise) interactions and they cannot eliminate redundancy. More recent algorithms (e.g., the feature selection based on relevance, redundancy and complementarity [FS-RRC] ( Li et al., 2020 ), Conditional Mutual Information-based Feature Selection considering Interaction [CMIFSI] ( Liang et al., 2019 )) have been demonstrated to be able to detect feature interactions and eliminate redundancies. However, again, they are mostly constrained to pair-wise feature interactions. Another popular family of filter algorithms is the Relief-based algorithm (RBA) family (e.g., Relief ( Kira and Rendell, 1992 ), ReliefF ( Kononenko, 1994 ), TURF ( Moore and White, 2007 ), SURF ( Greene et al., 2009 ), SURF* ( Greene et al., 2010 ), MultiSURF ( Urbanowicz et al., 2018a ), MultiSURF* ( Granizo-Mackenzie and Moore, 2013 ), etc…). Relief does not exhaustively search for feature interactions. Instead, it scores the importance of a feature according to how well the feature’s value distinguishes samples that are similar to each other (e.g., similar genotype) but belong to different classes (e.g., case and control). Notably, RBAs can detect pair-wise feature interactions, some RBAs (e.g., ReliefF, MultiSURF) can even detect higher order (>2 way) interactions ( Urbanowicz et al., 2018a ). However, RBAs cannot eliminate redundant features. Different RBAs have been reviewed and compared previously ( Urbanowicz et al., 2018a ; Urbanowicz et al., 2018b ).

Despite its advantages, it should be noted that multivariate methods are more computationally heavy than univariate methods and thus cannot as effectively be scaled to very high dimensional data. Furthermore, multivariate filters still suffer from some of the same limitations as univariate filters due to their independence from the classifier algorithm (i.e., it ignores interaction with the classifier). In this context, wrapper and embedded methods represent an alternative way to perform multivariate feature selection while allowing for interactions with the classifier although again there is a computational cost (see Sections 2.2, 2.3).

2.1.1 The Multiple Comparison Correction Problem and Choosing the Appropriate Filter Threshold

Filter methods often return a ranked list of features rather than an explicit best subset of features (as occurs in wrapper methods). For example, univariate statistical approaches like χ2 test and fisher exact test rank features based on p value. Due to the large number of hypothesis tests made, relying on the usual statistical significance threshold of p < 0.05 will result in a preponderance of type 1 errors (false positive). As an illustration, if we perform hypothesis tests on 1 million SNPs at a p value threshold <0.05, we can expect around 50,000 false positives, which is a considerable number. Therefore, choosing an appropriate threshold for relevant features adds a layer of complexity to predictive modelling when using feature selection methods that return ranked feature lists.

For methods that return a p value, the p value threshold is commonly adjusted by controlling for FWER (family-wise error rate) or FDR (false discovery rate). FWER is the probability of making at least one type 1 error across all tests performed (i.e., 5% FWER means there is 5% chance of making at least one type 1 error across all hypothesis tests). FWER can be controlled below a certain threshold (most commonly <5%) by applying a Bonferroni correction ( Dunn, 1961 ). The Bonferroni correction works by dividing the desired probability of type 1 error p (e.g., p < 0.05) by the total number of independent hypotheses tested. This is a relatively conservative test that assumes that all the hypotheses being tested are independent of each other. However, this assumption is likely to be violated in genetic analyses where SNPs that are close to each other in the linear DNA sequence tend to be highly correlated due to LD ( Figure 3 ). Thus, the effective number of independent hypothesis tests is likely to be smaller than the number of SNPs examined. Not taking LD into account will lead to overcorrection for the number of tests performed. For example, the most commonly accepted p value threshold used in GWAS ( p < 5 × 10 −8 ) is based on a Bonferroni correction on all independent common SNPs after taking account of the LD structure of the genome ( Dudbridge and Gusnanto, 2008 ; Xu et al., 2014 ). Despite its widespread use in GWAS, this threshold has been criticized for being too conservative, leading to excessive false negatives ( Panagiotou and Ioannidis, 2012 ). Panagiotou et al. ( Panagiotou and Ioannidis, 2012 ) noted that a considerable number of legitimate and replicable associations can have p values just above this threshold; therefore, a possible relaxation of this commonly accepted threshold has been suggested.

Alternatively, one can apply p value adjustment to control for FDR instead of FWER. Controlling for FDR is a less stringent metric than controlling for FWER because it is the allowed proportion of false positives among all positive findings (i.e., 5% FDR means that approximately 5% of all positive findings are false). Despite potentially including more false positives in the selected features, FDR has been shown to be more attractive if prediction (rather than inference) is the end goal ( Abramovich et al., 2006 ).

FDR can be controlled by applying the Benjamini-Hochberg (B-H) procedure ( Benjamini and Hochberg, 1995 ). However, like the Bonferroni correction, the B-H procedure assumes independent hypothesis tests. To satisfy this assumption, for example, Brzyski et al. (2017) proposed a strategy that clusters tested SNPs based on LD before applying B–H. Alternatively, there also exist procedures that control FDR without making any assumptions such as the Benjamini-Yekutieli (B-Y) procedure ( Benjamini and Yekutieli, 2001 ). However, the B-Y procedure is more stringent, leading to less power compared to procedures that assume independence like B-H ( Farcomeni, 2008 ).

The question remains, when applying a Bonferroni, B-H or B-Y correction, which FWER/FDR threshold is optimum (e.g., 5, 7, or 10%)? In a ML context, this threshold can be viewed as a hyperparameter. Thus, the optimum threshold that produces the best performance can be approximated by cross-validation as a part of the model selection process ( Figure 1A , step 5). The threshold for feature selection methods that do not directly produce a p value (e.g., multivariate algorithms like mRMR ( Peng et al., 2005 )) can also be chosen using cross validation (e.g. by taking the top n SNPs as the selected features).

2.2 Wrapper Methods for Feature Selection

In contrast to filter methods, wrapper methods use the performance of the chosen classifier algorithm as a metric to aid the selection of the best feature subset ( Figure 5B ). Thus, wrapper methods identify the best-performing set of features for the chosen classifier algorithm ( Guyon and Elisseeff, 2003 ; Remeseiro and Bolon-Canedo, 2019 ). This is the main advantage of wrapper methods, and has been shown to result in higher predictive performance than can be obtained with filter methods ( Inza et al., 2004 ; Wah et al., 2018 ; Ghosh et al., 2020 ). However, exhaustive searches of the total possible feature combination space are computationally infeasible ( Bins and Draper, 2001 ). Therefore, heuristic search strategies across the space of possible feature subsets must be defined (e.g., randomized ( Mao and Yang, 2019 ), sequential search ( Xiong et al., 2001 ), genetic algorithm ( Yang and Honavar, 1998 ; Li et al., 2004 ), ant colony optimization ( Forsati et al., 2014 ), etc…) to generate a subset of features. A specific classification algorithm is then trained and evaluated using the generated feature subsets. The classification performances of the generated subsets are compared, and the subset that results in the best performance [typically estimated using AUC (area under the receiver operating characteristic curve)] is chosen as the optimum subset. Practically, any search strategy and classifier algorithm can be combined to produce a wrapper method.

Wrapper methods implicitly take into consideration feature dependencies, including interactions and redundancies, during the selection of the best subset. However, due to the high number of computations required to generate the feature subsets and evaluate them, wrapper methods are computationally heavy (relative to filter and embedded methods) ( Chandrashekar and Sahin, 2014 ). As such, applying wrapper methods to SNP genotype data is usually not favored, due to the very high dimensionality of SNP data sets ( Kotzyba - Hibert et al., 1995 ; Bolón-Canedo et al., 2014 ).

Wrapper methods are dependent on the classifier used. Therefore, there is no guarantee that the selected features will remain optimum if another classifier is used. In some cases, using classifier performance as a guide for feature selection might produce a feature subset with good accuracy within the training dataset, but poor generalizability to external datasets) (i.e., more prone to overfitting) ( Kohavi and John, 1997 ).

Unlike filter methods which produce a ranked list of features, wrapper methods produce a “best” feature subset as the output. This has both advantages and disadvantages. One advantage of this is that the user does not need to determine the most optimum threshold or number of features selected (because the output is already a feature subset). The disadvantage is that it is not immediately obvious which features are relatively more important within the set. Overall, this means that although wrapper methods can produce better classification performance, they are less useful in exposing the relationship between the features and the class.

2.3 Embedded Methods for Feature Selection

In an embedded method, feature selection is integrated or built into the classifier algorithm. During the training step, the classifier adjusts its internal parameters and determines the appropriate weights/importance given for each feature to produce the best classification accuracy. Therefore, the search for the optimum feature subset and model construction in an embedded method is combined in a single step ( Guyon and Elisseeff, 2003 ) ( Figure 5C ). Some examples of embedded methods include decision tree-based algorithms (e.g., decision tree, random forest, gradient boosting), and feature selection using regularization models (e.g., LASSO or elastic net). Regularization methods usually work with linear classifiers (e.g., SVM, logistic regression) by penalizing/shrinking the coefficient of features that do not contribute to the model in a meaningful way ( Okser et al., 2013 ). It should be noted that like many filter methods, decision tree-based and regularization methods mentioned above also return a ranked list of features. Decision tree-based algorithms rank feature importance based on metrics like the Mean Decrease Impurity (MDI) ( Louppe et al., 2013 ). For regularization methods, the ranking of features is provided by the magnitude of the feature coefficients.

Embedded methods are an intermediate solution between filter and wrapper methods in the sense that the embedded methods combine the qualities of both methods ( Guo et al., 2019 ). Specifically, like filter methods, embedded methods are computationally lighter than wrapper methods (albeit still more demanding than filter methods). This reduced computational load occurs even though the embedded method allows for interactions with the classifier (i.e., it incorporates classifier’s bias into feature selection, which tends to produce better classifier performance) as is done for wrapper methods.

Some embedded methods (i.e…, random forest and other decision tree-based algorithms) do allow for feature interactions. Notably, unlike most multivariate filters, tree-based approaches can consider higher-order interactions (i.e., more than two). Historically, random forest is rarely applied directly to whole-genome datasets due to computational and memory constraints ( Szymczak et al., 2009 ; Schwarz et al., 2010 ). For example, it has been shown that the original Random Forest algorithm (developed by Breiman and Cutler, 2004) can be applied to analyze no more than 10,000 SNPs ( Schwarz et al., 2010 ). Indeed, many applications of random forest have been focused on low-dimensional dataset. For example, Bureau et al. ( Bureau et al., 2005 ), identified relevant SNPs from a dataset of just 42 SNPs. Lopez et al. ( López et al., 2018 ) implemented a random forest algorithm to identify relevant SNPs from a dataset that contains a total of 101 SNPs that have been previously associated with type 2 diabetes.

Nevertheless, recent advances in computational power, together with optimizations and modifications of the random forest algorithm (e.g., the Random Jungle ( Schwarz et al., 2010 )) have resulted in efficiency gains that enable it to be applied to whole-genome datasets. However, studies have indicated that the effectiveness of random forest to detect feature interactions declines as the number of features increases, thus limiting the useful application of random forest approaches to highly dimensional datasets ( Lunetta et al., 2004 ; Winham et al., 2012 ). Furthermore, the ability of standard random forest to detect feature interactions is somewhat dependent on strong individual effects, potentially losing epistatic SNPs with a weak individual effect. Several modified random forest algorithms have been developed to better account for epistatic interactions between SNPs with weak individual effect (e.g., T-tree ( Botta et al., 2014 ), GWGGI ( Wei and Lu, 2014 )). These modified algorithms are still less sensitive than exhaustive search methods (Section 4).

Unlike some multivariate filters (Section 2.1), random forest does not automatically eliminate redundant features. Indeed, Mariusz Kubus ( Kubus, 2019 ) showed that the presence of redundant features decreases the performance of the random forest algorithm. A potential solution to this problem includes filtering out the redundant features before applying random forest [see hybrid method (Section 3.1)]. Another possible solution might be aggregating the information carried by these redundant features (e.g., using haplotypes instead of SNPs to build the model). Some software packages like T-tree ( Botta et al., 2014 ) have a built-in capability to account for redundancy by transforming the input SNPs into groups of SNPs in high-LD with each other.

In contrast to decision tree-based algorithms, penalized methods (e.g., LASSO) can discard redundant features, but it have no built-in ability to detect feature interactions ( Barrera-Gómez et al., 2017 ). Instead, interaction terms must be explicitly included in the analysis ( Signorino and Kirchner, 2018 ). This is commonly achieved by exhaustively including all (usually pairwise) interaction terms for the features. While this approach can be effective for data with low dimensionality, it can be inaccurate and computationally prohibitive in highly dimensional data settings. Two-stage or hybrid strategies that result in reduced search spaces are potential solutions to this problem (Section 3.1).

2.4 Which Feature Selection Method Is Optimal?

The “no free lunch” theorem states that in searching for a solution, no single algorithm can be specialized to be optimal for all problem settings ( Wolpert and Macready, 1997 ). This is true for feature selection methods, each of which has its own strengths and weaknesses ( Table 1 ), relying on different metrics and underlying assumptions. Several studies have compared the predictive performance of the different feature selection methods ( Forman, 2003 ; Bolón-Canedo et al., 2013 ; Aphinyanaphongs et al., 2014 ; Wah et al., 2018 ; Bommert et al., 2020 ). These comparative studies have resulted in the widely held opinion that there is no such thing as the “best method” that is fit for all problem settings.

Which feature selection method is best is a problem-specific question that depends on the dataset being analyzed and the specific goals that the researcher aims to accomplish. For example, suppose the aim is to identify which features are relatively the most important (which can be useful to help uncover the biological mechanism behind the disease). In that case, filter methods are better because they produce a ranked list of features and are the most computationally efficient. If the dataset contains a relatively low number of features (e.g., tens to hundreds), applying wrapper methods likely results in the best predictive performance. Indeed, in this case, model selection algorithms can be applied to identify which wrapper algorithm is the best. By contrast, for the typical SNP genotype dataset with up to a million features, computational limitations mean that directly applying wrapper or embedded methods might not be computationally practical even though they model feature dependencies and tend to produce better classifier accuracy than filter methods.

New feature selection strategies are emerging that either: 1), use a two-step strategy with a combination of different feature selection methods (hybrid methods); or 2), combine the output of multiple feature selection methods (ensemble methods). These strategies take advantage of the strengths of the different feature selection methods that they include.

3 Hybrid Methods—Combining Different Feature Selection Approaches

Hybrid methods combine different feature selection methods in a multi-step process to take advantage of the strengths of the component methods ( Figure 5D ). For example, univariate filter-wrapper hybrid methods incorporate a univariate filter method as the first step to reduce the initial feature set size, thus limiting the search space and computational load for the subsequent wrapper step. In this instance, the filter method is used because of its simplicity and speed. By contrast, the wrapper method is used because it can model feature dependencies and allow interactions with the classifier, thus producing better performance. Typically, a relaxed scoring threshold is used for the filtering step because the main goal is to prioritize a subset of SNPs for further selection by the wrapper method. For example, when using the univariate χ2 test in the initial feature selection step, instead of the genome-wide significance threshold commonly used in GWAS ( p > 5 × 10 –8 ), one might choose a less stringent threshold (e.g., p > 5 × 10 –4 ), or adjust by FDR instead. While this might result in more false positives, these can be further eliminated and SNPs with weak individual effects, but strong interacting effects will be able to survive the filtering step and thus can be detected by the wrapper method in the subsequent step. Practically, any filter, wrapper, or embedded method can be combined to create a hybrid method.

In a hybrid method, implementing the filter step reduces the feature search space thus allowing for the subsequent use of computationally expensive wrapper or embedded methods for high-dimensional datasets (which might otherwise be computationally unfeasible). For example, Yoshida and Koike ( Yoshida and Koike, 2011 ) presented a novel embedded method to detect interacting SNPs associated with rheumatoid arthritis called SNPInterForest (a modification of random forest algorithm). To accommodate the computational load of the proposed algorithm, the authors first narrowed the feature size from 500,000 SNPs to 10,000 SNPs using a univariate filter before further selection using SNPInterForest.

Wei et al. (2013) built a Crohn’s disease prediction model that employed a single SNP association test (a univariate filter method), followed by logistic regression with L1 (LASSO) regularization (an embedded method). The first filtering step reduced the original feature size from 178,822 SNPs to 10,000 SNPs for further selection with LASSO. The final predictive model achieved a respectable AUC of 0.86 in the testing set.

There is always a trade-off between computational complexity and performance in feature selection. In this context, hybrid methods can be considered a “middle ground” solution between the simple filter method and the more computationally complex but performant wrapper and embedded methods. Indeed, many examples in the literature have shown that a hybrid method tends to produce better performance than a simple filter while also being less computationally expensive than a purely wrapper method. For example, Alzubi et al. (2017) proposed a feature selection strategy using a hybrid of the CMIM filter and RFE-SVM wrapper method to classify healthy and diseased patients. They used SNP datasets for five conditions (thyroid cancer, autism, colorectal cancer, intellectual disability, and breast cancer). The authors showed that generally, the SNPs selected by the hybrid CMIM + RFE-SVM produce better classification performance than using any single filter method like mRMR ( Peng et al., 2005 ), CMIM ( Schlittgen, 2011 ), FCBF ( Yu and Liu, 2004 ), and ReliefF ( Urbanowicz et al., 2018b ), thus showing the superiority of the hybrid method.

Ghosh et al. (2020) demonstrated that a hybrid filter-wrapper feature selection technique, based on ant colony optimization, performs better than those based solely on filter techniques. The proposed hybrid method was less computationally complex than those based on the wrapper technique while preserving its relatively higher accuracy than the filter technique. Similarly, Butler-Yeoman et al. (2015) proposed a novel filter-wrapper hybrid feature selection algorithm that was based on particle swarm optimisation (FastPSO and RapidPSO). The authors further showed that the proposed hybrid method performs better than a pure filter algorithm (FilterPSO), while being less computationally complex than a pure wrapper algorithm (WrapperPSO).

Hybrid methods still have limitations despite their advantages when compared to purely filter, embedded, and wrapper methods. For example, relevant interacting SNPs with no significant individual effects (i.e., exclusively epistatic) can potentially be lost during the filtering step. This is because most filter methods cannot model feature-feature interactions. This can be mitigated by using filter algorithms that can model feature interactions (Section 2.1).

3.1 Integrative Method—Incorporating External Knowledge to Limit Feature Search Space

Integrative methods incorporate biological knowledge as an a priori filter for SNP pre-selection ( Figure 5E ). This enables the researcher to narrow the search space to “interesting” SNPs that are recognized as being relevant to the phenotype of interest. Limiting the search space means limiting the computational complexity for downstream analysis.

To integrate external knowledge, one can obtain information from public protein-protein interaction databases (e.g., IntAct, ChEMBLOR, BioGRID) or pathway databases (KEGG, Reactome). Software (e.g., INTERSNP ( Herold et al., 2009 )) has also been developed to help select a combination of “interesting” SNPs based on a priori knowledge (e.g., genomic location, pathway information, and statistical evidence). This information enables a reduction in the search space to only those SNPs that are mapped to genes that researchers contend are involved in relevant protein interactions or pathways of interest. For example, Ma et al. (2015) successfully identified SNP-SNP interactions that are associated with high-density lipoprotein cholesterol (HDL-C) levels. The search space was reduced by limiting the search to SNPs that have previously been associated with lipid levels, SNPs mapped to genes in known lipid-related pathways and those that are involved in relevant protein-protein interactions. In other examples, the SNP search space has been limited to SNPs that are located within known risk loci. For example, D’angelo et al. ( D’Angelo et al., 2009 ) identified significant gene-gene interactions that are associated with rheumatoid arthritis (RA) by restricting their search to chromosome 6 (a known as risk locus for RA ( Newton et al., 2004 )) and using a combined LASSO-PCA approach.

An obvious limitation with these types of integrative approaches is the fact that online databases and our current biological knowledge are incomplete. Therefore, relying on external a priori knowledge will hinder the identification of novel variants outside our current biological understanding.

3.2 Ensemble Method—Combining the Output of Different Feature Selections

Ensemble feature selection methods are based on the assumption that combining the output of multiple algorithms is better than using the output of a single algorithm ( Figure 6 ) ( Bolón-Canedo et al., 2014 ). In theory, an ensemble of multiple feature selection methods allows the user to combine the strengths of the different methods while overcoming their weaknesses ( Pes, 2020 ). This is possible because different feature selection algorithms can retain complementary but different information. Several studies have shown that ensemble feature selection methods tend to produce better classification accuracy than is achieved using single feature selection methods ( Seijo-Pardo et al., 2015 ; Hoque et al., 2017 ; Wang et al., 2019 ; Tsai and Sung, 2020 ). Furthermore, ensemble feature selection can improve the stability of the selected feature set (i.e., it is more robust to small changes in the input data) ( Yang and Mao, 2011 ). Stability and reproducibility of results is important because it increase the confidence of users when inferring knowledge from the selected features ( Saeys et al., 2008 ).

www.frontiersin.org

FIGURE 6 . (A) Generalized illustration of ensemble methods. In ensemble methods, the outputs of several feature selection methods are aggregated to obtain the final selected features. FS = feature selection. (B) Generalized illustration of majority voting system where the different generated feature subsets are used to train and test a specific classifier. The final output is the class predicted by the majority of the classifiers.

When designing an ensemble approach, the first thing to consider is the choice of individual feature selection algorithms to be included. Using more than one feature selection method will increase the computation time, therefore filter and (to a lesser extent) embedded methods are usually preferred. By contrast, wrappers are generally avoided. Researchers must also make sure that the included algorithms will output diverse feature sets because there is no point in building an ensemble of algorithms that all produce the same results. Several metrics can be used to measure diversity (e.g. pair-wise Q statistics ( Kuncheva et al., 2002 )).

It is also important to consider how to combine the partial outputs generated by each algorithm into one final output; this is known as the aggregation method. Several aggregation methods have been proposed, the simplest works by taking the union or intersection of the top-ranked outputs of the different algorithms. While taking the intersection seems logical (i.e., if all algorithms select a feature, it might be highly relevant), this approach results in a restrictive set of features and tends to produce worse results than selecting the union ( Álvarez-Estévez et al., 2011 ). To overcome this, other popular aggregation methods assign each feature the mean or median position it has achieved among the outputs of all algorithms and use these positions to produce a final ranked feature subset. The final fusion rank of each feature can also be calculated as a weighted sum of the ranks assigned by the individual algorithms, where the weight of each algorithm is determined based on metrics such as the classification performance of the algorithm ( Long et al., 2001 ). Alternatively, majority voting systems ( Bolón-Canedo et al., 2012 ) ( Figure 6B ) can be used to determine the final class prediction. In majority voting systems, the different feature subsets generated by each algorithm are used to train and test a specific classifier. The final predicted output is the class that is predicted by the majority of the classifiers (see ( Guan et al., 2014 ; Bolón-Canedo and Alonso-Betanzos, 2019 ) for reviews about ensemble methods).

Verma et al. (2018) proposed the use of a collective feature selection approach that combined the union of the top-ranked outputs of several feature selection methods (MDR, random forest, MultiSURFNTuRF). They applied this approach to identify SNPs associated with body mass index (BMI) and showed that the ensemble approach could detect epistatic effects that were otherwise missed using any single individual feature selection method.

Bolón-Canedo et al. (2012) applied an ensemble of five filter methods (CFS, Consistency-based, INTERACT, Information Gain and ReliefF) to ten high dimensional microarray datasets. The authors demonstrated that the ensemble of five filter methods achieved the lowest average error for every classifier tested (C4.5, IB1, and naïve Bayes) across all datasets, confirming the advantage of using the ensemble method over individual filters.

4 Exhaustive Searches for Higher-Order SNP-SNP Interactions

There are instances where scientists are mainly interested in inference, not prediction (e.g., the research interest lies in interpreting the biology of the selected SNPs). Recently, researchers within the GWAS field have recognized the importance of identifying significant SNP-SNP interactions, especially for complex diseases. The wrapper and embedded methods (e.g., decision tree-based algorithms) that can detect feature interactions (see Section 2.2–2.3) have some limitations: 1). Despite modifications that enable epistasis detection (Section 2.3), random forest-based algorithms are not exhaustive and are still prone to miss epistatic SNPs with low individual effects; 2) wrapper methods return a subset of features but do not identify which are relatively more important than others.

In theory, the most reliable (albeit naïve) way to detect relevant SNP-SNP interactions is by exhaustively testing each possible SNP combination and how it might relate to the phenotype class. Indeed, several exhaustive filter methods have been proposed (see ( Cordell, 2009 ; Niel et al., 2015 )). Some examples include, BOolean Operation-based Screening and Testing” (BOOST), FastEpistasis ( Schüpbach et al., 2010 ), and Tree-based Epistasis Association Mapping (TEAM) ( Zhang et al., 2010 ). However, these methods are restricted to testing and identifying pair-wise SNP interactions. Therefore, any epistatic effects of ≥3 orders will be missed. This contrasts with random forest (and many of its modifications), which despite its lower sensitivity (compared to exhaustive filters), can identify higher order interactions.

For higher-order interactions, exhaustive filter methods have been developed (e.g., Multifactor Dimensionality Reduction (MDR) ( Ritchie et al., 2001 ) or the Combinatorial Partitioning Method (CPM) ( Nelson et al., 2001 )) and shown to be able to detect SNP-SNP interactions across ≥3 orders. However, due to the computational complexity of these analyses, these methods are effectively constrained to a maximum of several hundred features and they cannot be applied to genome-wide datasets ( Lou et al., 2007 ). Goudey et al. ( Goudey et al., 2015 ) estimated that evaluating all three-way interactions in a GWAS dataset of 1.1 Million SNPs could take up to 5 years even on a parallelized computing server with approximately 262,000 cores.

The application of exhaustive methods to genome-wide data can be achieved using an extended hybrid approach (i.e., applying a filter method as a first step, followed by an exhaustive search), or an integrative approach (incorporating external knowledge) that reduces the search space for the exhaustive methods ( Pattin and Moore, 2008 ). For example, Greene et al. ( Greene et al., 2009 ) recommended the use of SURF (a Relief-based filter algorithm) as a filter before using MDR to exhaustively search for relevant SNP interactions. Collins et al. (2013) used MDR to identify significant three-way SNP interactions that are associated with tuberculosis from a dataset of 19 SNPs mapped to candidate tuberculosis genes. Similarly, algorithms that incorporate two-stage strategies to detect high-order interactions have been developed (e.g., dynamic clustering for high-order genome-wide epistatic interactions detecting (DCHE) ( Guo et al., 2014 ) and the epistasis detector based on the clustering of relatively frequent items (EDCF) ( Xie et al., 2012 )). DCHE and EDCF work by first identifying significant pair-wise interactions and using them as candidates to search for high-order interactions. More recently, swarm intelligence search algorithms have been proposed as an alternative way to look for candidate higher-order feature interactions, prior to application of an exhaustive search strategy. For example, Tuo et al. (2020) proposed the use of multipopulation harmony search algorithm to identify candidate k -order SNP interactions to reduce computation load before applying MDR to verify the interactions. Notably, the multi-stage algorithm (MP-HS-DHSI) that Tuo et al. developed is scalable to high-dimensional datasets (>100,000 SNPs), much less computationally demanding than purely exhaustive searches, and is sensitive enough to detect interactions where the individual SNPs have no individual effects ( Tuo et al., 2020 ).

Despite being time demanding, the exhaustive search for pair-wise SNP interaction is possible ( Marchini et al., 2005 ). However, exhaustive searches for higher-order interactions are not yet available. Researchers must resort to hybrid, integrative, or two-stage approaches to reduce the feature space prior to exhaustive search ( Table 2 ). Several (non-exhaustive) embedded methods (e.g., approaches based on decision tree algorithms) have been proposed as viable options to identify SNP interactions and increase the best predictive power of the resulting information. However, the need for an efficient and scalable algorithm to detect SNP-SNP interactions remains, especially for higher-order interactions.

www.frontiersin.org

TABLE 2 . Summary of algorithms reviewed to detect epistasis along with datasets applications, computational time, and memory requirements. Data are taken from three comparative studies, each of which are colour coded differently. N/A, not available.

5 Conclusion

Supervised ML algorithms can be applied to genome-wide SNP datasets. However, this is often not ideal because the curse of dimensionality leads to long training times and production of an overfitted predictive model. Therefore, the reduction of the total feature numbers to a more manageable level by selection of the most informative SNPs is essential before training the model.

Currently, no single feature selection method stands above the rest. Each method has its strengths and weaknesses ( Table 1 , Table 3 , discussed in Section 2.4). Indeed, it is becoming rarer for researchers to depend on just a single feature selection method. Therefore, we contend that the use of a two-stage approach or hybrid approach should be considered “best practice.” In a typical hybrid approach, a filter method is used in the first stage to reduce the number of candidate SNPs to a more manageable level, so that more complex and computationally heavy wrapper, embedded, or exhaustive search methods can be applied. Depending on the available resources, the filter used should be multivariate and able to detect feature interactions. Alternatively, biological knowledge can be used as an a priori filter for SNP pre-selection. Multiple feature selection methods can also be combined in a parallel scheme (ensemble method). By exploiting strengths of the different methods, ensemble methods allow better accuracy and stability than relying on any single feature selection method.

www.frontiersin.org

TABLE 3 . Advantages, limitations, and references for the feature selection algorithms reviewed in this paper.

Author Contributions

NP conceived and wrote the review. TF, AK, and JOS conceived and commented on the review.

NP received a University of Auckland PhD Scholarship. TF and JOS were funded by a grant from the Dines Family Foundation.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Abraham, G., and Inouye, M. (2015). Genomic Risk Prediction of Complex Human Disease and its Clinical Application. Curr. Opin. Genet. Dev. 33, 10–16. doi:10.1016/j.gde.2015.06.005

PubMed Abstract | CrossRef Full Text | Google Scholar

Abramovich, F., Benjamini, Y., Donoho, D. L., and Johnstone, I. M. (2006). Adapting to Unknown Sparsity by Controlling the False Discovery Rate. Ann. Stat. 34, 584–653. doi:10.1214/009053606000000074

CrossRef Full Text | Google Scholar

Altshuler, D., Daly, M. J., and Lander, E. S. (2008). Genetic Mapping in Human Disease. Science 322, 881–888. doi:10.1126/science.1156409

Álvarez-Estévez, D., Sánchez-Maroño, N., Alonso-Betanzos, A., and Moret-Bonillo, V. (2011). Reducing Dimensionality in a Database of Sleep EEG Arousals. Expert Syst. Appl. 38, 7746–7754.

Google Scholar

Alzubi, R., Ramzan, N., Alzoubi, H., and Amira, A. (2017). A Hybrid Feature Selection Method for Complex Diseases SNPs. IEEE Access 6, 1292–1301. doi:10.1109/ACCESS.2017.2778268

Aphinyanaphongs, Y., Fu, L. D., Li, Z., Peskin, E. R., Efstathiadis, E., Aliferis, C. F., et al. (2014). A Comprehensive Empirical Comparison of Modern Supervised Classification and Feature Selection Methods for Text Categorization. J. Assn Inf. Sci. Tec. 65, 1964–1987. doi:10.1002/asi.23110

Ashley, E. A., Butte, A. J., Wheeler, M. T., Chen, R., Klein, T. E., Dewey, F. E., et al. (2010). Clinical Assessment Incorporating a Personal Genome. Lancet 375, 1525–1535. doi:10.1016/S0140-6736(10)60452-7

Barrera-Gómez, J., Agier, L., Portengen, L., Chadeau-Hyam, M., Giorgis-Allemand, L., Siroux, V., et al. (2017). A Systematic Comparison of Statistical Methods to Detect Interactions in Exposome-Health Associations. Environ. Heal. A Glob. Access Sci. Source 16, 74. doi:10.1186/s12940-017-0277-6

Battiti, R. (1994). Using Mutual Information for Selecting Features in Supervised Neural Net Learning. IEEE Trans. Neural Netw. 5, 537–550. doi:10.1109/72.298224

Benjamini, Y., and Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B Methodol. 57, 289–300. doi:10.1111/j.2517-6161.1995.tb02031.x

Benjamini, Y., and Yekutieli, D. (2001). The Control of the False Discovery Rate in Multiple Testing under Dependency. Ann. Stat. 29, 1165–1188. doi:10.1214/aos/1013699998

Bins, J., and Draper, B. A. (2001). Feature Selection from Huge Feature Sets. Proc. IEEE Int. Conf. Comput. Vis. 2, 159–165. doi:10.1109/ICCV.2001.937619

Bolón-Canedo, V., and Alonso-Betanzos, A. (2019). Ensembles for Feature Selection: A Review and Future Trends. Inf. Fusion 52, 1–12. doi:10.1016/j.inffus.2018.11.008

Bolón-Canedo, V., Sánchez-Maroño, N., and Alonso-Betanzos, A. (2013). A Review of Feature Selection Methods on Synthetic Data. Knowl. Inf. Syst. 34, 483–519.

Bolón-Canedo, V., Sánchez-Maroño, N., and Alonso-Betanzos, A. (2012). An Ensemble of Filters and Classifiers for Microarray Data Classification. Pattern Recognit. 45, 531–539.

Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., Benítez, J. M., and Herrera, F. (2014). A Review of Microarray Datasets and Applied Feature Selection Methods. Inf. Sci. (Ny) 282, 111–135.

Bommert, A., Sun, X., Bischl, B., Rahnenführer, J., and Lang, M. (2020). Benchmark for Filter Methods for Feature Selection in High-Dimensional Classification Data. Comput. Statistics Data Analysis 143, 106839. doi:10.1016/j.csda.2019.106839

Botta, V., Louppe, G., Geurts, P., and Wehenkel, L. (2014). Exploiting SNP Correlations within Random Forest for Genome-wide Association Studies. PLoS One 9, e93379. doi:10.1371/journal.pone.0093379

Breiman, L. (2001). Random Forests. Mach. Learn. 45, 5–32. doi:10.1023/a:1010933404324

Broekema, R. V., Bakker, O. B., and Jonkers, I. H. (2020). A Practical View of Fine-Mapping and Gene Prioritization in the Post-genome-wide Association Era. Open Biol. 10, 190221. doi:10.1098/rsob.190221

Brzyski, D., Peterson, C. B., Sobczyk, P., Candès, E. J., Bogdan, M., and Sabatti, C. (2017). Controlling the Rate of GWAS False Discoveries. Genetics 205, 61–75. doi:10.1534/genetics.116.193987

Bureau, A., Dupuis, J., Falls, K., Lunetta, K. L., Hayward, B., Keith, T. P., et al. (2005). Identifying SNPs Predictive of Phenotype Using Random Forests. Genet. Epidemiol. 28, 171–182. doi:10.1002/gepi.20041

Butler-Yeoman, T., Xue, B., and Zhang, M. (2015). “Particle Swarm Optimisation for Feature Selection: A Hybrid Filter-Wrapper Approach,” in 2015 IEEE Congress on Evolutionary Computation (CEC) , Sendai, Japan , 25-28 May 2015 , 2428–2435. doi:10.1109/CEC.2015.7257186

Chandrashekar, G., and Sahin, F. (2014). A Survey on Feature Selection Methods. Comput. Electr. Eng. 40, 16–28. doi:10.1016/j.compeleceng.2013.11.024

Collins, R. L., Hu, T., Wejse, C., Sirugo, G., Williams, S. M., and Moore, J. H. (2013). Multifactor Dimensionality Reduction Reveals a Three-Locus Epistatic Interaction Associated with Susceptibility to Pulmonary Tuberculosis. BioData Min. 6, 4–5. doi:10.1186/1756-0381-6-4

Cordell, H. J. (2009). Detecting Gene-Gene Interactions that Underlie Human Diseases. Nat. Rev. Genet. 10, 392–404. doi:10.1038/nrg2579

Couronné, R., Probst, P., and Boulesteix, A.-L. (2018). Random Forest versus Logistic Regression: a Large-Scale Benchmark Experiment. BMC Bioinforma. 19, 270. doi:10.1186/s12859-018-2264-5

Cueto-López, N., García-Ordás, M. T., Dávila-Batista, V., Moreno, V., Aragonés, N., and Alaiz-Rodríguez, R. (2019). A Comparative Study on Feature Selection for a Risk Prediction Model for Colorectal Cancer. Comput. Methods Programs Biomed. 177, 219–229. doi:10.1016/j.cmpb.2019.06.001

Danasingh, A. A. G. S., Subramanian, A. a. B., and Epiphany, J. L. (2020). Identifying Redundant Features Using Unsupervised Learning for High-Dimensional Data. SN Appl. Sci. 2, 1367. doi:10.1007/s42452-020-3157-6

D’Angelo, G. M., Rao, D., and Gu, C. C. (2009). Combining Least Absolute Shrinkage and Selection Operator (LASSO) and Principal-Components Analysis for Detection of Gene-Gene Interactions in Genome-wide Association Studies. BMC Proc. 3, S62. doi:10.1186/1753-6561-3-S7-S62

De, R., Bush, W. S., and Moore, J. H. (2014). Bioinformatics Challenges in Genome-wide Association Studies (Gwas). Methods Mol. Biol. 1168, 63–81. doi:10.1007/978-1-4939-0847-9_5

Donnelly, P. (2008). Progress and Challenges in Genome-wide Association Studies in Humans. Nature 456, 728–731. doi:10.1038/nature07631

Dudbridge, F., and Gusnanto, A. (2008). Estimation of Significance Thresholds for Genomewide Association Scans. Genet. Epidemiol. 32, 227–234. doi:10.1002/gepi.20297

Dunn, O. J. (1961). Multiple Comparisons Among Means. J. Am. Stat. Assoc. 56, 52–64. doi:10.1080/01621459.1961.10482090

Farcomeni, A. (2008). A Review of Modern Multiple Hypothesis Testing, with Particular Attention to the False Discovery Proportion. Stat. Methods Med. Res. 17, 347–388. doi:10.1177/0962280206079046

Forman, G. (2003). An Extensive Empirical Study of Feature Selection Metrics for Text Classification. J. Mach. Learn. Res. 3, 1289–1305. doi:10.5555/944919.944974

Forsati, R., Moayedikia, A., Jensen, R., Shamsfard, M., and Meybodi, M. R. (2014). Enriched Ant Colony Optimization and its Application in Feature Selection. Neurocomputing 142, 354–371. doi:10.1016/j.neucom.2014.03.053

Ghosh, M., Guha, R., Sarkar, R., and Abraham, A. (2020). A Wrapper-Filter Feature Selection Technique Based on Ant Colony Optimization. Neural Comput. Applic 32, 7839–7857. doi:10.1007/s00521-019-04171-3

Goudey, B., Abedini, M., Hopper, J. L., Inouye, M., Makalic, E., Schmidt, D. F., et al. (2015). High Performance Computing Enabling Exhaustive Analysis of Higher Order Single Nucleotide Polymorphism Interaction in Genome Wide Association Studies. Health Inf. Sci. Syst. 3, S3. doi:10.1186/2047-2501-3-S1-S3

Granizo-Mackenzie, D., and Moore, J. H. (2013). “Multiple Threshold Spatially Uniform ReliefF for the Genetic Analysis of Complex Human Diseases,” in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Berlin, Heidelberg: Springer ), 7833, 1–10. doi:10.1007/978-3-642-37189-9_1

Greene, C. S., Penrod, N. M., Kiralis, J., and Moore, J. H. (2009). Spatially Uniform ReliefF (SURF) for Computationally-Efficient Filtering of Gene-Gene Interactions. BioData Min. 2, 5–9. doi:10.1186/1756-0381-2-5

Greene, C. S., Himmelstein, D. S., Kiralis, J., and Moore, J. H. (2010). “The Informative Extremes: Using Both Nearest and Farthest Individuals Can Improve Relief Algorithms in the Domain of Human Genetics,” in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Berlin, Heidelberg: Springer ), 6023, 182–193. doi:10.1007/978-3-642-12211-8_16

Guan, D., Yuan, W., Lee, Y. K., Najeebullah, K., and Rasel, M. K. (2014). “A Review of Ensemble Learning Based Feature Selection,” in IETE Technical Review (India): Institution of Electronics and Telecommunication Engineers ), 31, 190–198. doi:10.1080/02564602.2014.906859

Guo, X., Meng, Y., Yu, N., and Pan, Y. (2014). Cloud Computing for Detecting High-Order Genome-wide Epistatic Interaction via Dynamic Clustering. BMC Bioinforma. 15, 102–116. doi:10.1186/1471-2105-15-102

Guo, Y., Chung, F.-L., Li, G., and Zhang, L. (2019). Multi-Label Bioinformatics Data Classification with Ensemble Embedded Feature Selection. IEEE Access 7, 103863–103875. doi:10.1109/access.2019.2931035

Guyon, I., and Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 3, 1157–1182. doi:10.5555/944919.944968

Guyon, I., Gunn, S., Nikravesh, M., and Zadeh, L. (2008). Feature Extraction: Foundations and Applications , 207. Berlin: Springer .

Hall, M. (2000). “Correlation-based Feature Selection of Discrete and Numeric Class Machine Learning,” in Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000) , Stanford University, Stanford, CA, USA , June 29 - July 2, 2000 .

Han, B., Chen, X. W., Talebizadeh, Z., and Xu, H. (2012). Genetic Studies of Complex Human Diseases: Characterizing SNP-Disease Associations Using Bayesian Networks. BMC Syst. Biol. 6 Suppl 3, S14. doi:10.1186/1752-0509-6-S3-S14

Hayes-Roth, F. (1975). Review of "Adaptation in Natural and Artificial Systems by John H. Holland", the U. Of Michigan Press, 1975. SIGART Bull. 53, 15. doi:10.1145/1216504.1216510

Herold, C., Steffens, M., Brockschmidt, F. F., Baur, M. P., and Becker, T. (2009). INTERSNP: Genome-wide Interaction Analysis Guided by A Priori Information. Bioinformatics 25, 3275–3281. doi:10.1093/bioinformatics/btp596

Hindorff, L. A., Sethupathy, P., Junkins, H. A., Ramos, E. M., Mehta, J. P., Collins, F. S., et al. (2009). Potential Etiologic and Functional Implications of Genome-wide Association Loci for Human Diseases and Traits. Proc. Natl. Acad. Sci. U. S. A. 106, 9362–9367. doi:10.1073/pnas.0903103106

Ho, D. S. W., Schierding, W., Wake, M., Saffery, R., and O'Sullivan, J. (2019). Machine Learning SNP Based Prediction for Precision Medicine. Front. Genet. 10, 267. doi:10.3389/fgene.2019.00267

Hoque, N., Singh, M., and Bhattacharyya, D. K. (2017). EFS-MI: an Ensemble Feature Selection Method for Classification. Complex Intell. Syst. 4, 105–118. doi:10.1007/s40747-017-0060-x

Inza, I., Larrañaga, P., Blanco, R., and Cerrolaza, A. J. (2004). Filter versus Wrapper Gene Selection Approaches in DNA Microarray Domains. Artif. Intell. Med. 31, 91–103. doi:10.1016/j.artmed.2004.01.007

John, G. H., Kohavi, R., and Pfleger, K. (1994). “Irrelevant Features and the Subset Selection Problem,” in Machine Learning Proceedings 1994 (Burlington, MA: Morgan Kaufmann Publishers ), 121–129, 121–129. doi:10.1016/b978-1-55860-335-6.50023-4

Kafaie, S., Xu, L., and Hu, T. (2021). Statistical Methods with Exhaustive Search in the Identification of Gene-Gene Interactions for Colorectal Cancer. Genet. Epidemiol. 45, 222–234. doi:10.1002/gepi.22372

Kira, K., and Rendell, L. A. (1992). “Feature Selection Problem: Traditional Methods and a New Algorithm,” in Proceedings Tenth National Conference on Artificial Intelligence 2, 129–134.

Kittler, J. (1978). “Feature Set Search Alborithms,” in Pattern Recognition and Signal Processing . Dordrecht, Netherlands: Springer Dordrecht , 41–60. doi:10.1007/978-94-009-9941-1_3

Kohavi, R., and John, G. H. (1997). Wrappers for Feature Subset Selection. Artif. Intell. 97, 273–324. doi:10.1016/s0004-3702(97)00043-x

Koller, D., and Sahami, M. (1996). “Toward Optimal Feature Selection,” in International Conference on Machine Learning . Stanford, CA: Stanford InfoLab , 284–292.

König, I. R., Auerbach, J., Gola, D., Held, E., Holzinger, E. R., Legault, M. A., et al. (2016). Machine Learning and Data Mining in Complex Genomic Data-Aa Review on the Lessons Learned in Genetic Analysis Workshop 19. BMC Genet. 17, 1. BioMed Central. doi:10.1186/s12863-015-0315-8

Kononenko, I. (1994). “Estimating Attributes: Analysis and Extensions of RELIEF,” in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Berlin, Heidelberg: Springer ), 784, 171–182. doi:10.1007/3-540-57868-4_5710.1007/3-540-57868-4_57

Kotzyba‐Hibert, F., Kapfer, I., and Goeldner, M. (1995). Recent Trends in Photoaffinity Labeling. Angewandte Chemie Int. Ed. Engl. 34, 1296–1312.

Kruppa, J., Ziegler, A., and König, I. R. (2012). Risk Estimation and Risk Prediction Using Machine-Learning Methods. Hum. Genet. 131, 1639–1654. doi:10.1007/s00439-012-1194-y

Kubus, M. (2019). The Problem of Redundant Variables in Random Forests. Folia Oeconomica 6, 7–16. doi:10.18778/0208-6018.339.01

Kuncheva, L. I., Skurichina, M., and Duin, R. P. W. (2002). An Experimental Study on Diversity for Bagging and Boosting with Linear Classifiers. Inf. Fusion 3, 245–258. doi:10.1016/s1566-2535(02)00093-3

Li, C., Luo, X., Qi, Y., Gao, Z., and Lin, X. (2020). A New Feature Selection Algorithm Based on Relevance, Redundancy and Complementarity. Comput. Biol. Med. 119, 103667. Elsevier. doi:10.1016/j.compbiomed.2020.103667

Li, L., Umbach, D. M., Terry, P., and Taylor, J. A. (2004). Application of the GA/KNN Method to SELDI Proteomics Data. Bioinformatics 20, 1638–1640. doi:10.1093/bioinformatics/bth098

Liang, J., Hou, L., Luan, Z., and Huang, W. (2019). Feature Selection with Conditional Mutual Information Considering Feature Interaction. Symmetry 11, 858. doi:10.3390/sym11070858

Long, A. D., Mangalam, H. J., Chan, B. Y., Tolleri, L., Hatfield, G. W., and Baldi, P. (2001). Improved Statistical Inference from DNA Microarray Data Using Analysis of Variance and A Bayesian Statistical Framework. Analysis of Global Gene Expression in Escherichia coli K12. J. Biol. Chem. 276, 19937–19944. doi:10.1074/jbc.M010192200

López, B., Torrent-Fontbona, F., Viñas, R., and Fernández-Real, J. M. (2018). Single Nucleotide Polymorphism Relevance Learning with Random Forests for Type 2 Diabetes Risk Prediction. Artif. Intell. Med. 85, 43–49. doi:10.1016/j.artmed.2017.09.005

Lou, X. Y., Chen, G. B., Yan, L., Ma, J. Z., Zhu, J., Elston, R. C., et al. (2007). A Generalized Combinatorial Approach for Detecting Gene-By-Gene and Gene-By-Environment Interactions with Application to Nicotine Dependence. Am. J. Hum. Genet. 80 (6), 1125–1137. Elsevier. doi:10.1086/518312

Louppe, G., Wehenkel, L., Sutera, A., and Geurts, P. (2013). “Understanding Variable Importances in Forests of Randomized Trees,” in Advances in Neural Information Processing Systems 26.

Lunetta, K. L., Hayward, L. B., Segal, J., and van Eerdewegh, P. (2004). Screening Large-Scale Association Study Data: Exploiting Interactions Using Random Forests. BMC Genet. 5, 32. doi:10.1186/1471-2156-5-32

Ma, L., Keinan, A., and Clark, A. G. (2015). Biological Knowledge-Driven Analysis of Epistasis in Human GWAS with Application to Lipid Traits. Methods Mol. Biol. 1253, 35–45. doi:10.1007/978-1-4939-2155-3_3

Maher, B. (2008). Personal Genomes: The Case of the Missing Heritability. Nature 456, 18–21. doi:10.1038/456018a

Makowsky, R., Pajewski, N. M., Klimentidis, Y. C., Vazquez, A. I., Duarte, C. W., Allison, D. B., et al. (2011). Beyond Missing Heritability: Prediction of Complex Traits. PLoS Genet. 7, e1002051. doi:10.1371/journal.pgen.1002051

Manolio, T. A. (2013). Bringing Genome-wide Association Findings into Clinical Use. Nat. Rev. Genet. 14, 549–558. doi:10.1038/nrg3523

Mao, Y., and Yang, Y. (2019). A Wrapper Feature Subset Selection Method Based on Randomized Search and Multilayer Structure. Biomed. Res. Int. 2019, 9864213. doi:10.1155/2019/9864213

Marchini, J., Donnelly, P., and Cardon, L. R. (2005). Genome-wide Strategies for Detecting Multiple Loci that Influence Complex Diseases. Nat. Genet. 37, 413–417. doi:10.1038/ng1537

Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of Machine Learning . Cambridge, MA: MIT Press.

Moore, J. H., and White, B. C. (2007). “Tuning ReliefF for Genome-wide Genetic Analysis,” in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Berlin, Heidelberg: Springer ), 4447, 166–175.

Nelson, M. R., Kardia, S. L., Ferrell, R. E., and Sing, C. F. (2001). A Combinatorial Partitioning Method to Identify Multilocus Genotypic Partitions that Predict Quantitative Trait Variation. Genome Res. 11, 458–470. doi:10.1101/gr.172901

Newton, J. L., Harney, S. M., Wordsworth, B. P., and Brown, M. A. (2004). A Review of the MHC Genetics of Rheumatoid Arthritis. Genes. Immun. 5, 151–157. doi:10.1038/sj.gene.6364045

Niel, C., Sinoquet, C., Dina, C., and Rocheleau, G. (2015). A Survey about Methods Dedicated to Epistasis Detection. Front. Genet. 6, 285. doi:10.3389/fgene.2015.00285

Okser, S., Pahikkala, T., Airola, A., Salakoski, T., Ripatti, S., and Aittokallio, T. (2014). Regularized Machine Learning in the Genetic Prediction of Complex Traits. PLoS Genet. 10, e1004754. doi:10.1371/journal.pgen.1004754

Okser, S., Pahikkala, T., and Aittokallio, T. (2013). Genetic Variants and Their Interactions in Disease Risk Prediction - Machine Learning and Network Perspectives. BioData Min. 6, 5. doi:10.1186/1756-0381-6-5

Onengut-Gumuscu, S., Chen, W. M., Burren, O., Cooper, N. J., Quinlan, A. R., Mychaleckyj, J. C., et al. (2015). Fine Mapping of Type 1 Diabetes Susceptibility Loci and Evidence for Colocalization of Causal Variants with Lymphoid Gene Enhancers. Nat. Genet. 47, 381–386. doi:10.1038/ng.3245

Ooka, T., Johno, H., Nakamoto, K., Yoda, Y., Yokomichi, H., and Yamagata, Z. (2021). Random Forest Approach for Determining Risk Prediction and Predictive Factors of Type 2 Diabetes: Large-Scale Health Check-Up Data in Japan. Bmjnph 4, 140–148. doi:10.1136/bmjnph-2020-000200

Pal, M., and Foody, G. M. (2010). Feature Selection for Classification of Hyperspectral Data by SVM. IEEE Trans. Geosci. Remote Sens. 48, 2297–2307. doi:10.1109/tgrs.2009.2039484

Panagiotou, O. A., and Ioannidis, J. P. (2012). What Should the Genome-wide Significance Threshold Be? Empirical Replication of Borderline Genetic Associations. Int. J. Epidemiol. 41, 273–286. doi:10.1093/ije/dyr178

Pattin, K. A., and Moore, J. H. (2008). Exploiting the Proteome to Improve the Genome-wide Genetic Analysis of Epistasis in Common Human Diseases. Hum. Genet. 124, 19–29. doi:10.1007/s00439-008-0522-8

Peng, H., Long, F., and Ding, C. (2005). Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238. doi:10.1109/TPAMI.2005.159

Pes, B. (2020). Ensemble Feature Selection for High-Dimensional Data: a Stability Analysis across Multiple Domains. Neural Comput. Applic 32, 5951–5973. doi:10.1007/s00521-019-04082-3

Remeseiro, B., and Bolon-Canedo, V. (2019). A Review of Feature Selection Methods in Medical Applications. Comput. Biol. Med. 112, 103375. doi:10.1016/j.compbiomed.2019.103375

Ritchie, M. D., Hahn, L. W., Roodi, N., Bailey, L. R., Dupont, W. D., Parl, F. F., et al. (2001). Multifactor-dimensionality Reduction Reveals High-Order Interactions Among Estrogen-Metabolism Genes in Sporadic Breast Cancer. Am. J. Hum. Genet. 69, 138–147. doi:10.1086/321276

Romagnoni, A., Jégou, S., Van Steen, K., Wainrib, G., and Hugot, J. P. (2019). Comparative Performances of Machine Learning Methods for Classifying Crohn Disease Patients Using Genome-wide Genotyping Data. Sci. Rep. 9, 10351. doi:10.1038/s41598-019-46649-z

Saeys, Y., Inza, I., and Larrañaga, P. (2007). A Review of Feature Selection Techniques in Bioinformatics. Bioinformatics 23, 2507–2517. doi:10.1093/bioinformatics/btm344

Saeys, Y., Abeel, T., and Van De Peer, Y. (2008). “Robust Feature Selection Using Ensemble Feature Selection Techniques,” in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Berlin, Heidelberg: Springer ), 5212, 313–325. doi:10.1007/978-3-540-87481-2_21

Schlittgen, R. (2011). A Weighted Least-Squares Approach to Clusterwise Regression. AStA Adv. Stat. Anal. 95, 205–217. doi:10.1007/s10182-011-0155-4

Schüpbach, T., Xenarios, I., Bergmann, S., and Kapur, K. (2010). FastEpistasis: a High Performance Computing Solution for Quantitative Trait Epistasis. Bioinformatics 26, 1468–1469. doi:10.1093/bioinformatics/btq147

Schwarz, D. F., König, I. R., and Ziegler, A. (2010). On Safari to Random Jungle: a Fast Implementation of Random Forests for High-Dimensional Data. Bioinformatics 26, 1752–1758. doi:10.1093/bioinformatics/btq257

Seijo-Pardo, B., Bolón-Canedo, V., Porto-Díaz, I., and Alonso-Betanzos, A. (2015). “Ensemble Feature Selection for Rankings of Features,” in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Cham: Springer-Verlag ), 9095, 29–42. doi:10.1007/978-3-319-19222-2_3

Signorino, C. S., and Kirchner, A. (2018). Using LASSO to Model Interactions and Nonlinearities in Survey Data. Surv. Pract. 11, 1–10. doi:10.29115/sp-2018-0005

Skalak, D. B. (1994). “Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms,” in Machine Learning Proceedings 1994 . Burlington, MA: Morgan Kauffmann , 293–301. doi:10.1016/b978-1-55860-335-6.50043-x

Spain, S. L., and Barrett, J. C. (2015). Strategies for Fine-Mapping Complex Traits. Hum. Mol. Genet. 24, R111–R119. doi:10.1093/hmg/ddv260

Spiegel, A. M., and Hawkins, M. (2012). 'Personalized Medicine' to Identify Genetic Risks for Type 2 Diabetes and Focus Prevention: Can it Fulfill its Promise? Health Aff. (Millwood) 31, 43–49. doi:10.1377/hlthaff.2011.1054

Szymczak, S., Biernacka, J. M., Cordell, H. J., González-Recio, O., König, I. R., Zhang, H., et al. (2009). Machine Learning in Genome-wide Association Studies. Genet. Epidemiol. 33 Suppl 1, S51–S57. doi:10.1002/gepi.20473

Tsai, C.-F., and Sung, Y.-T. (2020). Ensemble Feature Selection in High Dimension, Low Sample Size Datasets: Parallel and Serial Combination Approaches. Knowledge-Based Syst. 203, 106097. doi:10.1016/j.knosys.2020.106097

Tuo, S., Liu, H., and Chen, H. (2020). Multipopulation Harmony Search Algorithm for the Detection of High-Order SNP Interactions. Bioinformatics 36, 4389–4398. doi:10.1093/bioinformatics/btaa215

Uddin, S., Khan, A., Hossain, M. E., and Moni, M. A. (2019). Comparing Different Supervised Machine Learning Algorithms for Disease Prediction. BMC Med. Inf. Decis. Mak. 19, 281. doi:10.1186/s12911-019-1004-8

Urbanowicz, R. J., Meeker, M., La Cava, W., Olson, R. S., and Moore, J. H. (2018b). Relief-based Feature Selection: Introduction and Review. J. Biomed. Inf. 85, 189–203. doi:10.1016/j.jbi.2018.07.014

Urbanowicz, R. J., Olson, R. S., Schmitt, P., Meeker, M., and Moore, J. H. (2018a). Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining. J. Biomed. Inf. 85, 168–188. doi:10.1016/j.jbi.2018.07.015

Verma, S. S., Lucas, A., Zhang, X., Veturi, Y., Dudek, S., Li, B., et al. (2018). Collective Feature Selection to Identify Crucial Epistatic Variants. BioData Min. 11, 5. doi:10.1186/s13040-018-0168-6

Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A., et al. (2017). 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 101, 5–22. doi:10.1016/j.ajhg.2017.06.005

Wah, Y. B., Ibrahim, N., Hamid, H. A., Abdul-Rahman, S., and Fong, S. (2018). Feature Selection Methods: Case of Filter and Wrapper Approaches for Maximising Classification Accuracy. Pertanika J. Sci. Technol. 26, 329–340.

Wan, X., Yang, C., Yang, Q., Xue, H., Fan, X., Tang, N. L. S., et al. (2010). BOOST: A Fast Approach to Detecting Gene-Gene Interactions in Genome-wide Case-Control Studies. Am. J. Hum. Genet. 87 (3), 325–340. Elsevier. doi:10.1016/j.ajhg.2010.07.021

Wang, J., Xu, J., Zhao, C., Peng, Y., and Wang, H. (2019). An Ensemble Feature Selection Method for High-Dimensional Data Based on Sort Aggregation. Syst. Sci. Control Eng. 7, 32–39. doi:10.1080/21642583.2019.1620658

Wei, C., and Lu, Q. (2014). GWGGI: Software for Genome-wide Gene-Gene Interaction Analysis. BMC Genet. 15, 101. doi:10.1186/s12863-014-0101-z

Wei, Z., Wang, W., Bradfield, J., Li, J., Cardinale, C., Frackelton, E., et al. (2013). Large Sample Size, Wide Variant Spectrum, and Advanced Machine-Learning Technique Boost Risk Prediction for Inflammatory Bowel Disease. Am. J. Hum. Genet. 92, 1008–1012. doi:10.1016/j.ajhg.2013.05.002

Winham, S. J., Colby, C. L., Freimuth, R. R., Wang, X., de Andrade, M., Huebner, M., et al. (2012). SNP Interaction Detection with Random Forests in High-Dimensional Genetic Data. BMC Bioinforma. 13, 164. doi:10.1186/1471-2105-13-164

Wolpert, D. H., and Macready, W. G. (1997). No Free Lunch Theorems for Optimization. IEEE Trans. Evol. Comput. 1, 67–82. doi:10.1109/4235.585893

Wray, N. R., Goddard, M. E., and Visscher, P. M. (2007). Prediction of Individual Genetic Risk to Disease from Genome-wide Association Studies. Genome Res. 17, 1520–1528. doi:10.1101/gr.6665407

Xie, M., Li, J., and Jiang, T. (2012). Detecting Genome-wide Epistases Based on the Clustering of Relatively Frequent Items. Bioinformatics 28, 5–12. doi:10.1093/bioinformatics/btr603

Xiong, M., Fang, X., and Zhao, J. (2001). Biomarker Identification by Feature Wrappers. Genome Res. 11, 1878–1887. doi:10.1101/gr.190001

Xu, C., Tachmazidou, I., Walter, K., Ciampi, A., Zeggini, E., and Greenwood, C. M. T. (2014). Estimating Genome‐Wide Significance for Whole‐Genome Sequencing Studies. Genet. Epidemiol. 38, 281–290. Wiley Online Libr. doi:10.1002/gepi.21797

Yang, F., and Mao, K. Z. (2011). Robust Feature Selection for Microarray Data Based on Multicriterion Fusion. IEEE/ACM Trans. Comput. Biol. Bioinform 8, 1080–1092. doi:10.1109/TCBB.2010.103

Yang, J., and Honavar, V. (1998). Feature Subset Selection Using a Genetic Algorithm. IEEE Intell. Syst. 13, 44–49. doi:10.1109/5254.671091

Yoshida, M., and Koike, A. (2011). SNPInterForest: a New Method for Detecting Epistatic Interactions. BMC Bioinforma. 12, 469. doi:10.1186/1471-2105-12-469

Yu, L., and Liu, H. (2004). Efficient Feature Selection via Analysis of Relevance and Redundancy. J. Mach. Learn. Res. 5, 1205–1224. doi:10.5555/1005332.1044700

Zhang, X., Huang, S., Zou, F., and Wang, W. (2010). TEAM: Efficient Two-Locus Epistasis Tests in Human Genome-wide Association Study. Bioinformatics 26, i217–27. doi:10.1093/bioinformatics/btq186

Zhang, Y., Li, S., Wang, T., and Zhang, Z. (2013). Divergence-based Feature Selection for Separate Classes. Neurocomputing 101, 32–42. doi:10.1016/j.neucom.2012.06.036

Keywords: machine learing, feature selection (FS), risk prediction, disease risk prediction, statistical approaches

Citation: Pudjihartono N, Fadason T, Kempa-Liehr AW and O'Sullivan JM (2022) A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Front. Bioinform. 2:927312. doi: 10.3389/fbinf.2022.927312

Received: 24 April 2022; Accepted: 03 June 2022; Published: 27 June 2022.

Reviewed by:

Copyright © 2022 Pudjihartono, Fadason, Kempa-Liehr and O'Sullivan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Andreas W. Kempa-Liehr, [email protected] ; Justin M. O'Sullivan, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Scientific Reports
  • PMC10730875

Logo of scirep

Analyzing the impact of feature selection methods on machine learning algorithms for heart disease prediction

Zeinab noroozi.

1 Department of Artificial Intelligence, Islamic Azad University of Kazeroon, Kazeroon, Iran

Azam Orooji

2 Department of Advanced Technologies, School of Medicine, North Khorasan University of Medical Sciences (NKUMS), Bojnurd, North Khorasan Iran

Leila Erfannia

3 Health Human Resources Research Center, Clinical Education Research Center, Shiraz University of Medical Sciences, Shiraz, Iran

4 Health Information Management Department, School of Health Management and Information Sciences, Shiraz University of Medical Sciences, Shiraz, Iran

Associated Data

The public repository UCI Machine Learning Repository, Cleveland Heart disease data set were used for analyzing which is public and can retrieve from https://archive.ics.uci.edu/ml/datasets/Heart+Disease .

The present study examines the role of feature selection methods in optimizing machine learning algorithms for predicting heart disease. The Cleveland Heart disease dataset with sixteen feature selection techniques in three categories of filter, wrapper, and evolutionary were used. Then seven algorithms Bayes net, Naïve Bayes (BN), multivariate linear model (MLM), Support Vector Machine (SVM), logit boost, j48, and Random Forest were applied to identify the best models for heart disease prediction. Precision, F-measure, Specificity, Accuracy, Sensitivity, ROC area, and PRC were measured to compare feature selection methods' effect on prediction algorithms. The results demonstrate that feature selection resulted in significant improvements in model performance in some methods (e.g., j48), whereas it led to a decrease in model performance in other models (e.g. MLP, RF). SVM-based filtering methods have a best-fit accuracy of 85.5. In fact, in a best-case scenario, filtering methods result in + 2.3 model accuracy. SVM-CFS/information gain/Symmetrical uncertainty methods have the highest improvement in this index. The filter feature selection methods with the highest number of features selected outperformed other methods in terms of models' ACC, Precision, and F-measures. However, wrapper-based and evolutionary algorithms improved models' performance from sensitivity and specificity points of view.

Introduction

The prevalence of cardiovascular disease is on the rise worldwide such that the World Health Organization (WHO) estimates that 17 million people die annually from cardiovascular diseases, particularly stroke and heart attack. These diseases are responsible for 31% of global mortality and are considered the primary cause of death worldwide. It is estimated that the death rate from cardiovascular disease will rise to 22 million people by 2030. According to American Heart Association statistics, 50% of adults in the United States suffer from cardiovascular disease 1 – 3 . Risk factors include lifestyle behaviors, age, gender, smoking; family history, obesity, high blood fat, blood sugar level, poor food, alcohol consumption, and body weight are all factors that might contribute to these disorders, which are brought on by the heart's abnormal functioning. It is crucial to recognize the behaviors and warning symptoms of cardiovascular disorders 1 . Several tests, including auscultation, ECG, blood pressure, fat, and blood sugar, are required to diagnose CVD. Prioritizing these tests is crucial since they might often take a long time to complete while the patient has to start taking his/her medication right away. It's critical to recognize the numerous healthy behaviors that contribute to CVD 4 . On the other hand, this condition is challenging to identify because of the numerous risk factors that contribute to its onset. The survival rate of patients can be increased by timely and accurate diagnosis of certain disorders, though 2 .

A proper diagnosis is essential to the functioning of the health system. In the US, 5% of outpatients are given a serious medical illness that is misdiagnosed. This problem not only puts the patient in danger, but also leads to ineffective diagnostic procedures and other inefficiencies in the healthcare system. Diagnostic mistakes raise the expense of the healthcare system and erode public confidence in it. On the other hand, a lot of healthcare professionals are dissatisfied with the amount of time therapists spend entering data into computers, which reduces the effectiveness of doctor-patient contact 5 . The diagnosis of a heart attack is a highly complex and important procedure that must be conducted with care and precision. It is typically based on the knowledge and experience of the physician, which, if not done properly, can result in significant financial and life-altering expenses for the patient 6 . However, not all physicians possess the same expertise in subspecialties, and the geographical distribution of qualified specialists is uneven. As a result of these multiple factors used to evaluate the diagnosis of the heart attack, physicians typically make the diagnosis based on the patient's present test results 7 . Additionally, doctors review prior diagnoses made on other patients who have similar test results. These intricate procedures are, however, of little importance 8 .

To accurately diagnose heart attack patients, a physician must possess expertise and experience. Consequently, the obligation to leverage the knowledge and expertise of various professionals and the clinical screening data collected in databases to facilitate the analysis process is seen as a beneficial framework that integrates clinical selection aids and computer-aided patient records. Furthermore, it can reduce treatment errors, enhance patient safety, eliminate unnecessary conflicts, and enhance patient outcomes. Machine learning has been extensively discussed in the medical field, particularly for the diagnosis and treatment of diseases 7 . Recent research has highlighted the potential of machine learning to improve accuracy and diagnostic time. AI-based tools constructed with machine learning have become increasingly effective diagnostic tools in recent years 9 , 10 . Machine learning algorithms are highly effective in predicting the outcome of the data in a large amount. Data mining is a process of transforming large amounts of raw data into data that will be highly useful for decision-making and forecasting 11 . By producing more precise and timely diagnoses, machine learning technology has the potential to transform the healthcare system and provide access to quality healthcare to unprivileged communities worldwide. Machine learning has the potential to shorten the time it takes for patients to meet with their physicians, as well as to reduce the need for unnecessary diagnostic tests and enhance the precision of diagnoses. Preventive interventions can significantly reduce the rate of complex diseases 1 , 2 . As a result, many clinicians have proposed increasing the identification of patients through the use of Machine Learning and predictive models to reduce mortality and enhance clinical decision-making. Machine learning can be used to detect the risk of cardiovascular disease and provide clinicians with useful treatments and advice for their patients 12 .

In addition to the various cardiovascular disorders, there are pathological alterations that take place in the heart and the blood vessels. Data classification can enable the development of tailored models and interventions that reduce the risk of cardiovascular disease. These analyses assist medical professionals in re-evaluating the underlying risks and, even if a prior vascular disease has occurred, can provide more efficient solutions and treatments to improve the quality of life and extend life expectancy 13 , and reduce mortality. An expert can use supervised learning to answer the following: whether a medical image contains a malignant tumor or a benign tumor. Is a patient with heart disease likely to survive? Is there a risk of disease progression? Is it possible for a person with heart disease to develop heart disease with existing factors? These and other questions can be answered using supervised learning techniques and classification modeling 14 , 15 . Classification is one of the most common methods used in data mining. It divides data into classes and allows one to organize different kinds of data, from complex data to simple data. Classification is one of the supervised learning methods in data mining. The main goal of classification is to connect the input variables with the target variables and make predictions based on this relationship. The classification techniques used in this study ranged from decision tree to support vector machines (SVM) and random forest (Random Forest) 16 . In a study conducted by Melillo and colleagues, the CART algorithm was found to have the highest accuracy of 93.3% among the other algorithms. This algorithm was used to determine which patients had congestive heart disease, and which patients were at lower risk 17 .

Although Machine Learning (ML) is essential for the diagnosis of a wide range of diseases, the production of large-scale data sets and the presence of numerous non-essential and redundant features in these data sets is a significant deficiency in ML algorithms 8 . Furthermore, in many cases, only a small number of features are essential and pertinent to the objective. As the rest of the features are disregarded as trivial and redundant, the performance and accuracy of the classification are adversely affected. Therefore, it is essential to select a compact and appropriate subset of the major features to enhance the classification performance, as well as overcome the "curse of dimensionality". The purpose of feature selection techniques is to assess the significance of features. The aim is to reduce the number of inputs for the requirements that are most pertinent to the model. In addition to reducing the number of inputs, feature selection also significantly reduces the processing time. Even if several feature selection techniques have been employed in decision support systems in medical datasets; there are always improvements to be made 18 .

Previous research on predicting heart disease in two broad categories has focused on either optimizing algorithms based on various machine learning techniques or attempting to optimize algorithms by utilizing various feature selection techniques. However, it has been less discussed to compare the impact of various feature selection techniques on model performance. This study aims to compare the performance of three different feature selection techniques (filter, wrapper, and evolutionary) in machine learning models for predicting heart disease.

This paper contains the following significant points:

  • The present study examines the contributions of different feature selection techniques, filter, wrapper, and evolutionary methods (16 methods) effect on machine–learning algorithms for heart disease prediction.
  • In the subsequent phase, all sixteen feature selection techniques were employed with Bayes net, Naïve Bayes (BN), Multivariate Linear Model (MLM), Support Vector Machine (SVM), logit boost, j48, and Random Forest.
  • The results were then compared according to the assessment criteria of Precision, F-measure, Specificity, Accuracy, Sensitivity, ROC area, and PRC.
  • The most important and significant result of the present study is a comprehensive comparison of a variety of feature selection techniques on machine algorithms for the prediction of heart diseases. The primary and most significant outcome of the study was that, despite the filter methods selecting more features, they were still able to enhance the accuracy factors and precision, as well as F-measures, when applied to machine learning algorithms.
  • The most significant improvements in factors are associated with a + 2.3 increase in accuracy after implementation of SVM + CFS/information gain/symmetry uncertainty feature selection methods, as well as an + 2.2 improvement in the F-measure factor derived from SVM + CFS/information gain/symmetry uncertainty.
  • The results showed that although feature selection in some algorithms leads to improved performance, in others it reduces the performance of the algorithm.

This paper is structured as follows: Following the introduction in section " introduction ", the related literature is reviewed in section " related literature ". Research methods are reviewed in section " methodology ". The results of the research are presented in section " results ". Subsequently, the results of the study are discussed in section " discussion ". Finally, the conclusions of the study are presented in section " conclusion ". Lastly, the limitations and future scope are discussed in Section " Limitation and future scope ".

Related literature

The Cleveland UCI dataset contains a number of related studies on the prediction of heart disease. These studies fall into two broad categories: the first, which compares algorithms based on classic or deep learning, and the second, which compares the performance of algorithms based on feature selection.

Premsmith et al. presented a model to detect heart disease through Logistic Regression and Neural Network models using data mining techniques in their study. The results demonstrated logistic regression with an accuracy of 91.65%, a precision of 95.45%, a recall of 84%, and F-Measure of 89.36%. This model outperformed the neural network in terms of performance 3 . In a study to enhance heart attack prediction accuracy through ensemble classification techniques, Latha et al. concluded that a maximum of 7% accuracy improvement can be expected from ensemble classification for poor classifiers and those techniques such as bagging and boosting will be effective in increasing the prediction accuracy of poor classifiers 16 . Chaurasia et al. conducted a study to evaluate the accuracy of the detection of heart disease using Naive Bayes (Naive), J48, and bagging. The results indicated that Naive berries provided an accuracy of 82.31%, J48 provided an accuracy of 84.35%, and bagging provided an accuracy of 85.03%. Bagging had a greater predictive power than Naive Bayes 19 .

Mienye et al. presented a deep learning strategy for predicting heart disease in a study utilizing a Particle Swarm Optimization Stacked Semiconductor Auto encoder (SSAE). This research proposes an approach for predicting heart diseases through the use of a stacked SSAE auto encoder that has a softmax layer. The softmax layer is a layer in which the last hidden layer of a sparse Auto encoder is connected to a softmax classifier, resulting in the formation of a SSAE network. This network is then refined with the implementation of the PSO algorithm, resulting in the development of feature learning and enhanced classification capabilities. The application of these algorithms to the Cleveland test yielded the following results: 0.961 accuracy, 0.930 precision, 0.988 sensitivity, and 0.958 F-measure 2 .

In a research project to assess the predictive power of MLP and PSO algorithms for the prediction of cardiac disease, Batainh et al. proposed an algorithm with an accuracy of 0.846 percent, an AUC of 0.848 percent, a precision of 0.808 percent, a recall of 0.883 percent, and an F1 score of 0.844. This algorithm outperforms other algorithms such as Gaussian NB classifiers, Logistic regression classifiers, Decision tree classifiers, Random forest classifiers, Gradient boosting classifiers, K-nearest neighbors classifiers, XGB classifiers, Extra trees classifiers, and Support vector classifiers, and can be used to provide clinicians with improved accuracy and speed in the prediction of heart disease 5 .

In order to enhance the predictive accuracy of heart disease, Thiyagaraj employed SVM, PSO, and a rough set algorithm in a study. To reduce the redundancy of data and enhance the integrity of the data, data was normalized using Z-score. The optimal set was then selected using PSO and the rough set. Finally, the radial basis function-transductive support vector machines (RBF) classifier was employed for the prediction. The proposed algorithm was found to have superior performance compared to other algorithms 7 .

A battery of papers focused on the use of classification techniques in the field of cardiovascular disease. These studies employed classification methods to prognosis the onset of disease, to classify patients, and to model cardiovascular data. The classification and regression tree algorithm (CART), a supervised algorithm, was employed in the studies conducted by Ozcan and Peker to prognosis the onset of heart disease and classify the determinants of the disease. The tree rules extracted from this study offer cardiologists a valuable resource to make informed decisions without the need for additional expertise in this area. The outcomes of this research will not only enable cardiologists to make faster and more accurate diagnoses but will also assist patients in reducing costs and improving the duration of treatment. In this study, based on data from 1190 cardiac patients, ST slope and Old peak were found to be significant predictors of heart disease 15 .

Bhatt et al., in their study based on data from Kaggle datasets and using Random Forest, Decision Tree Algorithms, Multilayer Perception, and XGBOOST classifier, predicted heart disease. In conclusion, the MLP algorithm demonstrated the highest level of accuracy (87.28%) among the other algorithms evaluated 14 . In a study conducted by Khan et al., 518 patients enrolled in two care facilities in Pakistan were predicted to develop heart disease using decision tree (DT), random forest (RF), logistic regression (LR), Naive Bayes (NB), and support algorithms. The most accurate algorithm used to classify heart disease was the Random Forest algorithm, which had an accuracy of 85.01% 20 . This was the best out of the other algorithms, according to a study by Kadhim and colleagues. They looked at a dataset of IEEE-data-port data sources and used a bunch of different algorithms to classify it. The Random Forest algorithm was the most accurate, with an accuracy of 95.4% 21 . In addition to these papers, a further set of studies have explored the application of machine learning to image and signal analysis.

Medical images are a critical tool in the diagnosis of a variety of medical conditions, including tumors. Due to the high degree of similarity between radiological images, timely diagnosis may be delayed. Consequently, the utilization of machine learning techniques can lead to an increase in the rate and precision of medical image-based diagnosis. Furthermore, with the growing number and volume of medical images available, the search for similar images and patients with similar complications can further enhance the speed and precision of diagnosis. The WSSENET (weakly supervised similarity assessment network) was a method used to evaluate the similarity of pulmonary radiology images, and it was found to be more accurate in retrieving similar images than prior methods 22 . In this paper 23 , a low-dose CT reconstruction method is proposed, based on prior sparse transform images, to resolve image issues. This method involves the learning of texture structure features in CT images from various datasets, and the generation of noise CT image sets to identify noise artifact features in CT images. The low-dose CT images processed with the enhanced algorithm are also used as prior images to develop a novel iterative reconstruction approach. DPRS is a method employed to expedite the retrieval of medical images within telemedicine systems, resulting in an enhanced response time and precision. Classification and selection of features are also employed for medical photo classification. Deep learning was employed to classify medical images in the study 24 . The adaptive guided bilateral filter was employed to filter the images. In this study, Black Widow Optimization was also employed to select the optimal features. The accuracy rate achieved in this study was 98.8% when Red Deer Optimization was applied to a Gated Deep Relevance Learning network for classification. Metaheuristic approaches have gained increased recognition in the scientific community due to their reduced processing time, robustness, and adaptability 25 . In his study presented a methodology based on a multi-objective symbiotic organism search to solve multidimensional problems. The results of a Feasibility Test and Friedman's Rank Test demonstrated that this method is sufficiently effective in solving complex multidimensional problems with multiple axes. A triangular matching algorithm was used in the study 26 . The method of soft tissue surface feature tracking is presented in the study. A comparison of the results of the soft tissue feature tracking method with the results of the convolution neural network was conducted. The result showed that the method of soft tissue feature tracking has a higher degree of accuracy. In a study (Dang et al.), a matching method was presented to overcome the issues of conventional feature matching. The method of matching feature points in various endoscopic video frames was presented as a category, and the corresponding feature points in subsequent frames were compared with the network classifier. The experimental data demonstrated that the feature-matching algorithm based on a convolutional network is efficient due to feature-matching, no rotation displacement, and no scaling displacement. For the initial 200 frames of a video, the matching accuracy reached 90% 27 . In a study, Ganesh et al. used a wrapper method based on the K Nearest Neighborhood (KNN) algorithm to select the best features. In this study, the WSA algorithm was compared with seven metaheuristic algorithms. The results showed that this algorithm was able to reduce 99% of the features in very large datasets without reducing the accuracy and performed 18% better than classical algorithms and 9% better than ensemble algorithms 28 . Priyadarshini et al. conducted a study using metaheuristic algorithms inspired by physics investigated feature selection. The performance of these algorithms were compared using factors such as accuracy, processing cost, suitability, average of selected features and convergence capabilities. The results showed that Equilibrium Optimizer (EO) had a better performance than other algorithms and it was suggested to solve problems related to feature selection 29 .

The following is a summary of the findings of the studies comparing the feature selection techniques and the algorithms used in the Cleveland dataset to predict heart diseases (Table ​ (Table1 1 ).

Related studies with a focus on feature selection effect on heart disease prediction.

Feature selection methodClassification algorithmEvaluation factorYearReferences
Chi-squared and analysis of variance (ANOVA)Logistic regression, k-nearest neighbor, decision tree, random forest, Gaussian naive bayes, extra gradient boosting, support vector classifier, multilayer perceptron, stochastic gradient descendent, and additional tree classifierAccuracy2023
Meta-heuristic algorithms(CS, FPA, WOA, and HHO)SVM, KNN, Random Forest, Naïve Base, Logistic RegressionF-score and AUC2023

Relief, Info gain

Chi-squared Filtered subset One attribute based Consistency based

Gain ratio

Filtered attribute CFS, Genetic algorithm

Multilayer perceptron, KNN, SVM, J48Accuracy2019
Fast Correlation-Based Feature Selection (FCBF), PSO and ACOKNN, SVM, RF, NB, MLPAccuracy2018

This group of studies included only a few feature selection techniques mostly filter methods as well as accuracy factor, as indicated in Table ​ Table1. 1 . However, in this study, sixteen feature selection methods in three groups filters, wrapper, and evolutionary were studied and their impact on all factors-including Precision, F-measure, Specificity, Accuracy, Sensitivity, ROC area, and PRC were measured.

Methodology

The present study was divided into four general phases, as illustrated in Fig.  1 .

An external file that holds a picture, illustration, etc.
Object name is 41598_2023_49962_Fig1_HTML.jpg

Study phases.

Once the data had been acquired and preprocessed, sixteen feature selection techniques were applied in three categories: filter, wrapper, and evolutionary methods. Subsequently, the best subset was selected, and seven machine-learning techniques applied. Subsequently, algorithm and feature selection performance were evaluated using various evaluation factors. Since a public dataset was used in this study, informed consent was not obtained. In addition, human subjects were not used in present research. Also, all stages of the research were in accordance with the standards and guidelines of ethics in research, and the study was conducted after obtaining the code of ethics in the ethics board of Shiraz University of Medical Sciences.

The dataset used for the heart disease analysis is the Cleveland Heart disease dataset. This dataset was extracted from UCI Machine Learning Repository and consists of 303 records. This dataset includes a total of 165 individuals with cardiovascular disease and 138 individuals with no cardiovascular case history. The dataset was characterized by 13 attributes for predicting heart disease, with one attribute serving as the final endpoint. Table ​ Table2 2 provides a description of this dataset.

The detail of Cleveland dataset.

AttributesExplanationTypeValue
AgeAge in yearsNumeric29–77
SexgenderBinaryMale = 1, Female = 0
CpChest pain typeNominal1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic
TrestbpsResting blood pressure in mmHgNumerical94–200
CholSerum cholesterol in mg/dlNumeric126–564
FBSFasting blood sugar > 120 mg/dlBinaryTrue = 1, False = 0
RestecgResting electrocardiographic resultsBinaryNormal = 0, Abnormality = 1
ExangMaximum heart rate achievedNumeric71–202
OldpeakST depression induced by exercise relative to restNumeric0–6.2
SlopeThe slope of the peak exercise ST segmentNominal1 = upsloping, 2 = flat, 3 = down sloping
CaNumber of major vessels colored by fluoroscopyNumeric0–4
ThalDefect typeNominal3 = Normal. 6 = Fixed defect 7 = reversible defect
TargetHealthy or patientBinary1 = healthy, 0 = patient

Data preprocessing is one of the most critical steps after obtaining the data. Due to the uniformity and global nature of the data set, only the missing value analysis was used as a pre-processing technique, and records with blank fields were eliminated from the data set. At this stage, the dataset has been filtered for missing data and 6 missing records were removed, leaving 297 records to be processed.

Feature selection

Feature selection is the process of removing unrelated and repetitive features from a dataset based on an evaluation index to make it more accurate. There are three main types of feature selection methods: filter, wrapper, and embedded 31 . Filtering methods use the general properties of the training data to perform the selection as a step-by-step process independent of an induction algorithm. Filtering methods have lower computational complexity and are better at generalizing. Because filter methods only look at the intrinsic properties of the training samples to evaluate a feature or a group of features, they can be used with a wide range of classifiers 32 .

In a wrapper-based method, the selection process involves optimization of a predictor. Unlike a filter method, a wrapper method is tailored to a particular classifier and evaluates the quality of a subset of candidates. As a result, a wrapper method achieves better classification performance than a filter method. In a third-party method, feature selection is performed during the training phase. Embedded methods constitute a subset of overlay methods, which are characterized by a more profound relationship between feature selection and the classifier construction 33 . Feature subsets are formed when the embedded methods are used to construct the classifier 32 , 33 .

In the present study, filter methods were employed alongside wrapper and evolutionary methods (Fitness function: precision + SVM), which are briefly outlined below.

Filter method

Correlation-based feature selection (CFS): This multivariate filter algorithm ranks feature subsets based on a heuristic evaluation function based on a correlation. The bias function evaluates subsets that correlate with the class and are not correlated with other features. Non-relevant features are disregarded as they will not have a high correlation with the class; additional features should be evaluated as they are highly correlated to one or more other features. The acceptance of a feature is dependent on its ability to predict classes in areas of the sample space that have not previously been predicted by other features 32 .

Information gain : This univariate filter is a widely used way of evaluating features. It assigns an order of importance to all features and then determines the necessary threshold value. In this example, the threshold value is determined by selecting features that receive positive information gain 32 .

Gain ratio : The purpose of the algorithm modified for information gain is to mitigate bias. The algorithm evaluates the number and scope of branches when selecting a feature. By taking into account the internal information of a segment, the algorithm attempts to adjust the information gain 34 .

Relief : This method involves selecting a random sample of data and then finding the closest neighbor of that class and its counterpart. The closest neighbor’s attribute values are then compared to the sample and the associated scores for each attribute are updated. The logic for each attribute is that it distinguishes between samples from different classes and takes the same value into account for samples that belong to the same class 32 .

Symmetrical uncertainty : To determine the relationship between a feature and a class label, symmetric uncertainty is used. The mean normalized mutual benefit of a feature (f), each other feature (n), and the class label reflects the relationship between feature f and other features in a set of features (F) 35 .

Wrapper method

Forward and backward selection : In a backward elimination model, all features are eliminated and the least important features are removed sequentially. In a forward selection model, no features are eliminated, and the most important features are added sequentially 36 .

Naïve Bayes : This algorithm is derived from probability theory to identify the most likely classifications. It utilizes the Bayes formula (Eq.  1 ) to determine the likelihood of a data record Y having a class label c j 11 .

Decision tree : The tree-based technique involves each path beginning at the root of the tree is initiated by a sequence of data separators, and the sequence continues until the result reaches the leaf node. The tree-based technique is, in reality, a hierarchy of knowledge that consists of nodes and connections. Nodes, when used for classification purposes, represent targets 37 .

K-Nearest-Neighbor (KNN) : It is a classifier and regression model used for classification. As KNNs are typically sample-based (or memory-based) learning schemes, all computational steps in KNNs are postponed until classification. Furthermore, KNNs do not require an explicit training step to construct a classifier 33 .

NN : A neural network is a computer model composed of a vast number of interconnected nodes, each of which represents a particular output function, referred to as an activation function. Each node represents a signal, referred to as a weight that passes through the connection between two nodes. The weight corresponds to the memory capacity of the neural network, and the output of the neural network will vary depending on how the nodes are connected, the degree of weight, and the incentive function 38 .

SVM : Support vector machines (SVM) are algorithmic extensions of statistical learning theory models that are designed to generate inferences that are consistent with the data. The question of estimating model performance in an unfamiliar data set, taking into account the model’s properties and the model’s performance in the training set is posed by support vector machines. These machines solve a restricted quadratic optimization problem to find the optimal dividing line between sets. The model generates data, and different kernel functions can be employed to provide varying degrees of linearity and flexibility 39 .

Logistic regression : Logistic regression (or logistic regression analysis) is a statistical technique that involves the prediction of the outcome of a class-dependent variable (or class of variables) from a set of predicted variables. Logistic regression involves the use of a binary dependent variable (or class) with two categories and is primarily used to predict, as well as to calculate, the probability of a given outcome 40 .

Evolutionary algorithms

They are a type of metaheuristic algorithm based on population that involves the use of a set of solutions in each step of the solution process. This set of solutions is composed of operators that combine/change solutions to incrementally improve/evolve aggregate solutions based on the Proportion uses function. This category includes algorithms such as PSO, ABC, and genetic algorithms 41 .

Artificial Bee Colony (ABC) : ABC is a hybrid population-based optimization algorithm in which artificial bees act as change operators to refine the solutions to the optimization problem-i-e-of food resources. The objective of the bees is to locate food sources with the primary nectar. In ABC, an artificial bee navigates a multidimensional area and selects nectar resources based on experience and hive companions or based on its location. In addition, some bees fly (explore) and select food sources randomly, without relying on experience. When they locate a source of the primary nectar, they retain their positions. ABC combines local and global search methods to achieve a balance between exploration and utilization of the search space 42 .

Genetic algorithm : A genetic algorithm is a type of programming technique that utilizes evolutionary biology techniques, including heredity, mutation, and the principles of Darwin’s selection, to find the most appropriate formula to predict or match a pattern. In many cases, genetic algorithms are a suitable substitute for regression-based prediction methods. Genetic algorithm modeling is a programming approach that utilizes genetic evolution as a tool for problem-solving. Inputs are transformed into solutions through a process model based on genetic evolution, and the solutions are then evaluated as candidates for the fitness function. If the output condition of the problem can be met, the algorithm is terminated. In general; a genetic algorithm is an algorithm that is based on repetition, with most of its parts selected as random processes. It consists of parts of a function of fitting, displaying, selection, and change 43 .

Particle swarm optimization (PSO): In particle swarm optimization algorithms, each member of a population or solution is referred to as a particle. Each particle flies and moves through the search space with its initial position and velocity to locate the most optimal solution. Each particle stores the best position it has achieved while searching and moving through the search space as its own experience. This information is then shared with other particles within the neighborhood, allowing them to identify the locations where they had the greatest success and thus the best position within their neighborhood or the entire search space. The best group experience is known as the solution 1 .

Machine learning algorithms:

This study employed a variety of machine learning models, including Bayes net, Naïve Bayes (BN), multivariate linear model (MLM), Support Vector Machine (SVM), logit boost, j48, and Random Forest. Bayes nets are mathematical models that represent relationships among random variables through conditional probabilities, similar to how a classifier evaluates the probability of P(c| x) of a class of discrete variables c in the presence of certain characteristics of a given X pay 44 . Random forests are a subset of tree based models, in which tree predictors are calculated independently from a random vector’s values after a distribution that is equivalent for all trees within the forest. The generalization error of random forest classifiers is contingent upon the relationship between individual trees in the forest and the strength of those trees. J48 classifiers are extensions of the classification decision tree algorithm (C4.5) that generate binary trees. This system constructs a tree to represent the classification procedure. After constructing the tree, the algorithm applies to any tuple within the database to classify that tuple 45 .

An MLP is a supervised learning approach that utilizes back-propagating techniques. Because there are many layers of neurons in an MLP, it can be considered a deep learning approach and is commonly employed to solve supervised learning problems. Additionally, it has been used in computational neuroscience research as well as in distributed parallel processing (DCP) research 46 . The logit boost is a boosted classification algorithm that is based on incremental Logistic regression and strives to reduce logistic loss.

Evaluation and analysis tools:

For data analysis and the identification of significant risk factors, Waikato environment for knowledge analysis (Weka) version 3.3.4 was utilized. Evolutionary algorithms were implemented in Matlab 2019b, and machine learning models were implemented in R 3.4.0. The models were validated using a tenfold cross-validation method and various criteria, such as accuracy, sensitivity, specificity, and precision, as well as F-measure, ROC, and PRC area (Table ​ (Table3). 3 ). These indices operate based on the confusion matrix, a two-dimensional matrix that compares the predicted class values to the actual class values. Within the first quartile, true positives (TP) refer to the number of correctly classified patients with heart disease, and false positives (FP) refer to patients without heart disease who are incorrectly classified as having heart disease. The False Negative (FT) refers to patients with heart disease that are not classified correctly by the model, while TN (true negative) refers to patients without heart disease that are classified correctly 12 . The f-measure, the ROC area, and the PRC area indices are aggregated indices that provide an overall assessment of the model; the mathematical formulas (Eqs. 2–6) for the calculation of the assessment indices are outlined in Table ​ Table3 3 .

Study performance indices.

Performance criteriaCalculation
Accuracy (2)
Precision (3)
Sensitivity/Recall (4)
Specificity (5)
F-score (6)

Ethical statement

The study protocol was approved by the Shiraz University of Medical Sciences (SUMS) Ethics Board. Approval Date: 2022-11-19; Approval ID: IR.SUMS.NUMIMG.REC.1401.097.

The heart disease dataset consisted of 297 records (after removing 6 missing records) in which 160 subjects (53.9%) had no heart disease, and 137 subjects (46.1%) had heart disease. To determine the risk factors associated with heart disease diagnosis, sixteen feature selection methods were applied in three categories: filter, wrapper, and evolution. All of the feature selection techniques were employed on the features, and the outputs of each operation, as well as the features chosen by each technique, are presented in Table ​ Table4 4 .

Feature selection results.

Type of algorithmFeature selection techniqueNo selected featuresAgeSexCpTrestbpsCholFBSRestecgThalachExangOldpeakSlopeCaThal
FilterCFS10**********
Information Gain10**********
Gain ratio11***********
Relief12************
Symmetrical uncertainty10**********
WrapperForward selection5*****
Backward selection4****
SVM9*********
NB6******
Logistic regression9*********
NN11***********
KNN7*******
Decision tree7*******

Evolutionary

(Fitness function: accuracy + SVM)

Genetic algorithm9*********
PSO7*******
Artificial Bee Colony7*******

*Selected features are shown with star mark (*), Cp: chest pain, FBS: fasting blood sugar, restECG: rest electrocardiographic, Exang: exercise-induced angina, Slope: peak exercise slope measure, Ca: number of major vessels colored by fluoroscopy, Thal: heart rate, Trestbps: resting blood pressures of patients measured in mm Hg on admission to the hospital, Chol: serum cholesterol, Thalach: maximum heart rate, Old peak: ST depression made by exercise relative to rest.

The results of Table ​ Table4 4 demonstrate that the forward and backward regression methods have selected the minimum number of features, while the Relief method has selected the most (n = 12). In the subsequent step, seven different machine-learning methods were employed. The performance of these methods was evaluated using the tenfold cross-validation technique. All models were initially implemented based on the complete data set, followed by the features selected by the feature selection methods.

After implementing all feature selection methods and determining the number of selected features, the results of Table ​ Table5 5 show the methods that selected the least number of features in each category. According to this, wrapper algorithms choose the least features while filter methods choose the most features. In addition, all the features selected by the filter algorithms were similar, while evolutionary algorithms, despite the same number of features, chose different feature types.

Minimum features which choose by different methods.

MethodsFeature selection algorithmNumber of selected featuresSelected features
FilterCSF10age, sex, cp, restecg, thalach, exang, oldpeak, ca, thal, slope
Information Gain10age, sex, cp, restecg, thalach, exang, oldpeak, ca, thal, slope
Symmetrical uncertainty10age, sex, cp, restecg, thalach, exang, oldpeak, ca, thal, slope
WrapperBackward selection4cp, oldpeak, ca, thal
Forward selection5exang, cp, oldpeak, ca, thal
EvolutionaryPSO7cp, resteg, thalach, exang, oldpeak, ca, thal
ABC7age, chol, fbs, resteg, thalach, slope, ca

The results of running the machine learning algorithms before feature selection are presented in Fig.  2 .

An external file that holds a picture, illustration, etc.
Object name is 41598_2023_49962_Fig2_HTML.jpg

Before FS (based on the original data set).

The SVM algorithm achieves a good performance with ACC = 83.165%, Spec = 89.4%, and Precision = 86. However, when the combined criteria are taken into account, Bayesian networks achieve better performance with ACC = 81.3%, F = 81.3%, AUC = 90.3%, and PRC = 90. The highest sensitivity value achieved for MLP was 81%.

Figures ​ Figures3, 3 , ​ ,4 4 and ​ and5 5 compare the machine learning algorithms' performance after feature selection for accuracy, F-measure, and ROC diagram area.

An external file that holds a picture, illustration, etc.
Object name is 41598_2023_49962_Fig3_HTML.jpg

Accuracy result of the algorithm after feature selection.

An external file that holds a picture, illustration, etc.
Object name is 41598_2023_49962_Fig4_HTML.jpg

F-measure results after Feature selection.

An external file that holds a picture, illustration, etc.
Object name is 41598_2023_49962_Fig5_HTML.jpg

AUC results after Feature selection.

The accuracy of all algorithms is demonstrated in Fig.  3 following the selection of features. The SVM algorithm implemented using the CFS/Information Gain/Symmetrical Uncertainty feature selection method displays the highest performance in comparison to other algorithms. The Bayes net algorithm displays the highest performance after the implementation of feature selection methods.

The values associated with the F-measure are presented in Fig.  4 following the implementation of the algorithms based on the feature selection methods. The highest performance was associated with the SVM + CFS/information gain / Symmetrical uncertainty algorithm.

In Fig.  5 , the AUC values are displayed after performing the feature selection methods. Bayesian Network + Wrapper-logistic Regression algorithm had the best performance among other algorithms. As can be seen in the picture, the amount of AUC has been improved after feature selection in most algorithms.

The results demonstrate that feature selection resulted in significant improvements in model performance in some methods (e.g., j48), while it led to a decrease in model performance in other models (e.g. MLP, RF). Table ​ Table6 6 compares the best results achieved before and after feature selection.

Performance result comparison before and after feature selection.

Performance metricBefore FSAfter FSDifferences
Best valueML technique(s)Best valueML technique + FS algorithm(s)
Accuracy83.2SVM, Bayesian Network85.5SVM + CFS/information gain/Symmetrical uncertainty + 2.3
Sensitivity81MLP82.5MLP + GA + 1.5
Specificity89.4SVM91.2SVM + Wrapper-NB + 1.8
Precision86SVM87.9SVM + CFS/information gain/Symmetrical uncertainty + 1.9
F-measure81.3Bayesian Network83.5SVM + CFS/information gain/Symmetrical uncertainty + 2.2
ROC area90.3Bayesian Network90.7Bayesian Network + Wrapper- Logistic Regression + 0.4
PRC area90Bayesian Network90.2Bayesian Network + Wrapper- NN + 0.2

Table ​ Table6 6 demonstrates that filter feature selection techniques have improved model performance in terms of Accuracy, Precision, and F-Measure, however, Wrapper-based and evolutionary algorithms have enhanced model sensitivity and specificity. SVM-based filtering methods have a best-fit accuracy of 85.5. In fact, in a best-case scenario, filtering methods result in + 2.3 model accuracy. SVM-based feature selection methods have the highest improvement in this index, with the PRC index having the lowest improvement of + 0.2.

Figure  6 shows the ML model running time before and after the feature selection. All models are running on Corei3 (RAM = 4GB). The comparison of the results shows that the ML models with the original set of data reached an average model building time of 0.59 ± 0.34 s, among which MLP with 1.64 s and NB with 0.01 s had the highest and lowest times. Following the implementation of the feature selection methods, the ML models with the features selected by the Relief and gain ratio method achieved an average model building time (ABT) of 0.44 ± 0.19 s and 0.42 ± 0.18 s respectively. Additionally, the backward method and the Wrapper + NB method resulted in an ABT of 0.14 ± 0.06 and an average model construction time of 0.13 ± 0.06, respectively.

An external file that holds a picture, illustration, etc.
Object name is 41598_2023_49962_Fig6_HTML.jpg

ML models ‘execution time before and after feature selection.

Table ​ Table6 6 summarizes the findings of this study and related papers.

Based on the data presented in Table ​ Table7, 7 , the accuracy of this paper was 85.5% higher than that of similar papers based on the SVM algorithm and the CFS/Information Gain/Symmetrical Uncertainty Feature selection methods.

Comparative accuracy results of similar studies compared to present study.

Feature selection methodsML algorithmsBest algorithm performanceBest accuracy (%)YearReferences
Feature reduction (11 features)Naive Bayes, J48, and baggingBagging85.032014
Chi-squared feature evaluatorRandom forestRF83.72015
Feature reductionNaive Bayes, KNN, decision tree, and baggingKNN79.22017
Brute force feature selectionBayes Net, Naive Bayes, Random forest, C4.5, Multilayer perceptron, PART, majority votingMajority voting85.482019
linear discriminant analysis (LDA), hybrid feature selection algorithm, and medical doctors’ recommendation-based feature selectionNaive Bayes (NB), Random Forest (RF), k-nearest Neighbor (KNN), support vector machine (SVM), Extreme Gradient Boosting (XGBOOST)SVM81.842019
PC features, Chi-squared, Relief-F, symmetrical uncertaintyBayes Net, Logistic, Stochastic Gradient Descent (SGD), KNN, random forestChi-squared feature selection with the Bayes Net algorithm85.002020
filter methods (CSF, Information Gain, Gain Ratio, Relief, Symmetrical uncertainty), Wrapper (Forward and backward selection, Naïve Bayes, Decision tree, KNN, NN, SVM, Logistic regression), and evolutionary PSO, ABC, and genetic algorithms)Bayes Net, Naïve Bayes (BN), multivariate linear model (MLM), Support Vector Machine (SVM), logit boost, j48, and Random ForestSVM + CFS/information gain/Symmetrical uncertainty85.52023Our study

This study evaluates the influence of filter selection methods on the performance of various algorithms. Firstly, the algorithms were applied to the dataset without the implementation of feature selection methods. The SVM and Bayesian Network algorithms demonstrated the most robust performance, with accuracy values of 83.2 and 83.0 respectively. However, when combined criteria such as the F-measure = 81.3, the AUC = 90.3, and the PRC area = 90, the Bayesian network performed more efficiently. Subsequently, sixteen feature selection methods were applied in three categories: the filter, the wrapper, and the evolutionary. The wrapper method selected the least number (backward selection = 4, forward selection = 5) and the filter method selected the most features (Relife = 12). Evolutionary methods PSO and ABC also selected 7 features. Although the numbers were similar, the selection of features varied between the two algorithms. In his analysis of feature selection correlation methods for predicting heart disease, Reddy concluded that the highest level of accuracy could be achieved with the selection of 8 features, however, when the number of features was reduced to 6, no improvement in performance was observed. Conversely, when the selection method was changed and only 3 features were selected, an increase in accuracy was observed. He concluded that the selection method and type of features selected can have a significant impact on the algorithms' performance 50 . The highest accuracy in this study is associated with SVM = 85.5 using the CFS/Information Gain/Symmetrical Uncertainty Filter method after the three feature selection methods were applied. All three methods selected the same ten features. While filter methods resulted in more features selected than other wrapper or evolutionary methods, they also generated greater accuracy. In Gokulnath's study, the filtering methods were found to have increased the model's accuracy, with filter methods yielding the most significant improvement in the F-measure index (2.2) 30 . In his study on cardiovascular disease, Şevket Ay concluded that the selection of the feature through metaheuristics such as cuckoo search (CS), flower pollination algorithm (FPA), whale optimization algorithm (WOA), and Harris hawks (HH) resulted in an improvement in the F-score and the AUC indices 8 . In the present study, the genetic algorithm was only able to increase the sensitivity index by 1.5 compared to other methods. The wrapper-based feature selection methods were found to improve the ROC area and PRC area, as well as the specificity indicators. Furthermore, the results of the study indicated that feature selection does not always result in model improvement. For instance, two algorithms (MLP and RF) experienced a decrease in performance following feature selection.

The present study was able to achieve a higher performance in terms of accuracy by achieving a performance of 85.5% compared to other similar papers (Table ​ (Table7). 7 ). This result was obtained after implementing CFS/information gain/Symmetrical uncertainty feature selection methods. The results of the present study showed that feature selection can lead to the improvement of most of the evaluation indices of the algorithms in the prediction of heart disease, and the highest improvement was observed in the accuracy index. All 16 feature selection methods were implemented with all algorithms, resulting in new insights into feature selection methods. Indeed, one of the key findings of this study was the influence of various feature selection groups on algorithm performance. Also, the results of the present study showed that the feature selection methods that lead to the selection of the least number of features cannot achieve the best improvement in the model's performance. The results of our study indicated that the best accuracy was obtained based on filter methods that selected more features than other methods. These results can contribute to the issue that maybe the type of variables has a greater impact on building the model than the number of selected features. Also, every feature selection method may not lead to improved model performance. Therefore, comparing various feature selection methods and measuring their impact plays a significant role in building the best prediction model.

Artificial Intelligence (AI) technologies have advanced to a point where they offer deep, efficient, and non-intrusive analytical capabilities to facilitate the decision-making of physicians and health policy-makers in comparison to conventional methods 9 , 10 . In addition, the utilization of Machine Learning (ML) models in support of medical diagnoses, screening, and clinical prognosis is on the rise due to their high capacity to identify and categorize patients 51 . Presently, clinical professionals are confronted with a vast amount of health data that is both complex and imprecise, making it difficult for them to make informed decisions 52 . The speed of decision-making in heart diseases has the potential to reduce complications and improve the patient's condition, whereas machine learning algorithms have been instrumental in predicting, diagnosing, and treating various diseases through their high accuracy.

The main findings of this study are as follows:

  • This study examines the role of feature selection methods in optimizing machine learning algorithms in heart disease prediction.
  • Based on the findings, the filter feature selection method with the highest number of features selected outperformed other methods in terms of models' ACC, Precision, and F-measures.
  • Wrapper-based and evolutionary algorithms improved models' performance from a sensitivity and specificity point of view.
  • Based on current knowledge, this study is among the few to compare the performance of different feature selection methods against each other in the heart disease algorithm field.
  • Previous research has mainly focused on enhancing algorithms, whereas studies that have examined the impact of feature selection on the field of cardiac prediction have focused on a limited number of methods, such as filter or metaheuristic.
  • As a result, the findings of this study may be of value to health decision-makers, clinical specialists, and researchers. The findings of this study will enable clinical professionals to utilize artificial intelligence more effectively in the prediction of heart disease. Policymakers will be able to plan and allocate resources for the utilization of AI in the area of health promotion and prevention of cardiovascular disease, and researchers can draw on the findings of this study to inform further research on the function of feature selection methods across various fields of disease.

Limitation and future scope

The limitations of this study include the use of a single dataset and the utilization of only seven algorithms. It appears that improved results can be obtained by utilizing multiple datasets and additional algorithms. Another limitation of this study is that socio-economic characteristics and other clinical characteristics related to people’s lifestyle (e.g., smoking, physical activity) were not taken into account. Future studies will be able to provide better results by taking into account a broader range of clinical characteristics and socio-economic characteristics. However, other information (e.g. patient medical images, ECG signals) were not included in this study. The simultaneous utilization of structured and non-structured data, signals, and medical images, can provide researchers with more comprehensive insights and thus serve as a foundation for future exploration. Furthermore, the limited size of the dataset studied may limit the ability to disseminate the findings of the current study to the general public, thus necessitating the utilization of larger datasets and larger sample sizes to enhance the outcome of future research. Therefore, based on the findings of this paper, the present research team will focus on using larger datasets with a wider range of features and will also look at the impact of different feature selection techniques on different disease domains and, finally, the current team will employ more algorithms and, of course, deep learning techniques.

Author contributions

Z.N., A.O., and L.E. have read and approved the manuscript. Z.N. and L.E. contributed equally to the study design. Z.N., A.O. and L.E. prepared the manuscript and revised it critically. Z.N., A.O., and L.E. were the initiator of the literature search and review, reading, categorizing and analyzing. L.E. and Z.N. developed the proposed model and A.O. additionally performed supervisory tasks.

This study was supported by Shiraz University of Medical Sciences (Grant No. 26917).

Data availability

Competing interests.

The authors declare no competing interests.

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

IMAGES

  1. Feature Selection in Machine Learning: An easy Introduction 💿📀

    feature selection in machine learning research paper

  2. Feature Selection Techniques in Machine Learning

    feature selection in machine learning research paper

  3. Machine Learning Feature Selection

    feature selection in machine learning research paper

  4. Classification and feature selection in machine learning

    feature selection in machine learning research paper

  5. Feature Selection

    feature selection in machine learning research paper

  6. How to Choose a Feature Selection Method For Machine Learning

    feature selection in machine learning research paper

VIDEO

  1. L1 and L2 for Multicollinearity and Feature Selection

  2. Active Learning for Support Vector Machines (SVMs)

  3. Machine Learning Blink 10.1 (Why dimensionality reduction?)

  4. Feature Selection methods using R and R studio, Machine Learning

  5. ANOVA F Test program

  6. Project 2: Feature Selection in machine learning using the Pearson Correlation

COMMENTS

  1. Feature selection techniques for machine learning: a survey of more

    Feature selection has been a crucial area of research in machine learning for many years. In this field, a feature is a measure that describes relevant and discriminative information about a data object [].Selecting the right features is a critical step in building a machine learning model, as it can significantly improve the model's performance, reduce its complexity, and make it easier to ...

  2. Feature selection in machine learning: A new perspective

    Feature selection in machine learning: A new perspective

  3. (PDF) A Review of Feature Selection and Its Methods

    The Dimensionality Reduction (DR) can be handled in t wo. ways namely Feature Selection (FS) and Feature Extraction (FE). This paper focuses. on a survey of feature selection methods, from this ...

  4. Fundamentals of Feature Selection: An Overview and Comparison

    Tremendous efforts have been put into the development of Feature Selection (FS) methods by the machine learning community. In this paper, we present basics surrounding this topic, providing its general process, evaluation procedure and metrics. In addition, we thoroughly discuss major application aspects. We also provide a comprehensive overview and comparison of some existing feature ...

  5. (PDF) Feature Selection

    The achieved conclusions can advise the machine learning users which classifier and feature selection method to use to optimize the classification accuracy, which may be important especially in ...

  6. Feature selection in machine learning: A new perspective

    Feature selection provides an effective way to solve this problem by removing irrelevant and redundant data, which can reduce computation time, improve learning accuracy, and facilitate a better understanding for the learning model or data. In this study, we discuss several frequently-used evaluation measures for feature selection, and then ...

  7. A tutorial-based survey on feature selection: Recent advancements on

    Sequential forward feature selection, sequential backward feature selection, floating search (Pudil et al., 1994) and recursive support vector machine (SVM) (Zhang et al., 2006) are conventional wrapper-based feature selection techniques in machine learning. If learning algorithms encounter a large number of features, the main challenge of ...

  8. Feature selection techniques for machine learning: a survey of more

    The manuscript further categorizes the algorithms analyzed in this review based on the properties required for a specific dataset and objective under study. Additionally, it discusses popular area-specific feature selection techniques. Finally, it identifies and discusses some open research challenges in feature selection that are yet to be ...

  9. [1601.07996] Feature Selection: A Data Perspective

    Feature selection, as a data preprocessing strategy, has been proven to be effective and efficient in preparing data (especially high-dimensional data) for various data mining and machine learning problems. The objectives of feature selection include: building simpler and more comprehensible models, improving data mining performance, and ...

  10. Feature Selection Tutorial with Python Examples

    In Machine Learning, feature selection entails selecting a subset of the available features in a dataset to use for model development. There are many motivations ... reasons feature selection has received a lot of attention in data analytics research. In this paper we provide an overview of the main methods and present practical examples with ...

  11. A Comprehensive Review of Feature Selection and Feature Selection

    Feature selection techniques search the... | Find, read and cite all the research you need on ResearchGate ... Th is paper has been organized as follows. ... of Machine Learning Research, 5: 1205 ...

  12. Feature Selection in Machine Learning: Methods and Comparison

    Feature selection refers to various methods which selects the most appropriate from the data according to the problem. This paper provides an insight into various feature selection methods which include filter, wrapper, embedded, hybrid and a detailed explanation of various techniques utilized by these methods. Further, a comparison among these ...

  13. PDF An Introduction to Variable and Feature Selection

    An Introduction to Variable and Feature Selection

  14. PDF Feature Selection: A literature Review

    processing of the data is essential. Feature selection is one of the most frequent and important techniques in data pre-processing, and has become an indispensable component of the machine learning process [1]. It is also known as variable selection, attribute selection, or variable subset selection in machine learning and statistics.

  15. Filter Methods for Feature Selection in Supervised Machine Learning

    The amount of data for machine learning (ML) applications is constantly growing. Not only the number of observations, especially the number of measured variables (features) increases with ongoing digitization. Selecting the most appropriate features for predictive modeling is an important lever for the success of ML applications in business and research. Feature selection methods (FSM) that ...

  16. LASSO: A feature selection technique in predictive modeling for machine

    A feature selection technique in predictive modeling for ...

  17. PDF Feature Selection for Unsupervised Learning

    Since research in feature selection for unsupervised learning is relatively recent, we hope that this paper will serve as a guide to future researchers. With this aim, we 1. Explore the wrapper framework for unsupervised learning, 2. Identify the issues involved in developing a feature selection algorithm for unsupervised learning within this ...

  18. (PDF) A Survey on Feature Selection

    In this paper, work done by feature selection researchers to improve the performance of machine learning algorithms is presented which will be useful to budding researchers in this field ...

  19. PDF Feature Selection: A Review and Comparative Study

    Feature selection is an active research filed in machine learning, as it is an important pre-processing, finding success in different real problem applications. In general, feature selection algorithms are categorized into supervised, Semi-supervised and Unsupervised feature selection [2,3,4,5,6].

  20. Feature Selection Techniques and its Importance in Machine Learning: A

    Feature selection is well studied research topic in the field of artificial intelligence, machine learning and pattern recognition. Feature selection it removes the redundant, irrelevant and noisy features from the original features of datasets by choosing the relevant features having the smaller subdivision of dataset. By applying various techniques of feature selection to the datasets ...

  21. [2106.06437] Feature Selection Tutorial with Python Examples

    Padraig Cunningham, Bahavathy Kathirgamanathan, Sarah Jane Delany. View a PDF of the paper titled Feature Selection Tutorial with Python Examples, by Padraig Cunningham and 2 other authors. In Machine Learning, feature selection entails selecting a subset of the available features in a dataset to use for model development.

  22. A Review of Feature Selection Methods for Machine Learning-Based

    The prediction of disease risk using SNP genotype data can be considered as a binary classification problem within supervised learning. There is a generalized workflow for creating a predictive ML model from a case-control genotype dataset ().The first step is data pre-processing, which includes quality control and feature selection (Figure 1A, step 1).

  23. Analyzing the impact of feature selection methods on machine learning

    Therefore, based on the findings of this paper, the present research team will focus on using larger datasets with a wider range of features and will also look at the impact of different feature selection techniques on different disease domains and, finally, the current team will employ more algorithms and, of course, deep learning techniques.