Variable/predictor/feature selection methods are notorious for instability and poor overlap between “selected” features and “real” features. They also have significant sample size requirements in order to be stable and mostly correct in terms of the set of selected features. As shown in this presentation, the *lasso* has too low a probability of selecting the “right” features and too high a probability of selecting the “wrong” features even under the ideal case where predictors are uncorrelated and the true unknown regression coefficients have a Laplace (double exponential) distribution centered at zero:

A key problem with feature selection methods is they do not inform the analysis of the set of features for which we don’t know enough to select or reject them. The following approximate Bayesian procedure may help in this regard, and expose whether the sample size is adequate for the task. This is a simplified univariable screening approach that would be most appropriate when predictors are uncorrelated with each other.

- Assume the candidate features are all scaled similarly and have linear effects on the outcome
- Assume independent prior distributions for the candidate feature regression coefficients. The priors can be skeptical of large effects and may even place a spike of probability mass at zero effect.
- Assume for the moment that we are only interested in selecting features that have a positive association with Y (if not, replace the posterior probabililty calculation with the probability that |\beta_{j}| > 0.2 for example)
- Fit a model with each feature as the only predictor, and repeat for all candidate predictors
- For each predictor j compute the posterior probability (PP) that \beta_j > 0
- Consider a predictor to be selected when PP > 0.9 and rejected when PP < 0.1
- Count the number of candidate features for which PP is between 0.1 and 0.9
- If the fraction of candidates falling into the uncertainty interval exceeds some number such as 0.25, conclude that the sample size is inadequate for the task

I’m interested in comments about the usefulness/reasonableness of this procedure. In my mind this is more useful than computing the false discovery probability. It quantifies what we don’t know.