Retrospective Selective P-Value Based on Summary Statistics

This function is useful for conducting valid retrospective F-screening as defined in the 2025 paper "Valid F-screening in linear regression" by McGough, Witten, and Kessler (arxiv preprint: https://arxiv.org/abs/2505.23113). Suppose that we have access to the outputs of an "overall" least squares linear regression model, such as from the output of summary(lm(y~X)), and we want to conduct a test of the significance of a single regression coefficient (beta_j) that accounts for the rejection of the "overall" F-test. Then this function can provide a selective p-value for beta_j based on of only a few summary statistics. The arguments of this function include R-squared and residual standard error (RSE) from the overall model (e.g. from summary(lm(y~X))), and a t-statistic for the test of H_0: beta_j=0. This function is especially useful in settings where the raw data is unavailable, such as published studies.

Usage

psel_retro(
  n,
  p,
  R_squared,
  RSE,
  tstat,
  sigma_sq = NULL,
  alpha_ov = 0.05,
  B = 1e+06,
  min_select = 1000,
  max_attempts = 100
)

Arguments

n: Sample size (number of observations).
p: Number of predictors used in the "overall" least squares linear model (excluding the intercept).
R_squared: R-squared from the "overall" fitted least squares linear model (e.g. from summary(lm(y~X))).
RSE: Residual standard error from the "overall" fitted least squares linear model (e.g. from summary(lm(y~X))).
tstat: Observed t-statistic for the post hoc hypothesis test of beta_j.
sigma_sq: Optional estimate of the noise variance. If NULL, uses debiased estimate that accounts for selection.
alpha_ov: Significance level for the overall F-test. Default is 0.05.
B: Number of Monte Carlo samples per iteration. Default is 1,000,000.
min_select: Minimum number of samples satisfying the selection condition. Default is 1,000.
max_attempts: Maximum number of iterations for passing selection criterion before giving up. Default is 100.

Value

A numeric value representing the estimated selective p-value. If no selected samples are obtained after max_attempts, the function returns NA and issues a warning.

Examples

data(mtcars)
mod <- lm(mpg ~ wt + hp, data = mtcars)
rse <- summary(mod)$sigma
r2 <- summary(mod)$r.squared
t_hp <- summary(mod)$coefficients["hp", "t value"]
psel_retro(n=nrow(mtcars), p=2, R_squared=r2, RSE=rse, tstat=t_hp)
#> [1] 0.001473
result <- lmFScreen(mpg ~ wt + hp, data = mtcars)
result[["selective pvalues"]][2]
#>       hp 
#> 0.001377 
# the retrospective and prospective p-values coincide (up to Monte Carlo error)