Characterization of the vapor intrusion (VI) exposure pathway can be extremely challenging given the complexity and the numerous factors that can influence this exposure pathway. One approach to characterizing this pathway is the “bottom-up investigation approach” (Ma et al. 2021), where investigation progresses from the subsurface volatile organic compound (VOC) sources toward receptors. This approach may include the measurement of VOC concentrations in subslab soil gas and comparing with regulatory screening levels. USEPA has recommended a generic attenuation factor (AF) of 0.03 that is frequently used to establish subslab vapor screening levels based on risk-based indoor air screening levels (IASLs). This AF is based on statistical analysis of USEPA’s VI database that contains more than 2,000 paired measurements from residential buildings VI sites across the country (primarily residential buildings). This presentation presents an innovative approach to the screening-level assessment of the VI pathway based on analyzing USEPA’s VI database using statistical and machine learning algorithms.
Statistical and machine learning classification models based on logistic regression, standard decision trees, gradient boosting, and XGBoost were built using USEPA’s VI database. After applying appropriate filters, the USEPA’s database was divided into model training and test datasets, and the performance of each model was evaluated using test datasets and metrics such as false negative (FN) and false positive (FP) error rates, recall and precision. The overall purpose was to develop models that use subslab trichloroethene (TCE) concentrations to predict potential exceedances of USEPA’s IASLs for various VOCs. Hyperparameter tuning was conducted using grid search and random search to improve the performance of the models. The performance of these statistical and machine learning models was compared with the performance of USEPA’s generic AF for VI risk screening using test datasets collected in residential and commercial buildings.
The tuned XGBoost model performed well and was consistent with USEPA AF to predict exceedances of USEPA IASLs for the residential test dataset based on FN (≤5%) and FP (≤21%) error rates, recall (>90%), and precision (>70%). The performances of logistic regression and decision tree models were inferior compared to USEPA AF for the residential test dataset. These FN and FP rates indicate that the XGBoost model is sufficiently conservative to identify residential buildings/sites potentially posing unacceptable VI risk based on subslab TCE vapor concentrations while avoiding potential costly remediation due to “screening-in” of sites/buildings not posing a significant risk. Limited evaluation using the data collected from commercial buildings indicated that both USEPA AF and tuned XGBoost were found to erroneously screen-in commercial buildings not posing significant risk at elevated FP rates (>30%). This is likely because both USEPA AF and the tuned XGBoost model were developed based on the residential building dataset. Development of separate VI screening models based on VI data for commercial buildings will likely avoid potential costly remediation due to screening-in of sites/buildings not posing a significant risk.