Home
"Regression modeling of control points for conceptual constraints on interpolated iodine-129 groundwater contamination distribution plume extents."
1. Introduction
The radiological contaminant iodine-129 (I-129) is a serious concern in groundwater beneath the Hanford Site in south-central Washing State. Its source in groundwater is past discharges to the various disposal infiltration infrastructures and leaky waste storage tanks on the Hanford Site during the operational period between 1945 and 1996. I-129 poses considerable risks to human health and the environment that would become downstream receptors if exposed to I-129 contaminated groundwater through various means (i.e. ingestion). Therefore, cleanup has been mandated for I-129 contaminated groundwater at the Hanford Site. In order to understand the I-129 groundwater contamination extents (known in the industry as a plume) and migration, modeling of the groundwater concentrations at the mandated cleanup level of 1.00 pico-Curie per liter (pCi/L) is performed. For accurate flow and transport model forecasts of I-129 time-to-cleanup, it is important to ensure the most accurate representations of the plume area model. This study uses data science techniques to estimate better CP values and to compare how the CPs used in plume interpolation affect the interpolants’ ability to match actual field-measured I-129 concentration data. The CPs were derived from input by project scientists and do not represent measured data but rather values that should honor a site conceptual model of the groundwater system. The outcomes of this comparison will inform on how best to employ the use of control points in the flow and transport modeling
2. Dataset
The dataset used for I-129 plume interpolation includes observed data from the Hanford Environmental Information System (HEIS) database and the derived CPs. The data used in this study were previously selected for creation of the original I-129 plume rasters. I-129 groundwater sampling analytical results collected between the years of 2014 2019 comprise the observed data. The CPs are examined yearly and the values used in this study represent updates from 2019. Sampled data used in the regression were derived directly from the raster interpolations performed in this study.
Data proportions are tilted heavily towards the observed data over the CPs (86% observed versus 14% CPs, Figure 1). Sampled and observed data correlate well (Figure 2) except for locations where observed data did not lie with the boundary of raster interpolation. This resulted in those locations returning sampled values of zero.
The original dataset is comprised of 290 total data points (250 observed and 40 CPs).Observed data range in value from 0.2 to 12.0 pCi/L and have mean, median, variance and standard deviation values of 2.31, 1.24, 4.89, and 2.21 pCi/L respectively. CP values range from 0.00 to 1.10 pCi/L and have mean, median, variance and standard deviation values of 0.20, 0.00, 0.15, and 0.38 pCi/L respectively.
3. Variables
Sampled raster ("SAMPLED") values were used in the regression modeling In addition to the measured I-129 concentrations and professionally derived CP's. Measured and CP values were compiled into the variable "MAPVAL" which are the original values inteneded for raster interpolation. The Rasters of I-129 groundwater concentration distributions were created from datasets both with and without CP's. This was done in order to gauge the effect of the the CP's on the interpolation residuals. It is expected that better CP estimations will reduce residual values and thus provide more accurate I-129 plume extents.
The Ordinary kriging algorithm used for interpolating the rasters, predicts the value of a function at a given point by computing a weighted average of the known values of the function in the neighborhood of the point. The algorithm Assumes constant unknown mean only over the search neighborhood of the random variable. Ordinary kriging was implemented using the FORTRAN executable Quantile [1] which outputs the raster in ASCII format. The ASCIIs were then imported into the geographic information system (GIS) QGIS for sampling. Raster sampling locations are shown in Figure 3.
4. Research Question
The objective of this project is to predict CP values through the regression that would provided better estimated values for controlling the the I-129 plume extents as indicated by interpolated values that more closely match measured values. However, domain knowledge plays huge in these decisions because of the mulitude of environmantal factors that may affect contaminant concentrations. Especially along the leading edges of th contaminant concentration distribution. Thus before commiting to a method of CP value estimation, it is pertinent to ask:
Is there potential for enhanced estimates of groundwater contaminant plume control points using regression methods that are beyond estimates based on professional judgement alone?
5. Data Processing and Partitioning
Data processing workflow follows the raster sampling explained in Section 3 along with using random noise derived from CP statistics in a modeling context to capture and integrate the uncertainty or error present in the original interpolated data ("SAMPLED") during model iterations. Since "SAMPLED" values are derived from an interpolated raster using "MAPVAL", there is inherent uncertainty in these values due to the interpolation process. The true values might be slightly off from the interpolated values. By adding noise based on the CPs (which presumably are more accurate measurements), the modeling process is accounting for potential errors introduced by the interpolation.
Post sampling data processing steps were done in R. The data were subset fo modeling with and without CP's and a subset containing the original measured data points was set aside for modeling metrics calculations. The data were then subjected to simple linear and Bayesian regression.
6. Model Implementation and Selection
Simple and Bayesian linear regression analyses were conducted on subsets with and without control points (CPs). For each method, 10,000 iterations were executed, and the best models were selected based on their performance. The top simple linear regression models were identified by their lowest mean squared errors (MSE). For Bayesian models, leave-one-out (LOO) cross-validation, which evaluates predictive accuracy for individual data points, was used. This approach aligns with the Bayesian emphasis on uncertainty and individual predictions.
All analyses and metric computations were conducted using R. Both regression types were applied to datasets with and without CPs. Notably, the Bayesian model including CPs demonstrated superior performance (Table 1).
Model | MSE Minimum | MSE Maximum |
---|---|---|
Simple Linear Regression | 0.1279729 | 0.2354571 |
Simple Linear Regression, No CP | 0.6246118 | 1.326805 |
Bayesian Regression | 0.002790325 | 0.09096921 |
Bayesian Regression, No CP | 0.7220395 | 1.018008 |
7. Conclusions
I-129 plumes were re-interpolated using Quantile with CP values predicted from various regressions. The plumes were sampled at original data point locations and compared the sampled values with the originals. Mean squared error (MSE) was the chosen comparison metric. The interpolation using the original dataset with CPs resulted in an MSE of 0.03247458, while without CPs, the MSE was 0.04985682. Both simple linear and Bayesian regression techniques, with and without CPs, consistently yielded an MSE of 0.03145218. Figure 4 represents the I-129 plume rendered for the Bayesian regression with CP's retained. These findings affirmatively answer the research question, demonstrating that regression techniques can enhance CP's initially determined by professional judgment.
8. References