"Data engineering, machine learning, and geochemical modeling of analytical data for source partitioning and elucidation of endmember contribution to groundwater samples from multiple supra-basalt and basalt aquifer system samples."

1. Dataset

The data originates from two sampling events of wells and surface waters conducted in 2008 and 2010 by the Columbia Basin Ground Water Management Area (GWMA) of Adams, Franklin, Grant, and Lincoln counties spanning the Columbia Basin of central and eatern Washington State (Figure 1). The dataset consists of 443 sampling locations accounting for 443 observations. All parameters were numeric with the exception of well open interval representing the aquifer zone(s) sampled.

The target (or independant) variable is the primary hydrostratigraphic unit (HSU; primary unit) representing the aquifer zone(s) that each sampled well's intake is open to. Primary unit is at the heart of water resources concerns for GWMA because those areas are where available basalt aquifer groundwater resources are stored. These aquifers were recharged during the pleistocene epoch from continental glacial meltwater and from catastrohic flooding events that occurred in the region close to the end of the pleistocene. Because of their layered, confined nature, deeper basalt aquifers receive minimal to no modern recharge and thus puming in the Columbia Basin has reduced their stored resources significantly over the past 60 years. Thus it is of great importance to those dependent upon the resource to understand how its current and future availibilty is affected by withdrawals impacts.

The single catogrical feature of the dataset is type. Type represents if the sample comes from an aquifer (type = groundwater) or other waterbody such as a lake or stream (type = surfacewater). Tjese factors have significant bearing on water chemistry and thus recharge endmember differentiation.

Numeric features of the GWMA dataset include a range of groundwater chemistry and geospatial variables. Chemistry features include field parameters (pH, specific conductivity (SC), dissolved oxygen (DO), oxidation reduction potential (ORP)), major ions, stable isotopes of oxygen an hydrogen, radiocarbon dating parameters, dissolved gases (He, H2, Ne, N2, Ar, O₂, CO₂) and tritium counts. Horizontal geospatial coordinates are in universal transverse mercator format and include easting and northing that are projected in the UTM Zone 11N coordinate system. The vertical geospatial coordinate elevation was not used as a feature because it was not available in the dataset. Not all features were retined as model variables based on their suitability for the analysis. Variable descriptions and selection are discussed in the next section.

2. Variables

Input variables after exploratory data analysis and cleaning include the following:

SC – specific conductivity in micro-siemens/cm. This parameter is a measure of the amount of dissolved ionic solids in the sample. SC correlates directly with the samples ability to conduct electricity due to the concentration of ions in solution.

DO – dissolved oxygen in milligrams/liter. Measurement of free, useable oxygen in solution. DO is a good proxy for interconnection of the aquifer with the atmosphere and thus being an open system.

Nitrate – nitrate + nitrate as N in milligrams/liter. Primary groundwater contaminant from decomposition of organic substances and fertilizers.

ORP – oxidation reduction potential in millivolts. Measure of the capacity of the sample to accept hydrogen cations based on the strength of the negative charge resultant of the amount of free oxygen anions.

Temperature in degrees Celsius – sample quality indicator for representativeness of actual aquifer conditions. If the sample has a temperature out of range of what is expected for general groundwater conditions at the time of sampling, the sample could be compromised.

pH – measure of free hydrogen ions in solution. Acid conditions are indicated by low pH and basic conditions by high pH.

Latitude – northing geospatial coordinate [1].

Longitude – easting geospatial coordinate[1].

Primary Unit – category of sampling location open interval hydrostratgraphic unit(s). These are the names of the individual basalt flow units containing the stratified aquifer water-producing zone sampled at given locations.

Type – category of the sample representing whether it is groundwater or surface water.

Major Ions – numerical concentration values for Ca, Cl, K, Na, HCO₃, Mg, and SO₄. Major ions are indicators of general water quality and are reflective of water chemistry with respect to water/rock interaction.

SiO₂ – Numerical concentration. SiO₂ is a parameter useful for for determining water/rock interactions.

Age Dating Parameters – Numerical values for carbon-14, C-14 age, and tritium age. These parameters are indicative of time of recharge and thus important in groundwater source partitioning.

Stable Isotopes – Per-mil numerical ratios of δ¹⁸O, δD, and δ¹³C. Stable isotopes are are powerful indicators of climate-driven recharge conditions and recharge sources.

The output (target) variable (Primary Unit) has five classes corresponding to four basalt aquifer subunits and one indeterminant mixed endmember class. The mixed class was not used to train the model because it does not represent a recharge type endmember. Surface water and sediment samples were indistinguishable from the mixed claas so they were included therein. The features were plotted on boxplots and a correlation heatmap to look for outliers and redundancy (Figures 1 and 2).

All outliers seen Figure 1 were retained because there were no anomalous water quality indicators (SC, DO, ORP, pH, temperature) suggesting that any samples were compromised. In Figure 2, Mg and Ca are highly correlable but both features were retained because conceptually they both are major water type paramters which change somewhat proportionally dependant upon aqueous geochemical reactions.

3. Research Question

The purpose of this study is to answer the question of whether machine learning (ML) and neural networks (NN) techniques can be effective tools for differentiating groundwater sources of recharge for water supply wells and estimating the percent contributions of individual recharge endmembers constituting water supply well samples. Specifically:

Can groundwater recharge sources and source partitioning be accurately estimated using ensemble methods such as voting classifiers employing ML and NN models?

Basalt aquifers in the Columbia Basin in Washington State are at risk of being exhausted to the point where static levels will not be able to support water supply demands. Addressing this issue is important to water resources manageers in the region because many municipalities depend on groundwater and the region's multi-billion dollar agricultural economy is unsustainable without those groundwater resources. Water rights and regulatory constraints would also be much better served with a better understanding of basalt aquifer recharge partitioning. For example, knowing that wells open to specific basalt aquifer zones containing higher percentages of endmembers that receive substantial recharge will likely be more reliable producers.

4. Data Processing and Partitioning

Missing data in the GWMA dataset presents considerable challenges to analysis. Being that the dataset is relative small, imputaion was employed to fill in data gaps before variable selection (see Section 2 for an explanation of variables). Many imputation techniques exist but Multiple Imputation by Chained Equations (MICE; Figure 3) [2] was chosen because MICE factors correlations between variables well and it is quite accurate when compared to other methods.

This process is repeated for the desired number of datasets using either a Light Gradient Boosting Machine (LGBM) [3] model or mean matching [4]. Mean matching is used to produce imputations that behave more like the original data. Comparison of LGBM and mean matching results suggested a final featurest consisting of all chemical parameters except SiO₂ and C-14 Age from the LGBM imputation. Features SiO₂ and C-14 Age from mean mactching were retained in the final feature set. Comparison between the imputed and non-imputed original dataset versions indicated that dissolved gases be dropped from the final feature set because of the poor correlations of their kernel density estimations with that of the original dataset's for those variables.

The raw dataset was originally associated with open interval HSU information gleaned from borehole geologic logs. These data were labeled with 7 original HSU's corresponding with 5 basalt aquifer subunits, a sedimentary suprabasalt aquifer, and a mixed unit of multiple HSU's. These labeling designations were tested using various clustering techniques including agglomerative, K-means, gaussian mixture, bayesian gaussian mixture, and hierarchical density based (HDBSCAN) [5] clustering. Five classes (GRB = Grande Ronde Basalt, WB = Wanapum Basalt, WB_GRB = GRB/WB mix, SW = surface water, and Mixed) were determined to be most appropriate for the data via the clustering. Soft clustering using the HDBSCAN method was used to find the degree to which each point belongs to each cluster. This method was deemed highly appropriate for label determination by clustering because < 10% of points had top-two cluster membership probabilities within 0.2 of one another (Figure 4). This is a quite good metric and arguement for five classification categories. A 95% cluster membershp probability was used as a threshold for selecting instances for the training dataset. Subsequently each feature instance was assigned a label derived from the clusterting by counting the number of the 7 original primary units for each label class. For example, cluster 2 contained by far mostly samples from primary unit GRB so therefore the feature instances in that cluster were assiged the label GRB (Figure 5). The mixed samples were separated from the rest of the data which were then used to create the synthetic training dataset. Using the 95% cluster association threshold, the original data were separated into a training set of 95 samples and 383 mixed samples.

**Figure 4. Histogram and empirical cumulative distribution function of the differences between membership probabilities of each point’s top two clusters for the dataset.**

featureset_label_assignment — **Figure 5. Featureset Label Assignment.**

The cleaned training dataset determined via clustering consisted of original samples having a 95% probability of being a pure endmember and not mixed. This original training dataset had only 95 rows and thus potentially lacked enough statistical power in order to increase the accuracy of the classification models. Therefore, a synthetic training dataset was created based on the actual measured values. Cognitive tabular generative adversarial networks (CTGAN) [6] modeling was used to generate a synthetic dataset closley resembling the original dataset but with 10,000 rows. Similarty bewteen the original and sythetic datasets was observed by comparing the kernel density estimates of both (Figure 6). Feature densities between the synthetic and original datasets were similarly distributed with an overall quality score of 90.48% and column pair trends matching at 94.37%. These metrics indicate that the synthetic dataset is well suited for classification model training.

**Figure 6. Original Versus Synthetic Datasets Comparisons.**

The synthetic feature and target sets were split into training, testing, and validation sets using 81% of the data for training, 10% for testing, and 9% for validation. This split was chosen because of the substantial size of the dataset. Subsequently the numerical training and test feature set variables were scaled using the scikit-learn Python package’s StandardScaler module [7]. Standardization is appropriate for this study’s dataset because of the ML methods being used. The split sets where transformed into tensors only for use in the NN models.

5. Model Implementation and Selection

Support vector machine (SVM), random forest, extra trees, multi-layer perceptron (MLP), convolutional neural networks, and XGBoost models were implemented by fitting with the training data and finding the best estimators using various hyperparameter tuning methods (Table 1). The best models after hyperparameter tuning produced validation accuracies between 84.4 and 100%. In an effort to increase overall accuracy, ensemble voting classifiers composed of the best estimators were constructed.

Table 1. Model Performance Comparison

Model Hyperparameter Tuning Method Validation Accuracy

Support Vector Classifier grid search cross validation 0.888

Random Forest grid search cross validation 0.893

Extra Trees grid search cross validation 0.898

MLP grid search cross validation 0.934

CNN with Randomized Search randomized search cross validation 0.844

CNN with Early Stopping early stopping 0.857

XGBoost randomized search cross validation 1.00

Three voting classifiers were constructed using ensembles of the best estimators. The Voting classifier estimator is a stronger meta-classifier that balances out the individual classifiers’ weaknesses on a particular dataset. One voting classifier contained SVC, random forest, and extra trees best estimators. A second voting classifier was composed of both CNN best estimators and the MLP best estimator and a third voting classifier was composed of the MLP and XGBoost best estimators. The voting classifiers' accuracy scores are listed in Table 2. Soft voting was employed in the MLP/XGB voting classifier which gave it the highest accuracy of the three. Soft voting classifiers classify input data based on the probabilities of all the predictions made by different classifiers. Thus, its output can be used as a proxy of the estimated recharge endmemeber composition of a groundwater sample.

Table 2. Voting Classifier Performance Comparison

Model Validation Accuracy

RF/ET/SVC 0.908

MLP/CNNRS/CNNES 0.854

MLP/XGB with soft voting 0.918

Table 1. Model Performance Comparison
Model	Hyperparameter Tuning Method	Validation Accuracy
Support Vector Classifier	grid search cross validation	0.888
Random Forest	grid search cross validation	0.893
Extra Trees	grid search cross validation	0.898
MLP	grid search cross validation	0.934
CNN with Randomized Search	randomized search cross validation	0.844
CNN with Early Stopping	early stopping	0.857
XGBoost	randomized search cross validation	1.00

Table 2. Voting Classifier Performance Comparison
Model	Validation Accuracy
RF/ET/SVC	0.908
MLP/CNNRS/CNNES	0.854
MLP/XGB with soft voting	0.918

For endmember composition estimation of a random groundwater sample, the MLP/XGB voting classifier with soft voting (MLP/XGB-VC-SV; Figure 7) is the best choice because it has the highest accuracy. It is also notable that stratified sampling was tried for all base models because of a heavy imbalance of the SW (surface water) class value counts being an order of magnitude lower than those of the other classes. The XGB classifier was the only model to benefit from stratified sampling. Synthetic minority oversampling technique (SMOTE) [8] was employed in the XGB classifier. The SMOTE algorithm performs data augmentation by creating synthetic data points based on the original data points. The advantage of SMOTE is that it does not generate duplicates, but rather creates synthetic data points that are slightly different from the original data points.

**Figure 7. MLP/XGB-VC-SV Composition.**

Using MLP/XGB-VC-SV on an unlabeled random sample exemplifies application of MLP/XGB-VC-SV. Figure 8 shows the flow routine when the model is used to predict groundwater sample endmember mixing percentages for a sample of the GWMA raw mixed dataset. From the lower table graphic we can see that the first sample is predomiantanly composed of groundwater from the Wanapum/Grande Ronde Basalts (WB_GRB) at about 56% of the sample, 44% Wanapum (WB) only with traces of Grande Ronde and surface water. We know that these predictions are 91% accurate based on the model validation accuracy scores.

6. Conclusions

Accurate groundwater sample endmemeber source partitioning was achieved using ML methods, and thus positively answering the research question posted in Section 3. The ensemble MLP/XGB-VC-SV model proved highly capable of indentifying the predominate basalt aquifer zone and mixing endmembers in previously unclassified mixed samples. Implications for performing such modeling include having a better understanding of aquifer recharge sources which to better inform decision making by water resources managers. Chemical fingerprints for individual samples uusing concentration can now be developed with confidence because of the source partioning. For example, the median concetrations of the training set class WB_GRB can be directly compared to that of other samples and the predominate source endmember can be quickly assessed without extensive analysis (Figure 9).

**Figure 9. Example Chemical Fingerprint Comparison.**

Ensemble Methods for Chemical Fingerprinting for Source Identification and Partitioning of a Layered Aquifer System: Columbia River Basin, Washington State