Home
"Data engineering, machine learning, and geochemical modeling of analytical data for source partitioning and elucidation of endmember contribution to groundwater samples from multiple supra-basalt and basalt aquifer system samples."
1. Dataset
The data originates from two sampling events of wells and surface waters conducted in 2008 and 2010 by the Columbia Basin Ground Water Management Area (GWMA) of Adams, Franklin, Grant, and Lincoln counties spanning the Columbia Basin of central and eatern Washington State (Figure 1). The dataset consists of 443 sampling locations accounting for 443 observations. All parameters were numeric with the exception of well open interval representing the aquifer zone(s) sampled.
The target (or independant) variable is the primary hydrostratigraphic unit (HSU; primary unit) representing the aquifer zone(s) that each sampled well's intake is open to. Primary unit is at the heart of water resources concerns for GWMA because those areas are where available basalt aquifer groundwater resources are stored. These aquifers were recharged during the pleistocene epoch from continental glacial meltwater and from catastrohic flooding events that occurred in the region close to the end of the pleistocene. Because of their layered, confined nature, deeper basalt aquifers receive minimal to no modern recharge and thus puming in the Columbia Basin has reduced their stored resources significantly over the past 60 years. Thus it is of great importance to those dependent upon the resource to understand how its current and future availibilty is affected by withdrawals impacts.
The single catogrical feature of the dataset is type. Type represents if the sample comes from an aquifer (type = groundwater) or other waterbody such as a lake or stream (type = surfacewater). Tjese factors have significant bearing on water chemistry and thus recharge endmember differentiation.
Numeric features of the GWMA dataset include a range of groundwater chemistry and geospatial variables. Chemistry features include field parameters (pH, specific conductivity (SC), dissolved oxygen (DO), oxidation reduction potential (ORP)), major ions, stable isotopes of oxygen an hydrogen, radiocarbon dating parameters, dissolved gases (He, H2, Ne, N2, Ar, O2, CO2) and tritium counts. Horizontal geospatial coordinates are in universal transverse mercator format and include easting and northing that are projected in the UTM Zone 11N coordinate system. The vertical geospatial coordinate elevation was not used as a feature because it was not available in the dataset. Not all features were retined as model variables based on their suitability for the analysis. Variable descriptions and selection are discussed in the next section.
2. Variables
Input variables after exploratory data analysis and cleaning include the following:
The output (target) variable (Primary Unit) has five classes corresponding to four basalt aquifer subunits and one indeterminant mixed endmember class. The mixed class was not used to train the model because it does not represent a recharge type endmember. Surface water and sediment samples were indistinguishable from the mixed claas so they were included therein. The features were plotted on boxplots and a correlation heatmap to look for outliers and redundancy (Figures 1 and 2).
All outliers seen Figure 1 were retained because there were no anomalous water quality indicators (SC, DO, ORP, pH, temperature) suggesting that any samples were compromised. In Figure 2, Mg and Ca are highly correlable but both features were retained because conceptually they both are major water type paramters which change somewhat proportionally dependant upon aqueous geochemical reactions.
3. Research Question
The purpose of this study is to answer the question of whether machine learning (ML) and neural networks (NN) techniques can be effective tools for differentiating groundwater sources of recharge for water supply wells and estimating the percent contributions of individual recharge endmembers constituting water supply well samples. Specifically:
Can groundwater recharge sources and source partitioning be accurately estimated using ensemble methods such as voting classifiers employing ML and NN models?
Basalt aquifers in the Columbia Basin in Washington State are at risk of being exhausted to the point where static levels will not be able to support water supply demands. Addressing this issue is important to water resources manageers in the region because many municipalities depend on groundwater and the region's multi-billion dollar agricultural economy is unsustainable without those groundwater resources. Water rights and regulatory constraints would also be much better served with a better understanding of basalt aquifer recharge partitioning. For example, knowing that wells open to specific basalt aquifer zones containing higher percentages of endmembers that receive substantial recharge will likely be more reliable producers.
4. Data Processing and Partitioning
Missing data in the GWMA dataset presents considerable challenges to analysis. Being that the dataset is relative small, imputaion was employed to fill in data gaps before variable selection (see Section 2 for an explanation of variables). Many imputation techniques exist but Multiple Imputation by Chained Equations (MICE; Figure 3) [2] was chosen because MICE factors correlations between variables well and it is quite accurate when compared to other methods.
This process is repeated for the desired number of datasets using either a Light Gradient Boosting Machine (LGBM) [3] model or mean matching [4]. Mean matching is used to produce imputations that behave more like the original data. Comparison of LGBM and mean matching results suggested a final featurest consisting of all chemical parameters except SiO2 and C-14 Age from the LGBM imputation. Features SiO2 and C-14 Age from mean mactching were retained in the final feature set. Comparison between the imputed and non-imputed original dataset versions indicated that dissolved gases be dropped from the final feature set because of the poor correlations of their kernel density estimations with that of the original dataset's for those variables.
The raw dataset was originally associated with open interval HSU information gleaned from borehole geologic logs. These data were labeled with 7 original HSU's corresponding with 5 basalt aquifer subunits, a sedimentary suprabasalt aquifer, and a mixed unit of multiple HSU's. These labeling designations were tested using various clustering techniques including agglomerative, K-means, gaussian mixture, bayesian gaussian mixture, and hierarchical density based (HDBSCAN) [5] clustering. Five classes (GRB = Grande Ronde Basalt, WB = Wanapum Basalt, WB_GRB = GRB/WB mix, SW = surface water, and Mixed) were determined to be most appropriate for the data via the clustering. Soft clustering using the HDBSCAN method was used to find the degree to which each point belongs to each cluster. This method was deemed highly appropriate for label determination by clustering because < 10% of points had top-two cluster membership probabilities within 0.2 of one another (Figure 4). This is a quite good metric and arguement for five classification categories. A 95% cluster membershp probability was used as a threshold for selecting instances for the training dataset. Subsequently each feature instance was assigned a label derived from the clusterting by counting the number of the 7 original primary units for each label class. For example, cluster 2 contained by far mostly samples from primary unit GRB so therefore the feature instances in that cluster were assiged the label GRB (Figure 5). The mixed samples were separated from the rest of the data which were then used to create the synthetic training dataset. Using the 95% cluster association threshold, the original data were separated into a training set of 95 samples and 383 mixed samples.
The cleaned training dataset determined via clustering consisted of original samples having a 95% probability of being a pure endmember and not mixed. This original training dataset had only 95 rows and thus potentially lacked enough statistical power in order to increase the accuracy of the classification models. Therefore, a synthetic training dataset was created based on the actual measured values. Cognitive tabular generative adversarial networks (CTGAN) [6] modeling was used to generate a synthetic dataset closley resembling the original dataset but with 10,000 rows. Similarty bewteen the original and sythetic datasets was observed by comparing the kernel density estimates of both (Figure 6). Feature densities between the synthetic and original datasets were similarly distributed with an overall quality score of 90.48% and column pair trends matching at 94.37%. These metrics indicate that the synthetic dataset is well suited for classification model training.
The synthetic feature and target sets were split into training, testing, and validation sets using 81% of the data for training, 10% for testing, and 9% for validation. This split was chosen because of the substantial size of the dataset. Subsequently the numerical training and test feature set variables were scaled using the scikit-learn Python package’s StandardScaler module [7]. Standardization is appropriate for this study’s dataset because of the ML methods being used. The split sets where transformed into tensors only for use in the NN models.
5. Model Implementation and Selection
Support vector machine (SVM), random forest, extra trees, multi-layer perceptron (MLP), convolutional neural networks, and XGBoost models were implemented by fitting with the training data and finding the best estimators using various hyperparameter tuning methods (Table 1). The best models after hyperparameter tuning produced validation accuracies between 84.4 and 100%. In an effort to increase overall accuracy, ensemble voting classifiers composed of the best estimators were constructed.
Model | Hyperparameter Tuning Method | Validation Accuracy |
---|---|---|
Support Vector Classifier | grid search cross validation | 0.888 |
Random Forest | grid search cross validation | 0.893 |
Extra Trees | grid search cross validation | 0.898 |
MLP | grid search cross validation | 0.934 |
CNN with Randomized Search | randomized search cross validation | 0.844 |
CNN with Early Stopping | early stopping | 0.857 |
XGBoost | randomized search cross validation | 1.00 |
Model | Validation Accuracy |
---|---|
RF/ET/SVC | 0.908 |
MLP/CNNRS/CNNES | 0.854 |
MLP/XGB with soft voting | 0.918 |
For endmember composition estimation of a random groundwater sample, the MLP/XGB voting classifier with soft voting (MLP/XGB-VC-SV; Figure 7) is the best choice because it has the highest accuracy. It is also notable that stratified sampling was tried for all base models because of a heavy imbalance of the SW (surface water) class value counts being an order of magnitude lower than those of the other classes. The XGB classifier was the only model to benefit from stratified sampling. Synthetic minority oversampling technique (SMOTE) [8] was employed in the XGB classifier. The SMOTE algorithm performs data augmentation by creating synthetic data points based on the original data points. The advantage of SMOTE is that it does not generate duplicates, but rather creates synthetic data points that are slightly different from the original data points.
Using MLP/XGB-VC-SV on an unlabeled random sample exemplifies application of MLP/XGB-VC-SV. Figure 8 shows the flow routine when the model is used to predict groundwater sample endmember mixing percentages for a sample of the GWMA raw mixed dataset. From the lower table graphic we can see that the first sample is predomiantanly composed of groundwater from the Wanapum/Grande Ronde Basalts (WB_GRB) at about 56% of the sample, 44% Wanapum (WB) only with traces of Grande Ronde and surface water. We know that these predictions are 91% accurate based on the model validation accuracy scores.
6. Conclusions
Accurate groundwater sample endmemeber source partitioning was achieved using ML methods, and thus positively answering the research question posted in Section 3. The ensemble MLP/XGB-VC-SV model proved highly capable of indentifying the predominate basalt aquifer zone and mixing endmembers in previously unclassified mixed samples. Implications for performing such modeling include having a better understanding of aquifer recharge sources which to better inform decision making by water resources managers. Chemical fingerprints for individual samples uusing concentration can now be developed with confidence because of the source partioning. For example, the median concetrations of the training set class WB_GRB can be directly compared to that of other samples and the predominate source endmember can be quickly assessed without extensive analysis (Figure 9).
7. References