"Data science and geostatistical modeling of nitrate groundwater pollution as a estimation tool for water supply well placement"

1. Dataset

The data (https://apps.ecology.wa.gov/eim/search/default.aspx) originates from the Washington Department of Ecology’s Lower Yakima Valley Groundwater Management Area Ambient Groundwater Monitoring Network study (https://apps.ecology.wa.gov/publications/documents/2103106.pdf ) conducted between July and December of 2021. The dataset consists of 172 sampling locations that were sampled twice for a total of 344 observations. Both categorical and numeric features are represented in the dataset.

Two categorical features “Well_Type” (type) and “Well_Completion_Type” (completion) are included in the locations dataset. Type indicates the purpose the sampling location well was constructed for (i.e., irrigation, monitoring or supply). Completion indicates whether the well interval open to the aquifer is screened to inhibit aquifer materials from entering the well or open ended which is unscreened. Type and completion can affect sample results in that groundwater could be sampled over a range of the saturated aquifer thickness versus a discrete zone.

Numeric features include groundwater chemistry and geospatial variables. Chemistry variables include pH, specific conductivity (SC), dissolved oxygen (DO), oxidation reduction potential (ORP), and nitrate + nitrate as N (nitrate). Horizontal geospatial coordinates are in decimal degree format and include longitude and latitude that are projected in the NAD83 HARN (National Geodetic Survey, 2018) coordinate system. The vertical geospatial coordinate elevation was not used as a feature because it was missing in many instances and elevations were similar among existing instances.

2. Variables

Input variables after exploratory data analysis and cleaning include the following:

SC – specific conductivity in micro-siemens/cm. This parameter is a measure of the amount of dissolved ionic solids in the sample. SC correlates directly with the samples ability to conduct electricity due to the concentration of ions in solution.

DO – dissolved oxygen in milligrams/liter. Measurement of free, useable oxygen in solution. DO is a good proxy for interconnection of the aquifer with the atmosphere and thus being an open system.

Nitrate – nitrate + nitrate as N in milligrams/liter. Primary groundwater contaminant from decomposition of organic substances and fertilizers.

ORP – oxidation reduction potential in millivolts. Measure of the capacity of the sample to accept hydrogen cations based on the strength of the negative charge resultant of the amount of free oxygen anions.

Temperature in degrees Celsius – sample quality indicator for representativeness of actual aquifer conditions. If the sample has a temperature out of range of what is expected for general groundwater conditions at the time of sampling, the sample could be compromised.

pH – measure of free hydrogen ions in solution. Acid conditions are indicated by low pH and basic conditions by high pH.

Latitude – northing geospatial coordinate.

Longitude – easting geospatial coordinate.

Well type – category of sampling location construction. Can be irrigation well, monitoring well specifically designed for groundwater sampling, or domestic supply well.

Well completion – category of sampling location open interval. Can be screened or open-ended (not screened).

Average open interval depth in feet below ground surface – indicates the depth at which the well is open to the aquifer. This in an important factor in whether the sample if taken from an aquifer zone that is contaminated or not contaminated.

The output (target) variable “Safe” is binary with two classes indicating whether the sampling location contains nitrate values > 10 mg/L water quality limit. If the sampling location has nitrate < 10 mg/L, Safe = “Yes” else “No”.

The imported datasets for well locations and chemistry were combined, cleaned, and screened to result in the inputs and target described above. The features were plotted on boxplots and a correlation heatmap to look for outliers and redundancy (Figures 1 and 2).

All outliers seen Figure 1 were retained because there were no anomalous water quality indicators (SC, DO, ORP, pH, temperature) suggesting that any samples were compromised. In Figure 2 it was apparent that “Well_Completion_Depth” and “avg_open_interval” are redundant and therefore the former was dropped from the feature set.

3. Research Question

The purpose of this study is to answer the question of whether machine learning (ML) techniques can be effective tools for selecting locations for water supply wells in areas of known groundwater contamination. Specifically:

Can safe water supply well locations be accurately selected using machine learning techniques such as Support Vector Machine, Logistic Regression, Random Forest, and XGBoost?

Water supply wells drilled in the lower Yakima River Valley in Washington State are at risk of producing water of insufficient quality for consumption in terms of nitrate contamination concentration levels being above the water quality standard of 10 mg/L (National Primary Drinking Water Standards, 2022). Addressing this issue is important because the Lower Yakima Valley is experiencing high rates of residential housing development. These new developments will be reliant on groundwater to meet their domestic water supply needs. The region has long been an agricultural area in which heavy use of crop fertilizers and concentrated animal feeding operations (CAFO) have been large contributors to groundwater nitrate contamination.

4. Data Processing and Partitioning

The combined feature set and target set were split into training and tests using 70% of the data for training and 30% for validation. The 70/30 split was chosen over an 80/20 split because of the relatively small size of the dataset. Subsequently the numerical training and test feature set variables were scaled using the scikit-learn Python package’s StandardScaler module (scikit-learn, 2022). After scaling the numerical variables, dummies were created for the categorical variables. Standardization is more appropriate than normalization when features have different units which is the case for this study’s dataset.

5. Model Implementation and Selection

Support vector machine (SVM), logistic regression, random forest, and XGBoost models were implemented by fitting with the training data and finding the best estimators using a grid search. Figure 3 is a code example of the method. All models produced high accuracy with testing accuracies > 98%. Logistic regression (LR) was selected as the model for interpolation of the probabilities of the classification results because of its high training and testing accuracies and simplicity. Random forest (RF) has higher computational costs while offering no performance advantage and the XGBoost classifier only produces two probability values for its entire classification, rendering it unsuitable for mapping the probabilities. SVM proved also to be a good choice with only a slightly lower testing accuracy than LR.

**Figure 3. Model Training and Grid Search for Logistic Regression.**

6. Conclusions

The predicted probabilities of the nitrate values being safe (<10 mg/L) from the logistic regression were mapped in QGIS and an interpolated probability raster was produced using thin-plate spline (TPS) (Figure 4). TPS was chosen over other interpolation techniques because they are ideal for examining the combined effect of continuous predictors on a single outcome (probability of encountering groundwater with >=10 ppb nitrate concentration in this case). The mapping clearly indicates areas where construction of a water supply well would be risky in terms of nitrate contamination and areas of lower risk. Being that groundwater risk probabilities can be predicted with high accuracy, it is safe to conclude that ML methods can be used in the selection of water supply well locations.

**Figure 4. Probability Map from Logistic Regression of Results Showing Nitrate Contamination Risk Areas.**

Probability Mapping of Groundwater Contamination by Nitrate from Agricultural Practices in the Lower Yakima River Basin, Washington State