Data Release for Evaluation of Six Methods for Correcting Bias in Estimates from Ensemble Tree Machine Learning Regression Models

Dates

Publication Date: 2021-03-01
Time Period: 2020

Citation

Belitz, K., Stackelberg, P.E., and Sharpe, J.B., 2021, Data Release for Evaluation of Six Methods for Correcting Bias in Estimates from Ensemble Tree Machine Learning Regression Models: U.S. Geological Survey data release, https://doi.org/10.5066/P9LCTYI2.

Summary

Ensemble-tree machine learning (ML) regression models can be prone to systematic bias: small values are overestimated and large values are underestimated. Additional bias can be introduced if the dependent variable is a transform of the original data. Six methods were evaluated for their ability to correct systematic and introduced bias: (1) empirical distribution matching (EDM); (2) regression of observed on estimated values (ROE); (3) linear transfer function (LTF); (4) linear equation based on Z-score transform (ZZ); (5) second machine learning model used to estimate residuals (ML2-RES); and (6) Duan smearing estimate applied after ROE is implemented (ROE-Duan). The performance of the methods was evaluated using four previously [...]

Summary

Ensemble-tree machine learning (ML) regression models can be prone to systematic bias: small values are overestimated and large values are underestimated. Additional bias can be introduced if the dependent variable is a transform of the original data. Six methods were evaluated for their ability to correct systematic and introduced bias: (1) empirical distribution matching (EDM); (2) regression of observed on estimated values (ROE); (3) linear transfer function (LTF); (4) linear equation based on Z-score transform (ZZ); (5) second machine learning model used to estimate residuals (ML2-RES); and (6) Duan smearing estimate applied after ROE is implemented (ROE-Duan). The performance of the methods was evaluated using four previously published ML case studies of groundwater quality: (1) pH in the glacial aquifer system; (2) pH in the North Atlantic Coastal Plain; (3) nitrate in the Central Valley of California; and (4) iron in the Mississippi Embayment. This data release includes nine tables. For each of the four case studies, there are training data and holdout data; hence there are eight data tables. Each of the data tables includes observed values and ML estimates; these were obtained from previously published reports (Ransom and others, 2017; DeSimone and others, 2020; Knierem and others, 2020; Stackelberg and others, 2020). Each of the tables also includes bias-corrected values for each of the data points. The methods for obtaining the bias-corrected values are described in the primary related publication (Belitz and Stackelberg; 2021). The ninth table includes coefficients of equations associated with selected bias-correction methods for each of the case studies. Not all of the methods were applied to all of the case studies.

Contacts

Point of Contact :: Kenneth Belitz
Originator :: Kenneth Belitz, Paul E Stackelberg, Jennifer B Sharpe
Metadata Contact :: Jennifer B Sharpe
Publisher :: U.S. Geological Survey
Distributor :: U.S. Geological Survey - ScienceBase
SDC Data Owner :: Earth System Processes Division
USGS Mission Area :: Water Resources

Attached Files

sorting

Metadata First

Recent First

Alphabetical

Click on title to download individual files attached to this item.

Bias_In_ML_Regression_Models_Metadata.xml Original FGDC Metadata	View	48.13 KB	application/fgdc+xml
Table_9_Coefficients.csv		290 Bytes	text/csv
Table_1_Glacial_Training.csv		1.05 MB	text/csv
Table_2_Glacial_Holdout.csv		267.83 KB	text/csv
Table_3_NACP_Training.csv		254.89 KB	text/csv
Table_4_NACP_Holdout.csv		63.76 KB	text/csv
Table_5_CV_Training.csv		310.38 KB	text/csv
Table_6_CV_Holdout.csv		146.73 KB	text/csv
Table_7_MISE_Training.csv		106.59 KB	text/csv
Table_8_MISE_Holdout.csv		27.34 KB	text/csv

Related External Resources

Type: Related Primary Publication

Belitz, K. and Stackelberg, P.E., 2021. Evaluation of Six Methods for Correcting Bias in Estimates from Ensemble Tree Machine Learning Regression Models. Environmental Modelling & Software, p.105006.	https://doi.org/10.1016/j.envsoft.2021.105006

Purpose

The purpose of this paper is to evaluate methods for correcting two types of bias that can be present in values produced by ensemble-tree machine learning (ML) models. One type of bias is systematic: ML models tend to overestimate small values and underestimate large values. The other type of bias is introduced if the dependent variable is a transform of the original data.

Data Release for Evaluation of Six Methods for Correcting Bias in Estimates from Ensemble Tree Machine Learning Regression Models

Dates

Citation

Summary

Summary

Contacts

Attached Files

sorting

Metadata First

Recent First

Alphabetical

Purpose

Map

Communities

Tags

Provenance

Additional Information

Identifiers

Item Actions

View Item as ...

Save Item as ...

View Item...

Dates

Citation

Summary

Summary

Contacts

Attached Files sorting Metadata First Recent First Alphabetical

Related External Resources

Purpose

Map

Communities

Tags

Provenance

Additional Information

Identifiers

Item Actions

View Item as ...

Save Item as ...

View Item...

Attached Files

sorting

Metadata First

Recent First

Alphabetical