Novel two-stage hybrid paradigm combining data pre-processing approaches to predict biochemical oxygen demand concentration

Sungwon Kim; Youngmin Seo; Mousaab Zakhrouf; Anurag Malik

doi:10.3741/JKWRA.2021.54.S-1.1037

Preview

Research Article

Journal of Korea Water Resources Association. 31 December 2021. 1037-1051
https://doi.org/10.3741/JKWRA.2021.54.S-1.1037

Novel two-stage hybrid paradigm combining data pre-processing approaches to predict biochemical oxygen demand concentration

생물화학적 산소요구량 농도예측을 위하여 데이터 전처리 접근법을 결합한 새로운 이단계 하이브리드 패러다임

Sungwon Kim^a^*

Youngmin Seo^b

Mousaab Zakhrouf^c

Anurag Malik^d

김 성원^a^*

서 영민^b

자크로프 마샵^c

말릭 아누락^d

^aFull Professor, Department of Railroad Construction and Safety Engineering, Dongyang University, Yeongju, Korea

^bLecturer, Department of Constructional and Environmental Engineering, Kyungpook National University, Sangju, Korea

^cResearch Associate, URMER Laboratory, Department of Hydraulics, Faculty of Technology, University of Tlemcen, Tlemcen, Algeria

^dScientist, Punjab Agricultural University, Regional Research Station, Bathinda, Punjab, India

^a동양대학교 철도건설안전공학과 정교수

^b경북대학교 건설환경공학과 강사

^c알제리 트렘슨대학교 수리학과 연구원

^d인도 펀잡 농업대학교 지역연구단 과학자

^{*Corresponding Author}

ABSTRACT

Biochemical oxygen demand (BOD) concentration, one of important water quality indicators, is treated as the measuring item for the ecological chapter in lakes and rivers. This investigation employed novel two-stage hybrid paradigm (i.e., wavelet-based gated recurrent unit, wavelet-based generalized regression neural networks, and wavelet-based random forests) to predict BOD concentration in the Dosan and Hwangji stations, South Korea. These models were assessed with the corresponding independent models (i.e., gated recurrent unit, generalized regression neural networks, and random forests). Diverse water quality and quantity indicators were implemented for developing independent and two-stage hybrid models based on several input combinations (i.e., Divisions 1-5). The addressed models were evaluated using three statistical indices including the root mean square error (RMSE), Nash-Sutcliffe efficiency (NSE), and correlation coefficient (CC). It can be found from results that the two-stage hybrid models cannot always enhance the predictive precision of independent models confidently. Results showed that the DWT-RF5 (RMSE = 0.108 mg/L) model provided more accurate prediction of BOD concentration compared to other optimal models in Dosan station, and the DWT-GRNN4 (RMSE = 0.132 mg/L) model was the best for predicting BOD concentration in Hwangji station, South Korea.

Keywords

Biochemical oxygen demand

Gated recurrent unit

Generalized regression neural networks

Random forests

Discrete wavelet transform

Water quality indicator

주요한 수질지표 중의 하나인 생물화학적 산소요구량(BOD) 농도는 호소와 하천에서 생태학적 측면에서 관측항목으로 취급하고 있다. 본 연구에서는 대한민국의 도산 및 황지지점에서 BOD 농도예측을 위하여 새로운 이단계 하이브리드 패러다임(웨이블릿 기반 게이트 순환 유닛, 웨이블릿 기반 일반화된 회귀신경망, 그리고 웨이블릿 기반 랜덤 포레스트) 을 활용하였다. 이러한 모형들은 각 대응하는 독립모형들(게이트 순환 유닛, 일반화된 회귀신경망, 그리고 랜덤 포레스트) 과 함께 평가되었다. 다양한 수질 및 수량지표들이 여러 개의 입력조합(분류1-5) 을 기본으로 하여 독립 및 이단계 하이브리드 모형을 개발하기 위하여 구현되었다. 언급한 모형들은 root mean squared error (RMSE), Nash-Sutcliffe efficiency (NSE), 그리고 correlation coefficient (CC) 를 포함한 세 개의 통계지표로서 평가되었으며, 통계결과치를 분석하면 이단계 하이브리드 모형들이 항상 대응하는 독립모형들의 예측 정도를 개선하지 않은 것으로 나타났다. 대한민국의 도산관측소에서는 DWT-RF5 (RMSE = 0.108 mg/L) 모형이 다른 최적모형과 비교하여 BOD 농도의 더 정확한 예측을 나타내었으며, 황지관측소에서는 DWT-GRNN4 (RMSE = 0.132 mg/L) 모형이 BOD 농도를 예측하는 최고의 모형이다.

키워드

생물화학적 산소요구량

게이트 순환 유닛

일반화된 회귀신경망

랜덤 포레스트

이산 웨이블릿 변환

수질지표

MAIN

1. Introduction
2. Models and Data Pre-Processing Approach
2.1 Gated recurrent unit (GRU)
2.2 Generalized regression neural networks (GRNN)
2.3 Random forests (RF)
2.4 Discrete wavelet transform (DWT)
2.5 Evaluation of independent and two-stage hybrid models’ performance
3. Study Boundary and Data Information
4. Results and Discussion
4.1 Dosan station
4.2 Hwangji station
4.3 Discussion
5. Conclusions

1. Introduction

Water quality can be specified as the biological, chemical, and physical aspects of corresponding water of rivers, reservoirs, and oceans (Ahmed and Shah, 2017; Kim et al., 2020). The water quality indicators can be involved the several items including water temperature (WT), electrical conductivity (EC), biochemical oxygen demand (BOD), dissolved oxygen (DO), the potential of Hydrogen (pH), turbidity (TU), chemical oxygen demand (COD), suspended solids (SS), total organic carbon (TOC), total nitrogen (T-N), total phosphorus (T-P), and chlorophyll-a (CHA) and so on. Also, their assessment can be very important for the critical management of different water resources systems (Khaled et al., 2017; Kim et al., 2020).

BOD concentration has been accepted as a confirmation of river water pollution by the U.K. Royal Commission on River Pollution since 1908 (Royal Commission on Sewage Disposal, 1908). From the commission meeting, the five-day term at a 20 Celsius degree (℃) was defined and handled to estimate BOD5 concentration. In USA, the American Public Health Association Standard Methods Committee (APHASMC) classified BOD concentration as a quotation to evaluate the natural pollution of water since 1936 (Jouanneau et al., 2014). In addition, BOD concentration can be proposed as the requirement of DO concentration to lessen the natural material of water at the addressed temperature (Raheli et al., 2017; Tao et al., 2019).

Measurement of water quality indicators is categorized as three divisions including on-site measuring classification (e.g., WT, DO, pH, and TU), laboratory-based analysis classification (e.g., COD, SS, TOC, T-N, T-P, and CHA), and incubated-based analysis classification (e.g., BOD). Therefore, BOD concentration can be computed employing the amount of oxygen concentration consumed per liter of sampling data based on the five-day term at a 20 Celsius degree (℃). Zou et al. (2007) explained that the traditional approaches of water quality indicators required much time and effort to overcome the addressed prediction problems. Also, Ay and Kisi (2012) demonstrated that BOC concentration was one of important water quality indicators for conservation and maintenance of ecosystems in rivers.

To beat the drawbacks of traditional approaches for the prediction problems of water quality and hydrology, deep learning and machine learning approaches have been surveyed and published the numerous documents since two thousand (Zhang et al., 2002; Diamantopoulou et al., 2007; Dogan et al., 2009; Kim, 2000, 2011; Kim and Kim, 2007; Kim et al., 2009, 2012; Li et al., 2019; Zounemat-Kermani et al., 2019; Kim et al., 2021). Among the diverse deep learning and machine learning approaches, a few particular mechanisms have been employed to predict BOD concentration (Emamgholizadeh et al., 2014; Noori et al., 2015; Ahmed and Shah, 2017; Khaled et al., 2017; Raheli et al., 2017; Ahmadi et al., 2018; Tao et al., 2019).

Granata et al. (2017) predicted the water quality indicators including BOD, COD, total dissolved solid (TDS), and total suspended solids (TSS) concentrations utilizing the support vector regression (SVR) and regression tree (RT) models in USA. Solgi et al. (2017) predicted BOD concentration employing the hybrid SVR and adaptive neuro-fuzzy inference system (ANFIS) models in the Karun River, Iran.

Though diverse deep learning and machine learning approaches have been employed for predicting specific water quality indicator in rivers, novel approaches are required to enhance the prediction accuracy of BOD concentration. In this investigation, a novel two-stage hybrid paradigm (i.e., wavelet-based gated recurrent unit, wavelet-based generalized regression neural networks, and wavelet-based random forests) which includes the combination of data pre-processing (i.e., discrete wavelet transform), deep learning, and machine learning approaches (i.e., gated recurrent unit, generalized regression neural networks, and random forests), provides the capability and efficiency for the solution of complicated and high nonlinear problems. Within the range of our experience and recognition, the novel two-stage hybrid paradigm has not been employed for this argument.

This investigation demonstrates the accuracy and capability of DWT-GRU, DWT-GRNN, and DWT-RF models for predicting BOD concentration in South Korea. The performance of addressed models are evaluated and compared to the independent models based on three statistical indices (i.e., RMSE, NSE, and CC) and graphical support (i.e., scatter diagram and Taylor scheme). This investigation is classified as follows: The chapter two supplies the addressed models and data pre-processing approach, respectively. Study boundary and data information are suggested in the chapter three, and chapter four illustrates results and discussion. Finally, conclusions are summarized in the chapter five.

2. Models and Data Pre-Processing Approach

The addressed models employed in this investigation are deep learning (i.e., GRU) and machine learning (i.e., GRNN and RF) models, respectively. In addition, the implemented data pre-processing approach is the discrete wavelet transform (DWT) method. Subsequent section expressed the addressed models and data pre-processing approach. It can be found from Fig. 1 that the universal modeling and prediction process of implemented investigation is emphasized.

https://cdn.apub.kr/journalsite/sites/kwra/2021-054-12S/N020054S-102/images/kwra_54_S-1_02_F1.jpg

Fig. 1.

Flowchart of investigation steps

2.1 Gated recurrent unit (GRU)

Gated recurrent unit (Fig. 2) is an advanced report of long short-term memory (LSTM) model implemented by Cho et al. (2014). The GRU model is an improved approach employing the gating mechanism based on the LSTM model, and reduces the management of memory cells compared to the LSTM model (Yang et al., 2020). The GRU model restricts the signal signs to two gates. Revising the gate decides how frequently the device changes its arrangement or information. The reset gate determines to combine the current arrangement information with the historical memory. Also, adjustment of parameters are utilized in the GRU model arrangement. The computational processes for the update gate, reset gate, input gate, and standard GRU model are represented by the following Eqs. (1)~(4).

https://cdn.apub.kr/journalsite/sites/kwra/2021-054-12S/N020054S-102/images/kwra_54_S-1_02_F2.jpg

Fig. 2.

A schematic diagram of gated recurrent unit (GRU) model

(1)

Z_{j}^{(t)} = σ_{Z} (b_{i}^{Z} + \sum_{i = 1}^{N} x_{i}^{(t)} W_{i j}^{Z} + \sum_{i = 1}^{N} h_{i}^{(t - 1)} U_{i j}^{Z})

(2)

R_{j}^{(t)} = σ_{R} (b_{i}^{R} + \sum_{i = 1}^{N} x_{i}^{(t)} W_{i j}^{R} + \sum_{i = 1}^{N} h_{i}^{(t - 1)} U_{i j}^{R})

(3)

I_{j}^{(t)} = σ_{I} (b_{i}^{I} + \sum_{i = 1}^{N} x_{i}^{(t)} W_{i j}^{I} + \sum_{i = 1}^{N} h_{i}^{(t - 1)} [U_{i j}^{I} * R_{j}^{(t)}])

(4)

h_{j}^{(t)} = Z_{j}^{(t)} * \sum_{i = 1}^{N} h_{i}^{(t - 1)} + (1 - Z_{j}^{(t)}) * I_{j}^{(1)}

where $x_{i}^{(t)}$ is the input for time series (t), (W^Z, W^R, W^I) and (U^Z, U^R, U^I) are the weight matrices of input and hidden layers for different gates, $Z_{j}^{(t)}$ is the output of update gate for time series (t), $R_{j}^{(t)}$ is the output of reset gate for time series (t), $I_{j}^{(t)}$ is the output of input gate for time series (t), $h_{j}^{(t)}$ is the output of GRU cell for time series (t), $h_{i}^{(t - 1)}$ is the output of short-term state cell for time series (t-1), $σ_{H}$ is the hyperbolic tangent function, $σ_{Z}$ and $σ_{R}$ are the sigmoid functions, and * is the Hadamard product.

2.2 Generalized regression neural networks (GRNN)

Generalized regression neural networks (Fig. 3) expresses a shifted strategy of radial basis function (RBF) (Specht, 1991). The input, hidden, summation, and output layers are the core pattern of design for the GRNN model’s policy. The equivalent neurons of input, hidden, and summation layers are completely connected, whereas the neuron of output layer is correlated with only a few neurons equivalent to the summation layer. Two strategies of neurons including different summation neurons and only one division neuron aggregate the summation layer. The number of summation and output neurons is identical. The division neuron, however, implements the inclusion of weighted designated signals from the neurons of hidden layers instead of a transfer function. Each neuron from output layer is correlated with the summation and division neurons equivalent to the summation layer. In addition, the connection weights from the summation to the output layers are not constructed. The computation of each neuron equivalent to the output layer is evaluated employing dividing the output signals from the summation neuron by the output neuron from division neuron equivalent to the summation layer (Kişi, 2006; Kim and Kim, 2008; Ladlani et al., 2012; Li et al., 2014; Ahmadi et al., 2019).

https://cdn.apub.kr/journalsite/sites/kwra/2021-054-12S/N020054S-102/images/kwra_54_S-1_02_F3.jpg

Fig. 3.

A schematic diagram of generalized regression neural networks (GRNN) model

2.3 Random forests (RF)

Random forests (RF) (Fig. 4) which collects the decision trees that evolve in parallel, was inaugurated by Breiman (2001). The prediction of trees are incorporated to make the general prediction of forest trees. A RF model resembles a Treeboost model (Friedman, 2002) because the RF and Treeboost models employ many trees similarly. However, the core difference between both models is that the trees in the Treeboost model are evolved in sequence such that the output of one tree is supplied to the next tree, whereas a RF model collects the independent trees that are evolved in parallel pattern (Simard et al., 2000; Zounemat-Kermani et al., 2017; Alizamir et al., 2021). The RF model employs a randomized and separated methodology for providing many different unpruned decision trees to each neuron. Therefore, the results of approximated trees makes a more stable and flexible architecture for accomplishing accurate and efficient prediction. The majority decision or arithmetic average is examined for aggregating prediction (Breiman, 2001).

https://cdn.apub.kr/journalsite/sites/kwra/2021-054-12S/N020054S-102/images/kwra_54_S-1_02_F4.jpg

Fig. 4.

A schematic diagram of random forests (RF) model (Alizamir et al., 2021)

2.4 Discrete wavelet transform (DWT)

Discrete wavelet transform (DWT) approach has been approved as one of multi-resolution signal procedure approaches (Kim et al., 2016; Seo and Kim, 2016; Seo et al., 2015, 2016, 2018). The data of original time series can be divided into diverse frequency element including an approximation and multiple details employing the DWT approach. If X = $\{X_{t} : t = 0, 1, \dots, N - 1\}$ is a time series data, the J₀-level DWT approach of X provides W (i.e., DWT coefficients) employing an orthonormal transform (Percival and Walden, 2000).

In fact, the DWT approach can be performed utilizing the Mallat algorithm (Mallat, 1989). The core point of Mallat algorithm is two-channel filters which make up high-pass (wavelet) filter $\{h_{l} : l = 0, 1, \dots, L - 1\}$ and low-pass (scaling) filter $\{g_{l} : l = 0, 1, \dots, L - 1\}$ . The main system is comprised of circular filtering and downsampling. Percival and Walden (2000) explained that the wavelet and scaling coefficients for the jth decomposition level can be defined as following Eq. (5).

(5)

W_{j, t} \equiv \sum_{l = 0}^{L - 1} h_{l} V_{j - 1, 2 t + 1 - l \mod N_{j - 1},} V_{j, t} \equiv \sum_{l = 0}^{L - 1} g_{l} V_{j - 1, 2 t + 1 - l \mod N_{j - 1},} t = 0, 1, \dots, N_{j} - 1

where W_j,t and V_j,t are the elements of W_j and V_j, respectively. A schematic diagram for two-level DWT approach can be found in Fig. 5. In this investigation, two details (D₁ and D₂) and an approximation (A₂) are generated from an original time series. Fig. 6 shows the flowchart for developing the DWT-GRU, DWT-GRNN, and DWT-RF models.

https://cdn.apub.kr/journalsite/sites/kwra/2021-054-12S/N020054S-102/images/kwra_54_S-1_02_F5.jpg

Fig. 5.

Two-level DWT approach employing Mallat algorithm (Kim et al., 2017)

https://cdn.apub.kr/journalsite/sites/kwra/2021-054-12S/N020054S-102/images/kwra_54_S-1_02_F6.jpg

Fig. 6.

Flowchart for developing two-stage hybrid models

2.5 Evaluation of independent and two-stage hybrid models’ performance

To evaluate the independent and two-stage hybrid models’ performance, diverse and different statistical indices were implemented. The difference between observed and predicted BOD concentration can be computed by utilizing root mean square error (RMSE) (Willmott and Matsuura, 2005) index. RMSE = 0 indicates the accurate computation for predicting BOD concentration. RMSE index (i.e., see the Eq. (6)) must be applied for model evaluation (Deo et al., 2019). The Nash-Sutcliffe efficiency (NSE) (Nash and Sutcliffe, 1970) index can judge the models’ efficiency between observed and predicted BOD concentration. The perfect model (i.e., computed error variance = 0) shows the NSE index equals one. In case of the predicted BOD concentration when the computed error variance is larger than the observed variance, the NSE < 0 occurs. Garrick et al. (1978) investigated that NSE index (i.e., see the Eq. (7)) can be significantly accurate value (e.g., over 0.8) for poorly-matched models, while the best-matched models cannot yield accurate values. Correlation coefficient (CC) index is explained as the correlation between observed and predicted BOD concentration. When CC index indicates zero value, BOD concentration cannot be predicted, while the prediction of BOD concentration can be accomplished perfectly when CC index shows one value (Zounemat-Kermani et al., 2019; Kim et al., 2020). CC index can be computed employing Eq. (8).

(6)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {[B O D_{o b s} - B O D_{p r e}]}^{2}}

(7)

N S E = 1 - \frac{\sum_{i = 1}^{n} {[B O D_{o b s} - B O D_{p r e}]}^{2}}{\sum_{i = 1}^{n} {[B O D_{o b s} - {B O D}_{o b s}]}^{2}}

(8)

C C = (\frac{\frac{1}{n} \sum_{i = 1}^{n} (B O D_{o b s} - {B O D}_{o b s}) (B O D_{p r e} - {B O D}_{p r e})}{\sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(B O D_{o b s} - {B O D}_{o b s})}^{2}} \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(B O D_{p r e} - {B O D}_{p r e})}^{2}}})

where BOD_obs and BOD_pre are the observed and predicted BOD concentrations, $B O D$ _obs and $B O D$ _pre are the observed and predicted mean BOD concentration, and n is total number for data available.

3. Study Boundary and Data Information

In this investigation, Dosan (latitude 36°77' 24'' N; longitude 128°89' 04'' E) and Hwangji (latitude 37°06' 74'' N; longitude 129°05' 07'' E) stations were chosen to predict BOD concentration in the upper Nakdong River basin, South Korea. The analysis of water quality and quantity indicators was accomplished based on eleven physical and chemical characteristics including BOD, TOC, T-P, T-N, COD, SS, WT, DO, EC, pH, and water discharge (DIS). Only DIS represents water quantity indicator. Fig. 7 shows the schematic maps of Dosan and Hwangji stations.

https://cdn.apub.kr/journalsite/sites/kwra/2021-054-12S/N020054S-102/images/kwra_54_S-1_02_F7.jpg

Fig. 7.

The schematic maps of Dosan and Hwangji stations

The historic data (January 2008 - December 2020) with the irregular measured periods for water quality and quantity indicators can be downloaded and collected directly from the web-based information system of National Institute of Environmental Research (NIER) (http://water.nier.go.kr) which is managed and operated by the Ministry of Environment (ME), South Korea. Also, the perfect data were separate into training and testing datasets. The training data included 80% (N = 407 for Dosan and N = 398 for Hwangji) of perfect data, and the testing data utilized the remnant 20% (N = 101 for Dosan and N = 99 for Hwangji), respectively. Table 1 provides basic statistical properties of water quality and quantity indicators.

On July 15, 2020, the Ministry of Environment (ME) announced that it would prepare a target of water quality indicator for the border areas of water pollution control system that each provincial government in the Han River and Nakdong River water system must achieve by 2030. Also, the target of water quality indicator in the Nakbon-A station (i.e., border area of local water pollution control system including Dosan and Hwangji stations in the Nakdong River) to be achieved by 2030 is to reduce BOD concentration by 1.40 (mg/L) on average (ME, 2020). Table 1 demonstrates, therefore, the stable and acceptable ranges of BOD concentration on average for Dosan (0.91 mg/L) and Hwangji (1.18 mg/L) stations compared to the Nakbon-A station.

For the advanced understanding of individual water quality and quantity indicators affecting on the BOD concentration, the correlations between corresponding input indicators and BOD concentration were computed and listed in table 2. It can be found that COD and TOC indicators provided the significant correlations on BOD concentration at both stations. In this investigation, CategoryⅠ(i.e., on-site measuring classification) includes pH, EC, DO, and WT water quality indicators which can analyze the sample data collected from the on-site measuring field directly. CategoryⅡ, however, demonstrates SS, COD, T-N, T-P, TOC (i.e., laboratory-based analysis classification), and BOD (i.e., incubated-based analysis classification) which can compute the sample data gathered from the laboratory facility indirectly. Finally, Category Ⅲ expresses the water quantity indicator such as DIS.

Table 1.

Statistical properties of water quality and quantity indicators

Station	Water Quality Variables	Unit	Average	Min.	Max.	St. Dev.
Dosan	pH	-	7.92	6.30	9.40	0.45
	EC	µS/cm	275.84	80.00	749.00	96.62
	DO	mg/L	11.63	7.30	17.60	2.50
	WT	℃	14.09	0.10	33.70	8.78
	COD SS T-N T-P TOC DIS BOD	mg/L mg/L mg/L mg/L mg/L m³/sec mg/L	3.47 10.34 2.29 0.02 2.16 24.78 0.91	1.50 0.20 0.79 0.00 0.90 0.34 0.20	50.50 620.00 5.01 0.39 49.30 853.16 4.90	2.50 44.72 0.66 0.03 2.37 74.07 0.52
Hwangji	pH	-	8.23	7.00	9.10	0.35
	EC	µS/cm	458.27	191.00	823.00	123.39
	DO	mg/L	11.52	7.30	17.50	2.12
	WT	℃	12.68	-1.00	26.30	6.91
	COD SS T-N T-P TOC DIS BOD	mg/L mg/L mg/L mg/L mg/L m³/sec mg/L	3.24 4.75 3.40 0.04 1.96 5.75 1.18	1.50 0.20 1.73 0.00 0.80 0.49 0.30	10.00 266.00 7.56 0.22 9.40 330.65 9.00	0.95 15.61 0.90 0.03 0.75 17.68 0.82

Table 2.

Computation of correlations between corresponding input indicator and BOD concentration

Category	Input water quality indicator	Output Indicator (BOD concentration)
Category	Input water quality indicator	Dosan	Hwangji
Ⅰ	pH EC DO WT	0.176 -0.072 -0.312 0.370	-0.003 0.088 0.036 -0.074
Ⅱ	SS COD T-N T-P TOC	0.485 0.617 -0.253 0.513 0.527	0.120 0.721 0.163 0.349 0.721
Ⅲ	DIS	0.152	-0.042

4. Results and Discussion

This investigation implemented various water quality and quantity indicators to predict BOD concentration at Dosan and Hwangji stations, South Korea. As clarified previously, the evaluation for performance of independent and two-stage hybrid models to predict BOD concentration is the core concept of this investigation.

A few ones (e.g., pH, EC, DO, and WT) among water quality indicators can be directly measured using the specific monitoring instrument. Also, some of them (e.g., SS, COD, T-N, T-P, and TOC) can be measured indirectly based on the laboratory-based analysis, and required a certain level of time. However, BOD, one of water quality indicators in total water pollution management of Four River, South Korea, can be measured by incubating at 20 Celsius degree (℃) during five-day indirectly (Jouanneau et al., 2014). Because the aim of this investigation is explained as the prediction of BOD concentration utilizing deep learning and machine learning approaches directly, it can save the time and activity to incubate BOD in the laboratory facility.

Different associations of water quality and quantity indicators were implemented as a concept of input combination to select the best input combination at both stations. Therefore, the independent and two-stage hybrid models were developed for predicting BOD concentration by applying the diverse input combinations. Since the COD and TOC indicators were selected as the fundamental water quality ones at both stations, this investigation specified the combination of two addressed water quality indicators as Division 1. Diverse input combinations of water quality and quantity indicators to predict BOD concentration are implemented in Table 3 where all developed model were classified into five divisions (i.e., Divisions 1-5).

Table 3.

Diverse input combinations of independent and two-stage hybrid models

Division	Input combination	Output	Models
			Independent			Two-stage hybrid
			GRU	GRNN	RF	DWT-GRU	DWT-GRNN	DWT-RF
1	COD, TOC	BOD	GRU1	GRNN1	RF1	DWT-GRU1	DWT-GRNN1	DWT-RF1
2	COD, TOC, T-P, SS	BOD	GRU2	GRNN2	RF2	DWT-GRU2	DWT-GRNN2	DWT-RF2
3	COD, TOC, WT, pH	BOD	GRU3	GRNN3	RF3	DWT-GRU3	DWT-GRNN3	DWT-RF3
4	COD, TOC, T-P, SS, WT, pH	BOD	GRU4	GRNN4	RF4	DWT-GRU4	DWT-GRNN4	DWT-RF4
5	COD, TOC, T-P, SS, WT, pH, DIS	BOD	GRU5	GRNN5	RF5	DWT-GRU5	DWT-GRNN5	DWT-RF5

4.1 Dosan station

4.1.1 Independent models

The results of three statistical indices for different independent models are arranged in Table 4 at Dosan station. It can be noticed from Table 4 that results of RF1 model (CC = 0.777, NSE = 0.603, and RMSE = 0.212 mg/L) were better than those of GRU1 and GRNN1 models during testing step based on Division 1. In Division 2, the RF2 model (CC = 0.931, NSE = 0.857, and RMSE = 0.127 mg/L) was superior to the GRU2 and GRNN2 models. Also, the RF3 model (CC = 901, NSE = 0.809, and RMSE = 0.147 mg/L) outperformed the GRU3 and GRNN3 models in Division 3 during testing step. In addition, comparison of independent models in Division 4 suggested that the RF4 model (CC = 0.937, NSE = 0.870, and RMSE = 0.122 mg/L) prevailed the GRU4 and GRNN4 models clearly during testing step. Finally, the RF5 model (CC = 0.938, NSE = 0.875, and RMSE = 0.119 mg/L) was more precise than the GRU5 and GRNN5 models during testing step in Division 5.

Acknowledging the outstanding models from all Divisions 1-5, the best performance of independent models (i.e., GRU (Division 1), GRNN (Division 4), and RF (Division 5)) can be identified among various input combinations during testing step. It can be recognized from Table 4 that the optimal architecture of RF5 model provided more precise results than the GRU1 and GRNN4 models during testing step. Therefore, the RF5 model was more authentic than the GRU1 and GRNN4 models for predicting BOD concentration among the optimal independent models at Dosan station.

Table 4.

RMSE, NSE, and CC values for the independent and two-stage hybrid models at Dosan station

Type	Model	Testing step
Type	Model	RMSE (mg/L)	NSE	CC
Independent	GRU1 GRU2 GRU3 GRU4 GRU5	0.316 0.325 0.336 0.323 0.354	0.119 0.066 0.006 0.081 -0.107	0.445 0.258 0.371 0.446 0.617
	GRNN1 GRNN2 GRNN3 GRNN4 GRNN5	0.286 0.264 0.292 0.244 0.269	0.278 0.384 0.247 0.476 0.360	0.538 0.623 0.523 0.691 0.600
	RF1 RF2 RF3 RF4 RF5	0.212 0.127 0.147 0.122 0.119	0.603 0.857 0.809 0.870 0.875	0.777 0.931 0.901 0.937 0.938
Two-stage hybrid	DWT-GRU1 DWT-GRU2 DWT-GRU3 DWT-GRU4 DWT-GRU5	0.339 0.317 0.469 0.360 0.332	-0.016 0.114 -0.938 -0.142 0.028	0.379 0.557 0.263 0.314 0.487
	DWT-GRNN1 DWT-GRNN2 DWT-GRNN3 DWT-GRNN4 DWT-GRNN5	0.337 0.329 0.337 0.325 0.332	-0.004 0.044 -0.001 0.069 0.027	0.099 0.242 0.110 0.321 0.187
	DWT-RF1 DWT-RF2 DWT-RF3 DWT-RF4 DWT-RF5	0.126 0.114 0.126 0.122 0.108	0.860 0.885 0.859 0.869 0.897	0.936 0.952 0.937 0.945 0.960

To validate the precision of optimal models using graphical support, Figs. 8(a)~8(c) give the scatter diagrams for observed and predicted BOD concentration values using the optimal independent models at Dosan station. The values of RMSE index and linear equations for the optimal independent models were demonstrated for each scatter diagram. It can be noted from RMSE values that a clarified difference can be traced among the GRU1, GRNN4, and RF5 models. Accordingly, the RF5 model performed better than GRU1 and GRNN4 models clearly, whereas the GRU1 model yielded the worst precision at Dosan station.

https://cdn.apub.kr/journalsite/sites/kwra/2021-054-12S/N020054S-102/images/kwra_54_S-1_02_F8.jpg

Fig. 8.

Comparison of observed and predicted BOD concentration values for the optimal independent models during testing step (Dosan station)

4.1.2 Two-stage hybrid models

The results of three statistical indices for different two-stage hybrid models are also filed in Table 4 at Dosan station. It can be recognized from Table 4 that results of DWT-RF1 model (CC = 0.936, NSE = 0.860, and RMSE = 0.126 mg/L) were better than those of the DWT-GRU1 and DWT-GRNN1 models during testing step considering Division 1. Based on Division 2, the DWT-RF2 model (CC = 0.952, NSE = 0.885, and RMSE = 0.114 mg/L) was more excellent than the DWT-GRU2 and DWT-GRU2 models. In addition, the DWT-RF3 model (CC = 937, NSE = 0.859, and RMSE = 0.126 mg/L) surpassed the DWT-GRU3 and DWT-GRNN3 models regarding Division 3 during testing step. Furthermore, comparison of two-stage hybrid models in Division 4 implemented that the DWT-RF4 model (CC = 0.945, NSE = 0.869, and RMSE = 0.122 mg/L) dominated the DWT-GRU4 and DWT-GRNN4 models apparently during testing step. Finally, the DWT-RF5 model (CC = 0.960, NSE = 0.897, and RMSE = 0.108 mg/L) was more accurate than the DWT-GRU5 and DWT-GRNN5 models during testing step based on Division 5.

Defending the distinguished models from all Divisions 1-5, the superior performance of two-stage hybrid models (i.e., DWT-GRU (Division 5), DWT-GRNN (Division 4), and DWT-RF (Division 5)) can be described among various input combinations during testing step. It can be perceived from Table 4 that the optimum structure of DWT-RF5 model yielded more accurate results than the DWT-GRU5 and DWT-GRNN4 models during testing step. Therefore, the DWT-RF5 model was more reliable than the DWT-GRU5 and DWT-GRNN4 models for predicting BOD concentration among the optimum two-stage hybrid models at Dosan station.

To confirm the accuracy of optimum models using graphical aid, Figs. 9(a)~9(c) give the scatter diagrams for observed and predicted BOD concentration values using the optimum two-stage hybrid models at Dosan station. The values of RMSE index and linear equations for the optimum two-stage hybrid models were displayed for corresponding scatter diagram. It can be judged from RMSE values that a clear difference can be detected among the DWT-GRU5, DWT-GRNN4, and DWT-RF5 models. As a result, the DWT-RF5 model accomplished better than DWT-GRU5 and DWT-GRNN4 models distinctly, while the DWT-GRNN4 model provided the worst accuracy at Dosan station.

https://cdn.apub.kr/journalsite/sites/kwra/2021-054-12S/N020054S-102/images/kwra_54_S-1_02_F9.jpg

Fig. 9.

Comparison of observed and predicted BOD concentration values for the optimum two-stage hybrid models during testing step (Dosan station)

4.2 Hwangji station

4.2.1 Independent models

The results of three statistical indices for different independent models are provided in Table 5 at Hwangji station. It can be judged from Table 5 that results of RF1 model (CC = 0.959, NSE = 0.911, and RMSE = 0.269 mg/L) were better than those of the GRU1 and GRNN1 models during testing step based on Division 1. In Division 2, the RF2 model (CC = 0.990, NSE = 0.955, and RMSE = 0.191 mg/L) was superior to the GRU2 and GRNN2 models. Also, the RF3 model (CC = 0.990, NSE = 0.968, and RMSE = 0.163 mg/L) exceeded the GRU3 and GRNN3 models in Division 3 during testing step. In addition, comparison of independent models in Division 4 submitted that the RF4 model (CC = 0.994, NSE = 0.965, and RMSE = 0.170 mg/L) controlled the GRU4 and GRNN4 models clearly during testing step. Finally, the RF5 model (CC = 0.993, NSE = 0.966, and RMSE = 0.168 mg/L) was more correct than the GRU5 and GRNN5 models during testing step in Division 5.

Admitting the magnificent models from all Divisions 1-5, the best performance of independent models can be selected among various input combinations during testing step. It can be classified from Table 5 that the optimal architecture of RF3 model produced more correct results than the GRU4 and GRNN5 models during testing step. Therefore, the RF3 model was more reliable than the GRU4 and GRNN5 models for predicting BOD concentration among the optimal independent models at Hwangji station.

Table 5.

RMSE, NSE, and CC values for the independent and two-stage hybrid models at Hwangji station

Type	Model	Testing step
Type	Model	RMSE (mg/L)	NSE	CC
Independent	GRU1 GRU2 GRU3 GRU4 GRU5	0.546 0.537 0.434 0.366 0.618	0.639 0.651 0.773 0.838 0.537	0.852 0.824 0.948 0.923 0.767
	GRNN1 GRNN2 GRNN3 GRNN4 GRNN5	0.289 0.280 0.303 0.248 0.187	0.898 0.904 0.888 0.925 0.957	0.948 0.951 0.942 0.962 0.979
	RF1 RF2 RF3 RF4 RF5	0.269 0.191 0.163 0.170 0.168	0.911 0.955 0.968 0.965 0.966	0.959 0.990 0.990 0.994 0.993
Two-stage hybrid	DWT-GRU1 DWT-GRU2 DWT-GRU3 DWT-GRU4 DWT-GRU5	0.575 0.573 0.508 0.420 0.670	0.600 0.603 0.687 0.786 0.457	0.790 0.802 0.873 0.925 0.682
	DWT-GRNN1 DWT-GRNN2 DWT-GRNN3 DWT-GRNN4 DWT-GRNN5	0.330 0.205 0.258 0.132 0.157	0.867 0.949 0.919 0.979 0.970	0.931 0.974 0.959 0.990 0.985
	DWT-RF1 DWT-RF2 DWT-RF3 DWT-RF4 DWT-RF5	0.187 0.206 0.225 0.208 0.196	0.957 0.948 0.938 0.947 0.953	0.986 0.985 0.981 0.986 0.986

To validate the correctness of optimal models using visual support, Figs. 10(a)~10(c) support the scatter diagrams for observed and predicted BOD concentration values using the optimal independent models at Hwangji station. The values of RMSE index and linear equations for the optimal independent models were indicated for each scatter diagram. It can be proved from RMSE values that a resolved difference can be ascertained among the GRU4, GRNN5, and RF3 models. Accordingly, the RF3 model carried out better than GRU4 and GRNN5 models definitely, whereas the GRU4 model generated the worst correctness at Hwangji station.

https://cdn.apub.kr/journalsite/sites/kwra/2021-054-12S/N020054S-102/images/kwra_54_S-1_02_F10.jpg

Fig. 10.

Comparison of observed and predicted BOD concentration values for the optimal independent models during testing step (Hwangji station)

4.2.2 Two-stage hybrid models

Also, the results of three statistical indices for different two-stage hybrid models are registered in Table 5 at Hwangji station. It can be perceived from Table 5 that results of DWT-RF1 model (CC = 0.986, NSE = 0.957, and RMSE = 0.187 mg/L) were superior to those of the DWT-GRU1 and DWT-GRNN1 models during testing step reflecting Division 1. Division 2 yielded that the DWT-RF2 model (CC = 0.985, NSE = 0.948, and RMSE = 0.206 mg/L) was more superb than the DWT-GRU2 and DWT-GRU2 models. Moreover, the DWT-RF3 model (CC = 0.981, NSE = 0.938, and RMSE = 0.225 mg/L) outperformed the DWT-GRU3 and DWT-GRNN3 models taking notice of Division 3 during testing step. Furthermore, comparison of two-stage hybrid models in Division 4 achieved that the DWT-GRNN4 model (CC = 0.990, NSE = 0.979, and RMSE = 0.132 mg/L) handled the DWT-GRU4 and DWT-RF4 models obviously during testing step. Finally, the DWT-RF5 model (CC = 0.986, NSE = 0.953, and RMSE = 0.196 mg/L) was more efficient than the DWT-GRU5 and DWT-GRNN5 models during testing step based on Division 5.

Securing the discerning models from all Divisions 1-5, the outstanding performance of two-stage hybrid models can be provided among various input combinations during testing step. It can be understood from Table 5 that the optimum structure of DWT-RF1 model produced more efficient results than the DWT-GRU4 and DWT-GRNN4 models during testing step. Therefore, the DWT-RF1 model was more stable than the DWT-GRU4 and DWT-GRNN4 models for predicting BOD concentration among the optimum two-stage hybrid models at Hwangji station.

To approve the efficiency of optimum models using visual aid, Figs. 11(a)~11(c) give the scatter diagrams for observed and predicted BOD concentration values using the optimum two-stage hybrid models at Hwangji station. The values of RMSE index and linear equations for the optimum two-stage hybrid models were arrayed for corresponding scatter diagram. It can be shown from RMSE values that an apparent difference can be revealed among the DWT-GRU4, DWT-GRNN4, and DWT-RF1 models. As a result, the DWT-RF1 model carried out better than DWT-GRU4 and DWT-GRNN4 models undoubtedly, while the DWT-GRU4 model yielded the worst efficiency at Hwangji station.

https://cdn.apub.kr/journalsite/sites/kwra/2021-054-12S/N020054S-102/images/kwra_54_S-1_02_F11.jpg

Fig. 11.

Comparison of observed and predicted BOD concentration values for the optimum two-stage hybrid models during testing step (Hwangji station)

4.3 Discussion

Overall, the addressed investigation investigated the nonlinear and nonstatic performance of BOD concentration using the independent and two-stage hybrid models at Dosan and Hwangji stations, South Korea. Since Dosan (GRU1, GRNN4, and RF5 models) and Hwangji (GRU4, GRNN5, and RF3 models) stations provided the best accuracy differently based on various input combinations, authors cannot confirm which input combination predicts BOD concentration among the independent models accurately. In addition, the statistical results suggested that the GRU and GRNN models could not predict BOD concentration precisely compared to the RF model based on the corresponding Division at both stations. Therefore, the predicted accuracy of independent models fluctuated for various input combinations, generally because all independent models implemented the different inferences and architectures.

The core purpose for implementation of two-stage hybrid model which combines discrete wavelet transform approach into the independent model, is to improve the predicted accuracy of BOD concentration compared to the corresponding independent model. From the viewpoint of two-stage hybrid models’ performance based on RMSE values at Dosan station, the DWT-GRU2 (2.5% for GRU2) and DWT-GRU5 (6.6% for GRU5) models enhanced the predicted accuracy clearly among the DWT-GRU models. Also, the DWT-RF1 (68.3% for RF1), DWT-RF2 (11.4% for RF2), DWT-RF3 (16.7% for RF3), and DWT-RF5 (10.2% for RF5) models boosted the predicted correctness definitely among the DWT-RF models. However, the DWT-GRNN models could not increase the predicted efficiency for corresponding GRNN models. Recognizing the optimal models’ classification for independent and two-stage hybrid models, the DWT-RF5 model, which provided the best accuracy, improved the predicted accuracy by 200.9% (DWT-GRNN4), 207.4% (DWT-GRU5), 10.2% (RF5), 125.9% (GRNN4), and 192.6% (GRU1), respectively.

Considering the two-stage hybrid models’ performance based on RMSE values at Hwangji station, the DWT-GRNN2 (36.6% for GRNN2), DWT-GRNN3 (17.4% for GRNN3), DWT-GRNN4 (87.9% for GRNN4), and DWT-GRNN5 (19.1% for GRNN1) models boosted the predicted efficiency obviously among the DWT-GRNN models. In addition, only the DWT-RF1 (43.9% for RF1) increased the predicted efficiency clearly among the DWT-RF models. However, the DWT-GRU models could not improve the predicted efficiency for corresponding GRU model. Regarding the optimal models’ classification for independent and two-stage hybrid models, the DWT-GRNN4 model which yielded the best efficiency, increased the predicted efficiency by 41.7% (DWT-RF1), 218.2% (DWT-GRU4), 23.5% (RF3), 41.7% (GRNN5), and 177.3% (GRU4), respectively. In this investigation, the two-stage hybrid models could not always enhance the predicted accuracy of independent model at both stations. This demonstration followed the previous article of Zounemat-Kermani et al. (2019) similarly, which developed the multilayer perceptron (MLP) and cascade correlation neural networks (CCNN) models to predict DO concentration in Florida, USA. They combined two data pre-processing approaches including DWT and variational mode decomposition (VMD) into the MLP and CCNN models for enhancing the predicted accuracy of DO concentration. Results revealed that the DWT-CCNN and VMD-CCNN models could not improve the predicted accuracy of CCNN model. Kim et al. (2020) provided the deep echo state network (Deep ESN), extreme learning machine (ELM), gradient boosting regression tree (GBRT), and RF models to predict BOD concentration at Gongreung and Gyeongan stations, South Korea. They found that the Deep ESN model accomplished the most accurate prediction among the developed standalone models.

Taylor (2001) implemented the polar-based scheme to acquire a visual assistance of model accomplishment. He described that the addressed special scheme explained the relationship of three statistical indices including CC, normalized standard deviation (NSD), and RMSE obviously. Fig. 12(a) provides the Taylor scheme based on the optimal independent and two-stage hybrid models for Dosan station. It can be found from Fig. 12(a) that the points of RF5 and DWT-RF5 (i.e., has the smallest RMSE) models signified the shortest from the observed one, whereas the points of GRU1 and DWT-GRNN4 models displayed the longest visualization from that of observation. Also, Fig. 12(b) illustrates the Taylor scheme based on the optimum independent and two-stage hybrid models for Hwangji station. It can be judged from Fig. 12(b) that the nodes of RF3 and DWT-GRNN4 (i.e., has the smallest RMSE) models indicated the nearest from the measured one, while the nodes of GRU4 and DWT-GRU4 models proved the longest distances from that of measurement.

Since this investigation may not be conventional and traditional approach to predict BOD concentration, their limitation should be investigated by future tasks. Therefore, the paradigm which combines the different evolutionary algorithms (Sahay and Srivastava, 2014; Kalteh, 2015; Yaseen et al., 2018; Zakhrouf et al., 2018, 2020; Fallah et al., 2019; Rezaie-Balf et al., 2019) into the two-stage hybrid models, is recommended to enhance the predicted accuracy of BOD concentration.

https://cdn.apub.kr/journalsite/sites/kwra/2021-054-12S/N020054S-102/images/kwra_54_S-1_02_F12.jpg

Fig. 12.

Taylor schemes based on the optimal independent and two-stage hybrid models at both station

5. Conclusions

This investigation surveyed the predicted accuracy and efficiency of BOD concentration utilizing the independent and two-stage hybrid models at Dosan and Hwangji stations, South Korea. Among eleven water quality and quantity indicators available from both stations, eight water quality and quantity indicators including pH, WT, SS, COD, T-P, TOC, BOD, and DIS were selected to constitute the various input combination (i.e., Divisions 1-5). For the training and testing steps of independent and two-stage hybrid models, the collected data (January 2008 - December 2020) were separated into 80% (training) and 20% (testing), respectively. The statistical criteria and graphical support (i.e., scatter diagram and Taylor scheme) were employed to compare the discussed models based on various input combinations.

Considering the best models from all Divisions 1-5, the DWT-RF5 model (RMSE = 0.108 mg/L, NSE = 0.897, and CC = 0.960) provided the best results compared to the discussed optimal models (i.e., GRU1, GRNN4, RF5, DWT-GRU5, and DWT-GRNN4) based on independent and two-stage hybrid models during testing step at Dosan station. In addition, the DWT-GRNN4 model (RMSE = 0.132 mg/L, NSE = 0.979, and CC = 0.990) was found to support the more accurate and credible results among the addressed optimum models (i.e., GRU4, GRNN5, RF3, DWT-GRU4, and DWT-RF1) for predicting BOD concentration during testing step at Hwangji station. However, this investigation demonstrated that the accuracy and efficiency of BOD concentration predicted by the independent model could not be always strengthened from the implementation of two-stage hybrid models at both stations. To confirm the results of this investigation, therefore, it must be obtained the reliable water quality and quantity indicators from the potential datasets, and accomplished the prediction of BOD concentration employing diverse two-stage hybrid paradigm in rivers.

References

Ahmadi, A., Fatemi, Z., and Nazari, S. (2018). "Assessment of input data selection methods for BOD simulation using data-driven models: A case study." Environmental Monitoring and Assessment, Vol. 190, No. 4, p. 239. 10.1007/s10661-018-6608-429564564

Ahmadi, A., Nasseri, M., and Solomatine, D.P. (2019). "Parametric uncertainty assessment of hydrological models: coupling UNEEC-P and a fuzzy general regression neural network." Hydrological Sciences Journal, Vol. 64, No. 9, pp. 1080-1094. 10.1080/02626667.2019.1610565

Ahmed, A.A.M., and Shah, S.M.A. (2017). "Application of adaptive neuro-fuzzy inference system (ANFIS) to estimate the biochemical oxygen demand (BOD) of Surma River." Journal of King Saud University-Engineering Sciences, Vol. 29, No. 3, pp. 237-243. 10.1016/j.jksues.2015.02.001

Alizamir, M., Kim, S., Zounemat-Kermani, M., Heddam, S., Shahrabadi, A.H., and Gharabaghi, B. (2021). "Modelling daily soil temperature by hydro-meteorological data at different depths using a novel data-intelligence model: Deep echo state network model." Artificial Intelligence Review, Vol. 54, No. 4, pp. 2863-2890. 10.1007/s10462-020-09915-5

Ay, M., and Kisi, O. (2012). "Modeling of dissolved oxygen concentration using different neural network techniques in Foundation Creek, El Paso County, Colorado." Journal of Environmental Engineering, Vol. 138, No. 6, pp. 654-662. 10.1061/(ASCE)EE.1943-7870.0000511

Breiman, L. (2001). "Random forests." Machine Learning, Vol. 45, No. 1, pp. 5-32. 10.1023/A:1010933404324

Cho, K., Van Merriënboer, B., Bahdanau, and D., Bengio, Y. (2014). "On the properties of neural machine translation: Encoder-decoder approaches." arXiv preprint arXiv, 1409. 1259. doi: 10.3115/v1/W14-4012 10.3115/v1/W14-4012

Deo, R.C., Şahin, M., Adamowski, J.F., and Mi, J. (2019). "Universally deployable extreme learning machines integrated with remotely sensed MODIS satellite predictors over Australia to forecast global solar radiation: A new approach." Renewable and Sustainable Energy Reviews, Vol. 104, pp. 235-261. 10.1016/j.rser.2019.01.009

Diamantopoulou, M.J., Antonopoulos, V.Z., and Papamichail, D.M. (2007). "Cascade correlation artificial neural networks for estimating missing monthly values of water quality parameters in rivers." Water Resources Management, Vol. 21, No. 3, pp. 649-662. 10.1007/s11269-006-9036-0

Dogan, E., Sengorur, B., and Koklu, R. (2009). "Modeling biological oxygen demand of the Melen River in Turkey using an artificial neural network technique." Journal of Environmental Management, Vol. 90, Issue 2, pp. 1229-1235. 10.1016/j.jenvman.2008.06.00418691805

Emamgholizadeh, S., Kashi, H., Marofpoor, I., and Zalaghi, E. (2014). "Prediction of water quality parameters of Karoon River (Iran) by artificial intelligence-based models." International Journal of Environmental Science and Technology, Vol. 11, No. 3, pp. 645-656. 10.1007/s13762-013-0378-x

Fallah, H., Kisi, O., Kim, S., and Rezaie-Balf, M. (2019). "A new optimization approach for the least-cost design of water distribution networks: Improved crow search algorithm." Water Resources Management, Vol. 33, No. 10, pp. 3595-3613. 10.1007/s11269-019-02322-8

Friedman, J.H. (2002). "Stochastic gradient boosting." Computational Statistics and Data Analysis, Vol. 38, No. 4, pp. 367-378. 10.1016/S0167-9473(01)00065-2

Garrick, M., Cunnane, C., and Nash, J.E. (1978). "A criterion of efficiency for rainfall-runoff models." Journal of Hydrology, Vol. 36, No. 3-4, pp. 375-381. 10.1016/0022-1694(78)90155-5

Granata, F., Papirio, S., Esposito, G., Gargano, R., and de Marinis, G. (2017). "Machine learning algorithms for the forecasting of wastewater quality indicators." Water, Vol. 9, No. 2, p. 105. 10.3390/w9020105

Jouanneau, S., Recoules, L., Durand, M.J., Boukabache, A., Picot, V., Primault, Y., Lakel, A., Sengelin, M., Barillon, B., and Thouand, G. (2014). "Methods for assessing biochemical oxygen demand (BOD): A review." Water Research, Vol. 49, pp. 62-82. 10.1016/j.watres.2013.10.06624316182

Kalteh, A.M. (2015). "Wavelet genetic algorithm-support vector regression (wavelet GA-SVR) for monthly flow forecasting." Water Resources Management, Vol. 29, No. 4, pp.1283-1293. 10.1007/s11269-014-0873-y

Khaled, B., Abdellah, A., Noureddine, D., Salim, H., and Sabeha, A. (2017). "Modelling of biochemical oxygen demand from limited water quality variable by ANFIS using two partition methods." Water Quality Research Journal of Canada, Vol. 53, No. 1, pp. 24-40. 10.2166/wqrj.2017.015

Kim, S. (2000). "The application of neural networks method for the flood discharge forecasting in the river basin." Journal of Korean Society of Civil Engineers, Vol. 20, No. 6-B, pp. 801-811 (in Korean).

Kim, S. (2011). "Nonlinear hydrologic modeling using the stochastic and neural networks approach." Disaster Advances, Vol. 4, No. 1, pp. 53-63.

Kim, S., Alizamir, M., Zounemat-Kermani, M., Kisi, O., and Singh, V.P. (2020). "Assessing the biochemical oxygen demand using neural networks and ensemble tree approaches in South Korea." Journal of Environmental Management, Vol. 270, p. 110834. 10.1016/j.jenvman.2020.11083432507742

Kim, S., and Kim, H.S. (2007). "Neural networks-genetic algorithm model for modeling of nonlinear evaporation and evapotranspiration time series 1. Theory and application of the model." Journal of Korean Water Resources Association, Vol. 40, No. 1, pp. 73-88. (in Korean) 10.3741/JKWRA.2007.40.1.073

Kim, S., and Kim, H.S. (2008). "Neural networks and genetic algorithm approach for nonlinear evaporation and evapotranspiration modeling." Journal of Hydrology, Vol. 351, No. 3-4, pp. 299-317. 10.1016/j.jhydrol.2007.12.014

Kim, S., Kim, J.H., and Park, K.B. (2009). "Statistical learning theory for the disaggregation of the climatic data." Proceedings of the 33rd IAHR Congress, Vancouver, Canada, pp. 1154-1162.

Kim, S., Kisi, O., Seo, Y., Singh, V.P., and Lee, C.J. (2017). "Assessment of rainfall aggregation and disaggregation using data-driven models and wavelet decomposition." Hydrology Research, Vol. 48, No. 1, pp. 99-116. 10.2166/nh.2016.314

Kim, S., Maleki, N., Rezaie-Balf, M., Singh, V.P., Alizamir, M., Kim, N.W., Lee, J.T., and Kisi, O. (2021). "Assessment of the total organic carbon employing the different nature-inspired approaches in the Nakdong River, South Korea." Environmental Monitoring and Assessment, Vol. 193, No. 7, pp.1-22. 10.1007/s10661-021-08907-434173069

Kim, S., Park, K.B., and Seo, Y.M. (2012). "Estimation of pan evaporation using neural networks and climate-based models." Disaster Advances, Vol. 5, No. 3, pp. 34-43.

Kim, S., Seo, Y., and Lee, C.J. (2016). "Modeling of rainfall by combining neural computation and wavelet technique." Procedia Engineering, Vol. 154, pp. 1231-1236. 10.1016/j.proeng.2016.07.442

Kişi, Ö. (2006). "Generalized regression neural networks for evapotranspiration modelling." Hydrological Sciences Journal, Vol. 51, No. 6, pp. 1092-1105. 10.1623/hysj.51.6.1092

Ladlani, I., Houichi, L., Djemili, L., Heddam, S., and Belouz, K. (2012). "Modeling daily reference evapotranspiration (ET_o) in the north of Algeria using generalized regression neural networks (GRNN) and radial basis function neural networks (RBFNN): A comparative study." Meteorology and Atmospheric Physics, Vol. 118, No. 3, pp. 163-178. 10.1007/s00703-012-0205-9

Li, J., Abdulmohsin, H.A., Hasan, S.S., Kaiming, L., Al-Khateeb, B., Ghareb, M.I., and Mohammed, M.N. (2019). "Hybrid soft computing approach for determining water quality indicator: Euphrates River." Neural Computing and Applications, Vol. 31, No. 3, pp. 827-837. 10.1007/s00521-017-3112-7

Li, X., Zecchin, A.C., and Maier, H.R. (2014). "Selection of smoothing parameter estimators for general regression neural networks - applications to hydrological and water resources modelling." Environmental Modelling and Software, Vol. 59, pp. 162-186. 10.1016/j.envsoft.2014.05.010

Mallat, S.G. (1989). "A theory of multiresolution signal decomposition: the wavelet representation." IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 11, No. 3, pp. 674-693. 10.1109/34.192463

Ministry of Environment (ME) (2020). Full-scale implementation of the total water pollution control system in the 2030 phase of the Four Major Rivers (7.15). Press release.

Nash, J.E., and Sutcliffe, J.V. (1970). "River flow forecasting through conceptual models, Part 1 - A discussion of principles." Journal of Hydrology, Vol. 10, No. 3, pp. 282-290. 10.1016/0022-1694(70)90255-6

Noori, R., Yeh, H.D., Abbasi, M., Kachoosangi, F.T., and Moazami, S. (2015). "Uncertainty analysis of support vector machine for online prediction of five-day biochemical oxygen demand." Journal of Hydrology, Vol. 527, pp. 833-843. 10.1016/j.jhydrol.2015.05.046

Percival, D.B., and Walden, A.T. (2000). Wavelet methods for time series analysis. Cambridge University Press, New York, NY, U.S. 10.1017/CBO9780511841040

Raheli, B., Aalami, M.T., El-Shafie, A., Ghorbani, M.A., and Deo, R.C. (2017). "Uncertainty assessment of the multilayer perceptron (MLP) neural network model with implementation of the novel hybrid MLP-FFA method for prediction of biochemical oxygen demand and dissolved oxygen: A case study of Langat River." Environmental Earth Sciences, Vol. 76, No. 14, p. 503. 10.1007/s12665-017-6842-z

Rezaie-Balf, M., Maleki, N., Kim, S., Ashrafian, A., Babaie-Miri, F., Kim, N.W., Chung, I.M., and Alaghmand, S. (2019). "Forecasting daily solar radiation using CEEMDAN decomposition-based MARS model trained by crow search algorithm." Energies, Vol. 12, No. 8, p. 1416. 10.3390/en12081416

Royal Commission on Sewage Disposal (1908). Fifth report on methods of treating and disposing of sewage. UK.

Sahay, R.R., and Srivastava, A. (2014). "Predicting monsoon floods in rivers embedding wavelet transform, genetic algorithm and neural network." Water Resources Management, Vol. 28, No. 2, pp. 301-317. 10.1007/s11269-013-0446-5

Seo, Y., and Kim, S. (2016). "Hydrological forecasting using hybrid data-driven approach." American Journal of Applied Sciences, Vol. 13, No. 8, pp.891-899. 10.3844/ajassp.2016.891.899

Seo, Y., Kim, S., and Singh, V.P. (2018). "Comparison of different heuristic and decomposition techniques for river stage modeling." Environmental Monitoring and Assessment, Vol. 190, No. 7, pp. 1-22. 10.1007/s10661-018-6768-229892912

Seo, Y., Kim, S., Kisi, O., and Singh, V.P. (2015). "Daily water level forecasting using wavelet decomposition and artificial intelligence techniques." Journal of Hydrology, Vol. 520, pp. 224-243. 10.1016/j.jhydrol.2014.11.050

Seo, Y., Kim, S., Kisi, O., Singh, V.P., and Parasuraman, K. (2016). "River stage forecasting using wavelet packet decomposition and machine learning models." Water Resources Management, Vol. 30, No. 11, pp. 4011-4035. 10.1007/s11269-016-1409-4

Simard, M., Saatchi, S.S., and De Grandi, G. (2000). "The use of decision tree and multiscale texture for classification of JERS-1 SAR data over tropical forest." IEEE Transactions on Geoscience and Remote Sensing, Vol. 38, No. 5, pp. 2310-2321. 10.1109/36.868888

Solgi, A., Pourhaghi, A., Bahmani, R., and Zarei, H. (2017). "Improving SVR and ANFIS performance using wavelet transform and PCA algorithm for modeling and predicting biochemical oxygen demand (BOD)." Ecohydrology and Hydrobiology, Vol. 17, No. 2, pp.164-175. 10.1016/j.ecohyd.2017.02.002

Specht, D.F. (1991). "A general regression neural network." IEEE Transactions on Neural Networks, Vol. 2, No. 6, pp. 568-576. 10.1109/72.9793418282872

Tao, H., Bobaker, A.M., Ramal, M.M., Yaseen, Z.M., Hossain, M.S., and Shahid, S. (2019). "Determination of biochemical oxygen demand and dissolved oxygen for semi-arid river environment: application of soft computing models." Environmental Science and Pollution Research, Vol. 26, No. 1, pp. 923-937. 10.1007/s11356-018-3663-x30421367

Taylor, K.E. (2001). "Summarizing multiple aspects of model performance in a single diagram." Journal of Geophysical Research: Atmospheres, Vol. 106, No. D7, pp. 7183-7192. 10.1029/2000JD900719

Willmott, C.J., and Matsuura, K. (2005). "Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance." Climate Research, Vol. 30, No. 1, pp. 79-82. 10.3354/cr030079

Yang, G., Lee, H., and Lee, G. (2020). "A hybrid deep learning model to forecast particulate matter concentration levels in Seoul, South Korea." Atmosphere, Vol. 11, No. 4, p. 348. 10.3390/atmos11040348

Yaseen, Z.M., Karami, H., Ehteram, M., Mohd, N.S., Mousavi, S.F., Hin, L.S., Kisi, O., Farzin, S., Kim, S., and El-Shafie, A. (2018). "Optimization of reservoir operation using new hybrid algorithm." KSCE Journal of Civil Engineering, Vol. 22, No. 11, pp. 4668-4680. 10.1007/s12205-018-2095-y

Zakhrouf, M., Bouchelkia, H., Stamboul, M., and Kim, S. (2020). "Novel hybrid approaches based on evolutionary strategy for streamflow forecasting in the Chellif River, Algeria." Acta Geophysica, Vol. 68, No. 1, pp.167-180. 10.1007/s11600-019-00380-5

Zakhrouf, M., Bouchelkia, H., Stamboul, M., Kim, S., and Heddam, S. (2018). "Time series forecasting of river flow using an integrated approach of wavelet multi-resolution analysis and evolutionary data-driven models. A case study: Sebaou River (Algeria)." Physical Geography, Vol. 39, No. 6, pp. 506-522. 10.1080/02723646.2018.1429245

Zhang, Y., Pulliainen, J., Koponen, S., and Hallikainen, M. (2002). "Application of an empirical neural network to surface water quality estimation in the Gulf of Finland using combined optical data and microwave data." Remote Sensing of Environment, Vol. 81, No. 2-3, pp. 327-336. 10.1016/S0034-4257(02)00009-3

Zou, R., Lung, W.S., and Wu, J. (2007). "An adaptive neural network embedded genetic algorithm approach for inverse water quality modeling." Water Resources Research, Vol. 43, No. 8, W08427. 10.1029/2006WR005158

Zounemat-Kermani, M., Rajaee, T., Ramezani-Charmahineh, A., and Adamowski, J.F. (2017). "Estimating the aeration coefficient and air demand in bottom outlet conduits of dams using GEP and decision tree methods." Flow Measurement and Instrumentation, Vol. 54, pp. 9-19. 10.1016/j.flowmeasinst.2016.11.004

Zounemat-Kermani, M., Seo, Y., Kim, S., Ghorbani, M.A., Samadianfard, S., Naghshara, S., Kim, N.W., and Singh, V.P. (2019). "Can decomposition approaches always enhance soft computing models? Predicting the dissolved oxygen concentration in the St. Johns River, Florida." Applied Sciences, Vol. 9, No. 12, p. 2534. 10.3390/app9122534

Journal of Korea Water Resources Association ISSN:2799-8746(Print) 2799-8754(Online) 한국수자원학회 논문집

Preview

Novel two-stage hybrid paradigm combining data pre-processing approaches to predict biochemical oxygen demand concentration

ABSTRACT

MAIN

Fig. 1.

Flowchart of investigation steps

Fig. 2.

A schematic diagram of gated recurrent unit (GRU) model

(1)

(2)

(3)

(4)

Fig. 3.

A schematic diagram of generalized regression neural networks (GRNN) model

Fig. 4.

A schematic diagram of random forests (RF) model (Alizamir et al., 2021)

(5)

Fig. 5.

Two-level DWT approach employing Mallat algorithm (Kim et al., 2017)

Fig. 6.

Flowchart for developing two-stage hybrid models

(6)

(7)

(8)

Fig. 7.

The schematic maps of Dosan and Hwangji stations

Table 1.

Statistical properties of water quality and quantity indicators

Table 2.

Computation of correlations between corresponding input indicator and BOD concentration

Table 3.

Diverse input combinations of independent and two-stage hybrid models

Table 4.

RMSE, NSE, and CC values for the independent and two-stage hybrid models at Dosan station

Fig. 8.

Comparison of observed and predicted BOD concentration values for the optimal independent models during testing step (Dosan station)

Fig. 9.

Comparison of observed and predicted BOD concentration values for the optimum two-stage hybrid models during testing step (Dosan station)

Table 5.

RMSE, NSE, and CC values for the independent and two-stage hybrid models at Hwangji station

Fig. 10.

Comparison of observed and predicted BOD concentration values for the optimal independent models during testing step (Hwangji station)

Fig. 11.

Comparison of observed and predicted BOD concentration values for the optimum two-stage hybrid models during testing step (Hwangji station)

Fig. 12.

Taylor schemes based on the optimal independent and two-stage hybrid models at both station

References