Li, Jinshu

Jinshu Li, PhD is a Data Scientist of Stantec with 5 years research experience in water resource management and operation. With a composite background...

Joined:

About Li, Jinshu

Jinshu Li, PhD is a Data Scientist of Stantec with 5 years research experience in water resource management and operation. With a composite background in water resource engineering and statistics, Jinshu is professional at approaching water issues from statistical (probabilistic) perspectives. He is also experienced in analyzing real-world engineering data through state-of-the-art machine learning models and other statistical data analytics tools.

Li, Jinshu

Titles from this speaker

Description: A novel modeling framework for predicting water quality results based on storm...

A novel modeling framework for predicting water quality results based on storm parameters in northeastern Los Angeles County

Abstract

Introduction: The East San Gabriel Valley Watershed Management Group (ESGV Group), a client of Stantec, is comprised of the Cities of Claremont, La Verne, Pomona, and San Dimas (Group Members). The ESGV Group has performed water quality and storm parameter monitoring for five wet seasons. The collected data were analyzed to determine if simple relationships exist between descriptive storm parameters and the measured water quality. In other words, the group is interested in finding a linear relationship between storm parameters (e.g., storm duration, precipitation, intensity, etc.) and a specific water pollutant concentrations (e.g., E.coli). Traditionally, a multiple linear regression (MLR) is employed for this work, where all the available storm parameters are used as the predictors (i.e., regressors) and the water pollutant concentration serves as the output variable (i.e., response). This traditional approach can provide good performance when the number of predictors (i.e., available storm parameters) is small (e.g., 2-3). However, when the number of predictors becomes large (e.g., 5+), the traditional MLR may easily overfit and produce less optimal regression results. This is mainly because that MLR doesn't have any predictor selection process, so it tends to use all available predictors regardless of their usefulness. As a result, we propose a novel linear regression framework based on the Least Absolute Shrinkage and Selection Operator (LASSO), aimed at improving the performance on regression problems with large predictor numbers. We test our framework on three different types of water quality pollutants: E.coli, dissolved zinc, and total nitrogen. Each pollutant concentration is regressed against six available storm parameters: event precipitation, duration, peak intensity, average intensity, days since last storm, and days since last storm > 0.25 inches, under both traditional MLR method and the proposed LASSO regression. Although other storm parameters might be related to the water quality as well, those six storm parameters are the available data thus are selected as regressors in this study. The results show that for all three pollutants, the proposed LASSO regression outperforms the traditional MLR, using R squared (R2) as the evaluation metric. Methodology: Our proposed regression framework begins with pre-processing the collected data records, where outlier detection and influential data detection are conducted. Outliers are data values far from other data in the collected dataset, which have adverse impacts on regression. Figure 1 shows an example of how outliers can negatively influence the fit of the regression line. In our study, outliers are identified and filtered out by a pre-set z score. An influential data record is another type of 'outlier' that impacts the slope of the regression line. Influential data records are identified and filtered by a Cook's distance threshold. The processed dataset is obtained after removing both identified outliers and influential data records from the total available data. LASSO regression and the traditional MLR (for comparison) are then implemented on the processed data. LASSO regression is a variant of MLR, which encourages simple and sparse models (i.e., models with fewer storm parameters as predictors). It is usually preferred since it can automatically select useful predictors so that improve regression performance by avoiding overfitting. The difference between LASSO and MLR lies in the way of computing the estimated coefficient for each predictor, Î² Ì‚ , where LASSO adds an additional penalty term (l1 norm) on the coefficient in the minimization of residual. This penalty term shrinks the predictor coefficient values, so that an automatically predictor selection is achieved. The magnitude of this shrinkage is controlled by a hyperparameter Î±, known as LASSO parameter. The Î± value for each pollutant regression model is determined by fine tuning. The established regression models are evaluated on the processed dataset, based on cross-validation using R2 as the metric. R2 measures how much variation observed in the response is explained by the predictors in a regression model. The larger R2 is preferred for a model and the maximum is 1. Model Results: Three common pollutants are selected to compare the regression results of proposed LASSO with the traditional MLR approach: E.coli (contains 36 data points), dissolved zinc (contains 36 data points), and total nitrogen (contains 13 data points). Table 1 shows the R2 for both LASSO (red) and traditional MLR (blue). For all three selected pollutants, the R2 of the proposed LASSO regression outperforms the R2 of MLR. The total nitrogen provides the largest performance difference on two methods due to its limited amount of data. A small dataset usually favors the simple model, such as the LASSO model. To see how the LASSO regression model provides the simpler model than MLR for each pollutant, Table 2 presents the estimated regression coefficients for each predictor. For all pollutants, LASSO includes only a subset of predictors into the model with the positive predictor coefficients, while MLR uses all the available predictors. This demonstrates how LASSO can 'automatically' select the influential predictors and produce a simpler regression model. Figures 2, 3, and 4 show the regression results vs. actual data for all three pollutants. Consistent with the R2 comparisons, the LASSO regression results are in general slightly closer to the actual data than MLR, for all three test pollutants. Conclusions: This abstract presents a novel linear regression framework that has potential to be widely applied in stormwater management. Compared with the traditional MLR approach, the proposed LASSO regression encourages simpler and sparser models by automatically selecting the influential predictors. The regression relationship derived from LASSO outperforms the traditional MLR, as demonstrated in all three test cases of E.coli, dissolved zinc, and total nitrogen. More discussions on the framework will be provided in the following paper and the model will be further tested as more test data become available in the future.

This paper was presented at the WEF Stormwater Summit in Minneapolis, Minnesota, June 27-29, 2022.

SpeakerLi, Jinshu

Presentation time

08:30:00

09:00:00

Session time

08:30:00

10:00:00

Session number09

Session locationHyatt Regency Minneapolis

TopicData Analytics, Research, Stormwater

Author(s)

J. Li

Author(s)J. Li1; G. Kohli2; D. Son3; J. Carver4; J. Abelson5

Author affiliation(s)Stantec1; Stantec2; Stantec3; City of Pomona4; Stantec5;

SourceProceedings of the Water Environment Federation

Document typeConference Paper

PublisherWater Environment Federation

Print publication date Jun 2022

DOI10.2175/193864718825158452

Volume / Issue

Content sourceStormwater Summit

Word count19

Description: An automated open-channel deficiency rating classification model based on machine...

An automated open-channel deficiency rating classification model based on machine learning in Los Angeles County

Abstract

Introduction: An open channel is a concrete waterway that is widely used in many regions. Like other concrete structures, open channels suffer from many deficiencies, such as scour, cracking, spalling, and tilting. Inspections need to be performed on a regular basis aimed at identifying and rating deficiencies. Ratings represent the severities of a deficiency, which are often associated with descriptions and suggested actions. Table 1 shows an example of spall rating classification. Manually classifying deficiency ratings is often difficult, since it requires both expertise and extensive physical inspection. Therefore, automating deficiency rating work has been an industry challenge. Machine learning (ML) is a type of artificial intelligence (AI) that can learn from various data, identify patterns, and make decisions. In this study, we are interested in evaluating the power of ML on providing accurate rating classifications of channel deficiencies without human intervention. The developed ML model is described in Methodology and followed by preliminary Model Results and Conclusion. Methodology: Since the goal is to rate defects or deficiencies, a supervised ML model is selected. Specially, we train a convolutional neural network (CNN) on 80% of the total available data, then validate & test model results on the remaining 20%. For each deficiency, the total available data consists of thousands of pre-labeled deficiency photos. Table 2 lists the total available photos for each test deficiency. Figure 1 provides a proposed modelling flowchart. Convolutional neural network (CNN) is one of the most popular neural networks in computer vision. Compared with traditional neural networks, CNN consists of not only fully connected layers, but also convolutional layers and pooling layers. This structure allows CNN to utilize spatial pixel interaction information as well as reduce model complexity, which makes CNN especially suitable for image classification. The convolutional layer is used for parsing the input photo. Three brightness intensity values (red, green, and blue, or 'RGB' channels) are assigned to each pixel on a color photo. Figure 2 gives an illustrative example using a tree photo. After parsing a photo into its RGB channels, the convolutional layer employs several 'filters' that contain trainable weights to apply convolution operations on these channels. These filters move over the input layer until all pixels have been covered. The dimensions of input layer will be reduced after the convolution. A pooling layer typically follows a convolutional layer, which further decreases the input layer dimension. The pooling layer also summarizes the regional pixels information so that the model becomes robust against objective position variations. Figure 3 demonstrates a CNN example that is used to predict if a photo is a tree, streetlamp, or a stop sign. In this study, we further adopt a 'transfer learning' technique: A pre-trained CNN, ResNet-50, is used and followed by a trainable fully connected layer to yield deficiency rating predictions. Our model is developed based on open-source software libraries: TensorFlow and Keras frameworks. Model Results: Results are briefly discussed in this abstract. Figure 4 show both the loss and accuracy curves against 70 epochs for cracking as an example, with class weight considered in training. As the training proceeds, in general, the validation loss is decreasing and the validation accuracy is increasing before reaching a relative stable condition, which indicates the convergence of the model. Figure 5 plots the confusion matrix of cracking based on the cracking test data. There are three ratings for cracking (R1, R2, and R3). Each row represents the model prediction while each column is the true label. For example, there are 333 photos with the true R2 labels that are also predicted as R2. Based on the confusion matrix, the model accuracy, precision, and recall are calculated. In this study, we select accuracy and macro-averaged F1 score as two evaluation metrics. Accuracy is defined as the number of correct predictions divided by the total number of predictions, and it assumes that all the classes are equally important. Macro-averaged F1 score is a harmonic mean (tradeoff) between precision and recall, which is useful when the false negatives and false positives are essential. For some deficiencies, rating classes can be imbalanced, where one rating class only has several photos while another rating class has thousands of photos. In this case, F1 score is a preferred over accuracy since it accounts for the model ability to predict the minority class. One way to increase F1 score is to incorporate class weight into loss function, where class weight is the inverse data ratio among different rating classes. Table 3 shows the class weight information for each deficiency. Based on the class weights, Table 4 shows the test results for all four deficiencies, where 60%-70% accuracy can be generally achieved. The accuracy is generally lowered by 2% in exchange for 0.02 increase of F1 score, when comparing the no-class-weight model to the class-weight model. Due to the limited data (Table 2), tilting has the least accuracy performance. Also, because of the relatively balanced classes (Table 3), F1 score does not increase for tilting when transiting from the no-class-weight to the class-weight model. Conclusion: This abstract presents the method used to develop a machine learning model based on CNN, with the objective of assigning open-channel deficiency ratings. Four deficiencies were selected (cracking, spalling, tilting, and vegetation), and the preliminary results indicate a general 60%-70% accuracy and 0.4-0.5 F1 score. In general, the achieved accuracy performance is positively related to the number of training data. Adding class weights into the model can increase F1 score, if the deficiency rating classes are imbalanced. More discussions will be provided in the following paper.

This paper was presented at the WEF Stormwater Summit in Minneapolis, Minnesota, June 27-29, 2022.

SpeakerLi, Jinshu

Presentation time

10:45:00

11:15:00

Session time

08:30:00

12:15:00

Session number11

Session locationHyatt Regency Minneapolis

TopicInformation Technology, Machine Learning, Open Channel

Author(s)

J. Li

Author(s)J. Li1; D. Son2; G. Kohli3; J. Abelson4; A. Haikal5

Author affiliation(s)Stantec1; Stantec2; Stantec3; Stantec4; Los Angeles County Public Works5;

SourceProceedings of the Water Environment Federation

Document typeConference Paper

PublisherWater Environment Federation

Print publication date Jun 2022

DOI10.2175/193864718825158463

Volume / Issue

Content sourceStormwater Summit

Word count16

Base text size -

Li, Jinshu

About Li, Jinshu

My Recent Activity