Skip to main content

Asset management analytics for urban water mains: a literature review


This study presents a review of the state-of-the-art literature on water pipe failure predictions, assessment of water losses risk, optimal pipe maintenance plans, and maintenance coordination strategies. In addition, it provides a categorization of water main (WM) failures as well as a taxonomy of WM maintenance strategies. In particular, predictive and prescriptive analytics are highlighted with the investigation of their contributions and drawbacks from methodological and application perspectives. This review aims at providing a review of failure analytics developed recently in water mains domain either for prediction of failure or identification of optimal maintenance strategies conjointly. Future research directions and challenges are elaborated in advancing the understanding about the mechanisms leading to failures. The existing gaps between theory and practice in managing assets across water distribution networks ensuring cost-effectiveness and reliability are discussed. As knowledge about the state of the water mains and related areas is crucial, thus, this review provides an state-of-the-art update from recent studies, and accordingly, presents and discusses avenues for future research.


A water distribution network (WDN) carries freshwater from one or more sources to municipalities for essential human consumption, economic development, and social activities. The WDN is a complex system. It consists of main and booster pumps, water mains typically buried underground, branching pipes, elevated water towers, and interconnected sub-networks for individual neighborhoods. The system or part of it will fail when one or more of its key components, in particular the water mains (WMs), break. There have been numerous cases of water main failures globally, with severe consequences. Examples of consequences include high replacement costs, revenue losses, water damages and contamination, traffic disruptions, and consumer service interruptions (Fares and Zayed 2010; Besner et al. 2011; Malm et al. 2015; Kakoudakis et al. 2017; Liang et al. 2018; Vishwakarma and Sinha 2020; Dawood et al. 2020a).

Proactive interventions for reducing failure risks are necessary and cost-effective, particularly true in the context of aging WMs. Take major urban centers in Canada as example. Statistics Canada (Trudeau 2020) reported fair to very poor conditions for a significant portion of the WDNs. One simple reason is that these WMs have reached or are reaching the end of expected service life. Another reason is that the nature of water mains buried underground makes it complicated and costly to maintain and replace. As a result, there has been an increasing rate of failures over time (Asnaashari et al. 2013; Sattar et al. 2016; Folkman 2018; Snider and McBean 2018, 2021). Over the past decade, studies of the WM problems have made a significant progress, taking advantages of constantly advancing data-driven techniques. The studies aimed to detect water main breaks, analyze failure risks, and optimize maintenance.

This study aims to scrutinize the recent literature of water main failure prediction models, failure consequences, failure risk, optimal WM maintenance strategies, and optimal coordinated maintenance strategies. An awareness of the state of the water mains and related areas is crucial. The existing review articles have mostly focused on water main prediction models (St. Clair and Sinha 2012; Dawood et al. 2020b), the effect of availability and quantity of the database on WM failure models (Snider and McBean 2020a; Chen et al. 2022), the effect of the limited, uncertain dataset on WM failure models (Jenkins et al. 2014), and the effect of combined datasets from different utilities on the performance of machine learning models for predicting future breaks (Chen et al. 2022). Other reviews have discussed different approaches to optimizing rehabilitation and maintenance strategies for WMs and integrated infrastructures (Abusamra 2018; Ghobadi et al. 2021; Ramos-Salgado et al. 2022; Shahata et al. 2022; Barton et al. 2022). This review aims to address issues of the failure prediction models developed recently and optimal maintenance strategies conjointly. This review also addresses future research directions in predictive and prescriptive analytics considering cold-region climatic variables.

It is important to consider water main maintenance in coordination with other infrastructures. It is also important to pay close attention to indirect costs and consequences, but the challenge lies in quantifying the indirect consequences in a monetary value (Muhlbauer 2004); to the best of the authors’ knowledge, the literature in this domain is scarce (Atef 2010; Yerri et al. 2017). Water main failure consequences are categorized into direct and indirect costs. Loss of production, repair and return to service and pipeline replacement are direct costs. Travel delay, supply outage and substitution, health risk, property damage, customer dissatisfaction and environmental damages are examples of indirect costs (Fares and Zayed 2010; Besner et al. 2011; Malm et al. 2015; Kakoudakis et al. 2017; Vishwakarma and Sinha 2020; Dawood et al. 2020a). Considering indirect costs would make a major difference to WM maintenance plans (Yerri et al. 2017).

Issues exist in either the models or water main datasets themselves or both. They need to be discussed in detail. Thus, the purpose of this review is to provide an update of the knowledge from the recent studies and, more importantly, to explore ways to address the issues in future studies. This would help generate new ideas to improve failure prediction and risk analysis, and thus to reduce the costs in WM asset management planning, rehabilitation and renewal.

In the forthcoming review, the selection of literature is guided by the quest for answers to key questions pertinent to WMs. Some examples of such questions are given below:

  • Different methods and techniques have been proposed for locating and managing leaks in WMs (Misiunas 2005; Hamilton and Charalambous 2013; Zyoud and Fuchs-Hanusch 2019, 2020; Karimian et al. 2021). What are the requirements of these approaches? What are the pros and cons?

  • Failure models of various types have been used to analyze water main datasets (Economou et al. 2012; American Water Works Association 2019; Snider and McBean 2020a; Snider 2021; Barton et al. 2022). To what extent have the models met the expectation to predict the probability of future failures, time to next failure, and failure rate of pipe, or to predict whether or not a break will happen?

  • Failures reportedly could result from a large variety of factors: physical factors (e.g., pipe age, diameter, material, length, and wall thickness), environmental factors (e.g., soil type, climate, freeze/thaw properties, pipe bedding, trench backfill, traffic, and groundwater), and operational factors (e.g., number of pervious failures, water quality, internal water pressure, transient pressure, and leakage) (Stamou et al. 2000; Wang et al. 2009; Arsénio et al. 2015; Lin and Yuan 2019; Karimian et al. 2021). What are the main issues in data acquisition and quality? Are there factors with dominant influence on failures? Will these dominant factors change over time?

Availability of sensory and clouding systems has led to production of vast digital data from WMs. It is very crucial to take advantage of the available data to support short-term and long-term plans of asset management. The use of analytics has shown a rising trend. This review provides a timely update of the existing models for predicting failures and for management planning. Critical pipes are to be identified using the predictive models and then are further included in maintenance plans. The maintenance plans are efficiently optimized to save time, costs and resources.

Predictive analytics

Predictive models of water main failures and pipeline deteriorations (Kleiner and Rajani 2001; Rajani and Kleiner 2001; El-Abbasy et al. 2019; Robles-Velasco et al. 2020; Dawood et al. 2020a) may be classified into two main types: a physical law-based model and a data-driven model (Rajani and Kleiner 2001; Snider and McBean 2020a). The first type of model requires significant amounts of input data to analyze physical behaviors leading to a failure. The analysis involves comparing the resistance capacity of a pipeline to expected loads. The data includes an extensive list of parameters and needs to be collected from the field. Therefore, it is costly and time consuming to use physical law-based models (Rajani and Kleiner 2001). The implementation of the models should be limited to critical pipelines (Wilson et al. 2017). The second type of model uses historical data to discern patterns between historical values of some relevant parameters and breakage rates of pipelines. This type of model is much less expensive to use, compared to a physical law-based model. Thus, it is suitable to implement a data-driven model to all pipelines, as long as historical data exists (El-Abbasy et al. 2019; Snider and McBean 2020a).

The data-driven models may be subdivided into a deterministic model, a probabilistic model, and an artificial intelligence model:

  • The deterministic model relies on regression techniques to predict time to next break of pipe or break rate and often assumes uniform breaks in water main groups. This assumption rules out uncertainties within a dataset.

  • The probabilistic model (e.g., a survival analysis model) uses historical data to predict the probability of water main failure. The model deals with inherent randomness that is expected to be within a dataset of pipe breaks.

  • The artificial intelligence model adopts a learning approach to recognizing complicated relationships between input and output data, without calculating the covariate relationships like the deterministic and probabilistic models. Using the artificial intelligence model has the potential to significantly reduce the number of field inspections needed, provide timely warning of break risks and thus avoid a large number of breaks as well as their consequences (Fu et al. 2013; Marzouk and Osama 2017; Kakoudakis et al. 2017; Snider and McBean 2018; Ghobadi et al. 2021).

The classification of data-driven models and their sub-categories are shown in Figure 1.

Fig. 1
figure 1

Classification of water main predictive models

WDNs are a complicated system consisting of interconnected pipes and hydraulic control elements in order to transport potable water to urban populations (Ostfeld 2015). The water infrastructures are aging and deteriorating drastically throughout the major urban centers, which leads to WM failures. They are major problems for municipalities due to high costs for replacement/repair and consequences such as the disruption of services, health issues resulting from contaminated water, and revenue losses (Snider and McBean 2020a). Models should be used to predict breaks ahead of their occurrences and to plan rehabilitation and replacement. This would promote sustainable infrastructures and save costs.

This review focuses on the data-driven models, considering practicality, data requirement, and liableness. It provides a comprehensive overview of the past 15 years of literature to investigate the relationship and synergy between predictive analytics and prescriptive analytics for water mains. A structured survey of the literature was performed using such keywords as “water main deterioration”, “deterioration models”, “prediction models”, “probabilistic prediction models”, “asset management”, “water infrastructure”, “failure consequence”, “water main risk analysis”, “integrated municipal infrastructure” and “infrastructure optimization”. In total, over 60 articles were reviewed in their entirety.

The strengths and limitations of identified publications were analyzed. Several researchers have used regression, probabilistic and machine learning models for water main failure prediction, as explained above in detail. The application of these models highly depends on the availability of the dataset and desired output. The output of water main failure predictive models can be break rate, number of breaks, probability of future breaks, and time to next break. One of the great advantages of putting these studies together is that it will help municipalities to select and use the most appropriate model, depending on the availability of dataset. Therefore, by using these models, the key question of when and where the break will happen would be answered.

Wang et al. (2009) developed deterioration models, using data of water main breaks to predict annual break rates. These were multiple regression models involving pipe diameter, length, age and material, and identifying the length as having the greatest impact. This is possibly because their model output is breaks per kilometer length per year. They claimed that the models helped analyze break trends. Bruaset and Sægrov (2018) developed a linear regression model that correlated the failure rate of water main to frost heave of the ground due to the air temperature in a cold region. They found the failure rate increasing during the winter months and gray cast iron pipes (usually laid in trenches) being more vulnerable to fail. This implies that the failure rate would decrease under climate warming.

Xu and Sinha (2020) discussed some challenges and gaps in the use of survival analysis models. The use may give failure rate, number of failures, and time to next break (which can be interpreted as either the useful life span of a pipe or estimated remaining useful life). One challenge is the treatment of left truncation. In the literature, left truncation has not been addressed in most survival analyses of water pipeline failure. This would cause a bias in the results. The issue of left truncation needs attention and solutions.

An analysis of pipeline networks is costly and time consuming due to the complexity and large scale of the networks. Therefore, a failure analysis of pipelines is crucial for the efficient management of networks. There is a trend of increasing use of machine learning algorithms to predict the failure rate. Zakikhani et al. (2021) provided a review of failure prediction models, including machine learning models for oil and gas pipelines. Malek Mohammadi et al. (2021) used K-Nearest Neighbor (KNN) to predict the condition of sewer pipes. Also, machine learning algorithms have been recently used in prediction models of infrastructure failure. For example, Marcelino et al. (2021) used general machine learning to predict pavement performance.

Karimian et al. (2021) used an Evolutionary Polynomial Regression model to predict pipeline breaks. They clustered pipelines based on pipe age, diameter, length and material, and showed that pipelines of smaller diameter were more prone to failure. The occurrence of breaks was most sensitive to pipe diameter. For predicting the time to next break of ductile iron pipes, Snider and McBean (2018) made a comparison among a gradient-boosting algorithm model, an Artificial Neural Network (ANN) model and a Random Forest algorithm, suggesting the first one outperformed the other two. This is because gradient-boosting is an ensemble algorithm or a combination of multiple learning algorithms (usually decision trees) that form a stronger predictive model with better performance.

Al-Ali et al. (2019) reported a Logistic Regression (LR) model, aiming to find the most proper parameters for predicting the probability of water main failure, and leading to prioritizing pipes and planning an annual renewal. Dawood et al. (2020a) suggested considering soil type, traffic loads, trenchless method of construction, contractor experience and other influential factors, for improved results of pipe deterioration model. They recommended fuzzy-based assessments to reduce the risks of failure incidents.

In the study of pipeline failures in Colombia’s WDN, Giraldo-González and Rodríguez (2020) assessed three regression models and four machine learning models. The regression models were Linear Regression, Poisson Regression (PR) and Evolutionary Polynomial Regression, and the machine learning models were ANN, Bayes, Support Vector Machine and Gradient-Boosted Tree (GBT). The machine learning models used physical factors (age, diameter and length), environmental factors (moisture content, soil contraction, expansion potential, precipitation and land use), and operational factors (valve, hydrant, and previous failure) as predictors. The study used confusion matrices, accuracy and Receiver Operating Characteristic (ROC) curves as an evaluation criterion. The study concluded that PR outperformed the other regression models and GBT outperformed the other machine learning models.

Rahbaralam et al. (2020) employed two machine learning algorithms (LR and extreme gradient boosting) and one survival analysis model (Cox proportional hazard model) to predict Barcelona’s water main failures. The algorithms were fed with data after being resampled for feature selection, feature engineering and balancing. The algorithms were evaluated using accuracy, F1 score, recall, precision, Area Under the ROC Curve (AUC) and Matthew’s Correlation Coefficient (MCC). The extreme gradient boosting technique was the best.

Water main breaks interrupt services and cause revenue losses (Snider and McBean 2020a). Predictive models of break expected in the future help sustain WDNs and reduce costs. In Snider and McBean (2020b), the gradient boosting decision tree machine learning (xgboost) was compared with Weibull proportional hazard survival analysis, in terms of the effect of censored events on time to next break of cast iron pipes. The xgboost model combines multiple decision trees, which strengthens the performance.

Snider and McBean (2020b) reported that the xgboost model underpredicted time to next break because of the inability to include censored events. Removing censored events from a training dataset is not desirable for long-term planning of asset management. For this reason, they concluded that the model was adequate only for short-term planning of asset management. The Weibull proportional hazard survival analysis could learn from longer censored events in a training dataset; it frequently over-predicts break times (i.e., longer time to break) and therefore is appropriate for use for long-term planning. The analysis can give insights about pipe conditions by using historical data of pipe breaks. Note that unlike inspection data, historical data can easily be found in many water utilities (Xu and Sinha 2020, 2021).

Aslani et al. (2021) used machine learning models to predict water pipeline breaks, with input of spatiotemporal data. Vulnerable locations were identified by conducting a spatial clustering. They converted the results of the clustering analysis to an independent feature called hotspot level for subsequent use in the modeling process. They suggested that the results were useful for municipalities to locate hotspots and mitigate the vulnerability by pipe component renovations.

Robles-Velasco et al. (2020) used LR and Support Vector Classification (SVC) to predict whether a pipe will break or not. LR performed slightly better than SVC. The model output was between 0 and 1. This can be interpreted as the probability of failure, which is highly desirable nowadays. The probability of failure could be used by municipalities to optimally manage their annual rehabilitation plans. Many studies apply machine learning models to pipes that have had breaks (Harvey et al. 2013; Shirzad et al. 2014; Sattar et al. 2016; Kutyłowska 2017; Kerwin and Adey 2018). Robles-Velasco et al. (2020) considered all pipes rather than just those which had experienced breaks. They used three homogenized models with respect to the types of material and then a global model. A correlation analysis identified the covariance between standardized variables. They reported that replacing only 3% of pipelines could prevent around 30% of failures.

Chen et al. (2022) investigated the effect of combined datasets from different utilities on the performance of machine learning models for predicting future breaks. They combined datasets belonging to six utilities in three ways: using the dataset of only one utility, using a stratified sampling of all utilities and using a combined data of all utilities. The results showed that having a large quantity of data does not result in a better prediction model, but instead a sufficient amount of high-quality data such as historical breaks gives a better prediction model.

The examination of the above studies shows that in the case where only a limited amount of input dataset is available and where the purpose is to interpret break trends, regression models could be the best choice. Although survival analysis models are more suitable for long-term management plans, they over-predict break time and cannot handle the complexity that exists in water main dataset. On the other hand, machine learning models are more appropriate for water mains with good amounts of dataset as the models can treat complex relationships between input and output variables. However, these models are suitable only for short-term management planning. Also, it seems that physical parameters which are more accessible in water main dataset and widely used throughout the literature, have more impact on the output of the models. However, the effect of other parameters has yet to be discovered. In the following subsections, the problems existing in either the models or water main datasets itself explained in detail.

Data preparation for modelling

Most machine learning algorithms require data preparations: standardization, encoding, and feature transformation. Standardization rescales all factors. Some machine learning algorithms do not need standardization, however it improves model convergence. Standardization also improves model accuracy (Buntine et al. 2009; Shen et al. 2016). Consider a support vector classifier (SVC). This algorithm works based on maximizing the distance between the separating plane (hyperplane) and the support vectors (data points closer to the hyperplane). When the algorithm calculates the distances, without standardization, features with larger values will dominate features with smaller values. Therefore, standardization is required to reduce the dominancy effect between features and improve the model convergence (Lokman et al. 2019). Consequently, depending on the type of machine learning model selected, standardization might help improve accuracy.

Encoding categorical attributes yields numerical values for use in SVC and LR. The two widely used coding systems: one-hot-encoding, and dummy coding, convert categorical data into binary values (Cohen et al. 2014; Rahbaralam et al. 2020; Aslani et al. 2021). The integer encoding assigns an integer to categorical attributes based on failure rate per unit length (Robles-Velasco et al. 2020). The first two coding systems have the limitation that there is a significant increase in predictors when there are a large number of categories in the categorical attributes. Therefore, depending on the amount of dataset, a suitable coding system should be selected.

In many WMs prediction models, some attributes are difficult to model because of their disparity. Consider pipe length for instance. Disparate lengths of pipes exist in a dataset. Therefore, despite the fact that length is an important predictor, it is problematic. Some authors re-cut the length by street (Winkler et al. 2018), some used feature transformation and logarithms of length rather than the actual length and improved the accuracy noticeably (Robles-Velasco et al. 2020), and others used mean values for all variables related to length (Berardi et al. 2008). Therefore, the length of water mains needs attention and preparation before being fed into the model for better accuracy.

In some machine learning algorithms, tuning hyperparameters is an important issue which is difficult to properly address (Liu and Zio 2019; Fujiwara et al. 2020). This is because only a few hyperparameters need to be calibrated, which is not enough to capture all the variations in the model. When there are extensive variations in a model but insufficient parameters to capture the variations, an overfitting may occur (Ahmadi et al. 2015). Thus, overfitting should constantly be checked and avoided.

Missing data

In WM dataset, the issue of missing data is common (Osman et al. 2018). Handling missing data in the preprocessing is crucial. Missing data leads to losing some valuable information and causing data insufficiency (Wu and Liu 2017; Winkler et al. 2018). Consequently, removing missing values from a dataset can result in negative effects on data-driven models, unreliable parameter predictions, loss of valuable information, bias, and poor models (Tang et al. 2019). Therefore, it is necessary to keep as much information as possible (Barton et al. 2022).

Alternatively, there are several imputation techniques to handle this issue, e.g., traditional methods such as simple ways of substituting missing data with mean, median and constant values, or more advanced methods such as imputation using machine learning algorithms (for example, substituting missing data with the mean values from KNN in the training dataset) (Levinas et al. 2021; Xu and Sinha 2021). Advanced imputation methods are often better than simple imputation methods (Osman and Bainbridge 2011; Kabir et al. 2019). It is concluded that prior to developing a prediction model, one must have clean data and ensure minimal missing data.

Imbalanced dataset

Imbalanced data, censoring, and left truncation are three important issues associated with predictions of water main failures (Scheidegger et al. 2015; Xu and Sinha 2020). In water supply networks majority of pipelines never suffered from a failure. If the majority of pipelines in a dataset have not experience a break (one class) and a minority of them have experienced at least one break (another class), the dataset is considered as imbalanced, also known as unbalanced (Robles-Velasco et al. 2020) and as censored (Li et al. 2016; Snider and McBean 2020a)). Figure 2 depicts imbalanced data belonging to the City of Kitchener water main break dataset.

Fig. 2
figure 2

Frequency of number of breaks in a WDN (imbalanced dataset) (Data source: and

Dealing with imbalanced datasets is a challenging topic in data mining, receiving extensive research attention (Zhang and Wang 2013; Ribeiro and Reynoso-Meza 2020). Resampling may be implemented to an imbalanced dataset through random under-sampling, random over-sampling, and Synthetic Minority Over-sampling Technique (SMOTE) in classification models (He and Garcia 2009). Random under-sampling is a well-known method that removes examples of the majority class. Although this method decreases the computational time, it is at the expense of losing some valuable information (Japkowicz 2000; Seiffert et al. 2009).

Random over-sampling, on the other hand, randomly replicates the existing minority examples to make the dataset balanced. This technique also has its own limitations such as increasing the size of the dataset and causing the model to be overfitted. Thus, it is not applicable in the case of having an extensive dataset (García-Pedrajas et al. 2012). SMOTE randomly generates synthetic minority examples based on nearest neighbors and therefore it is a better way for balancing the dataset; it improves model performance (Fujiwara et al. 2020; Rahbaralam et al. 2020). Nevertheless, depending on the nature of dataset, one of the techniques might work better than the others.

An imbalanced dataset is also an issue in other fields, e.g., medical diagnostic and credit card fraud detection problems (Verhein and Chawla 2007). In such cases, the classification problem becomes very difficult since the main goal in imbalanced datasets is to predict the minority class (Huang et al. 2006). The models in question cannot be properly trained in the training phase and thus cannot correctly predict the minority class (Liu and Zio 2019). A naïve model could predict all data as the majority class and will likely achieve an accuracy of 99%. However, such models are useless in many cases. To evaluate the goodness of a model, accuracy serves a common metric measurement. However, accuracy alone is not considered as a suitable evaluation measurement in the case of an imbalanced data and might cause misinterpretation. Therefore, other metric measurements (e.g., F-measure) are very much demanded (Huang et al. 2006; Harvey and McBean 2014).

The confusion matrix is a good way of evaluation in the case of an imbalanced dataset. Accuracy and Recall are two metrics derived from the matrix. Accuracy gives the percentage of correctly predicted pipes while Recall measures the accuracy of true failures. However, higher Recall is at the expense of misclassification. AUC is another metric measurement that shows the capability of the model to avoid misclassification and can be computed from the ROC curve.

Censored events

Censoring happens when no pipe breaks are observed within a limited period of time, and this is the case in most water utilities datasets. Figure 3 illustrates an example of data censorship in water main break dataset. There are a large number of pipes in service, which have never experienced a break. Censored events can be handled by a traditional survival analysis (e.g., Cox proportional hazard models). On the contrary, many machine learning models are not capable of handling censored events. Although machine learning models are more capable of interpreting complex relationships that exist in a water main dataset, when using machine learning models, censoring is a concern.

Fig. 3
figure 3

(Modified from Snider and McBean (2020a))

Censored data

Censoring is almost the case for all WM datasets. Although survival models (e.g., Weibull proportional hazard survival analysis) can cope with censored data (Wang et al. 2019; Almheiri et al. 2021) and are good for long-term management planning (Snider and McBean 2020b), they are not suitable for modeling complex relationships between variables. Machine learning algorithms, on the other hand, are very efficient to model complex relationships between variables, but they are good only for short-term management planning (Snider and McBean 2020b). For example, xgboost has been found to surpass other single machine learning models such as Random Forest and ANN (Zhang et al. 2017; Snider and McBean 2018). The problem with xgboost is that it is not programmatically structured to deal with censored data. In fact, it removes censored data at the training stage so it cannot learn from the censored data, therefore it is constantly underpredict time to failure (Snider and McBean 2020b).

Machine learning models are more desirable for use to predict WM failures. To cope with the problem of censoring, a survival machine learning model (a combination of machine learning with a survival statistics) can be used, one exampling being Random Survival Forest, which is relatively new. These models not only incorporate censored data but also utilize data-driven approaches to model complex relationship between input and output variables (Snider and McBean 2021).

Left truncation

Left truncation occurs when the records of pipe failures before collecting data are missing. Like censoring, this is also always the case in a water main dataset and it is acknowledged widely (Barton et al. 2022). The effects of left truncation have been overlooked in many studies (Snider and McBean 2020b) even though this issue causes a systematic bias, especially for survival analysis models (Scheidegger et al. 2015; Xu and Sinha 2019). Instead, they assume the first recorded failure is the first real failure. This assumption will lead to bias and inaccurate predictions (Xu and Sinha 2020; Hawari et al. 2020). The scale and shape of the survival curve can be severely biased due to left truncation, which results in a change in estimates of the mean time to failure (MTTF) (Xu and Sinha 2021).

There are several ways to tackle the left truncation issue. One way is to revise the probability function (Mailhot et al. 2000; Scheidegger et al. 2013). Xu and Sinha (2021) proposed an integration of ANN imputation method with Weibull proportional hazard survival analysis to calibrate the survival curve and reduce MTTF estimation bias caused by left truncation. They showed a drop of bias from 14.3% to 2.1% by applying the method.

Correlation analysis

A correlation between predictors reduces the accuracy and increases computing time for most machine learning algorithms (Hall 1999; Kumar and Chong 2018), except tree-based algorithms which can handle correlations (Eisler and Holmes 2021). The issue of correlations between independent attributes has serious impacts, but it was not addressed (Snider and McBean 2018; Roccetti et al. 2019; Giraldo-González and Rodríguez 2020; Weeraddana et al. 2020; Rahbaralam et al. 2020; Dawood et al. 2020a; Amini and Dziedzic 2021). There are different methods to investigate the correlation, e.g., the t-test, ANOVA, MANOVA, Chi Squared and Pearson’s correlation analysis. Depending on the nature of the dataset (i.e., being numerical or categorical), the above-mentioned methods are useful. Pearson’s correlation analysis is one of the most widely used methods, but it is useful only for identifying the correlation between numerical variables (Zhang et al. 2014).

Prescriptive analytics

The literature in the domain of water distribution networks can be divided into different categories in many ways. In this review, literature is divided into two main categories: predictive analytics and prescriptive analytics. In the following subsections, literature of prescriptive analytics is explained in more details.

Failure consequences assessment and risk analysis

This paper reviewed the existing literature related to identifying risk, criticality index and failure consequences for WMs. Fares and Zayed (2010) utilized a hierarchy fuzzy expert system to evaluate the risk of WM failure. Their considered 16 risk factors. According to their study, risk factors can be divided to factors that lead to failure (deterioration factors) and factors which result from failure (consequence factors). They demonstrated that the most significant influences on failure risk are pipe age, pipe material, and pipe breakage rate, respectively. Kabir et al. (2015) proposed a Bayesian Belief Network model to prioritize metallic WMs and evaluate the risk of WMs failure. They used structural integrity, hydraulic capacity, water quality and consequence factors in their model, and they claimed that any other factors could also be included in their model. They showed that the model can visualize the most vulnerable, sensitive and the highest risk pipes within a WDN.

Mugume et al. (2015) simulated a simplified synthetic water distribution system in EPANET and a synthetic urban drainage system in the Storm Water Management Model. They investigated the system performance under the condition of pipe failure. They focused on minimizing failure consequences to improve resilience in urban water systems. They also investigated the effect of rehabilitation strategies including pipe replacement on resilience. They showed that if failure scenarios are considered during urban water systems design, the loss of system functionality could be minimized.

Al-Zahrani et al. (2016) identified the vulnerable locations in a WDN using a fuzzy-based decision support system. These vulnerable locations experience more structural failures as well as failures in supplying water at the target quality. Their model was applied to a case where a risk index was developed to show both the probability of failures and their impacts. They showed that the model helped utilities to prioritize pipes within the system based on the overall failure risk.

Vishwakarma and Sinha (2020) used the fuzzy inference method for developing the consequence of failure. They proposed a quantitative risk matrix for risk visualization, that compared to semi-quantitative and qualitative risk matrix, reduce subjectivity in the design process. Their model framework covers different types of the consequence of failure assessment such as economic, environmental and social impacts, as well as operational intelligence and complexity of renewal activities. They improve previously developed techniques of assessing failure consequences by using a quantitative risk matrix. Utilizing risk assessment has multiple advantages for management programs, such as supporting pipes renewal prioritization decisions and moving from reactive maintenance plans to proactive plans.

Phan et al. (2019) used a risk assessment framework in a case study of WM in a WDN. The calculation of the probability of failure used Weibull distribution. They used a fuzzy inference system to aggregate failure consequences because unifying different types of consequences into one outcome is difficult. Consequences consist of impacts on the redundancy/vulnerability of the network, water loss and rehabilitation costs and of impacts on public health. They used the diameter to quantify the volume of water loss and algebraic connectivity to consider the topological consequence. The topological consequence is useful for redundancy reduction. In order to prioritize water mains, a risk map is developed for use by decision-makers.

Balekelayi and Tesfamariam (2021) performed a hydrodynamic assessment for the wastewater system of Calgary using ordered weighted averaging technique to identify the criticality index of the wastewater pipes. A dynamic deterioration model was combined with the proposed criticality index to determine the operational risk of the wastewater pipes. This technique helps municipalities to prioritize the inspection and replacement of sewer pipes. They showed that the technique can successfully identify the criticality index of wastewater pipes when hydrodynamic data are not available. Using the information, hydraulic models can be regularly updated and thus wastewater pipe inspection plans can be prioritized. The results of the study can also be used for water mains.

Risk is a multiplication of the probability of failure (POF) and consequences of failure (COF). The probability of WM failures can be derived from the prediction models explained earlier in predictive analytics section. However, in order to achieve a good maintenance plan, POF is not the only factor that matters, and COF is another important factor. This is because some pipes might have the least POF but the highest COF in the network, which might be overlooked in the prioritization plan. Therefore, assessing COF is also of relevance. The failure consequences can be economic, environmental and social impacts. The indirect costs of failure should also be taken into consideration. Thus, the determination of failure consequences and hence the risk are difficult, because of uncertainties and many factors involved. Often, fuzzy techniques are used to deal with uncertainties and to quantify failure consequences and risk factors.

Maintenance planning, scheduling and prioritization

In this section, papers in regard to water loss minimization and asset management plans have been collected. The deterioration of assets is inevitable due to aging. Thus, an efficient asset management becomes crucial for assets to continue delivering an adequate level of service. There are efficient asset management plans in various infrastructures sectors such as road networks, urban railways and metro systems, buildings, wastewater and drainage systems (Mohammadi et al. 2018, 2019, 2020; Dziedzic et al. 2021). However, there is less progress in case of water systems asset management.

Kleiner et al. (2010) developed a non-homogeneous Poisson model for the analysis and forecast of breakage patterns in individual water mains, considering both static and dynamic factors. Their case study was for a water utility in Eastern Ontario. Different costs associated with each pipe were considered, including the costs of pipe replacement and repair, the costs of water loss due to failure, and cost-saving due to roadwork coordination. They used the results of pipe break predictions for the water main renewal schedule plan, utilizing a multi-objective genetic algorithm.

Malm et al. (2015) developed a Cost-Benefit Analysis for leakage reduction. They compared the costs and benefits for each alternative over time. They also considered uncertainty analysis. The results show that considering uncertainty analysis improved the results of the Cost-Benefit Analysis. They considered four different alternatives to reduce leakage in their case study of Gothenburg. It was found that reactively repairing, despite a high leakage rate, is more cost effective, compared to proactively pipe replacement.

Zyoud and Fuchs-Hanusch (2019, 2020) applied different techniques to a real water supply system in Palestine. They compared the traditional Multi Criteria Decision Making approach and the Analytic Hierarchy Process (AHP) method for a water loss management problem. Although AHP is easy to implement and has strong potential in structuring and decomposing complex decision problems, it cannot handle uncertainties. Therefore, Fuzzy AHP has been used to deal with uncertainty and incomplete information.

Barton et al. (2022) revealed that the quantity and quality of data have an important impact on the accuracy of WM failure models, and poor data results in low accuracy of models. They suggested that there should be increased focus on data collection since poor quality data makes it hard for utilities to manage WMs rehabilitation plan. They show that long term management plans for water mains remain a challenging issue and require further attention.

Ghobadi et al. (2021) proposed a pipe replacement scheduling method based on a life cycle cost assessment. In order to obtain an optimal replacement plan, a multi-objective nondominated sorting genetic algorithm (NSGA-II) is used. The proposed replacement plan avoids investment peaks and smooth the investment time series based on life cycle cost. Unlike many other studies, they considered that limitations exist in the annual budget in their model. They show that by using online monitoring and recording failure data, the accuracy of the pipe failure rate is improved, and the annual replacement plans can be updated. The scheduling plan becomes near optimal.

Decision making software tools and methodologies would help municipalities to perform their water infrastructure maintenance plans more efficiently. These plans usually consider a set of predefined alternatives. However, more practical replacement plans which affect several pipes simultaneously rather than just replacing individual pipes haven’t been considered in these methodologies. Ramos-Salgado et al. (2022) scheduled a sustainable water supply replacement plan with a five-step infrastructure asset management framework. (1) As the first step, a replacement priority index for every network asset has been obtained. (2) Despite the previous maintenance strategies (considering individual pipelines), they used street sections as the operational replacement unit to reduce the social consequences related to each intervention. (3) They considered the replacement plan of two adjacent pipes at the same time even with having different priority of replacement for the sake of operational and convenience criteria, since it is more acceptable by utilities and more aligned with their policies. Also, a fair budget allocation performed in their study based on social and geographic criteria to ensure a decent investment distribution between districts and towns. (4) After specifying the replacement priority of the network assets, a short-term, mid-term and long-term replacement plan is required. In this regard, a set of indicators for performance evaluation of the network is needed which specify the investment level and certain courses of action. A combination of four indicators is used name infrastructure value index (ratio between the value of the infrastructure at the current state and its replacement cost), average network age, average risk index, and the average probability of failure. These indicators are easy to calculate and interpret. They also present various information on the performance of the network. (5) Lastly a mathematical technique is used to calculate the required budget more efficiently.

Maintenance coordination and prioritization

There have been tremendous efforts on maintenance plans as an individual asset management plan. A coordinated asset management plan is much needed to better manage existing infrastructure assets, but the coordination has been neglected by many municipalities. This section gives particular attention to the coordination of interrelated infrastructures, optimum replacement time of them (e.g., roads, water and sewers) and prioritization of their budget allocation. Integrated rehabilitation actions among the co-located infrastructure assets are necessary when developing a renewal plan. This could decrease or avoid unnecessary rework, rehabilitation costs, service disruptions and risks (Halfawy 2008; Abusamra 2018).

Marzouk and Osama (2015) proposed a decision support tool to determine the optimal time of maintenance and replacement of mixed infrastructures simultaneously (i.e., pavement, water pipes, sewer pipes, gas pipes, and electrical cables). This approach could prevent costs associated with the surface layer of pavements to be destroyed multiple times (for example once for sewer pipes replacement and once for water pipes replacement). The useful life of different infrastructures was first identified by simulation, and then depending on the replacement time and costs, a decision was made on the optimal maintenance and replacement time. With regard to uncertainties of models, a fuzzy approach was applied. The key goal of their study was the minimization of the total costs of infrastructure replacement.

Marzouk and Osama (2017) presented a method for the coordinated maintenance of road, water distribution and wastewater distribution networks. First, a deterioration model is developed using a hierarchical fuzzy expert system technique to assess the condition of each infrastructure asset. Then, a risk model is developed using a fuzzy Monte Carlo simulation to calculate POF and AHP to calculate COF. Lastly, a multi-objective optimization using genetic algorithm (GA) is developed, with four objective functions: (1) minimizing the overall risk, (2) maximizing level of service (LOS), (3) maximizing the overall conditions of the assets, and (4) minimizing life cycle cost (LCC). The optimization model considers seven scenarios of actions for: (1) road segment only; (2) water only; (3) sewer only; (4) road and water; (5) road and sewer; (6) sewer and water; (7) road, sewer and water. The optimization constraints were set to meet the minimum requirements of the condition, performance and risk for all infrastructures within the annual budget. The results showed an average integrated risk index of 5.45 over a planning horizon of 20 years. Over 86% of the projects were recommended under integrated scenarios as follows: road, water and sewer at 38%; road and sewer at 24%; road and water at 24%. These maximize cost saving.

Abusamra (2018) pointed out numerous attempts to improve infrastructure maintenance and intervention plans within a limited budget. However, most of them were successful only in developing a plan for short-term planning and a single asset. The author proposed optimization models to help decision makers to identify a coordinated maintenance plan for the co-located infrastructure assets (i.e., roads, water, and sewer). Two multi-objective models were discussed: (1) evolutionary GAs optimization, which used a set of meta-heuristic rules to find a near-optimum solution; (2) linear programming optimization to find an exact solution. The objective function was to maximize an overall improvement and to maximize the network health index. The results showed an overall enhancement (time, cost, efficiency, risk, etc.) of 29% over a planning horizon of 25 years, achieved from coordinating the interventions. Compared to the conventional approach, coordination reduced disruptions and interventions by 67%.

Amador-Jimenez and Mohammadi (2020) considered different budgeting scenarios such as worst-first, silos, and trade-off optimization, to assess the pros and cons of proposed scenarios. They aimed to investigate the prioritization of budget allocation and management plans for different infrastructure assets (i.e., pavements, sanitary sewers, storm sewers and water mains), based on the proposed scenarios, and to select the superior management plan among all. They show that a trade-off optimization analysis improves results, giving the highest priority to water mains and lower priority to pavements and storm pipes in terms of investment management planning.

Very recently, Shahata et al. (2022) proposed a multi-stage integer programming that is capable of optimizing the most suitable, cost-effective renewal action (if any) for road, sewer and water infrastructure assets. The objective function was to maximize risk reduction in a cost-effective manner. Their decision-making approach used risk assessment and a performance rating model. The model also used rehabilitation alternatives, giving priority to integrated renewal actions. They showed that the approach is capable of reducing risk costs by using integrated actions (e.g., road, water and sewer by 36%; road and sewer by 23%; road and water by 25%). They also showed that their integrated model can enhance budget-saving, compared to the conventional silos approach (renewal plan of only each infrastructure). In order to improve the model's practicality, the consequence of each intervention alternative such as the impact on travel delay, noise pollution costs, lost business revenue, etc. should also be considered.


This review of water asset management analytics has revealed: a) a need to explore the influence of environmental factors on WM failures; b) a need to consider both direct and indirect costs in optimal mitigation analysis and replacement prioritization. The environmental factors indirectly contribute to failures. The contribution is particularly significant for WMs in cold regions. Failure models should be coupled with costs (direct and indirect) as a constraint in optimal scheduling plans. The coupling renders failure predictions meaningful as the ultimate goals are to update asset management plans and prioritize rehabilitation or replacement.

Further research efforts are needed to reveal new insights about contributing mechanisms of WM failures, to create novel ideas for reliable predictions of failures, and to invent ways for putting theoretical predictions into practical use in managing and maintaining WMs in a cost-effective manner. The mechanisms are more complex in cold regions. More details about potential avenues for future research are discussed below under each category of analytics:

Directions of future research in predictive analytics

In spite of extensive studies of WM failures over the past decades, significant knowledge gaps exist in predictive analytics. Environmental factors (e.g., weather conditions, climate factors and so on) are reportedly less influential than physical factors (e.g., pipe diameter, pipe length and so on). However, the influence of the environmental factors such as climatic variations and freezing in cold regions has received little attention (Kleiner and Rajani 2002; Farmani et al. 2017; Demissie et al. 2017; Almheiri et al. 2020). In the cold regions, pipes are more susceptible to break due to temperature fluctuations. Frozen water inside a pipe expands. Even if the pipe does not break, it can significantly degrade. Freezing temperature fluctuations result in extra stresses on pipes. Moisture on the ground can cause frosts at freezing temperatures and lead to ground movement and hence stresses on the pipes. Cast-iron pipes are more prone to failures at freezing temperatures because of the erosion of soils around them. If they are not lined with protection materials, they begin to corrode from inside and ultimately break. In future research, it would be meaningful to create homogenous groups of pipes based on the environmental factors such as soil type, freezing index, temperature, precipitation and frost depth in order to investigate their influence on WM failures.

The past studies have overlooked issues related to the apparent age of pipes based on their conditions. An application of rehabilitation techniques such as lining and cathodic retrofit to existing pipes causes a change in the conditions of the pipes and thus redefines their ages. Therefore, the influence of applied rehabilitation techniques and the resulting change need to be investigated.

One important step before developing any prediction model is data preprocessing and preparation. The missing gap of data needs to be handled properly. If the available amount of the missing data is not meaningful, they can simply be excluded. Otherwise, an existing missing gap should be filled by predictions using advanced imputation methods. Correlated attributes must be removed as they will decrease modelling efficiency significantly. One needs to pay adequate attention to imbalanced dataset and general data preparation before applying any prediction method because these steps impact modelling reliability significantly. Resampling dataset is a good way to cope with imbalanced dataset.

Unresolved issues of censoring, and left truncation are common with WM datasets. One way to deal with the issues is to use survival machine learning models (a combination of machine learning with a survival statistics). The models can handle both censoring and a complex relationship between input and output variables. Table 1 presents a summary of predictive analytics and the applied techniques published in the past 13 years.

Table 1 A summary of predictive analytics applications for water pipes failure

Directions of future research in prescriptive analytics

Previous studies using prescriptive analytics have been limited to consideration of economic costs as the maintenance objective to optimize. The social and environmental costs (indirect costs) associated with a failure are commonly ignored in WM maintenance planning and rehabilitation scheduling. Future research should aim to maximize the system reliability and at the same time minimize the risk index and failure consequences (costs). Beside economic costs, the social and environmental costs can have significant influence on maintenance planning and scheduling, and they should be considered.

The multiplication of probability and COF determines the risk factor; through this link, a risk map can be developed and utilized to develop a maintenance prioritization plan. The coupling of a WM prediction model, probability of WM failure and COF allows us to develop a precise, practical maintenance prioritization plan. This goal can be achieved using an optimization model, together with decision-making methods. The goal should be set in a way to reduce leakage, which in turn decreases expenses (direct and indirect) and increases the expectancy life of assets. However, long-term management plans remain challenging and further attention is needed.

The literature in related to optimization of maintenance/replacement time for infrastructures networks with coordination and prioritization of maintenance activities are rare. In reality, a WM infrastructure is often maintained in association with other infrastructure such as pavement, and thus the asset management impact of one infrastructure on the other is inevitable. The mutual impact remains essentially an under explored area. A prescriptive analysis of interdependent infrastructures would be helpful to prioritize budget allocations and to identify the optimal replacement/maintenance time in a realist setting.

In conclusion, the need to adopt a coordinated maintenance plan for integrated infrastructure assets is extensively acknowledged in industry and academia. When the assets reach an unacceptable LOS, which need some actions and interventions, the optimum decision on how to repair all overlapping assets using the pre-existing and limited budget and without overspending, remains challenging. Therefore, priorities should be set in a way to answer these questions: Which asset is more critical and needs immediate action? What are the actions/interventions (repair, rehabilitation, replace or do nothing)? When is the best time the work should be done? One important requirement for all coordinated maintenance plans is the ability to support long-term planning. In this regard, the life cycles of different infrastructure assets should be considered in these models. Table 2 presents a summary of prescriptive analytics and the applied techniques published in the past 12 years.

Table 2 A summary of prescriptive analytics applications for water pipes failure


In the light of the above literature review and after considering the knowledge gaps related to the existing analytics methods and issues associate with water main datasets, establishing an integrated approach for smart water mains asset management is advocated (Figure 4) incorporating the synergy between failure models (predictive analytics) and maintenance strategies (prescriptive analytics). Most WM datasets mainly consist of physical factors of pipes such as age, diameter, length and material. Usually, they do not include operational factors such as annual average daily traffic (AADT), number of breaks, water pressure, and environmental factors such as freezing and thawing index, temperature, precipitation, frost depth, and rain deficit. Therefore, in order to investigate the effects of the environmental factors, this study suggests merging them with WM datasets. After cleansing and careful data pre-processing, dimensionality reduction is useful to reduce dataset dimensions and computing time. To aggregate the efforts for similar regions with similar characteristics, one may perform clustering which is relatively new in this domain. The next step is to select features that contribute the most in failure prediction models. Concretely, with sufficient data, a prediction model can be developed as the ultimate step in predictive analytics.

Fig. 4
figure 4

Proposed integrated predictive and prescriptive analytics for smart water main asset management

In regard to prescriptive analytics, depending on the types of prediction model in previous stage, either the time to failure or POF/COF could be mapped across the water mains network. To obtain COF, the indirect costs of failure, such as proximity to environmental/external factors (e.g., rail tracks and transmission gas mains) and the impact on the LOS and the costumer class (e.g., hospital, emergency services, residential) could be considered. By minimizing the risk of failure, the infrastructure maintenance plan can be prioritized accordingly. Using the prediction models resulted from the predictive analytics step, a multi-objective maintenance plan could be developed in coordination with other infrastructures such as roads and sewer pipes. The other optimization objectives could be maximizing LOS, maximizing asset condition, and minimizing LCC. Lastly, the rehabilitation/replacement plan will be scheduled. It is expected that after implementing the rehabilitation/replacement plan in WMs, the condition of the networks will change. So, the predictive models should be updated based on the new information after each prediction-prescription-implementation cycle.


Water pipe failures have increased drastically due to a slow rate of replacements and thus aging of WMs. This issue is difficult to resolve because such networks are complex and are typically buried underground. In many municipalities, most parts of the networks have reached the end of their service life, expediting even more failures in near future. Given that failures incur revenue losses and cause interruptions to service and economic activities, it becomes increasingly urgent to find better solutions. Various financial, societal, and technical constraints make it infeasible to think of replacing aging WMs, which typically serves many residential, commercial, industrial and institutional consumers, and which consists of a vast network of interconnected pipelines, pumps, valves, regulators and tanks. Thus, predicting near-future failures is of economic, social and environmental relevance.

This review provided a comprehensive overview of the methods proposed for predicting and minimizing the failures and their consequences. It has provided new insights into the knowledge gaps identified in the existing studies related to the applications of predictive and prescriptive analytics in water systems asset management. In spite of extensive research efforts over the past decades, the treatment of imbalanced data, censoring and left truncation remains as key research gaps. The other gaps correspond to how to increase sustainability, reliability and resilience of WM systems through the use of predictive models and efficient rehabilitation planning.

Considering the literature and the identified gaps, this study proposed a failure analytics framework for WMs and discussed a number of avenues for future research. It is worthy to highlight that the quality of dataset could have a significant impact on the performance of prediction models. To achieve this goal, this review recommends that municipalities use advanced inspection technologies which result in establishing more accurate prediction models, leading in turn to more precise data-drive prescription analytics that improve the reliability of WMs and create cost efficiency gains.



Pipe age (year)


Area Under the ROC Curve


Artificial Neural Network


Break density (Breaks/Km2)


Analytic Hierarchy Process


Break year (year)


Consequences Of Failure


Pipe diameter (mm)


Freezing Index (degree days)


Genetic Algorithm


Gradient-Boosted Tree


K-Nearest Neighbor


Pipe length (m)


Level Of Service


Life Cycle Cost


Pipe location


Logistic Regression


Land use




Matthew’s Correlation Coefficient


Moisture content (%)


Number of previous breaks


Number of connections


Network type


Poisson Regression


Precipitation (mm)


Probability Of Failure


Receiver Operating Characteristic


Road type


Soil PH


Soil resistivity


Soil type


Support Vector Classification


Synthetic Minority Over-sampling Technique




Pipe thickness




Water Distribution Network


Water Main


Water pressure (KPa)


Water PH


Installation year (year)


Download references


The authors are very much thankful to four anonymous reviewers whose comments and suggestions were very helpful in improving the quality of this manuscript.


Not applicable.

Author information

Authors and Affiliations



The work has been done under the supervision of FN and SI. The first draft of the manuscript was written by AD. Reviews, edits, and further analysis were performed by FN and SI. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Fuzhan Nasiri.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The corresponding author, Fuzhan Nasiri, is an unpaid member of the editorial board of Environmental Systems Research. The authors declare that they have no other competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Delnaz, A., Nasiri, F. & Li, S.S. Asset management analytics for urban water mains: a literature review. Environ Syst Res 12, 12 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: