UN Data for Climate Action – Predicting and Alleviating Road Flooding in Senegal

With ten weeks’ busy work, we successfully completed our work for the UN Data for Climate Action Challenge. Here is a wrap up of the research questions we focus on, and the solutions for the problems.

The background is that climate change has the potential to raise the risk of flood for coastal countries, like those of Senegal. Given the large proportion of unpaved road in Senegal, flood risk could damage the road network and affect the accessibility of residents. Given the condition that in Africa countries there is a deficit funding for the infrastructure development, it is critical to identify which roads should be prioritized, to prepare for the possible damages brought by climate change. We propose two steps to identify the roads that should be prioritized.

First, we need to evaluate the probability of flood risk for the areas where roads go through under climate change. To achieve this, we build a flood risk model based on the topographic features and historical weather data for the area we study.

The second step is to analyze the contribution of each road segment to regional connectivity. Roads that are critical to accessibility and under flood risk should be prioritized for weatherproofing.

Applying optimization techniques, we can then determine explicit plans for allocating road maintenance funds. Multiple sustainable development objectives can be explored within this framework, such as maximizing rural connectivity or minimizing the expected number of people isolated due to flooding. This approach has the potential to minimize the long-term cost of establishing a reliable road network while helping to buffer vulnerable populations from extreme weather events.

Flood Risk Prediction Model

For flood risk prediction, we have collected data from multiple sources, for example, flooding maps of Senegal from NASA, daily weather data from NOAA, land cover data from the Food and Agriculture Organization of the UN, and different types of maps from Open Street Map. With the rich information in topography, hydrology, and weather etc. we are able to build machine learning models to evaluate the flood risk for the 1km*1km analysis unit. The bellowing framework shows features we use, the targets, the algorithms we use to build models, and evaluation methods.

There is also a critical step to join the target flooding area and all the features so that they are at the same spatial scales. For raster files, we mainly use the zonal statistics method to get values for each grid cell. For land cover and water area data, we calculate the intersection area of feature polygons to each grid cell. For daily weather, we use the weighted average, where the weights are determined by the distance of the grid cell to two weather stations.

Firstly, we train regression models and use the proportion of flooding in each grid cell during each biweekly time period as the target. We choose three machine-learning models to train on the data: Support Vector Machines (SVM), Random Forest (RF) and XGBoost. The best RF model achieves promising performance, with an R-squared (how close the data fit to the regression line) of about 0.7056 in test set, and a root mean square error (RMSE) of about 0.1041. The top 10 important features of the model show that the dynamic historical weather features affect the flooding area change, especially the historical temperature and precipitation.

However, the regression results do not reflect how adversely the road going through this area may be affected by the flood. This is a challenging idea to quantify, as the flooding area change of a grid cell is not directly related to the probability of road becoming flooded. Therefore, we set a threshold to determine whether the grid cell is flooded or not at a particular biweekly time period, and turn it into a classification problem. Each sample is labeled as flooded or not based on the percentage of flooding areas in this grid cell. For conservative consideration, the threshold is set as 0.5, which means that if a grid cell has 50% area flooded during a biweekly time period, this sample is labeled as flooded, vise versa. In the table shows the model evaluation and performance on test dataset.

A visualization of the historical flood risk map and the predicted map shows that we can precisely capture areas with high flooding risk such as #1, #2, and #3. Meanwhile, for some historical low flood risk areas (#4), our predicted model can overestimate the flood risk. Such areas may get flooded not that frequent in the past, but probably have a risk of getting flooded in the future, according to our model. The predicted results help to offer suggestive information for the future preparation.

Road Network Optimization

We use the telecommunication data from Orange to estimate the traffic flow in road segments. We first began by generating the Voronoi of the cellular network towers by computing the Delaunay Trian-gulation of each tower and assigning road intersections to each Voronoi region. We then began assigning population flow to the edges by checking if a user was in transition. We say a user is in transition if the tower corresponding to their cell phone use changed from one time stamp to the next. If a user is in transition, we calculate the shortest path between two randomly chosen roads corresponding to the origin and destination region. After the path is calculated, we increment the population of the edges in the path by one for the date of the destination’s time stamp.

The second task was to determine which edges in our graph were at most risk of being flooded. Using the 14 days composite flood map from NASA, we calculate the amount of flooding in a road at a particular time period. This is calculated by the sum of the areas that are flooded in one road segment at a specific time period. We then divide this sum by the length of the entire road segment. The assumption is that if a road segment is frequently flooded, or a large pro-portion of the road is flooded, then this road segment has a higher risk of being broken. Therefore, we define the flood risk of a road as the sum of flooded proportions over all the time periods. The third task was to determine overall importance of each road segment and make repair or preemptive fortifications based on the value of the road. We define road importance as how much of an impact its removal may have on accessibility to the surrounding regions. This is computed by finding the distance traveled by all inhabitants on two separate paths, and taking their difference. The first path is the original intact path. The second is the alternate route taken if one of the roads in the original path is damaged. We take the difference between the second path and the first path. That is, the bigger the difference is, the worse the new route is, and thus the more impact on accessibility the flooding of the chosen road will have. We calculated importance of the top 20 riskiest roads.

In conclusion, we solve the road optimization problem by building a flood risk model, evaluating the road traffic based on mobility behaviors extracted from cell phone records, and combining these two to assess the road importance. Hope these models can help decision makers to make more efficient strategies regarding to climate mitigation for transportation.

We thank our mentor Bistra Dilkina, Caleb Robinson, and Amrita Gupta for useful advice.

UN Project: Features Processed and Models for Flood Risk Prediction

After the exploration of data sources, in this week we focus on the geographic processing of data and start to explore some preliminary questions in flood risk prediction.

Geo-processing of data: for the 1km*1km grids created as learning units in Ziguinchor region, we try to extract the topographic and weather features for each.

1) Weather: in the NOAA weather data, there are two weather stations locate in the region we study. So the weather feature of each grid is calculated by taking a weighted average of the two stations based on the distance to each. While the flood images are a 14 days composite, the weather data is aggregated in the same time scale. As the figure shows, the total number of flood area in this region is highly consistent with the precipitation and dew point over all the time points from 2015 to 2017.

 

2) Water area and waterways: both are spatial data, with polygons or lines in map representing the water information in certain location. For each polygon, we calculate its intersection area with water area polygons, and calculate the distance of grid centroid to the nearest waterways. These two features indicate the geographic relation between our object area and waters.

3) Elevation and slope: elevation is raster data at 3 arc second resolutions, while also helps generate the slope values. We apply a zonal statistical method on the raster file and grid shapefiles, and obtain the average, max, min and standard deviation statistics of elevations and slopes in each grid unit.

4) Land cover: water storage capacity has great effect on the formation of flood, so the land cover type could be a very predictive feature for flood risk. We downloaded land cover map from Food and Agriculture Organization of the United Nations, can calculate the percentage of surface types for each grid.

After data wrangling, we explore several questions to build models. Some has decent results for us to keep working on.

Deep Learning Models:

In this method, the dataset is composed of image patches randomly sampled from color-coded map of Senegal, where each color represents a different feature of the land. We currently have a convolutional neural network that can classify whether a patch of land will be flooded within the next year at 85% accuracy. The dataset was labeled using a simple algorithm that counts occurrences of RGB values within a specified range, and if the number of occurrences is above a certain threshold then the image is labeled as ‘flooded’.

The classification accuracy is used to see how well the current network architecture can encode information, so that an autoencoder model that can generate flood patterns within specific images can be built on top of it. Another model in progress is one that can predict flood risk in a specified area with respect to time. An RNN is being used for this purpose. We hope to represent risk in different areas using a choloropleth map.

Machine Learning Models:

  1. Regression: Can we model the average area of flooding for each grid over all study days by the static topographic features and an average of weather features?

Random forest model: R-squared=0.806

  1. Classification: Can we model a grid flooded or not over the study dates, where flooded is determined by a variable threshold of percentage of flooding area in the grid?

Stochastic gradient boosting: Accuracy=0.895 when threshold is 0.25

  1. Regression: Extended the question 2, can we model the number of floods over study dates by a threshold?

Random forest model: R-squared=0.734

  1. Regression: Extended the question 3, can we train the data of 2015 and test on 2016 to evaluate model performance?

Random forest model: R-squared=0.730

  1. Regression & Classification: Can we predict the amount of flooding per cell per date, or whether it flooded or not by setting a threshold on the amount of flooding?

Generalized linear model: R-squared=0.155

In the next week, we will keep exploring models for the above questions.