Seeing like a bike

WEEK 6: WHAT WE HAVE DONE:

Gas sensors calibration

Material used:

  • Wine preserver gas + vacuum bag
    • Baseline calibration

As we mentioned in our previous blog, our first try to calibrate the gas sensors was using helium balloons and it did not work. It took a while for us to get the material we needed to do the calibration but now we are finally done with this process and we are ready to move on.

Gas sensors calibration process : N2O, CO, O3
Gas Sensors base line values 
After putting the wine preserver gas which consist of Nitrogen dioxide (N2O), Argon (R), and Carbon Dioxide (CO2), into the vacuum bag, the gas sensors gradually responded and they are the base line values. The following graph illustrates these values.

Gas sensors base values.

          A new 3D-printed case

We made a stronger  3D-printed case. Our previous version of this case was made of      ABS plastic which is a low quality material for our our case needs. Now, our  new case id made of Uv cure resin plastic which is much better than the previous one. Our new 3D-printed case is stronger, soft, and manageable.

3D-printed case.

         Our system broke

 On the way to our midterm presentation we dropped off by accident the box that goes in the back of the bike. This was the only one box that we have ready to collect data and some of the components inside  were damaged. We were sad because we were not able to show our system working in the presentation, but actually  the dropping  was a good happening since  we were able to detect some weaknessess  in our setting. So, now we are being working on the damages in our system and figure it out how to make it stronger. 

NEXT STEPS  

Our goals for week 7th are:

  1. Fix our broken box and build at least 3 more
  2. Collocate sensors at the Atlanta sensing station
  3. have at least three equipped bikes
  4. Get data and pilot data

 

Food for Thought: ACFB Tour and…?

Our team with Lauren at ACFB. John, Mizzani, Miriam, Lauren, and Dorris (left to right)

This week, along with the mid-program presentation, the team members were invited for a tour of the Atlanta Food Bank. Other than the advanced operations of the warehouse facilities, what caught my attention was that everyone who was working there was very passionate about what they were doing. Lauren Waits, one of our main stakeholders who is director of governmental affairs at ACFB, gave us a tour around the facility and not only she knew almost all the people working within the warehouse and within the office, she would talk to and ask about how they are. Some of the people explained what they did as they worked there, and the way they spoke was very passionate and informing to us although the tour seemed to be very informal. As information about the food bank, especially Atlanta Community Food Bank, was explained by previous blog post, fighting against food insecurity came to me as an issue that shouldn’t be overlooked and how our tool can impact our stakeholders in helping them with their media agenda.

 

This is the freezer and cooler on the grocery floor, where the partner organization who distributes food come to “shop” for 10 minutes

 

 

This is grocery floor entrance for partners/organization to come to load the food into their vehicles

Inside the Warehouse. The turnover of the warehouse is 2 to 3 times a month.

This is where small amount donation is sorted and packaged

Types of packages that are prepared here. Includes hygiene material as well as cleaning supplies

General flow of what we are working on

To touch upon our work, our team has been working with three major things: sentiment analysis and text mining visual/census data and map overlays/legislators’ voting record visual. Because we had our mid-program presentation this week, we focused more on making the presentation slides as well as a poster. This process helped us to determine what to focus on moving forward. With less than 4 weeks left the program, our team hopes to achieve finalizing the tool first before we move forward to add anything else to the tool.

To talk about my personal part of the project which is sentiment analysis and topic modeling, beyond the general idea of weighting and aggregating the sentiment scores of the articles, it seems very difficult to generate an equation or formula that would even somewhat show a correspondence with our gut-estimation of public sentiment. This approach is top-down where the possible factors that affect public opinion are guessed to be traffic, readability, and the media bias of the websites. Because of these road blocks, I’m planning on doing more literature reviews on social science paper to see any relevant information on public sentiment estimation in the coming week. This would be especially beneficial as the paper for Bloomberg Data Science for Social Good is due on July 8th.

On the side with Topic Modeling, also a clever tool or aggregation method must be used to make sense of the data. LDA, Tf-idf, n-gram and NER (all mentioned in previous blog post) give a lot of information about which words are (supposedly) more relevant to the text data. However, due to an inundated amount of words and text, it is difficult to weight the words from each of the articles. Using a similar aggregation technique as sentiments, these words will also be weighted to quickly visualize keywords being used within the document.

 

Unfortunately, chocolates, baby food, and bread don’t make the cut due to how it would be transported to the people

In making this tool more sustainable for the user, automation of the article gathering, cleaning, and analyzing would be our next step to see the trend of sentiment on individual news outlets as well as the locations. After the paper submission, I will be focusing on making the tool much more user-friendly so that it would be ready to be deployed as the end of the program nears.

 

SEEING LIKE A BIKE: Calibration and system de-bugging

Sensor calibration was the group’s main focus for the sixth week of the program. The Gas sensor and the GPS proved to be tougher to tame than the others.

The gas sensors seemed to be providing a wide range of values with absolutely no consistency. So to calibrate them we need to measure the voltage of the sensor when it is in an atmosphere devoid of the gas whose concentration is to be measured. This voltage serves as the base value which we feed into the code and which we would consider as the “zero”.

The first method we tried in an attempt to calibrate was to gently release Helium gas over the sensors using Helium balloons bought at the local store. This proved to be an inefficient method as the release of Helium was quicker than the response rate of the sensor and there was never really enough time for the sensors to record the voltages corresponding to the Helium before the balloon ran out of gas. Another downside to this method was that Helium balloons usually still do contain about 5% of normal air in them too, which would affect calibration of the sensors.

So to avoid this the next thought was to use a vacuum chamber which we could fill with Helium and allow the sensors to receive ample time to adjust to the environmental changes, while also getting rid of interference from the External air. We then went on a hunt for a vacuum chamber but turned up empty handed.

The option that opened up to us next was godsend. We are going to be given access to the nearby sensing station, where we will co-locate our sensor with the existing sensors and then be able to detect any variances in the values and adjust our sensors respectively.

Another issue that arose was the fact that every gas, if present in a certain concentration, has an effect on the other sensors.

CODE DEBUGGING:

The matrix situated at the front, has an LED array to indicate the status of different sensors.
The color “Green” means the sensor is working and receiving data.
The color “Blue” is working but not receiving correct data.
The color “Red” is that it is not working at all.

During the final setup and mounting of the system on a bike we realized the sensors from the Arduino were red. This posed a huge problem as the major portion of the system had now decided it just didn’t want to work!

At first we thought it was the fact that the raspberry had a certain delay on startup, which the arduino didn’t have and hence there was no sync between the two boards. We turned out to be right, partially.

It turned out that at first the arduino sent out garbage “nack” values. These values followed a particular sequence before the actual data transmission began. This was the cause of the time lapse between data transmission and collection. So we then proceeded to code the raspberry to be a Master device which would control the Slave Arduino. It could now find out when the last character in the sequence of garbage appeared and then reset the arduino for data transmission so that the two boards would be in sync.

DEPLOYMENT AND PILOT DATA:

Yesterday saw the first 15 min. test run of the system fully functional, but with certain sensors non calibrated.
The Pilot data on running preliminary data visualization yielded the following charts.

               

In the first row we see the Left Sonar Data and Right Sonar Data respectively.
The second row shows us the Gas Sensor Data.
The Left LIDAR shows lesser peaks, than in comparison to the Right LIDAR. This is correct keeping in mind the fact that on the Right we have many obstacles on the side where the sidewalk is present, as compared to the road side.

The Gas sensor data was plotted just to see if the sensors were producing data in real time conditions. They mean nothing as they have not yet been calibrated.

For some reason, 3min into the test, the GPS stopped writing to the JSON file and hence we were limited to the 3min for test data.
We aren’t really sure as to why this problem arose and that would bring me to the part of what we plan to do for the coming week.

NEXT WEEK:

The next week would see the usage of the sensing station for gas sensor calibrations. We would also be making adjustments for the effect of gases on the other’s sensors.
De-linking the data collection frequency of other sensors from the GPS time would also be a task. This would allow us to use data even when the GPS fails.
Speaking of failing GPS, we would have to understand and troubleshoot why the GPS stopped writing to the JSON file.

 

Half way to Justice

 

This week we put a lot into JUMA, the Justice Map, for the Atlanta Legal Aid Society. After adding various socioeconomic layers, a search box, and edit features, our contact with the Society was so impressed she wants us to do much more! We’re currently adding Zillow data along with additional data from the Legal Society. The Society has also told us about various shady (and legal) ways residents can easily lose their properties; ask us about them.

For our other project, estimating the number of residents who can qualify for the Anti-Displacement Tax Fund, we have split the work into two subprojects: estimating the income for residents, and forecasting property tax assessments. Income estimates are being generated from Census data, IRS data, and speaking with local residents. So far, we’ve determined that most residents in the affected areas should qualify, as the majority of resident incomes are generally much lower than the requirements, if they own their home. Right now, we are trying to figure out the home ownership rates by comparing owner addresses with parcel addresses.

For the tax assessment forecasts of homes in the neighborhoods with current beltline construction, we decided to use the known impacts of the completed beltline in the Old Fourth Ward neighborhood. A cluster analysis was performed on time series tax assessment data for 2200 homes in the Old Fourth Ward from 2005 to 2016. The time series were first differenced and scaled by the previous time-step tax assessment to create a new time series of the percent changes in value (to ensure comparative scales since there is a large range in home values). The cluster analysis resulted in 3 discrete clusters: (1) homes with a large increase in assessments after the beltline announcement in 2004, (2) homes with an increase after construction, and (3) homes that followed the more common recession and post-recession trend observed elsewhere. It’s interesting that the recession had less impact on the homes in this area compared with national trends; maybe the beltline provided some insulation. Random forests were then constructed to determine the most important home characteristics for each cluster. The distance to the beltline, the size of the land/home, and the nominal value were the most important features for classification, as we expected.

Now we need to group the homes in the West-side neighborhoods where the beltline is currently being built by those characteristics deemed important from the random forest results. These West-side groups will be matched with the Old Fourth Ward clusters and the West-side tax assessments forecasts will be modeled from the corresponding Old Fourth Ward trends. We’re also considering other models for comparison, such as random forests for clustering and recurrent neural networks for forecasting.

Still so much to do:)

Food for Thought: A Heap of Data Chores

Last Friday, we met with our partners from the Food Bank and they gave us further insights on the directions they would like us to take for our data collection. In addition, they gave us feedback on the SNAP app that we are building for them. They plan to host this app on their website and in the near future, they would like us to give a presentation about our app to relevant stakeholders. These regular meetings with our stakeholders are valuable because it is a way for us to make sure that the data collection, data analysis, and app creation are in alignment with their overall goals for this project. To better understand the internal operations of the Atlanta Community Food Bank, we will be taking a tour of the food bank next week.

Data Chores

Data Cleaning
This week, we  spent a significant amount of time cleaning and organizing the data that we collected. We have been working with data from ProPublica Congress and Open States to determine the voting records of Georgia politicians and issues related to food stamps. In addition, we have cleaned and organized the Twitter data based on whether they are geo-tagged and have relevant information regarding SNAP. Usually, since only one percent of tweets are geo-tagged, the number of geo-tagged tweets are quite small and the number of geo-tagged tweets relevant to SNAP are even smaller. We also have Facebook data which has been cleaned and organized as well. We plan to run sentiment analysis and social network analysis on this data in order to better understand the discourses regarding SNAP on social media.

Data Crunching
One main chore that we are dealing with this week is the sentiment analysis for the 1,600 articles that we collected through web hose, which is a site that allows one to scrape websites based on search terms. As previously mentioned, the two types of sentiment analysis tools that we are using are Vader and AFINN. Each article is considered as an instance in the data. Each instance is tokenized into sentences and words to extract features from them. Sentiment analysis, in particular, uses sentence tokenization and gives a numerical score.The problem with doing the sentiment analysis comes with whether each instance has weight or number of people that it should represent. It would not be fair to give all the text the same weight when that text had indication of more views (likes, hits, retweets). In order to cope with this problem, features about each instance were gathered. Additionally, information about the arguments and topics that are frequently appearing within these articles would be very useful to the stakeholders. To do this, preliminary topic modeling using one method called Latent Dirichlet Allocation (LDA) has been performed to extract the topical words from the set of text. Other information such as n-grams, name entity recognition, and tf-idf (term frequency-inverse document frequency) has been used.

Data Organization
The collected articles are currently being organized in order to conduct social network analysis on them. Social network analysis will not only be conducted on all of the articles, but by news source, and by news sources in Georgia. By doing this, we can better understand the ways of how SNAP is being reported about on various levels.

Data Assembly
We are currently working on more add-ons for our SNAP App.  This week, we added background information about the SNAP program into the SNAP app. In addition, we have been working on tying in an automated social media analytics tool into the SNAP App that can allow the client to see an update of the press surrounding main topics of SNAP in real time. We have been looking into how the sentiment analysis can be possibly automated as well as how we can use frameworks such as Hadoop or Spark that can store the massive amount of data we have  in analytics. We are also exploring the possibility of integrating Google trends and Google keyword planner within the app as well.

Next Week
Next week, we will continue on with these data chores. We also hope to begin to add some more parts to the SNAP app such as politician tracking and Google trends.

 

 

 

 

Modeling Energy Usage at Georgia Tech

We started out with just two pieces of information to model energy usage at Georgia Tech: the date-time and the energy usage at every hour for the past three years. Below is a visualization of the raw data for one of the buildings. As you can see, we decided to train all of our models on the first two years of data and reserve the last year to assess the performance of the model.

Using just the date and time of day alone, it is difficult to predict energy usage accurately. However, with those variables, we were able to engineer several other features including the month, day of the week, day of the year, hour of the day, and an indicator variable for whether or not it is a holiday. Our best model using these variables as predictors was a Generalized Additive Model (GAM), which gave us an R-squared of 0.55 and an average error of about 24.6 kWh (8.6%). According to our model, the hour of day was the most important predictor of energy consumption at a given hour. This was not a bad baseline model, but there was plenty of room for improvement.

Incorporating External Data

We have since incorporated two external datasets into our model. We scraped weather data from Weather Underground, which provides real-time weather data down to the minute. From that, we were able to get the temperature, humidity, etc. at any point over the last few years. The other data we extracted was class schedule information from OSCAR, Georgia Tech’s online student portal. From this, we were able to determine the number of classes that took place (as well as the number of students enrolled in those classes) at any given point, in any given building, over the last three years. Our hypothesis was that there would be a strong, positive correlation between the number of classes taking place and energy consumption.

Including all of the above information into our models improves our results significantly. Our best overall model of energy usage so far is a weighted average of a GAM and gradient boosted decision trees. Both methods on their own work fairly well, but when we average their predictions together, we get results that are superior to either method individually. This model gives us an R-squared of 0.75 and an average error of about 19.4 kWh (6.7%). Below is a graph of the predictions of our model superimposed over the actual data. Our model does reasonably well, but it tends to underpredict extreme values. Going forward, we will continue to try other methods and add more features in an effort to improve our predictions.

Interpreting Results

These models are not just useful for prediction. One useful result from tree-based algorithms like gradient boosting are the relative variable importances, which roughly speaking, tells us how much variation in energy usage is explained by each variable. As we can see, the number of classes is by far the most important predictor of energy usage, followed by hour of the day, number of students, and day of the year.

Using the GAM model, we can make inferences about what the precise nature of these relationships actually is. For example, for every additional class being held in the Clough Undergraduate Learning Commons (CULC), we estimate that the energy expenditure will increase by approximately 1.4 kWh, holding all other predictors constant. This kind of information can provide useful insights into how to improve energy efficiency on campus.

Going Forward

Currently, we are looking exclusively at the CULC. Later on, we will use what we learned from modeling the CULC to model other buildings on campus as well as the campus as a whole. That way, we can target buildings that are most inefficient and in need of an upgrade. Also, since there are many buildings on campus which don’t hold classes, the class schedule data will have limited utility. As a result, we plan on using information about the number of people connected to each building’s WiFi as a proxy for building occupancy.

 

Seeing Like a Bike: Towards Integrating the Sensor System

The seeing like a bike team is now in the stage of wrapping up the sensor box and integrating parts as a system. Each level of the sensor system design – from hardware to software – is under the iterative refinement process to better collect data as well as to provide a seamless experience to end users.

Box Design

The sensor box needs to be tenable from external pressure and shocks. Also, it needs to provide an easy-to-use interfaces for users. In order for allowing shock/pressure/vibration tolerance, we aim to make boxes using a sturdy ABS. Before working on actual ABS boxes, we first started testing our designs using wooden plates since it was more efficient in terms of time and cost. After several iterations, we were able to come up with a best design that can house complex structures of sensors, a battery, and an Arduino board. After finalizing the structure of the box, we tried to laser-cut an ABS box. However, using a laser cutter to make required holes and ventilation slits was tricky due to the characteristic of ABS – it was burning easily when exposed to lasers. After several experiments, we were able to find a good way to cut ABS boxes. Using a fast speed laser with low power, the burning effect decreased. By cutting the ABS multiple times with this weak laser, it was possible to make relatively neat holes and slits.

The front case, i.e., the server case, is also being redesigned and constructed using a 3D printer. Since the GPS device has been moved to the front, we redesigned it, and now, it is being slowly constructed from the printer. We are still in the process of refining the locations of slits and holes, but we are almost there!

Data Quality and Computing Efficiency

Even though data collection functionalities and high-level board optimizations were completed last week, we have been continuously working on optimizing the code structure and data formats. This optimization process is not only for data quality in the acquisition stage, but also for the power efficiency (so to allow the longevity of the device). This process involves (1) running the code in the command-line mode; (2) minimizing the use of REST APIs; (3) using GPS timestamps for entire sensory data; and (4) setting a time interval correctly for each sensor. In addition, we also redesigned the LED operations to make the indicators simple.

The original Raspberry Pi server was running in the Windows mode due to the convenience for setting the auto-running mode. However, running the Windows mode on Raspberry Pi with a battery is inefficient from the longevity perspective. This led us to dig into the Linux settings, and we finally changed the mode to the command line. Also, we tried to minimize the use of REST APIs. While they provide powerful interfaces for other applications to communicate to the server, it also creates another layer of software, i.e., network layer, which each sensor needs to go through. This is not critical in the power consumption, but still we moved several sensor communications to simple socket applications so to minimize the script executions.

Timestamp was another challenge in the original system as the Raspberry Pi’s time was not universal one. Due to the inconsistency and inaccuracy of times among devices, synchronization techniques had to be used to maintain the timestamps consistent across different datasets. To make it simple and accurate, we came up with a solution: GPS’s timestamp that comes from the satellite is used for the entire dataset. Since GPS data is collected every 200ms, we update the global time every 200ms, and other sensors use this global time when logging their times.

Finally, we set the data collection interval for each sensor. For example, air quality does not change very quickly, it is okay to collect data every second. Meanwhile, the acceleration of the bike can change very quickly depending on the movement of the bike, so the accelerometer needs to collect data more frequently then air quality sensors. As part of this effort, ultra sonar sensors’ interval has been set to 200ms based on a simple physics model.  We believe this series of setting and tuning parameters and refining software structures will allow better raw data quality and energy efficiency.

Next Steps

The next steps are to connect everything with a new box and case, to calibrate some sensors (for making sure the sensory data is correct), to deploy the system on a bike, and to start collecting pilot data. Once pilot data looks good, we will make same boxes more and deploy them to passionate riders’ bikes. We cannot wait heading out to ride a bike with our system.

Housing Justice: From Maps To Models

This week, we made good progress on both of our projects, the Anti-Displacement Tax (ADT) model and the Harbour Portfolio mapping tool. We read a paper from Dan Immurgluck relating to the ADT that gave us insight into what Dr. Immurgluck has done so far and what we can do differently in our model. In relation to the Harbour Mapping tool, we added a search box, map overlays, and a pop up to display the data. Let’s look at both the projects in detail.

We started the week off with reviewing the paper written by Dan Immurgluck. The paper was about understanding the most important variables in home value appreciaton. However, the model used in the paper wasn’t quite appealing to us. Therefore, we hope to use this paper’s model as a basis to create our own model to predict assessed home values. Jeremy suggested using a Neural Network algorithm; specifically, we are trying to use a Recurrent Neural Network algorithm. We are also using the Consumer Expenditure data and the Tax Accessor data to configure other features of our model. Estimating the assessed home values will help us determine the total cost of the ADT. Currently, we are also trying to look at tools to display our data in a way that is most useful to the community, and potentially create something that we could submit to the Bloomberg Data for Good Exchange.

The Harbour Mapping tool is developing really well. Takeria has been trying to incorporate an edit function and got the pop up data up and running on the map. The data will pop up only when a property is clicked on. She used the geoJson file with the properties to put data onto the pop up. Additonally, Bhavya and Takeria have been trying to integrate a database with the geoJson file. Bhavya added and styled the overlays of the Economic and Racial data onto the app. Right now, when you click on either the “race” overlay or the “econ” overlay, you can hover over an area and the map outlines the zip-code for that area. Vishwamitra worked on getting the Search Box to be displayed, and he’s trying to get it to search the database and use it in the map. Currently, we are trying to find options for databases that are compatible with geoJson files. Hayley and Jeremy also worked on creating a comprehensive list of Harbour properties to display on the map and merged this with the rest of the Tax Assessor data. We hope this information will be useful to our partner, Atlanta Legal Aid, in their predatory lending case.

Overall, we are all happy that our projects are progressing well. With regard to ADT, we have an idea of using a Neural Network model to predict the assessed home values, although we still need to discuss the details of this a bit more. For the mapping tool, we look forward to incorporating the database values into the map and producing our work on the server.

Food Bank: initial sentiment and network analyses

We’ve made a lot of progress since last week. Most of our work has been devoted to sentiment analysis and network analysis.

Sentiment Analysis
Initially, to examine themes in SNAP/food stamp coverage, we scraped articles from the past month that included the words “food stamp” or “food stamps” in the title and calculated how often each stemmed word appeared. To initially visualize this information, we made word clouds. The word cloud below shows the most common words in conservative articles about food stamps:

In order to further analyze the content of the news articles and social media posts that we’ve scraped, we’re doing sentiment analysis on the text. To do this, we examined various metrics about these texts such as the complexity of the words, the reading level, the punctuation, and whether the sentences in the articles are positive or negative.

To quantify the tone of the articles, we used Vader, a sentiment analysis metric from the Natural Language Toolkit, as well as AFINN, another sentiment metric. For Vader, sentences can range from -1 (negative) to 1 (positive). For AFINN, sentences can range from -5 to 5. For both metrics, a score above 0 indicates the the sentence has a positive sentiment. We placed each article in a category (eg, Economy, Opinion, News, etc) and found the average. The graph below shows the average total AFINN score vs. the average total vader score. The size of the bubble reflects the number of articles.

Interestingly, the Vader score suggests that all article categories had a positive sentiment (all > 0), while the AFINN score suggests that only the local, international, and opinion categories were positive, on average.

Georgia Representatives
When we talked with the food bank last week, they expressed interest in an analysis of of how Georgia politicians speak about SNAP on twitter. Our research suggest that Georgia politicians do not speak on the issue frequently enough for us to have sufficient data to analyze. Instead, we are considering creating a visualization that tracks how representatives have voted on legislation regarding SNAP. We are doing this using the Open States API, which has data on bills, legislators, and events in state governments, and ProPublica Congress API, which has national data.

We meet with the Atlanta Community Food Bank again tomorrow, and will consult with them to better understand how they currently follow food policy how we could use these APIs to analyze and present this information for them.

Network Analysis

Another strategy we are using to analyze news articles is by doing a  Term Frequency- Inverse Document Frequency network analysis in gephi on Washington Post articles about food stamps. In the above graph, the bigger, darker circles  are more connected. Words are connected if they appear in the same sentence. We were unsure why Perdue and Southerland were so connected. After researching these names, we learned that Steve Southerland is a Florida congressman who wants to impose work requirements on those who get SNAP, and Sunny Perdue is the Secretary of Agriculture.

 

Next Steps

Going forward, we hope to do a more granular sentiment analysis that can help us to extract arguments from our text. We also need to clean data that we’ve scraped from facebook, and we are also starting to learn about how Google Trends can be a useful tool to us going forward.

 

UN Project: Features Processed and Models for Flood Risk Prediction

After the exploration of data sources, in this week we focus on the geographic processing of data and start to explore some preliminary questions in flood risk prediction.

Geo-processing of data: for the 1km*1km grids created as learning units in Ziguinchor region, we try to extract the topographic and weather features for each.

1) Weather: in the NOAA weather data, there are two weather stations locate in the region we study. So the weather feature of each grid is calculated by taking a weighted average of the two stations based on the distance to each. While the flood images are a 14 days composite, the weather data is aggregated in the same time scale. As the figure shows, the total number of flood area in this region is highly consistent with the precipitation and dew point over all the time points from 2015 to 2017.

 

2) Water area and waterways: both are spatial data, with polygons or lines in map representing the water information in certain location. For each polygon, we calculate its intersection area with water area polygons, and calculate the distance of grid centroid to the nearest waterways. These two features indicate the geographic relation between our object area and waters.

3) Elevation and slope: elevation is raster data at 3 arc second resolutions, while also helps generate the slope values. We apply a zonal statistical method on the raster file and grid shapefiles, and obtain the average, max, min and standard deviation statistics of elevations and slopes in each grid unit.

4) Land cover: water storage capacity has great effect on the formation of flood, so the land cover type could be a very predictive feature for flood risk. We downloaded land cover map from Food and Agriculture Organization of the United Nations, can calculate the percentage of surface types for each grid.

After data wrangling, we explore several questions to build models. Some has decent results for us to keep working on.

Deep Learning Models:

In this method, the dataset is composed of image patches randomly sampled from color-coded map of Senegal, where each color represents a different feature of the land. We currently have a convolutional neural network that can classify whether a patch of land will be flooded within the next year at 85% accuracy. The dataset was labeled using a simple algorithm that counts occurrences of RGB values within a specified range, and if the number of occurrences is above a certain threshold then the image is labeled as ‘flooded’.

The classification accuracy is used to see how well the current network architecture can encode information, so that an autoencoder model that can generate flood patterns within specific images can be built on top of it. Another model in progress is one that can predict flood risk in a specified area with respect to time. An RNN is being used for this purpose. We hope to represent risk in different areas using a choloropleth map.

Machine Learning Models:

  1. Regression: Can we model the average area of flooding for each grid over all study days by the static topographic features and an average of weather features?

Random forest model: R-squared=0.806

  1. Classification: Can we model a grid flooded or not over the study dates, where flooded is determined by a variable threshold of percentage of flooding area in the grid?

Stochastic gradient boosting: Accuracy=0.895 when threshold is 0.25

  1. Regression: Extended the question 2, can we model the number of floods over study dates by a threshold?

Random forest model: R-squared=0.734

  1. Regression: Extended the question 3, can we train the data of 2015 and test on 2016 to evaluate model performance?

Random forest model: R-squared=0.730

  1. Regression & Classification: Can we predict the amount of flooding per cell per date, or whether it flooded or not by setting a threshold on the amount of flooding?

Generalized linear model: R-squared=0.155

In the next week, we will keep exploring models for the above questions.