Built Infrastructure Input Data and Processing

Hi there,

I am trying to run the urban flood risk mitigation model on a watershed in Ontario, Canada. I have found data sources for all the inputs, however, I am having trouble retaining the vector resolution after processing the building footprints programmatically in Python.

My data source is from GitHub - microsoft/CanadianBuildingFootprints: Computer generated building footprints for Canada and in geoJSON format. Ideally, I would like to crop down the geoJSON at a state/provincial level to the watershed area before converting to the required format for the model (.gpkg).

I was wondering how the building footprints from the sample data in the urban flood risk mitigation model were processed to cover such a smaller region.

Thanks for the help,


Hi @rxchelzhxng , that’s a cool data source!

I’m not quite sure what you mean here by “retaining resolution”. What sort of processing did you do in python?

I can’t speak about the sample data, but I just looked at the model’s requirements for the built infrastructure input, and I see that the model will do the following for you:

  • the model will project the data to match the coordinate system of the watersheds input, so you don’t need to do that ahead
  • the model will clip the built infrastructure input to the watershed vector

The only processing you probably have to do is create “type” column in the attribute table and populate it with integer codes that will be matched to the infrastructure CSV input. And because of that, it might make sense to first clip it down to just the watersheds area.

There are lots of approaches for that, in GIS or with python. I see that Ontario dataset has 3+million features, so the processing will not be trivial. Would you like to share your python code you have?

Hi Dave,

The part where I am losing vector resolution is when I convert the vector data (from shapefile) to a raster (.tif format). The reason I am doing the conversion to a raster file is that I have another script that will clip this raster using a geoJSON of watershed boundaries to the smaller area of interest (watershed).

Currently my code to convert from shapefile to raster is using rasterio’s python package rasterize() and rasterio.transform.from_bounds().
I do have to hardcode in the shape of the output raster where I believe is the point where I am losing definition in my output. I can set it higher than 25000 pixels by 250000 pixels but the issue is processing takes up a lot of memory.

Any approaches for bypassing this conversion from vector to raster format and going straight to clipping the dataset (geoJSON) to watersheds?

Yes! It’s very common to clip one vector by another vector. In python, you could try geopandas which offers “intersection” Set-Operations with Overlay — GeoPandas 0.9.0 documentation

Geopandas will need to load the entire dataset into memory, so that may or may not work on your system. If it’s a problem, you could use fiona and shapely to iterate through the building features and intersect them one by one. Here’s an example you could adapt: https://gis.stackexchange.com/a/178787/112596

Both of these approaches should handle a geojson just fine, but both will require that the buildings layer and the watersheds layer share a common projection/coordinate system.

Thanks for the resources! I actually found a clipping function in geopandas Clip a spatial vector layer in Python using Shapely & GeoPandas: GIS in Python | Earth Data Science - Earth Lab and I am trying to implement this first.

As for the other processing that needs to be done to populate the attribute table with integer codes and match it with the infrastructure CSV table, are there any approaches to doing this in Python? I have a solution for rasters (lulc and biophysical table) but since this is in a vector file, I am not sure how it would work.

I also noticed that the sample data for the urban flood risk mitigation model had a damage CSV table with a type of 0 and damage of 100. I opened the attribute table in the infrastructure.gpkg file in QGIS and didn’t see a value/integer code of 0 corresponding to the CSV table.
Was this meant to be intentional or am I missing something in my interpretation of the integer codes in a gpkg file?
Could you clarify how the sample data was processed to achieve a type of 0 that corresponds with the infrastructure gpkg file? Ideally I would like to be able to modify this in Python.

It looks like the model takes the numeric codes it finds in the vector attribute table and looks them up in the CSV. So it does not affect anything to have a code in the CSV that does not appear in the vector. The reverse would not be okay though.

Well the hard part here is knowing which vector features are which building type. I looked at the data source you linked and it does not appear to come with any attribute data along these lines. So coding the buildings by type would either be a manual process based on local knowledge (in which case I would do that by editing the attribute table in QGIS/ArcGIS), or requires another source of data that comes with that information, such as Open Street Map Key:building - OpenStreetMap Wiki. And then you could use python (geopandas / pandas) to populate a new column in the vector’s table.

Data collection/pre-processing is almost always the hardest part!

That makes sense, I did happen to see that there’s actually a type 0 at the very bottom of the attribute table for all the buildings in the infrastructure vector in the sample data. For our current testing I will assign a new column/row in the attribute table to do the same.

As for the Open Street Map data source, I’m not too familiar with it. Is there an API that we can call in order to access the building footprints with that key or is it a manual download of data like the Microsoft source I sent previously?

I haven’t done this in a few years, but the overpass API is a popular way to extract OSM data. Overpass API - OpenStreetMap Wiki

Thanks I will take a look into it!
An update: we were able to successfully run the urban flood risk mitigation model after pre-processing all our data

1 Like