Habitat Risk Assessment - problems with risk calculation and outputs?

Hello. I am running my own HRAs within R (rather than using the InVEST tool). However, as a first step, I am repeating some very simple analyses in both R and the InVEST tool to make sure that these generate comparable outputs (to verify that my R code is successfully replicating the InVEST HRA). In going through this process, I’m getting risk estimates from the InVEST HRA tool that don’t make sense.

First, I have a 200 x 100 grid. In R, I just work with this grid directly. There is a single habitat and a single stressor, so each grid cell i has a habitat score (1,2,3) and a stressor score (1,2,3). There is one additional exposure criterion and one additional consequence criterion (these additional criteria are not spatially explicit), which are both fixed with values of 2. Data quality (DQ) and weights are also fixed at 2 for all criteria. There is no distance decay function or buffer.

I am using the Euclidean risk function:
R.i = sqrt( (E.i-1)^2 + (C.i-1)^2) )
where
E.i = sum( e.i/(de.i * we.i) ) / sum(1/(de.iwe.i)), where de.i and we.i are DQ and weights for the e.i
C.i = sum( c.i/(dc.i * wc.i) ) / sum(1/(dc.i
wci)), where dc.i and wc.i are DQ and weights for the c.i

So, in my specific test case, the calculation of E and C for cell i are based one one spatially-explicit Exposure (or Consequence) value that can have a value 1,2,3, and one spatially non-explicit Exposure (or Consequence) value that has a value of 2. Again, W and DQ are all 2. Thus, the max value for E.i is:
E.i = (3/(22) + 2/(22)) / ((1/22) + 1/(22)) = 5/4 / 2/4 = 5/2 = 2.5

And the max value for C.i is similarly 2.5

Thus, the max value for any R.i should be sqrt( (2.5-1)^2 + (2.5-1)^2 ) = 2.12, assuming there are some cells where the spatially-explicit e.i and c.i both have the max of 3 (and there are many such cells).

But, when I run this through the InVEST HRA, the output (summary_statistics.csv) correctly says that the max E and max C are both 2.5, but then also says that the “R_MAX” is 2.25, which should be impossible.

FYI, for running this through the InVEST tool, the 200x100 grids (for habitat and stressor) were converted into polygon layers, projected near the equator, so that each cell is approximately 2770m on a side. Then at the InVEST GUI, the resolution is specified as 2770m. The max criteria score is specified as 3. Risk equation = Euclidean. Decay function = NONE. For the criteria scores, I’m specifying the spatially explicit species/habitat layer as a Consequence (C) criteria within the “Habitat Resilience” table. Within the “Habitat Stressor Overlap Properties” table, I’m specifiying the other non-spatial C criteria and the two E criteria (one being non-spatial, and the other being the spatially-explicit stressor layer)

Questions:

  1. I’m seeking explanation for how the InVEST tool can calculate a Risk.i value of 2.25, given max E.i and C.i values of 2.5.

  2. I’m also seeking clarification on the importance of listing different E and C criteria within different portions of the criteria table. Based on the formulae described in the manual, it shouldn’t matter where I list the different criteria. But I seem to get different results depending on which part of the table I include the criteria, irrespective of whether I label them E or C. For example, are all criteria listed under the “Habitat Resilience Attributes” treated as Consequence criteria, irrespective of whether they are input as E or C? The documention sort of suggests this in describing how the criteria table .csv should be set up, but it’s not entirely clear.

  3. How are “subregions” defined and outputs obtained for these?

Thank you!
Jeff

I’m using InVEST version: 3.9.0

1 Like

Hi @JeffM , thanks for sharing your case here. The lead developer of this model is out on vacation right now, so might be another week or so to get a detailed response, especially on the risk score discrepancies.

But for now I can try to answer your questions.

  1. I’m seeking explanation for how the InVEST tool can calculate a Risk.i value of 2.25, given max E.i and C.i values of 2.5.

We recently fixed some inaccuracies in the risk calculations so it would be most helpful if you could try the latest version 3.12.0 and let us know what you find. InVEST | Natural Capital Project

I’m also seeking clarification on the importance of listing different E and C criteria within different portions of the criteria table

  1. Each “block” of the criteria table represents a different habitat-stressor pair. And I believe it is a requirement that each habitat-stressor pair include the same list of criteria (though some can be marked with a RATING of 0 to exclude that criteria for that particular habitat-stressor pair).
    However, the first block of “Resilience Attributes” are treated as Consequence criteria that are specific to each Habitat, but apply the same to all Stressors. So you are correct, these criteria could instead be listed under each habitat-stressor pair, but since it does not make ecological sense for their scores to vary by stressor, the values are assigned only once for each habitat.

They are all treated as Consequence, except if they are labeled “E”, they are excluded altogether.

They are defined by polygons in the Area of Interest input dataset. Generally, these polygons are used to calculate zonal statistics from the raster outputs. So the subregion results are aggregated information from the pixel-scale raster results.

2 Likes

Dave – huge thank you for your replies. I’ve installed 3.12 (workbench and classic), but I now get errors trying to run the same analysis, even though I’m using all the same setup files and inputs as when I successfully ran it with version 3.9. Are there revised specifications for file setup with the newest version?

I’ve attached my log file.

Thank you
Jeff

InVEST-Habitat-Risk-Assessment-log-2022-10-12–12_04_14.txt (2.2 KB)

Hi Jeff, I think the new version maintained compatibility with respect to the input files.

The error indicates you tried to use a spatially explicit criteria rating, but the path provided to the spatial dataset seems incomplete:

ValueError: Criterion could not be opened as a spatial file C:/jeff/NOAA/ByRA project/SimulationAnalysis/R_jeff/ByRAruns/Rating

This should be a path to a vector or raster dataset, but appears to be a path to a directory?

Yeah, but that path listed in the error message is not a path that I specified. I don’t have a subfolder called “Rating”, nor is the software creating such a subfolder. But the software is generating that path as part of the error message for some reason. The files/path I am specifying for the spatially explicit criteria (within the Rating column of the cristeria.csv) are different shapefiles. I know these are specified correctly in the sense that the same setup runs fine with versions 3.9 and 3.11.

Hmm that does sound strange! Would you be able to upload your criteria CSV here?

Thanks for keeping at it with me. Here are the .csv’s that I’m pointing to in the setup…

habitat_stressor_info-2c.csv (285 Bytes)
swfsc_criteria_scores_2c.csv (471 Bytes)

Got it, thanks. It was just a minor formatting issue. The HABITAT RESILIENCE ATTRIBUTES row name needed to be on the same row as the Rating, DQ, Weight labels. Otherwise it was interpreting “Rating” as a value instead of as a label/header.

And since it is acceptable to list spatially explicit criteria files using relative paths, the model tried creating the absolute the path C:/jeff/NOAA/ByRA project/SimulationAnalysis/R_jeff/ByRAruns/Rating by joining the value “Rating” with the location of the criteria CSV.

this one should work, I hope
swfsc_criteria_scores_2c.csv (471 Bytes)

1 Like

Thanks, Dave! That helped. I modified the criteria.csv format per what you provided, and then I got a different error, so then I noted another format change that had to be made, pertaining to blank rows within the criteria.csv. Then it ran. Suffice to say, the format setup for the criteria.csv file is different for version 3.12 vs earlier versions!

Ok, so let’s return to the original question of my thread! Now that I have 3.12 running for me, I can see that Risk is being calculated more accurately now (consistent with what I’m getting in my own R code). So that’s great.

But now I’ve noted a different issue with 3.12 (that does not exist with 3.9 or 3.11), concerning the spatial resolution of the analysis. The grid cells of our projected data are about 2770m x 2770m in size. However, at the setup box, if I input this value (2770) as the resolution, the InVEST tool generates grid-shaped polygons that are much larger in size (and does the analysis at this coarser resolution, giving an incorrect risk calculation). To get cells of the correct size in the analysis, I had to input a very fine resolution at the setup box (I ended up going all the way down to 100m). This generated a correct-looking map with grid-shaped polygons of the correct size (you can see they are correct because the map includes a scale and if you zoom in enough, so can see that the cells are about 3km per side). Again, this is not an issue in the earlier version. Inputting resolution = 100m in 3.12 and 2770m in 3.11 produce approximately the same maps.

Hi @JeffM , I’m not able to reproduce this issue with some sample data; the model resolution seems to be working as expected.

If you would like to share all of your input data I am happy to take a closer look.

Hi Dave. Interesting. So, having taken a closer, I actually need to revise my description of the issue. With 3.12 Workbench, if I input 2770m as the resolution, the resulting map does indeed seem to generate polygons/cells that are about 2.8km on a side, so all seems well in that sense. But, the risk calculation for each cell seems to involve some kind of buffer (even though i’m not using a decay equation or buffer) or moving window-averaging, so that adjacent cells are more likely to have the same risk score and are getting aggregated into large polygons than I expect. See the attached image file to see what I mean.

Cool, it’s interesting to see that side-by-side comparison. If you examine some of the raster outputs I wonder if you notice the same?

Version 3.12 could have had some minor changes to the method for rasterizing the input polygons. Maybe @jdouglass has a better explanation for this?

@JeffM I agree with @dave this change sounds like it might be a consequence of the updates to the alignment step, but I’m not yet exactly sure why that is given what’s already been discussed here. Would you be willing to share your input layers with me so I can take a closer look? My email is jdouglass@stanford.edu if that’s easiest.

About your inputs, you mentioned above that

The grid cells of our projected data are about 2770m x 2770m in size.

When I look at your input tables from earlier in this thread, I see that all of the spatial inputs are vectors. Is it possible that your shapefiles contain polygons that are intended to represent pixels? And if so, do you have the original rasters from which these vectors are derived?

If you have the rasters available, you can just provide them directly as layers. Rasters can be used everywhere vectors are, both as spatial criteria and as habitat and stressor layers, and it will mean that you won’t have to worry about how the model rasterizes the vectors. When working with polygons that perfectly overlap pixels, GDAL has some unexpected behavior that can lead to the sort of buffering effect that you mention. Using raster inputs directly will get around the need to rasterize.

James

1 Like

Thanks for sharing your inputs @JeffM !

Raster inputs
For the raster inputs, there’s something funky going on about how these are created. Pixels have dimensions of 2.5 centimeters by 5 centimeters (!) in the projected units and have an incorrect origin relative to the vectors. Fortunately, gdal_edit.py is a very useful script and part of the GDAL toolchain, so I was able to edit the raster dimensions and make them align with the vector’s coverage area. Here’s an example of the gdal_edit.py command I used for each of the 4 rasters:

$ gdal_edit.py -a_ullr 277404.560324317 276708.1672510036 833978.5569194595 0.0 mmd.int_raster_proj.tif

The coordinate inputs to -a_urllr were extracted using our geoprocessing library pygeoprocessing via (pygeoprocessing.get_vector_info()), but you could just as easily extract them in a GIS tool.

Using these output rasters (which I have emailed back to you), the model runs through without errors.

Vector inputs
Based on manual inspection of the vector in QGIS 3.22, the “pixels” in the vector appear to in fact be 2774.05m on a side. Since the model is being run at a resolution of 2770, you’re guaranteed to have a single “pixel” intersecting with 4 or more pixels during the rasterization step, but it could be as many as 9 pixels! So this is most likely the cause of the buffering effect you’re seeing. If you’d like to keep using vectors, you’ll want to run the model with a resolution of approximately 2775m or greater to avoid the buffering you’re seeing.

1 Like

James – I’m sorry for such a delayed reply. I got busy with some other stuff and then took a week vacay, and then… you know how it goes.

So, first, a big thank you for identifying some issues with the rasters and sending me good ones. I did not create the rasters (I’m not the GIS person on our project), so some of your explanations (e.g., re the GDAL toolchain) are Greek to me, but I’m passing on this info to my GIS collaborator. I’m sure she’ll greatly appreciate all this!

Regarding you comments on the vector inputs, I guess I’m a little confused, because, while I think I see what you’re saying in theory, it seems to me that specifying a smaller pixel resolution would be of little consequence since ultimately the polygons generated won’t be any more granular than the resolution of the input data, whereas if I specify a larger resolution, the information will be aggregated (a loss of resolution from the input data). One thing I know is that the issues I’m having with the new version of HRA did not exist with the earlier versions. In the older versions, as long as I specified a resolution that was finer than the input data, I would get the same outputs (ie., whether i specified a resolution of 2500m or 25m), which made sense to me, because you can’t get output information that is more resolved than what you input.

cheers
Jeff

Hi Jeff,

I finally have a satisfying answer to why this buffering is occurring in the 3.12 version of HRA but not in 3.9: the 3.9 version of the model has a special case for spatial criteria where vectors are rasterized with the option ALL_TOUCHED=FALSE, while habitat and stressor vectors are rasterized with ALL_TOUCHED=TRUE. InVEST 3.12 rasterizes all vectors with ALL_TOUCHED=TRUE, so this is why you’re seeing this difference in outputs at larger model resolutions. Practically speaking, this is an accidental regression in the model’s behavior and the development build linked below should correct it. When you have a chance, could you try it out and make sure the model is now doing what you would expect?

https://storage.googleapis.com/natcap-dev-build-artifacts/invest/phargogh/3.12.0.post58%2Bg130e965cc/InVEST_phargogh3.12.0.post58%2Bg130e965cc_x64_Setup.exe

Related to this is the question of what the model should be doing when rasterizing, and so you may see a change in the model’s behavior in a future release of InVEST depending on the outcome of those conversations.

OK! On to your good theoretical question.

You’re right that specifying a coarser resolution would cause a loss of resolution from the input data, but it turns out that there is, in fact, a benefit to using a finer resolution, specifically at half the signal’s resolution or finer. This is known as the Nyquist Rate and, when used, allows us to discretize (rasterize, in our case) a continuous function (a geometry) without distortion.

So for your case, if your vectors represent pixels that are about 2770m on a side and you want to use the Nyquist Rate, you could run the model at a resolution of 1385m. When the polygons are constructed at the end of the model, they should very accurately represent the input signals.

And one other note about your vectors - while they are intended to represent pixels, the size of these pixels vary over space, perhaps as the result of a reprojection. Here are 2 examples from f2.int_rating.shp where you can see that 2 squares have different areas:

So, when the model converts these geometries to square pixels at the target resolution, we have to expect some error. Running the model at a resolution at or below the Nyquist Rate can help avoid some of this error propagating up to the results.

James

1 Like

Thanks, James!! super helpful!