How to Handle Climate NetCDF Files

I’ve started working with my climate NetCDF’s in CDO because there’s a pretty easy function that does exactly what I need for my 1st step: it averages a daily time step file into a monthly averaged (or whatever metric you choose) file. So this time, now that I have the *.nc files, I got a list of the base file names as such:

cat climate_structure | cut -d . -f1 > climate_models

Then  I wrote this simple bash script to loop over them all and write out the monthly averages:

while read climate_model; do
 echo "Now averaging $climate_model"
 cdo monmean ${climate_model}.nc ${climate_model}
done < climate_models

Some of my closest colleagues, Brooke Bateman and Andy Allstadt, had some good advice for how to work with the netCDF files in R once I get them:

  • ncdf4 package
  • raster package: can open netCDF either as…
    • single layer (with the raster function)
    • the entire thing (brick function)
  • SDMTools
    • sample x,y points from raster using extract data function
    • convert the raster to ASCII

NetCDF: Wading through the Wreckage

There were apparently a few people last week (and maybe this week) who were downloading tons of data, so it was “clogging the pipe” on our connection as our IT people put it. However, a few issues still remain, but they may not be of high importance to us…

Now downloading cnrm-a1b-tmin-NAm-grid
syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <!DOCTYPE^ HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"><html><head><title>500 Proxy Error</title></head><body><h1>Proxy Error</h1>The proxy server could not handle the request <em><a href="/thredds/dodsC/dcp/conus_t.dods">GET&nbsp;/thredds/dodsC/dcp/conus_t.dods</a></em>.<p>Reason: <strong>Error during SSL Handshake with remote server</strong></p><p /></body></html>
NetCDF: DAP server error
Location: file /build/netcdf-StLR0y/netcdf-4.4.0/ncdump/nccopy.c; line 1044
Now downloading cnrm-a2-tmax-NAm-grid
Now downloading cnrm-a2-tmin-NAm-grid
NetCDF: DAP server error
Location: file /build/netcdf-StLR0y/netcdf-4.4.0/ncdump/nccopy.c; line 1044
Now downloading cnrm-b1-tmax-NAm-grid
syntax error, unexpected $end, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: ^
NetCDF: DAP server error
Location: file /build/netcdf-StLR0y/netcdf-4.4.0/ncdump/nccopy.c; line 1355
Now downloading cnrm-b1-tmin-NAm-grid
syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <!DOCTYPE^ HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"><html><head><title>500 Proxy Error</title></head><body><h1>Proxy Error</h1>The proxy server could not handle the request <em><a href="/thredds/dodsC/dcp/">GET&nbsp;/thredds/dodsC/dcp/</a></em>.<p>Reason: <strong>Error during SSL Handshake with remote server</strong></p><p /></body></html>
NetCDF: DAP server error
Location: file /build/netcdf-StLR0y/netcdf-4.4.0/ncdump/nccopy.c; line 1355
Now downloading echam5-a1b-tmin-NAm-grid
CURL Error: Couldn't connect to server
curl error details: 
NetCDF: I/O failure
Location: file /build/netcdf-StLR0y/netcdf-4.4.0/ncdump/nccopy.c; line 1044

How Best to Download Climate Data?

I’ll be working on gathering this data for the Dakotas (bounding box: 49 N, 42 S, -96 E, -105 W), and then summarizing it to the level of sample blocks they created earlier in the project. I tried using OPeNDAP but I’m not quite sure how to use it. So, I tried NetcdfSubset instead, which seems like it would do what I want, but the request is too big for their on-line retrieval system. Instead, they suggested I use nccopy.

This is the giant URI of the climate data I need, broken up for readability in this blog post. In other words, I put the new lines there, and they shouldn’t be in the real URI you pass to nccopy.[163:1:230],time[0:1:51099],lat[137:1:192],

To get the file(s) you pass…

nccopy -u <URI>

My request is so big that perhaps it comes across a missing file somewhere in the list, if I try to just pass the whole URI? I get this error from nccopy (running v. 4.4.0 on Xenial Xerus):

NetCDF: file not found
Location: file /build/netcdf-StLR0y/netcdf-4.4.0/ncdump/nccopy.c; line 1355

I’m thinking that probably the best thing to do will be to script this request, so it handles each URI separately. The friendly data portal manager has been much help, in that it looks like from his example to write a standalone URI for each model, it would be something of the form (an example of the first dataset):

So here’s what I did: I copied the big URI broken up and pasted it into a text file called “climate_structure.” Then, I deleted the first line, and got the names of the climate models into a new file, “climate_models” like so:
cat climate_structure | cut -d [ -f1 > climate_models

Then, I setup a simple bash script to loop over the related models and variables now in “climate_models” and download each into a separate *.nc file, and give each the name of the model and variable combination.

#loop over list of climate model names

while read climate_model; do
 echo "Now downloading $climate_model"
 nccopy -u "[163:1:230],time[0:1:51099],lat[137:1:192],${climate_model}[0:1:51099][137:1:192][163:1:230]" ${climate_model}.nc
done < climate_models

Next I’ll need to get the precipitation.

My friend suggested instead I try GrADS to handle the data request, and had many friends that used it successfully and often. It can be scripted so I’m looking into that now! In the meantime, please comment if you’ve had any experiences with big climate data and gathering lots of NetCDF files!

Breeding Dispersal, Migratory Nomadism

Though the following terms aren’t all the same, they describe whether or not breeding birds occupy the same territories as they did in the previous year.

  • breeding dispersal – this is a commonly used term, especially in ornithology
  • migratory nomadism – comes up in grassland bird literature
  • fidelity – more broadly used in ecology
    • site
    • territory
  • return rate – a phrase used with other meanings in other disciplines, but also used in ecology
  • philopatry – can add “breeding” before it to specify

So, in my case, to find as much information as I can whether grassland birds return to the same place or not in subsequent years (and why/when/what modulates it), I’d need to search through all these terms. The 1st 2 are my favorites because of their prevalence within my sub-disciplines, and are what I’d use in my own papers to increase the chances my target audience will find my paper in a search.

My to-date only 1st authored paper starts to explore this phenomenon, and it’s a running theme throughout my dissertation. What do we know about breeding dispersal, and how do we know? One of the most common and rigorous methods is to look at return rates of banded, marked or otherwise identifiable individuals.

  • return/recovery of a banded individual: ideal! We know exactly “who” that individual is, where we caught it, how old it is/was, gender, etc.
  • failure of a banded individual to return: we don’t know what happened to it. It could have died or gone elsewhere

We can map territories, but without being able to identify who’s holding it, we don’t necessarily know when a territory is taken over nor by whom. I read an interesting paper (Siegel et al. 2015) that used age classes to determine whether dispersal of juvenile or adult birds was the predominant vector of colonization for a disturbance-dependent species. At a coarser scale, we can look at how occupancy of areas changes year-to-year; in other words, we can see if as many individuals of a species return to an area as the previous year. Scaling up that concept, we can look at distributions with broad-scale data sets and how distributions of individuals change between years (which is the scale at which I’ve gotten involved with my research :)).

Master Post: BBS 10-Stop Data

This one may not be as step-by-step, because the GIS layer of BBS routes I have may not have been published yet. Nonetheless, here’s what I did.

  1. Cleaning up the spatial data and splitting routes
  2. Assigning segment order once the routes are split
  3. Summarizing spatial covariates by segment
  4. Determining which segments can be considered the same
  5. Importing BBS and spatial data into R at the segment level

The final step looks similar to the route-level tutorial, but with pieceID being the variable to match on. Set “bird” to the AOU code of your species of interest.

dataperspecies <- subset(BBS[which(BBS$Aou == bird),]) 
vars$abundance <- dataperspecies$count[match(vars$pieceID, dataperspecies$pieceID)]
vars[] <- 0

assess <- data.table(vars)
whattodrop <- assess[,list(abundance=sum(abundance)),by='rteno']
droprtes <- whattodrop$rteno[whattodrop$abundance==0]
perspp <- vars[!(vars$rteno %in% droprtes),]
perspp$Year <- as.integer(perspp$Year)
perspp$yearfix <- perspp$Year
perspp <- perspp[perspp$RunType > 0,]
#vars <- vars[vars$RPID < 102]
perspp <- perspp[c("Year","MN","firstyear","rteno","ObsN","abundance")]

Here is the full script: bbs_10stop_species

A Thought on Downscaled BBS

The BBS was designed to capture avian populations in natural areas, so if the land surrounding a portion of the route was developed, historically the route path was altered and given a new number. The problem is, that new numerical designation was not standardized (though often, it’s 100 + the original route number).

In wanting to maintain spatio-temporal continuity, I thought that the portions of the routes that were still spatially congruent should be considered the same segment. This serves my random effect of location, such that an observation isn’t considered to come from a different location just because the route number is different (if the “old” and “new” segments spatially coincide). I use “coincide” loosely, in that I considered original and +100 route segments to be the same if they intersected. You can measure the degree of overlap and incorporate that if you choose. I made a shape file of intersecting duplicated segments, and imported it into R to adjust the location designations.

repeats <- read.dbf("intersecting_duplicated_segment.dbf")
repeats$new_rteno <- repeats$rteno - 100
repeats$new_route <- paste(repeats$new_rteno,repeats$segment,sep="_")

Now, the newer segments have the same numbering as the original routes. I match the new segments with the original route numbers to the BBS data, and then replace the new route segment numbers with the original numbering.

BBS$new_route<-repeats[match(BBS$route_piece, repeats$route),"new_route"]
BBS$new_rteno<-repeats[match(BBS$route_piece, repeats$route),"new_rteno"]
BBS[!]$route_piece <- BBS[!]$new_route
BBS[!]$rteno <- as.integer(BBS[!]$new_rteno)

The new route segments that intersect the original routes now have the old route numbering, and thus counts from these segments appear in the code to to have come from the same routes as the original.

Here’s the full script: bbs_10stop

Down-scaling BBS

As my several recent posts have mentioned, I’m working with BBS spatial data (i.e. route paths and start points). I wanted to “downscale” by splitting routes into smaller segments and looking at data sub-route level. Specifically, I needed to split routes into fifths. I needed to know what order the segments were in from the start of the route to match the finer-scale data.

Here’s how to do it…

  1. Break up the accounted for routes (or maybe even raw data set?) into “good” and “bad” routes depending on if the start is nearest a segment for a matching route
  2. Work with “good starts” and “good routes” (the easiest!)
    1. identify the segments closest to the start point as the first segment of the route
      1. run near on the start points to find out which segment in the master file is closest
      2. use the near_FID to join with the FID of the route segment layer
      3. export the resultant start segments into their own layer (number them 1)
      4. remove join
    2. delete the start segments from the master file of all the broken up segments of the route (select by location: identical to)
    3. use select by location (touches the boundary of) to find the next closest line segment to the start in the master file, and number them 2
    4. repeat 2-4 for the remaining segments, and number them accordingly
  3. Work with bad starts: once the good routes are out of the data set, is the bad point now closest to a remaining matching segment?
    1. break up those newly matched lines
    2. put leftover points into yet another file
  4. Look at remaining points to see if any more lines can be put to use

Given the numbering scheme recommended in the steps above, your segments should be numbered 1-5 on each route. I buffered those segments and averaged tree cover within  each segment buffer. In order to match my segment numbering with the rest of my R code, I renamed the segments.

vcf_subroute <- read.dbf("landcover/bbs_segment_buffers.dbf")
vcf_subroute$interval <- paste("count",vcf_subroute$segment,"0",sep="")

The remainder of the code is similar to that described in the route-level MODIS post, just with the added level of the interval.

vcf_subroute <- vcf_subroute[,c("rteno","interval",grep("[[:digit:]]", names(vcf_subroute), perl=TRUE, value=TRUE))]
vcf_subroute <- melt(vcf_subroute,id=c("rteno","interval"))
vcf_subroute$variable <- as.character(vcf_subroute$variable)
vcf_subroute$Year <- str_extract(vcf_subroute$variable,"[[:digit:]]+")
vcf_subroute$var <- sapply(strsplit(vcf_subroute$variable, "[[:digit:]]+"),"[[",2)
vcf_subroute$variable <- NULL
vcf_subroute <- dcast(vcf_subroute,rteno + interval ~ var,fun.aggregate=mean)
vcf_subroute$rteno <- as.character(vcf_subroute$rteno)

The full script, to this point: bbs10vcf

Cleaning Up BBS Routes

When I downloaded the BBS route shape file (apologies for not linking, I’m not sure if it’s publicly available) I had some work to do. I wrote down the steps I used in cleaning up the layer, and then splitting into fifths.

  1. One line per route. Great! set aside. (Still a note of caution, some BBS routes are too long. When I asked about this, I was told they were digitization errors but weren’t immediately fixable.)
  2. More than one line per route: I needed to only be sure to dissolve segments with matching route identifiers that touched. This is where iterative spatial joins and dissolves came in…
    1. spatial join: what is touching each segment?
    2. do the route identifiers match? if so, keep those…get rid of the ones that don’t
    3. join field to get the target ID’s, which end up making “connectivity groups,” to dissolve by
  3. repeat steps in #2 until there’s only one poly-line per connectable (zero distance) route parts
  4. this leaves disconnected fragments of things that don’t touch (decide what to do with those on a case-by-case basis)

Now, you might want to look at the lengths of the resulting routes and examine ones that are suspiciously long, and decide what to do with them in your analysis.

  1. depending on the geometry of the lines, the nice “split line” tool may not work as expected. I made a topology to figure out where splits had gaps. ones that had gaps were removed from the set
  2. once split, was there mysteriously more or less than 5 segments per route? if so, toss
  3. for the good splits, I matched available good starts to determine order of the segments
  4. I salvaged the remaining “split-able” lines
Adventures in fixing the rest: I tried to be ambitious and work with the problem lines, to little avail. I used “trim lines” to get rid of the short spurs, and that seemed to have worked. Yet, it just didn’t fix all of them.