
Transcription
Introduction to visualising spatial data in RRobin Lovelace ([email protected]), James Cheshire, Rachel Oldroyd and others2017-03-23.See github.com/Robinlovelace/Creating-maps-in-R for latest versionContentsPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1Part I: IntroductionPrerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .R Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .223Part II: Spatial data in RStarting the tutorial and downloading the dataThe structure of spatial data in R . . . . . . .Basic plotting . . . . . . . . . . . . . . . . . . .Selecting quadrants . . . . . . . . . . . . . . . .44567Part III: Creating and manipulating spatial dataCreating new spatial data . . . . . . . . . . . . . . .Projections: setting and transforming CRS in R . . .Attribute joins . . . . . . . . . . . . . . . . . . . . .Clipping and spatial joins . . . . . . . . . . . . . . .99101012Part IV: Making maps with tmap, ggplot2 and leaflettmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ggmap . . . . . . . . . . . . . . . . . . . . . . . . . . . .Creating interactive maps with leaflet . . . . . . . . . . .Advanced Task: Faceting for Maps . . . . . . . . . . . . .1414151618.Part V: Taking spatial data analysis in R further19Acknowledgements20References20PrefaceThis tutorial is an introduction to analysing spatial data in R, specifically through map-making with R’s‘base’ graphics and various dedicated map-making packages for R including tmap and leaflet. It teachesthe basics of using R as a fast, user-friendly and extremely powerful command-line Geographic InformationSystem (GIS).The tutorial is practical in nature: you will load-in, visualise and manipulate spatial data. We assume noprior knowledge of spatial data analysis but some experience with R will help. If you have not used R before,it may be worth following an introductory tutorial, such as Efficient R Programming (Gillespie and Lovelace,2016), the official Introduction to R or tutorials suggested on rstudio.com and cran.r-project.org.Now you know some R, it’s time to turn your attention towards spatial data with R. To that end, this tutorialis organised as follows:1. Introduction: provides a guide to R’s syntax and preparing for the tutorial1
2. Spatial data in R: describes basic spatial functions in R3. Creating and manipulating spatial data: includes changing projection, clipping and spatial joins4. Map making with tmap, ggplot2 and leaflet: this section demonstrates map making with moreadvanced visualisation tools5. Taking spatial analysis in R further: a compilation of resources for furthering your skillsTo distinguish between prose and code, please be aware of the following typographic conventions used inthis document: R code (e.g. plot(x, y)) is written in a monospace font and package names (e.g. rgdal)are written in bold. A double hash (##) at the start of a line of code indicates that this is output from R.Lengthy outputs have been omitted from the document to save space, so do not be alarmed if R producesadditional messages: you can always look up them up on-line.As with any programming language, there are often many ways to produce the same output in R. The codepresented in this document is not the only way to do things. We encourage you to play with the code togain a deeper understanding of R. Do not worry, you cannot ‘break’ anything using R and all the input datacan be re-loaded if things do go wrong. As with learning to skateboard, you learn by falling and getting anError: message in R is much less painful than falling onto concrete! We encourage Error:s — it means youare trying new things.Part I: IntroductionPrerequisitesFor this tutorial you need a copy of R. The latest version can be downloaded from http://cran.r-project.org/.We also suggest that you use an R editor, such as RStudio, as this will improve the user-experience and helpwith the learning process. This can be downloaded from http://www.rstudio.com. The R Studio interface iscomprised of a number of windows, the most important being the console window and the script window.Anything you type directly into the console window will not be saved, so use the script window to createscripts which you can save for later use. There is also a Data Environment window which lists the dataframesand objects being used. Familiarise yourself with the R Studio interface before getting started on the tutorial.When writing code in any language, it is good practice to use consistent and clear conventions, and R isno exception. Adding comments to your code is also useful; make these meaningful so you remember whatthe code is doing when you revisit it at a later date. You can add a comment by using the # symbol beforeor after a line of code, as illustrated in the block of code below. This code should create Figure 1 if typedcorrectly into the Console window:Figure 1: Basic plot of x and y (right) and code used to generate the plot (right).This first line in this block of code creates a new object called x and assigns it to a range of integers between1 and 400. The second line creates another object called y which is assigned to a mathematical formula, andthe third line plots the two together to create the plot shown.2
Note -, the directional “arrow” assignment symbol which creates a new object and assigns it to the valueyou have given.1If you require help on any function, use the help command, e.g. help(plot). Because R users love beingconcise, this can also be written as ?plot. Feel free to use it at any point you would like more detail on aspecific function (although R’s help files are famously cryptic for the un-initiated). Help on more generalterms can be found using the ? symbol. To test this, try typing ?regression. For the most part, learningby doing is a good motto, so let’s crack on and download some packages and data.R PackagesR has a huge and growing number of spatial data packages. We recommend taking a quick browse on R’smain website to see the spatial packages available: n this tutorial we will use the following packages: ggmap: extends the plotting package ggplot2 for mapsrgdal: R’s interface to the popular C/C spatial data processing library gdalrgeos: R’s interface to the powerful vector processing library geosmaptools: provides various mapping functionsdplyr and tidyr: fast and concise data manipulation packagestmap: a new packages for rapidly creating beautiful mapsSome packages may already be installed on your computer. To test if a package is installed, try to load itusing the library function; for example, to test if ggplot2 is installed, type library(ggplot2) into theconsole window. If there is no output from R, this is good news: it means that the library has already beeninstalled on your computer.If you get an error message,you will need to install the package using install.packages("ggplot2"). Thepackage will download from the Comprehensive R Archive Network (CRAN); if you are prompted to select a‘mirror’, select one that is close to current location. If you have not done so already, install these packages onyour computer now. A quick way to do this in one go is to enter the following lines of code:x - c("ggmap", "rgdal", "rgeos", "maptools", "dplyr", "tidyr", "tmap")# install.packages(x) # warning: uncommenting this may take a number of minuteslapply(x, library, character.only TRUE) # load the required packages1 Tip:typing Alt - on the keyboard will create the arrow in RStudio. The equals sign also works.3
Part II: Spatial data in RStarting the tutorial and downloading the dataNow that we have looked at R’s basic syntax and installed the necessary packages,let’s load some real spatialdata. The next part of the tutorial will focus on plotting and interrogating spatial objects.The data used for this tutorial can be downloaded s-in-R. Click on the “Download ZIP” button on the right hand side of the screenand once downloaded, unzip this to a new folder on your computer.Open the existing ‘Creating-maps-in-R’ project using File - Open File. on the top menu.Alternatively, use the project menu to open the project or create a new one. It is highly recommended thatyou use RStudio’s projects to organise your R work and that you organise your files into sub-folders (e.g.code, input-data, figures) to avoid digital clutter (Figure 2). The RStudio website contains an overviewof the software: rstudio.com/products/rstudio/.Figure 2: The RStudio environment with the project tab poised to open the Creating-maps-in-R project.Opening a project sets the current working directory to the project’s parent folder, the Creating-maps-in-Rfolder in this case. If you ever need to change your working directory, you can use the ‘Session’ menu at thetop of the page or use the setwd command.The first file we are going to load into R Studio is the “london sport” shapefile located in the ‘data’ folder ofthe project. It is worth looking at this input dataset in your file browser before opening it in R. You willnotice that there are several files named “london sport”, all with different file extensions. This is because ashapefile is actually made up of a number of different files, such as .prj, .dbf and .shp.You could also try opening the file “london sport.shp” file in a conventional GIS such as QGIS to see what ashapefile contains.You should also open “london sport.dbf” in a spreadsheet program such as LibreOffice Calc. to see what thisfile contains. Once you think you understand the input data, it’s time to open it in R. There are a number ofways to do this, the most commonly used and versatile of which is readOGR. This function, from the rgdalpackage, automatically extracts the information regarding the data.rgdal is R’s interface to the “Geospatial Abstraction Library (GDAL)” which is used by other open sourceGIS packages such as QGIS and enables R to handle a broader range of spatial data formats. If you’ve notalready installed and loaded the rgdal package (see the ‘prerequisites and packages’ section) do so now:4
library(rgdal)lnd - readOGR(dsn "data", layer "london sport")In the second line of code above the readOGR function is used to load a shapefile and assign it to a new spatialobject called “lnd”; short for London. readOGR is a function which accepts two arguments: dsn which standsfor “data source name” and specifies the directory in which the file is stored, and layer which specifies thefile name (note that there is no need to include the file extention .shp). The arguments are separated bya comma and the order in which they are specified is important. You do not have to explicitly type dsn or layer as R knows which order they appear, so readOGR("data", "london sport") would work justas well. For clarity, it is good practice to include argument names when learning new functions so we willcontinue to do so.The file we assigned to the lnd object contains the population of London Boroughs in 2001 and the percentageof the population participating in sporting activities. This data originates from the Active People Survey.The boundary data is from the Ordnance Survey.For information about how to load different types of spatial data, see the help documentation for readOGR.This can be accessed by typing ?readOGR. For another worked example, in which a GPS trace is loaded,please see Cheshire and Lovelace (2014).The structure of spatial data in RSpatial objects like the lnd object are made up of a number of different slots, the key slots being @data (nongeographic attribute data) and @polygons (or @lines for line data). The data slot can be thought of as anattribute table and the geometry slot is the polygons that make up the physcial boundaries. Specific slots areaccessed using the @ symbol. Let’s now analyse the sport object with some basic commands:head(lnd@data, n 2)##ons labelname Partic Per Pop 2001## 000AFBromley21.7295535## 100BD Richmond upon Thames26.6172330mean(lnd Partic Per) # short for mean(lnd@data Partic Per)## [1] 20.05455Take a look at the output created (note the table format of the data and the column names). There are twoimportant symbols at work in the above block of code: the @ symbol in the first line of code is used to referto the data slot of the lnd object. The symbol refers to the Partic Per column (a variable within thetable) in the data slot, which was identified from the result of running the first line of code.The head function in the first line of the code above simply means “show the first few lines of data” (tryentering head(lnd@data), see ?head for more details). The second line calculates finds the mean sportsparticipation per 100 people for zones in London. The results works because we are dealing with numericdata. To check the classes of all the variables in a spatial dataset, you can use the following command:sapply(lnd@data, class)####ons label"factor"name Partic Per"factor" "numeric"Pop 2001"factor"This shows that, unexpectedly, Pop 2001 is a factor. We can coerce the variable into the correct, numeric,format with the following command:lnd Pop 2001 - as.numeric(as.character(lnd Pop 2001))Type the function again but this time hit tab before completing the command. RStudio has auto-completefunctionality which can save you a lot of time in the long run (see Figure 3).5
Figure 3: Tab-autocompletion in action: display from RStudio after typing lnd@ then tab to see which slotsare in lndTo explore lnd object further, try typing nrow(lnd) (display number of rows) and record how many zonesthe dataset contains. You can also try ncol(lnd).Basic plottingNow we have seen something of the structure of spatial objects in R, let us look at plotting them. Note, thatplots use the geometry data, contained primarily in the @polygons slot.plot(lnd) # not shown in tutorial - try it on your computerplot is one of the most useful functions in R, as it changes its behaviour depending on the input data (this iscalled polymorphism by computer scientists). Inputting another object such as plot(lnd@data) will generatean entirely different type of plot. Thus R is intelligent at guessing what you want to do with the data youprovide it with.R has powerful subsetting capabilities that can be accessed very concisely using square brackets,as shown inthe following example:# select rows of lnd@data where sports participation is less than 15lnd@data[lnd Partic Per 15, ]##ons labelname Partic Per Pop 2001## 1700AQHarrow14.8206822## 2100BBNewham13.1243884## 3200AA City of London9.17181The above line of code asked R to select only the rows from the lnd object, where sports participation islower than 15, in this case rows 17, 21 and 32, which are Harrow, Newham and the city centre respectively.The square brackets work as follows: anything before the comma refers to the rows that will be selected,anything after the comma refers to the number of columns that should be returned. For example if the dataframe had 1000 columns and you were only interested in the first two columns you could specify 1:2 after thecomma. The “:” symbol simply means “to”, i.e. columns 1 to 2. Try experimenting with the square bracketsnotation (e.g. guess the result of lnd@data[1:2, 1:3] and test it).So far we have been interrogating only the attribute data slot (@data) of the lnd object, but the squarebrackets can also be used to subset spatial objects, i.e. the geometry slot. Using the same logic as before tryto plot a subset of zones with high sports participation.# Select zones where sports participation is between 20 and 25%sel - lnd Partic Per 20 & lnd Partic Per 25plot(lnd[sel, ]) # output not shown herehead(sel) # test output of previous selection (not shown)This plot is quite useful, but it only displays the areas which meet the criteria. To see the sporty areas incontext with the other areas of the map simply use the add TRUE argument after the initial plot. (add 6
T would also work, but we like to spell things out in this tutorial for clarity). What do you think the colargument refers to in the below block? (see Figure 5).If you wish to experiment with multiple criteria queries, use &.plot(lnd, col "lightgrey") # plot the london sport objectsel - lnd Partic Per 25plot(lnd[ sel, ], col "turquoise", add TRUE) # add selected zones to mapFigure 4: Simple plot of London with areas of high sports participation highlighted in blueCongratulations! You have just interrogated and visualised a spatial object: where are areas with high levelsof sports participation in London? The map tells us. Do not worry for now about the intricacies of how thiswas achieved: you have learned vital basics of how R works as a language; we will cover this in more detail insubsequent sections.As a bonus stage, select and plot only zones that are close to the centre of London (see Fig. 6). Programmingencourages rigorous thinking and it helps to define the problem more specifically:Challenge: Select all zones whose geographic centroid lies within 10 km of the geographiccentroid of inner London.2CentralLondonFigure 5: Zones in London whose centroid lie within 10 km of the geographic centroid of the City of London.Note the distinction between zones which only touch or ‘intersect’ with the buffer (light blue) and zoneswhose centroid is within the buffer (darker blue).Selecting quadrantsThe code below should help understand the way spatial data work in R.# Find the centre of the london arealat - coordinates(gCentroid(lnd))[[1]]2 To see how this map was created, see the code in intro-spatial.Rmd.This may be loaded by typingfile.edit("intro-spatial.Rmd") or online at aster/intro-spatial.Rmd.7
lng - coordinates(gCentroid(lnd))[[2]]# arguments to test whether or not a coordinate is east or north of the centreeast - sapply(coordinates(lnd)[,1], function(x) x lat)north - sapply(coordinates(lnd)[,2], function(x) x lng)# test if the coordinate is east and north of the centrelnd@data quadrant[east & north] - "northeast"Challenge: Based on the the above code as refrence try and find the remaining 3 quadrants andcolour them as per Figure 6 below. For bonus points try to desolve the quadrants so the map isleft with only 4 polygons.Figure 6: The 4 quadrants of London8
Part III: Creating and manipulating spatial dataAlongside visualisation and interrogation, a GIS must also be able to create and modify spatial data. R’sspatial packages provide a very wide and powerful suite of functionality for processing and creating spatialdata.Reprojecting and joining/clipping are fundamental GIS operations, so in this section we will explore howthese operations can be undertaken in R. Firstly We will join non-spatial data to spatial data so it can bemapped. Finally we will cover spatial joins, whereby information from two spatial objects is combined basedon spatial location.Creating new spatial dataR objects can be created by entering the name of the class we want to make. vector and data.frame objectsfor example, can be created as follows:vec - vector(mode "numeric", length 3)df - data.frame(x 1:3, y c(1/2, 2/3, 3/4))We can check the class of these new objects using class():class(vec)## [1] "numeric"class(df)## [1] "data.frame"The same logic applies to spatial data. The input must be a numeric matrix or data.frame:sp1 - SpatialPoints(coords df)We have just created a spatial points object, one of the fundamental data types for spatial data. (The othersare lines, polygons and pixels, which can be created by SpatialLines, SpatialPolygons and SpatialPixels,respectively.) Each type of spatial data has a corollary that can accepts non-spatial data, created by addingDataFrame. SpatialPointsDataFrame(), for example, creates points with an associated data.frame. Thenumber of rows in this dataset must equal the number of features in the spatial object, which in the case ofsp1 is 3.class(sp1)## [1] "SpatialPoints"## attr(,"package")## [1] "sp"spdf - SpatialPointsDataFrame(sp1, data df)class(spdf)## [1] "SpatialPointsDataFrame"## attr(,"package")## [1] "sp"The above code extends the pre-existing object sp1 by adding data from df. To see how strict spatial classesare, try replacing df with mat in the above code: it causes an error. All spatial data classes can be createdin a similar way, although SpatialLines and SpatialPolygons are much more complicated (Bivand et al.2013). More frequently your spatial data will be read-in from an externally-created file, e.g. using readOGR().Unlike the spatial objects we created above, most spatial data comes with an associate ‘CRS’.9
Projections: setting and transforming CRS in RThe Coordinate Reference System (CRS) of spatial objects defines where they are placed on the Earth’ssurface. You may have noticed ’proj4string ’in the summary of lnd above: the information that followsrepresents its CRS. Spatial data should always have a CRS. If no CRS information is provided, and thecorrect CRS is known, it can be set as follow:proj4string(lnd) - NA character # remove CRS information from lndproj4string(lnd) - CRS(" init epsg:27700") # assign a new CRSR issues a warning when the CRS is changed. This is so the user knows that they are simply changing theCRS, not reprojecting the data. An easy way to refer to different projections is via EPSG codes.Under this system 27700 represents the British National Grid. ‘WGS84’ (epsg:4326) is a very commonlyused CRS worldwide. The following code shows how to search the list of available EPSG codes and create anew version of lnd in WGS84:3EPSG - make EPSG() # create data frame of available EPSG codesEPSG[grepl("WGS 84 ", EPSG note), ] # search for WGS 84 code##codenoteprj4## 249 4326 # WGS 84 proj longlat datum WGS84 no defs## 4890 4978 # WGS 84 proj geocent datum WGS84 units m no defslnd84 - spTransform(lnd, CRS(" init epsg:4326")) # reprojectAbove, spTransform converts the coordinates of lnd into the widely used WGS84 CRS. Now we’ve transformedlnd into a more widely used CRS, it is worth saving it. R stores data efficiently in .RData or .Rds formats.The former is more restrictive and maintains the object’s name, so we use the latter.# Save lnd84 object (we will use it in Part IV)saveRDS(object lnd84, file "data/lnd84.Rds")Now we can remove the lnd84 object with the rm command. It will be useful later. (In RStudio, notice italso disappears from the Environment in the top right panel.)rm(lnd84) # remove the lnd object# we will load it back in later with readRDS(file "data/lnd84.Rds")Attribute joinsAttribute joins are used to link additional pieces of information to our polygons. In the lnd object, forexample, we have 4 attribute variables — that can be found by typing names(lnd). But what happens whenwe want to add more variables from an external source? We will use the example of recorded crimes byLondon boroughs to demonstrate this.To reaffirm our starting point, let’s re-load the “london sport” shapefile as a new object and plot it:library(rgdal) # ensure rgdal is loaded# Create new object called "lnd" from "london sport" shapefilelnd - readOGR(dsn "data", "london sport")plot(lnd) # plot the lnd object (not shown)nrow(lnd) # return the number of rows (not shown)The non-spatial data we are going to join to the lnd object contains records of crimes in London. This isstored in a comma separated values (.csv) file called “mps-recordedcrime-borough”. If you open the file in aseparate spreadsheet application first, we can see each row represents a single reported crime. We are going3 Note: entering projInfo() provides additional CRS options. spatialreference.org provides more information about EPSGcodes.10
to use a function called aggregate to aggregate the crimes at the borough level, ready to join to our spatiallnd dataset. A new object called crime data is created to store this data.# Create and look at new crime data objectcrime data - ngsAsFactors FALSE)head(crime data CrimeType) # information about crime type# Extract "Theft & Handling" crimes and savecrime theft - crime data[crime data CrimeType "Theft & Handling", ]head(crime theft, 2) # take a look at the result (replace 2 with 10 to see more rows)# Calculate the sum of the crime count for each district, save resultcrime ag - aggregate(CrimeCount Borough, FUN sum, data crime theft)# Show the first two rows of the aggregated crime datahead(crime ag, 2)You should not expect to understand all of this upon first try: simply typing the commands and thinkingbriefly about the outputs is all that is needed at this stage. Here are a few things that you may not have seenbefore that will likely be useful in the future: In the first line of code when we read in the file we specify its location (check in your file browser to besure). The function is used to select only those observations that meet a specific condition i.e. where it isequal to, in this case all crimes involving “Theft and Handling”. The symbol means “by”: we aggregated the CrimeCount variable by the district name.Now that we have crime data at the borough level, the challenge is to join it to the lnd object. We will baseour join on the Borough variable from the crime ag object and the name variable from the lnd object. It isnot always straight-forward to join objects based on names as the names do not always match. Let’s seewhich names in the crime ag object match the spatial data object, lnd:# Compare the name column in lnd to Borough column in crime ag to see which rows match.lnd name %in% crime ag Borough## [1]## [12]## TRUETRUETRUETRUE TRUETRUE TRUETRUE FALSE# Return rows which do not matchlnd name[!lnd name %in% crime ag Borough]## [1] City of London## 33 Levels: Barking and Dagenham Barnet Bexley Brent Bromley . WestminsterThe first line of code above uses the %in% command to identify which values in lnd name are also containedin the Borough names of the aggregated crime data. The results indicate that all but one of the boroughnames matches. The second line of code tells us that it is ‘City of London’. This does not exist in the crimedata. This may be because the City of London has it’s own Police Force.4 (The borough name in the crimedata does not match lnd name is ‘NULL’. Check this by typing crime ag Borough[!crime ag Borough%in% lnd name].)Challenge: identify the number of crimes taking place in borough ‘NULL’, less than 4,000.Having checked the data found that one borough does not match, we are now ready to join the spatial andnon-spatial datasets. It is recommended to use the left join function from the dplyr package but the4 Seewww.cityoflondon.police.uk/.11
Figure 7: Number of thefts per borough.merge function could equally be used. Note that when we ask for help for a function that is not loaded,nothing happens, indicating we need to load it:library(dplyr) # load dplyrWe use left join because we want the length of the data frame to remain unchanged, with variables fromnew data appended in new columns (see ?left join). The *join commands (including inner join andanti join) assume, by default, that matching variables have the same name. Here we will specify theassociation between variables in the two data sets:head(lnd name) # dataset to add to (results not shown)head(crime ag Borough) # the variables to join# head(left join(lnd@data, crime ag)) # test it workslnd@data - left join(lnd@data, crime ag, by c('name' 'Borough'))## Warning in left join impl(x, y, by x, by y, suffix x, suffix y): joining## character vector and factor, coercing into character vectorTake a look at the new lnd@data object. You should see new variables added, meaning the attribute joinwas successful. Congratulations! You can now plot the rate of theft crimes in London by borough (see Fig 8).library(tmap) # load tmap package (see Section IV)qtm(lnd, "CrimeCount") # plot the basic mapOptional challenge: create a map of additional variables in LondonWith the attribute joining skills you have learned in this section, you should now be able to take datasetsfrom many sources, e.g. data.london.gov.uk, and join them to your geographical data.Clipping and spatial joinsIn addition to joining by attribute (e.g. Borough name), it is also possible to do spatial joins in R. We usetransport infrastructure points as the spatial data to join, with the aim of finding out about how many arefound in each London borough.library(rgdal)# create new stations object using the "lnd-stns" shapefile.stations - readOGR(dsn "data", layer "lnd-stns")# stations read shape("data/lnd-stns.shp") # from tmapproj4string(stations) # this is the full geographical detail.proj4string(lnd) # what's the coordinate reference system (CRS)12
bbox(stations) # the extent, 'bounding box' of stationsbbox(lnd) # return the bounding box of the lnd objectThe proj4string() function shows that the Coordinate Reference System (CRS) of stations differs fromthat of our lnd object. OSGB 1936 (or EPSG 27700) is the official CRS for the UK, so we will convert the‘stations’ object to this:# Create reprojected stations objectstations - spTransform(stations, CRSobj CRS(proj4string(lnd)))plot(lnd) # plot Londonpoints(stations) # overlay the statio
T ialforclarity). Whatdoyouthinkthe col argumentreferstointhebelowblock? (seeFigure5 .