Visualization of data with maps

As a child I spent hours looking at atlases and desk globes. I was fascinated and dreamt myself away to all corners of the world….and I have today travelled and lived in most of the places I was dreaming of. Today, as an analyst, I enjoy combining data about different countries and I have found that visualizing it on maps put the world into perspective. Whether it is to compare population density or economic differences between different parts of the world, a picture often says much more than tables and figures.

MAPOFTHEWORLD.jpg

In this blog, we will present a couple of R-packages that can be used to visualize data on either flat 2D maps or on globes. These are easy to understand, easy to implement and do not require advanced knowledge in topography or modern geo-positioning tool. In fact, the only prerequisites are to know the difference between longitude and latitude….and some imagination. For those that are confused about which is latitude and longitude, keep this picture in mind:

LatLong

 

Population and economic data visualization

The surface area of Earth is approximately 510 million km² out of which 361 million km² and 149 million km² is land and it has been estimated that 70% of that land is habitable (how welcoming it is up for debate), which leaves means approximatetly 104 million km². There are about 7,5 billion people on earth and this means that every individual has 0.014 km² (14 000 m²) to enjoy . Seen in another way, the average person takes up 0,75 m² and earth’s population seen as a crowd would take up a little over 5 625 km² which is the combined areas of Gotland and Öland or the area of Brunei.  So, overpopulation is not really a big issue if only areals are considered. The issue actually lies in how the populations are scattered or clustered around natural resources. It is however quite an excercise to imagine the differences in population and population densities between countries and to visualize these will be the first of our examples on how maps can be used in a efficient manner.

One of the easiest and practical map/data R-packages are raster and rworldmap  which contains both geographical, demographic and some economic such as GDP (Gross Domestic Product) data for 243 countries. The limitation of what can be shown given that data is evident but there are plenty of other sources of data om economy, health-care and demography online that can be used. we will show two examples using the data in the package aswell as an example with data downloaded from the World Bank.

Let us start with a combination usage of these two packages that shows the content of information. Let’s say I want to give a map of Sweden containing the boundaries of all administrative regions. This can easily be done by:

library(raster)
library(rworldmap)

SwedenLevel1 = raster::getData("GADM", country = "Sweden", level = 1)
rworldmap::mapPolys(SwedenLevel1, nameColumnToPlot="NAME_1", addLegend=FALSE)
text(SwedenLevel1, label="NAME_1", cex=0.7)

For a finer granulation, on Commune level, the procedure is the same:

SwedenLevel2<- raster::getData("GADM", country = "Sweden", level = 2)
rworldmap::mapPolys(SwedenLevel2, nameColumnToPlot="NAME_2", addLegend=FALSE)
text(SwedenLevel2, label="NAME_2", cex=0.7)

which gives the following maps:

 

And this can be done for any of the 243 countries available

 

When looking at global data, we will not look at regional data for each country for the simple fact that we cannot hope to obtain it for every country. Even for a country like Sweden, blessed with statistics on any subject, this would require quite some work (either order the data from Statistiska Centralbyrån or do what I did….download historical data five commune at a time).

As mentioned above, earth’s population is today just under 7.5 billion and half of it lives in two countries: China and India. But are these coutries overcrowded? Apart from big cities (which are many in China), travelling through it’s rural area gives you an incredible sense of space. Let us first see at a map picturing the total population of each country. We first install and load some packages that may be usefull, espacially if other data is to be added apart from the ones available in raster adn rworldmap.

library(ggplot2)
library(grid)
library(data.table)
library(xlsx)
library(rworldmap)
library(classInt)
library(RColorBrewer)
library(maptools)
library(raster)
library(data.table)
library(xlsx)
library(dplyr)
library(maps)
Sys.setenv(JAVA_HOME='C:/Program Files/Java/jre1.8.0_181') # for 32-bit version
library(rJava)

Since we are looking at a global map, we need to get the map from the package and associate all the available country names to an index. When this is done, all information about country borders can be obtained. In fact, all countries have been created using polygons.

worldMap      = getMap()
EARTH         = as.vector(worldMap$NAME)
indexEARTH    = which(worldMap$NAME%in%EARTH)
CountryCoords = lapply(indexEARTH, function(i){
           df = data.frame(worldMap@polygons[[i]]@Polygons[[1]]@coords)
    df$region = as.character(worldMap$NAME[i])
 colnames(df) = list("long", "lat", "region")
                return(df)
                })
CountryCoords = do.call("rbind", CountryCoords)

which gives the longitude and latitude of every corner of the created polygons:

cooord.jpg

The next step is to obtain the population data for each country and to bind it to the map data. Let’s start with the total population and create a table containing all the necessary information:

WORLPOP                = as.data.frame(worldMap$POP_EST)
valuePOP               =    worldMap$POP_EST
CountryTablePOP        =    data.frame(country = EARTH, value = valuePOP)
CountryCoords$valuePOP =    CountryTablePOP$value[match(CountryCoords$region,CountryTablePOP$country)]

which gives:

coutryPOP

We now have all the information needed to plot a global map. Just to give a visual difference of the populations we choose to use a palette of 9 different shades of blue. This is why I chose to install and load the package RColorBrewer.

colourPalette <- brewer.pal(9,'RdPu')

Ppop  = ggplot() + geom_polygon(data = CountryCoords, aes(x = long, y = lat, group = region, fill = valuePOP),
        colour = "black", size = 0.1) +
coord_map(xlim = c(-180, 180), ylim = c(-60, 120))

Ppop = Ppop + scale_fill_gradient(name = "GDP",colourPalette = colourPalette, na.value="grey")
Ppop = Ppop + theme(axis.text.x = element_blank(),axis.text.y = element_blank(), axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(), axis.title = element_blank(),
plot.margin  = unit(0 * c(-3, -3, -1.5, -1.5), "lines"))

Which gives the following map:

Population

As you might have noticed, we chose to create a vector EARTH containing all countries on the planet. We could have created a vector EUROPE and concentrated our mapping on this region. The vector EARTH simply needs to be replaced by a vector of the form and leav it as an excercice for the reader.

EUROPE = c("Austria" ,"Italy","Belgium","Latvia","Bulgaria" ,"Lithuania",
"Croatia", "Luxembourg","Cyprus", "Malta","Czech Republic", "Netherlands",
"Denmark" ,"Poland","Estonia" ,"Portugal","Finland", "Romania","France", "Slovakia",
"Germany" ,"Slovenia","Greece", "Spain","Hungary" ,"Sweden","Ireland" ,"United Kingdom")

Now, we clearly see on the above map that India and China stick out as the two most populated countries in the world followed by the US and Brazil. But are they the most densly populated countries? Population density is calculated as the ratio between the area and the total population, or simply put as the number of available square meters (or km) per capita. The packages we use do not contain this information so we need to turn to another source for that. The World Bank is a gold mine of data on a wide range of subjects and it just happens to have a value for square meters per capita in it’s database, at least for 240 of earth’s countries.

The procedure is exactly the same with the exeption that one needs to merge this downloaded data with the mapping data from rworldmap:

CountryDen = merge(x = WorldPopDen, y =CountryTablePOP, by = "country",all.y =TRUE)

and one gets the following map showing population densities per capita:

Density_SQKM_PerCapita

The darker the color on the map the more square km are available per capita. China for instance comes only at the 97th most densely country! Note that the Netherlands is a very light blue…it comes at place 32 of the world’s most densely populated countries. Driving through the Netherlands will certainly convince you of this fact. For a complete list of countries ranked by their population densities, see Wikipedia.

Let us now turn to economic factors. One of the values most often used to define a country’s economy and wealth is the GDP (Gross Domestic Product). We remind the reader that the GDP is defined as:  the monetary value of all the finished goods and services produced within a country’s borders in a specific time period. GDP includes all private and public consumption, government outlays, investments, private inventories, paid-in construction costs and the foreign balance of trade (exports are added, imports are subtracted).

As previously in this blog we will look at two measures, namely the GDP and the GDP per Capita (which simply is the GDP distributed to every single individual in the country. OBS! there are some diverging defintions of the GDP per Capita but for simplicity we will just attribute every individual a share of the GDP). rworldmap contains information on both the GDP and population, so we simply need to create a new value as the ration between these two measures.

value                 = worldMap$GDP_MD_EST/worldMap$POP_EST
valueGDP              =    worldMap$GDP_MD_EST

CountryTableGDP = data.frame(country = EARTH, value = valueGDP)
CountryCoords$valueGDP = CountryTableGDP$value[match(CountryCoords$region,CountryTableGDP$country)]


Pgdp        = ggplot() + geom_polygon(data = CountryCoords, aes(x = long, y = lat, group =                          region, fill = valueGDP),
                    colour = "black", size = 0.1) +
coord_map(xlim = c(-180, 180), ylim = c(-60, 120))


Pgdp = Pgdp + scale_fill_gradient(name = "GDP",colourPalette = colourPalette)
Pgdp = Pgdp + theme(axis.text.x = element_blank(),axis.text.y = element_blank(),                                                          axis.ticks.x = element_blank(),
                                        axis.ticks.y = element_blank(), axis.title = element_blank(),
plot.margin = unit(0 * c(-3, -3, -1.5, -1.5), "lines"))


P = ggplot() + geom_polygon(data = CountryCoords, aes(x = long, y = lat, group = region,  fill = value),
colour = "black", size = 0.1) +
coord_map(xlim = c(-180, 180), ylim = c(-60, 120))
P = P + scale_fill_gradient(name = "GDP",colourPalette = colourPalette)
P = P + theme(axis.text.x = element_blank(),
axis.text.y = element_blank(), axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(), axis.title = element_blank(),
plot.margin = unit(0 * c(-3, -3, -1.5, -1.5), "lines"))

where Pgdp is the plot of country-GDPs and P is the one for country-GDP per Capita.

GDP
GDP

where the lighter the blue shade is the higher the country’s GDP is. In contrast, one has the map below which shows the GDP per Capita, where again, the lighter the blue shade is the higher the GDP….there are some noteworthy difference.

gdpcap
GDP per Capita

One fact not shown by the map above is that the country with the highest GDP per Capita is…..Luxemburg, as can be seen in the picture below:

gdpcapEurope

As you can see, these R packages are easy to use and the amount of data available online is the limit for what can be done. We now turn to 3D-mapping.

Flight paths with threejs and revgeocode

As fun as 2D-maps are they have their limitations and, as 2D-maps come they have their flaws. One of the obvious is the distortion of countries. we will not go into the details of all projections and issues (both practical and mathematical) with projecting a three dimension surface onto a two dimensional space. Instead, we are going to concentrate on the practical use of 3D interactive tools available in R.

In our previous post, The customer is always right , we discussed airlines, so why not make a spin off on that? What we are going to do now is trace all the fights between a given airport and all the destinations that can be reached from this given city (or airport).  What we want to do is trace paths between two points on a sphere. To do so we need a source point and a destination point. Luckily, there is a package, MUCflights, that has as a dataset most airport locations in the world.

airports

As you can see, it contains the name of the airport, which city and country it lies in, it’s geographic position, its ICAO and IATA codes and some other information that we won’t be using. Having this information is however not enough because not all airports have connections with all other airports in the world. We therefore need information on flight routes. We’re in luck! Again! There is an amazing site, OpenFlight.org, which shares a wide range of datasets related to airlines and airways.

routes

The most important pieces of information in this dataset are the “Source airport” and “Destination airport”. These are given as the IATA airport code. Given the airport-dataset obtained from the MUCflights-package and the routes given by OpenFlight one can easily construct a dataset containing both airport designations and geographical data for flight departure and arrival cities. It is a bit tricky because the variable names do not match and the Airport dataset has to be merged twice….but hey, it’s worth it (and if anyone has a better idea about how to perform this, please do leave a comment).

data("airports", package = "MUCflights")

Airports = as.data.frame(airports)
names(Airports) = c("AirportID","Name","City","Country","Source_IATA",
"ICAO","Latitude","Longitude","Altitude","Timezone","DST")
routes = fread("Route.csv")
airlines = fread("IATA.csv")


Departures = table(routes$`Source airport`,routes$`Destination airport`)
Departures = as.data.frame(Departures)
Departures = Departures[which(Departures$Freq !=0),]

names(Departures) = c("Source_IATA","Destination","Frequency")

Departures = aggregate(Departures$Frequency, by=list(Source_IATA=Departures$Source_IATA), FUN=sum)

Total = merge(Airports, Departures, by=c("Source_IATA"))
Total$Value = Total$x

routes$Source_IATA = routes$`Source airport` 
routes$Dest_IATA = routes$`Destination airport`

routes1 = merge(x=routes, y=Airports, by ="Source_IATA", all.x =TRUE)

routes1redux = routes1[,c("Source_IATA","Source airport",
"Destination airport","Dest_IATA","Name","City","Country","Latitude","Longitude")]

names(routes1redux) = c("Origin","Origin airport",
"Destination airport","Source_IATA","Source_Name","Source_City","Source_Country","origin_Latitude","origin_Longitude")

routes2 = merge(x=routes1redux, y=Airports, by ="Source_IATA", all.x =TRUE)
routes2redux = routes2[,c("Origin","Origin airport","Destination airport","Source_Name",
"Source_City","Source_Country","origin_Latitude","origin_Longitude", 
"Name","City","Country" ,"Latitude","Longitude")]
names(routes2redux) = c("Origin","Origin airport","Destination airport","Source_Name","Source_City",
"Source_Country","origin_Latitude","origin_Longitude","Destination_Name","Destination_City","Destination_Country",
"Destination_Latitude","Destination_Longitude")

Now come the real fun! As you might imagine, there are many routes from most airports and rendering a picture of all flights from every single airport would not be very appealing to see. So, we’ll reduce our excercice to all flights leaving Paris, France. There are still over 3000 destinations reached from there (and more going to Paris. France is after all the most visited country in the world!).

Another thing is that we will asume that planes ALWAYS choose a goedesic path between two points on the planet. These are the shortest path between two points on a sphere and they always follow great circles (in other words, the shortest path between two points on a sphere is a segment of the great circle containing these points).

We’ll need a few packages in order to calculate the arcs connecting the different geographical positions to that of Paris’ airports (Beauvais, Charles de Gaulle and Orly). We’ll also need to be able to have interactivity with the resulting plot. Apart from the ones that we already used in the previous part of this blog, we wrote a code to batch-install and attach the required packages. We also add WebGL to be able to publish the resulting plot in a web-browser.

library(rworldmap)
library(countrycode)
library(revgeocode)
Sys.setenv(JAVA_HOME='C:/Program Files/Java/jre1.8.0_181')
library(rJava)
list.of.packages = c("geosphere","threejs", "rworldmap","leaflet","rgeos","raster","DT", 
"ggplot2","sp","ggmap","knitr","rglwidget", "rgl")

new.packages = list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) suppressMessages(suppressWarnings(install.packages(new.packages)))
lapply(list.of.packages,function(x){suppressMessages(suppressWarnings(library(x,character.only=TRUE)))})

knit_hooks$set(webgl = hook_webgl)

So, let’s calcultate arcs. Since we only choose the routes from Paris, we only need to determine the arcs from that location (longitude = 2.55, Latitude = 49.01278). We called the set of ALL routes routesAll

arcs = routesAll %>%
cbind(
      origin_lat = 49.01278,
     origin_long = 2.550000
) %>%
transmute(
origin_lat = origin_lat,
origin_long = origin_long,
dest_lat = Destination_Latitude,
dest_long = Destination_Longitude,
value = value
)

arcs = unique(arcs)
arcs = arcs[complete.cases(arcs), ]

Note one very important property: There are two arcs from points A and B on a sphere, and unless they are diametrically opposed, one must be shorter than the other. Our function always chooses the shorter of these arcs.

The geographical positions given in all our dataframes are longitudes and latitudes, but we need to be able to give them as spatial points. Enters the package sp and its functions SpatialPoints and SpatialPointsDataFrame. We transform the given information in routesAll into spatial Points:

source     = data.frame(SourceLong=arcs$origin_lat,SourceLat=arcs$origin_long)
source_sp  = SpatialPoints(source, proj4string=CRS("+proj=longlat"))
source_sp  = SpatialPointsDataFrame(source_sp, data = source)

dest       = data.frame(DestLong=arcs$dest_long,DestLat=arcs$dest_lat)
dest_sp    = SpatialPoints(dest, proj4string=CRS("+proj=longlat"))
dest_sp    = SpatialPointsDataFrame(dest_sp, data = dest)

Just for the fun of it, one can create a dataset containing the distances of a arcs from Paris to destinations reached by airlines. Now, let’s plot this on a map. So far, we are still doing this on a 2D-map, and a pretty ugly one, one must say, but it is a good excercie

(worldMap              =     getMap() )
world.points           =     fortify(worldMap)  # Convert data into dataframe using fortify from ggplot2
world.points$region    =     world.points$id
world.df               =     world.points[,c("long","lat","group", "region")]
worldmap               =     ggplot() + 
  geom_polygon(data = world.df, aes(x = long, y = lat, group = group)) +
  geom_point(aes(x=combination[,2], y=comb_df[,1]),color="yellow", size=1) +  # Plot Source Locations
  geom_point(aes(x=combination[,4], y=comb_df[,3]),color="green", size=1) +  # Plot Dest Location
  scale_y_continuous(breaks = (-2:2) * 30) +
  scale_x_continuous(breaks = (-4:4) * 45) +
  theme_bw() +
  theme(axis.line = element_line(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border     = element_blank(),
        panel.background = element_blank())

which gives

fuglymap

We still get a pretty good view of the multitude of flights from Paris to all “corners” of the world. To get a simple 3D map we simply need to add one line of code:

worldmap +coord_map("ortho", orientation=c(30, -10, 10))+theme(panel.background = element_rect(fill = 'gray', colour = 'white'), panel.grid.major = element_line(color = "white"),
panel.grid.minor = element_line(color = "white"))

fuglyround

Now, for the Grand Finale! let’s make an interactive map of flights from Paris. To not clutter the visualization we restrict ourselves to 500 destinations, but it works just as fine with all of them. One needs to use the threejs package to retrieve the map from the package and the function glogejs to plot all arcs onto the resultiong sphere. Neat and easy. You will however not be able to view the result unless you can publish it in a web-browser. if you are using Rstudio, you have a function in the viewer window allowing you to publish you graohs on Rpubs (for instance).

sourceParis = factor(sprintf("%.2f:%.2f",combination[,2], combination[,1]))

frequent = sort(table(sourceParis), decreasing=TRUE)
destinations = names(frequent)[1:500]
idx = sourceParis %in% destinations
LongLat = unique(combination[idx,1:2])
Flights = combination[idx,]

library(ggmap)
(earth = system.file("images/world.jpg", package="threejs")) 
DF = data.frame(origin_lat = Flights[,1], origin_long = Flights[,2], dest_lat = Flights[,3], dest_long = Flights[,4])
#
globejs(img=earth, lat=LongLat[,1], long=LongLat[,2], arcs=DF,
arcsHeight=0.4, arcsLwd=2, arcsColor="red", arcsOpacity=0.25,
atmosphere=TRUE, height=400, width = 400)

Which gives (Click on the image or here to reach the interactive web-browser version)

interactive

 

Until next time, have fun with map!

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.

Up ↑

%d bloggers like this: