Using a custom dataset to plot maps

Innovation Lab Galileo
8 min readJun 30, 2021

This is the last article in a series focused on building our own datasets to plot maps in R. You can find all the articles in the following links.

In this article, we will…

The datasets that we generate are as easy to use as the ones included in R
  • Analyze the code that is needed to process our CSV files.
  • Plot a map using the coordinates we obtained from the map.
  • Compare some differences between GADM and Natural Earth.

The long-awaited moment is finally here. Now that we’ve used QGIS to generate a dataset with the coordinates of some regions, it is time to plot this map on R. We will use the following code.

# importing some packages
library(data.table)
library(dplyr)
library(ggplot2)
library(tibble)
# importing the CSV file generated in QGIS
qgis_export <- read.csv("vertices.csv", encoding="UTF-8", stringsAsFactors=FALSE)
# generate an unique numerical identifier for each combination of name, admin, vertex_part and vertex_part_ring
# save this idenetifier in a new column called group
setDT(qgis_export)[ , group := .GRP, .(name, admin, vertex_part, vertex_part_ring)]
# if we have extra columns, get rid of them
# name the columns with the names that ggplot2 requires
dataset <- qgis_export %>% select(X, Y, name, admin, group, vertex_part_index)
colnames(dataset) <- c('long', 'lat', 'region', 'pais', 'group', 'order')
# visualize the structure of our dataset
str(dataset)
# generating some random data to color the regions of our map, we can ommit this step if we already have a dataset with information we want to plot
# if we use our own dataset, we need to double check that we have columns named 'region' and 'value'
rand_data <- tibble(region=unique(dataset$region),
value=sample(100, length(region)))
# plotting our map just as in the first article
gg <- ggplot()
gg <- gg + geom_map(data=dataset, map=dataset,
aes(x=long, y=lat, map_id=region),
color="#b2b2b2", size=0.1, fill=NA)
gg <- gg + geom_map(data=rand_data, map=dataset,
aes(fill=value, map_id=region),
color="#b2b2b2", size=0.1)
gg <- gg + coord_fixed(ratio = 1)
gg

The result of running the previous code is a map like the following:

Custom map with the administrative level 1 regions from Central America

In the previous code, the most important line is the following setDT(qgis_export)[ , group := .GRP, .(name, admin, vertex_part, vertex_part_ring)]

Let’s break down this line to understand how it works.

  • setDT(qgis_export): This function converts qgis_export from a data frame to a data table. Data tables are more efficient in memory.
  • [ , group := content ]: Create a new column in the data table, named group with the specified content.
  • .GRP, grouping: GRP is a special symbol from the data table package, it tells R that we want to assign a correlative integer to each element from the following grouping.
  • .(name, admin, vertex_part, vertex_part_ring): We want to group first by region name (name) and then by country name (admin). Some of these regions will have islands or exclaves that still belong to the same region, vertex_part will allow to distinguish them from one another. Finally, if there is some enclave inside our region, vertex_part_ring will allow us to tell it apart.

This grouping and correlative assignment is needed, otherwise, ggplot2 could misunderstand and interpret as if several different regions were connected and would try to draw them continuously, and result in some unexpected lines appearing in our graph crossing from one end of a region to the beginning of the next one.

If we forget to include one of the four needed elements (name, admin, vertex_part, vertex_part_ring) in the grouping, we could end up with some problematic graphs like the following ones.

admin was not included in the grouping; now Colón (Honduras) is considered the same region as Colón (Panama) so they are joined together by a line that crossed some countries and the ocean
vertex_part was not included in the grouping; some countries have islands that ggplot2 now believes are connected to the mainland

Using GADM

If we want to show administrative level 2 or lower divisions (such as municipalities, cities, counties, and others) we will use the geographical information that GADM provides us.

GADM has a great amount of detail in its information, so it is advisable to work only with the relevant countries, instead of importing the whole world data.

  • As a comparison, the shapefile with administrative level 1 divisions from Natural Earth is 20 MB in size, while this same information from GADM is 604 MB in size.
  • If you try to export the vertices for the whole world from GADM maps in a single operation, QGIS may need up to 20 GB of RAM to successfully complete the task.

To work with GADM, the first steps are the same, the only changes happen after exporting our CSV and opening it in R.

To start, use Layer > Add Layer > Add Vector Layer… to add all the shapefiles corresponding to the countries you want to plot. If you add several countries and you want to generate a single CSV file containing them all, use the tool found in Vector > Data Management Tools > Merge Vector Layers… to join them in a single layer.

Multiple shapefiles can be added
Merge Vector Layers… is used to join them in a single layer
Input layers is used to select the layers to join

In the Merge Vector Layers window, at the Input layers section, click on the button to the right. After clicking, the window will change to show a list of layers, where each one corresponds to each shapefile we have added. Choose all the desired layers and click on Run, QGIS will generate a new layer.

Selecting all the layers to join
QGIS will generate a new layer with all the information combined

Once we have the new layer ready, the steps are the same as when we used Natural Earth. Press F6 to open the attribute table, and, at the top of the window click on Select features using an expression. Write an expression in case you need to select only some regions.

Then, go to Vector > Geometry Tools > Extract Vertices… and generate the vertices layer. When doing this, notice that the task takes longer to complete compared to how much it did take when using Natural Earth, this is caused by the high amount of details in GADM. We are ready to generate the CSV file by right-clicking the newly created layer, and selecting Export > Save Features As…

From this point onwards, we need some additional information about the region we are working with. In the case of Guatemala, for example, I know that there are some municipalities (administrative level 2) with the same name but in different departments (administrative level 1). It is extremely important to keep all this information, to correctly process the vertices in R.

La Democracia (Huehuetenango and Escuintla) and La Libertad (Petén and Huehuetenango) can cause some problems if I wouldn’t keep the NAME_1 column

In this case, we need to keep the columns NAME_0, NAME_1, NAME_2, vertex_part, vertex_part_ring and vertex_part_index. Remember to pick AS_XY in the GEOMETRY selection.

Different columns from the Natural Earth’s ones

After exporting the CSV, the processing needed to be done in R is similar but with some small changes.

# importing some packages
library(data.table)
library(dplyr)
library(ggplot2)
library(tibble)
# importing the CSV file generated in QGIS
qgis_export <- read.csv("dosPaises.csv", encoding="UTF-8", stringsAsFactors=FALSE)
# ----- SOME CHANGED FROM NATURAL EARTH -----# generate an unique numerical identifier for each combination of NAME_0, NAME_1, NAME_2, vertex_part and vertex_part_ring
# save this idenetifier in a new column called groupsetDT(qgis_export)[ , group := .GRP, .(NAME_0, NAME_1, NAME_2, vertex_part, vertex_part_ring)]
# if we have extra columns, get rid of them
# name the columns with the names that ggplot2 requires
# now we are naming the 'NAME_2' column as 'region' since we want to use it to plot
dataset <- qgis_export %>% select(X, Y, NAME_0, NAME_1, NAME_2, group, vertex_part_index)
colnames(dataset) <- c('long', 'lat', 'pais', 'departamento', 'region', 'group', 'order')
# ----- END OF THE MOST IMPORTANT CHANGES -----# visualize the structure of our dataset
str(dataset)
# generating some random data to color the regions of our map, we can ommit this step if we already have a dataset with information we want to plot
# if we use our own dataset, we need to double check that we have columns named 'region' and 'value'
rand_data <- tibble(region=unique(dataset$region),
value=sample(700, length(region)))
# plotting our map just as in the first article
gg <- ggplot()
gg <- gg + geom_map(data=dataset, map=dataset,
aes(x=long, y=lat, map_id=region),
color="#b2b2b2", size=0.1, fill=NA)
gg <- gg + geom_map(data=rand_data, map=dataset,
aes(fill=value, map_id=region),
color="#b2b2b2", size=0.1)
gg <- gg + coord_fixed(ratio = 1)
gg

Done! After running this code, we will have a custom map like this. Now we only need to substitute rand_data with the information that we want to show.

Graph for our custom region

Some pre-generated maps

QGIS can be very demanding on some computers, especially if we use the GADM maps. In the following link, you can download some CSV files ready to be used in R, generated with the instructions that we followed in these articles.

Closing words

The ggplot2 package in R is a powerful tool to create graphs of any kind, including maps. Due to the amount of detail that a map can contain, and the different administrative divisions there are, ggplot2 does not include maps for all the countries in the world. However, projects as Natural Earth and GADM have the objective of making political maps accessible for everyone. Tools such as QGIS can be intimidating at first, because of the huge variety of functions they have, but once we find where some of the layers and vectors options are, we can easily obtain any information we want from a shapefile.

--

--