I’m a big fan of the book “Albemarle County In Virginia” by Edgar Woods for two reasons: 1) it’s a thorough history of the County packed full of names and places, and 2) the Internet Archive has a full copy digitized and OCR’d. This makes it relatively easy to search through the text and plot all those place names on a map. In Appendix 8, Woods provides a list of Albemarle County residents who emigrated to other states along with their destinations.
Movement was not uncommon in 18th and 19th century American. Woods remarks on this on page 55 of his book:
The migratory spirit which characterized the early settlers, was rapidly developed at this period. Removals to other parts of the country had begun some years before the Revolution. The direction taken at first was towards the South. A numerous body of emigrants from Albemarle settled in North Carolina. After the war many emigrated to Georgia, but a far greater number hastened to fix their abodes on the fertile lands of the West, especially the blue grass region of Kentucky. For a time the practice was prevalent on the part of those expecting to change their domicile, of applying to the County Court for a formal recommendation of character, and certificates were given, declaring them to be honest men and good citizens.”
The list Woods provides stretches for nine pages. I thought it would be interesting to put all those names on a map and then build an Esri Web App to help explore this data. The rest of this article will cover how I went from this:
The app is available here. If you’re less interested in the gritty technical details, feel free to jump to my next post about using the app to draw conclusions about Albemarle emigration.
Note: Although I’ll explain the steps I took, my focus is on the general process and not documenting every detail 100%. Therefore I may skip over some minor sub-steps for the sake of brevity. Feel free to email or post a comment if you have questions.
Preparing the Data
The first step was to format and prepare the list of names/places so they could be geocoded (assigned geospatial coordinates for plotting on a map). The Internet Archive has already done the hard work of scanning and OCRing (converted the words in an image to text). Here’s a sample of what the data looks like:
James and Mary (Woods) Garth
William and Elizabeth (Davis) Irvin, L^ancaster
Thomas Irvin, Lancaster
Martin and Mildred Dawson, Gallia Co.
Andrew J. Humphreys, Logan Co.
John Wiant, Champaign Co.
John and Sarah Garrison, Preble Co.
The text is semi-structured which makes it easier to work with. The list is organized into sections by state. Each line below the state name gives the name of an individual, or spouse, with their destination separated by a comma. Most of the names are located at the county level, however some are at the city level and others are blank.
For this type of work I like using Excel for my data cleaning. I copied the entire list, about 700 rows, into a spreadsheet . Initially the spreadsheet looked like this:
Notice all the white space? The OCR process can be messy and in this case left some sections of the list with blank lines between names. The OCR process also captures the name of the book and page number printed at the top of every page. On line 679 you’ll also notice an OCR artifact, where “ILLINOIS” is spelled “IIvI^INOIS”. This type of occurrence is sprinkled throughout the text.
I started by removing all the blank rows using a little Excel magic. Then for each row I split the destination name into a new column called “Place Name”. This was easy thanks to the fact that Woods was consistent in using a comma between the person’s name and their associated destination name. Excel’s Text to Columns tool using a comma delimiter will automatically create a new column for all the text after the comma.
We go from this:
From here it was just a matter of some brute-force data cleaning. I added a column called “State” and manually copied in the State name for each row. Next I deleted all the extraneous rows containing the state name, page numbers, and book titles. I also used search-and-replace to find and fix some of the OCR artifacts like “^”.
The next step was to add a column called “Place Type” to capture the provided location level (county, city, or state) for each row. Again, the structure of the text helped by making it possible to formulate a few rules:
- Emigrants located at the county level will have “Co.” at the end of their place name
- Cities will have a value in the “Place Name” field that doesn’t end in “Co.”
- Emigrants who can only be located at the state level will have a blank “Place Name” field since they didn’t have an associated location.
Following these rules, I filled out the “Place Type” column with values for “County”, “City”, and “State”.
The last step was to add a column called “Total” to store the number of people described in each line. In most cases it was either one person or a married couple, however a few rows referenced several people or a family.
Here’s what the final result looked like:
Geocoding the Data
This new spreadsheet was an improvement over the original list in terms of usability, but it still needed geospatial coordinates in order to put these locations on a map. There are multiple options for geocoding textual data including building a Geocoder in ArcGIS or using a publicly available geocoding service from Google, Bing, or the US Census Bureau. However, most of these services are geared towards processing a standard street address. Since I only had count, city, and state names to worry about I opted for a more manual process in ArcGIS.
I located the counties and states in my list using shapefiles from the US Census Bureau. This was done by calculating the centroid of each county and state, and then using a table join to attach those coordinates to my list. There were only 10 unique city names so for those I located them manually using Google Maps (a low tech solution, but it works).
In the end I had 413 emigrant destinations and was able to plot them on the map for a quick and dirty visualization.
Working with Python
This was definite progress, but there was a problem. Because of how the data was structured, if multiple people went to the same location there would be multiple overlapping points at that spot. From looking at the map you’d have no idea if there was one point or 100 at a particular spot. It would also make using the web map more difficult, since you would have to navigate through numerous popups to see all the names at a particular location.
The ideal solution would be one point for each location, with an attribute for the number of people at that location (allowing me to symbolize with proportional symbols). Each point would also store all the emigrant names at that location in a single field, allowing for easy viewing in a popup window.
To reformat the data in this way I decided to re-purpose an existing Python script I had handy. I used a similar process for the Albemarle Historic Web Map where I geocoded place names in the text of “Albemarle County In Virginia”. In that project I used a custom gazetteer (list of place names) to search for places in the text and associate that text with a spatial coordinate. This case is a bit different because the geocoding was already done. But the process could be adapted to take all those stacked points and combine them into a new, single point. Here is the workflow I came up:
- In ArcGIS, develop a list of unique place names and give each a unique ID.
- I did this by dissolving my original shapefile on the Latitude, Longitude, and place name fields. I also used a statistic field to sum all of the people at that location.
- Each location was assigned a unique ID number. The exact number didn’t matter as long as it was unique to that spatial location
- Example: Gerrard County, Kentucky has an ID of 52
- In ArcGIS, associate each of the original emigrant points with the unique ID number for that location
- This was done using a spatial join with the place name shapefile to transfer the unique ID attribute to all the emigrant points that shared the same location
- Example: The four points that fall on Gerrard County each now have an ID of 52
- Convert the attributes of each shapefile to a CSV table
- One table called “Places” includes the unique place names, coordinates, and ID numbers
- Another table called “People” contains emigrants and the unique ID of their location
Finally, I could take these tables and use them as the input to my Python script:
#Open the input and output CSVs
peopleCSV = open('People.csv', 'rt')
placesCSV = open('Places.csv', 'rt')
outputCSV = open('Output.csv', 'wb')
#Create objects to read inputs and write the output
peopleReader = csv.reader(peopleCSV)
placeReader = csv.reader(placesCSV)
writer = csv.writer(outputCSV)
#Write a header column in the output CSV
writer.writerow( ("ID", "State", "CountyCity", "Type", "Lat", "Lon", "Total", "People") )
#Create an empty list to hold the names of people at each unique location
peopleNames = 
#Loop throught the CSV with place names
for row in placeReader:
#Create variables to store each of the relevant place name attributes
placeID = row
state = row
placeName = row
placeType = row
lat = row
lon = row
total = row
#Loop through the CSV with people names.
for row in peopleReader:
#Create variables to capture the unique ID associated with each place and each person
peopleID = row
personName = row
#Check to see if the ID from the Place list matches this row of the Person list. If they match this means the person is located at that place
if placeID == peopleID:
#Add an HTML break tag to the person's name - this is necessary for dispalying properly in the webmap popup box
entry = personName + "
#Add the person's name to the peopleNames list
#Create empty string variable
printNames = ""
#peopleNames is a list and will be formatted weird if we write it directly to a CSV. Loop through the list and add each name to a string
for people in peopleNames:
printNames += people
#Have the reader return to the beginning of the People list
#Write out a new row to the output CSV. All the emigrants at this location will now be stored in a single field
writer.writerow( (placeID, state, placeName, placeType, lat, lon, total, printNames) )
#Reset the list to empty for the next iteration
peopleNames = 
#Have the reader return the beinning the Places list
#Close the input and output CSVs
This output CSV could then be brought back into ArcGIS and converted to a Feature Class. Here’s what the final results looked like using proportional symbols to show the number of emigrants at each location:
As a final step I calculated a few additional layers for the webmap:
- “flow lines” connecting each destination point with its origin in Albemarle County.
- The Linear Directional Mean of all those lines to highlight the average direction, length, and geographic center of the flow lines
- A Standard Deviational Ellipse of all the destination points to better understand their central tendency, dispersion, and direction.
Getting Everything Online
Building an Esri Web App is a two step process. First you build a web map, then you pull that map into an app. I uploaded all the data to ArcGIS Online (AGOL) and went to work on the map. AGOL makes it easy to create proportional symbols. I also created a custom popup to highlight the number of emigrants at each location and show their names. Another cool feature of AGOL is that you can easily embed a chart or graph into your popups. I added a pie chart showing the number of emigrants at each location compared to the overall total number of emigrants.
Adding a Basemap
At this point I was happy with how the map was shaping up, but it was missing something. I wanted to give some historic context for these locations beyond what the standard Esri basemaps provided. I took a visit to the David Rumsey map collection and found a nice looking map by John Melish from 1822. The map was gorgeous but the color and shading detracted from my focus on the emigrant locations. To fix this, I downloaded the map, pulled it into Photoshop, and converted it from color to gray scale. Then I brought the map into ArcGIS and georeferenced it to give it spatial coordinates. Finally, I took the new georeferenced, gray scale version of Cram’s map and used Python’s GDAL library to build raster map tiles. Tiling is a great choice for building raster web map layers that load quickly. I uploaded the tiles and added them as a layer to my map:
Building an App
The last step was to incorporate the map into an app. This would give the ability to create a custom, streamlined interface and add widgets with additional functionality. Esri’s WebApp Builder is a handy way to quickly make good looking apps. I wanted to keep the final app simple, so I chose a minimal theme called “Billboard.” I also wanted to keep the app very focused on my content, so I turned off some of the default widgets that weren’t necessary here, like the “Locate Me”, “Search”, and “Overview Map” widgets. I added a few new widgets to help users get the most out of the app:
- An information panel with details about the map and sources
- A layer list widget for toggling layers on and off
- A chart widget for building interactive bar graphs
- An attribute table widget for exploring the emigrant attribute data
- A sharing widget so users can easily share the app with their friends and colleagues
These widgets are what separate an “app” from just a “map”. They allow the user to easily interact with the map by turning layers on and off, reordering layers, and adjusting the transparency. The chart widget is a really handy way for exploring the data. The widget displays a bar chart, pulling directly from the live data, of the total number of emigrants by state. What’s cool about it is that the chart is interactive and linked with the map, so selecting a state on the chart will highlight the associated points on the map.
I hope this post has been helpful for understanding how unstructured historical text can be cleaned and plotted on an interactive web map. Please keep reading, visit the app and explore for yourself, or ask questions in the comments!