Data Cleanup with OpenRefine

OpenRefine

Metadata collection during digitizing archival resources is not a straightforward task. Names, places, events listed in the documents often diverge from current spelling. There are variants, aliases, spelling errors etc. Modern search engines like Google often have synonyms or common spelling errors and can correct them:

Showing results for Kowalski
Search instead for Kowakski

But it works best for common names and spelling errors. In a project in which we would like to present the metadata as Linked Open Data we would want to have a clean list of entries, free of spelling errors and with identified variants, if any.

Let us take as an example names of people (we also collect places, historical events and more). The name alone does not typically identify the person - obviously there can be many people with the same name. Once the person is fixed, we find that very often his or her name exists in many variants. There are versions in various languages, the person could use pseudonym (or several) at some period of his life, change her name (before or after marriage), add titles to the name etc. Subjects and citizens name their leaders by their monikers, persons How to find your data in this mess?

For persons that are mentioned in the archival documents, we have selected several rules. The rules are somewhat arbitrary, but we had to start somewhere:

  1. We use one standard name for one person. The alternative names or names in different languages are also collected to help in search. We are typically guided by the name used in Polish, if possible, and use the Wikipedia spelling (in Polish or other language) when appropriate.

  2. We list the name as Last Name, First name(s) in this order. This sometimes causes difficulties as it is not always easy to figure out given names. The rule has exception for people known only by their full name, as Kings, Popes etc.; in this case we list the full name as it is commonly used or officially known.

  3. We assign each person a unique identifier which we create from scratch. If possible, we correlate this identifier with two of the most common registries: WikiData and VIAF. There are people who do not have their articles in Wikipedia in any language, and hence no WikiData ID. There are people who never wrote a book and are not in the library index represented by VIAF. For those we create a short description, add references and assign our own identifier.

The next step is to review some 80 thousand name records that we have collected till now to bring the names to this standard. We work in chunks, typically one archival fonds at a time, but it is still tens of thousands of records. One can work with spreadsheet, which is an universal tool, and by using such functions as sort, filter, global find-and-replace one can do a lot of work. However, we found that a specialized tool called OpenRefine can be much more useful for this task. OpenRefine (an Open Source software) grew from Google project and was originally called Google Refine. It was strongly connected with a project called Freebase1, now defunct, which collected data from several different databases as well as allowed users to add their own. OpenRefine was created expressly for the task of cleaning up or refining mixed quality data.

OpenRefine

OpenRefine is a sophisticated piece of software with very powerful tools. It can be installed on a personal computer and runs as a web server, which means that you interact with it via a browser. The data is stored locally, and does not leave your computer; you can work on sensitive data without problems with sharing. It also means that one can only work on one computer at a time, but it is rather easy to export the project and carry it on a memory stick.

OpenRefine is a rich in options, functions and capabilities. I will not attempt to describe all of them here. My goal is to introduce OpenRefine and show some features using the example of names of people collected in digitizing archival fonds of the Pilsudski Institute of America.

Import

OpenRefine claims to open various file formats, but I have only been successful with delimited text files (csv or tsv). The leftmost column should be the key to the table (if there is no key, a simple sequential number will do). In creating a new project you import some data (or lots… we found we can comfortably work with tables of 30 thousand rows or more), select encoding (use UTF-8 unless your data have intrinsically a different encoding) and create a project. From now on, you will work with the project, and your original data is left aside.

OpenRefine vs. spreadsheet

Like a typical spreadsheet, OpenRefine works on a flat, two dimensional data set. The first row should have column names. You can create additional columns, move and rename them in OpenRefine.

Unlike a spreadsheet, the cells contain only data, not formulas. This looks like a limitation, but it is not really - the data can be transformed in many ways using a rich set of functions, described in detail in documentation (link). You can transform values in the column’s cells, create a new column based on existing column data, and more.

Facets

One of the most powerful features of OpenRefine are facets2. Selecting a text facet on the ‘name’ column creates a list of names. If the name exists in many rows in the table, here it will be shown only once (with an indicator showing the number of occurrences in the table). Clicking the name in the names list displays all the rows of the data set with this name.

For example, adding the text facet to our sample dataset of 13772 rows gives us 6674 choices. Those numbers are prominently displayed on the screen, helping with a bird’s-eye view of the data. We also get a histogram of counts, not only for viewing: using sliders you can limit the result to only those that are, for example, in more than 50 rows etc.

Similar graphs are shown whenever some grouping of data is performed in OpenRefine, for example Clustering (see below). You can add other names to this list as well using “include” and “exclude” markers. You can now work on a group of records selected by their facets.You can add other facets as well. For example creating facet on the ‘folder’ column and selecting one folder limits the display of data records to this folder only, etc.

Search

The function works similarly to the “filter” functionality of a spreadsheet. By adding a “text filter” to the facet list, one can limit the list to a single name, or any other searchable criterion (including case sensitivity option and, for advanced users, regular expressions). For example

entering ‘chod’ into the person filter, we obtain

Chodacki 35
Chodacki, M 1
Chodacki, M. 7
Chodacki, M. S. 3
Chodacki, M., mjr 8
Chodacki, Marian 3
Chodacki, mjr 8
Chodecki, mjr dypl. 1

The number following the name is the count of data records with this name. We can now “include” all and review the data records to verify if it is the same person; if it is, we will probably modify the name to some standard form, in our case ‘Chodacki, Marian’.

Clustering

Here we move beyond the capabilities of a spreadsheet. Using one of a number of algorithms, the program groups values (in our case names) trying to find those that might be alternative representation of the same thing, e.g.

using the fastest clustering algorithm (key collision - fingerprint) we obtain, among many others:

  • Vargas, Getulio(30 rows)
  • Vargas, Getúlio(3 rows)
  • Getulio, Vargas(1 rows)

with the option to replace the variants with any of the three (or edit the replacement manually). In this case we would select the second version and merge all the records to read ‘Vargas, Getúlio’ - from the document context the Brazilian president. Later we will assign it the WikiData ID of Q156844 and VIAF ID of 24623952.

This particular algorithm (key collision - fingerprint) internally removes leading and trailing blank spaces, punctuation and control characters, changes all to lowercase, removes duplicate tokens and replaces accented characters with closest ASCII alternatives. It then finds entries with the same set of tokens, irrespective of their order.

You do not need to know the details of the algorithms to use them - just try different ones and select the meaningful groupings. It is a fascinating reading nevertheless, with such examples as for “N-gram fingerprint” method: “For example "Krzysztof", "Kryzysztof" and "Krzystof" have different lengths and different regular fingerprints, but share the same 1-gram fingerprint because they use the same letters.”

Transform

You can transform the data in any column, or create a new column with transformed values. The column may involve whole of the data set, or can be limited by facets and/or filters to select only some rows. The functions work like spreadsheet functions, but the syntax is closer to java. The best way is to use the built-in GREL functions, described in GREL functions manual, although the program claims to accept also other languages like Python). Most of the common functions (boolean, string, array, date etc.) are represented in GREL, together with a number of special functions like “cross”, which will do a lookup in a different project. Commonly used transforms are listed in the menu (accessible for each column), like trim (remove leading and trailing whitespaces) or collapse consecutive whitespaces. I would recommend to use those first, as it is often difficult to figure out why “Kowalski, Jan”, Kowalski, Jan” and Kowalski, Jan “ appear as three separate entries in the facet lists, and not one, as expected.

Column operations

The common operation on a whole column is “Create a column based on this column”. Without entering any transformation, you can simply duplicate the column with its values. For our own use, since we need to know which person's name was updated in the process, the first order of business is to create a new column from Names, called originalNames. This will not be edited in any way, and will later help flag those records that need to be updated in the original database where the data came from.

Creating a new column you can use any of the transform functions, just like in a spreadsheet (and using a powerful set of GREL functions).

Fetching URL’s

You can formulate an url (for example "https://pl.wikipedia.org/wiki/" + value) and run it to get the content of the page for each value in the column. The fetch operation has a ‘throttle’ parameter to slow down the requests (some services put limits on the frequent calls). Once fetched, the data can then be parsed to obtain the specific information you need (an ID, a description, etc.) This method helps enrich the already collected data with information available elsewhere on the web.

Reconciling

Reconciling is even more sophisticated than fetching, but to use you need a service. There are several services available (we used a service for VIAF identifiers) or you can run your own from an open source program. After configuring and running reconciliation (it makes a web call for each row in your dataset, so it may take a while), a new column will show the results, with a matching algorithm result showing how close the match is (1 is perfect). You can review the data, look at the original records, and accept (or reject) the match. In our case, reconciliation greatly speeded up assigning of the VIAF identifier to names.

Cell operations

Whether you select a facet (limiting the display to those rows matching the value of the facet) or work on the whole set, you can edit any cell. For example, in a column duplicating the name, you can edit a cell and replace the name with WikiData ID. You have then the option to “apply” or “Apply to All Identical Cells” The last option will fill the column with the new value you selected. Similarly, you can change the name and apply the change to identical cells in this column.

The last operation can also be done on the facet - it is editable, and changing “Kowakski” facet to “Kowalski” will update all the matching rows.

Undo/Redo

As in any modern program, you can undo any operation, and the project keeps the list of all (with short description). It is a rather simple task to back up from any error, even if done several steps before.

Export

When done with the data refinement, you can export the table as comma or tab delimited text , which is most useful for further work. Or you can create a custom export form, export as HTML to display in a browser or on a website, directly export to Google Sheet and more.

Conclusion

We have found OpenRefine very useful in working with manually entered data, and I hope this short and limited introduction will encourage you to try it.

___________________

References

1) Freebase data have been transferred to Google, and can be now accessed via a Google API
2) See Faceted search in Wikipedia

Read more

Marek Zieliński, June 16, 2017

Explore more blog posts:

PARTNERZY
Ministerstwo Kultury
Biblioteka Narodowa
Naczelna Dyrekcja Archiwów Państwowych
Konsulat RP w NY
Fundacja na rzecz Dziedzictwa Narodowego
PSFCU
NYC Department of Cultural Affairs