art historian, environmentalist, educator

Dealing with bias and big, messy data

Last week we talked about understanding the field of digital humanities, and part of the struggle within the field is to have the recognition for the work that is being done within the field.  Christof Schoch discusses this in the Journal of Digital Humanities, and questions why this is so.  He cites researcher and practioner Joanna Drucker, who states that the term “data” is inadequate when working in the humanities, so much so that she coins the word “capta” to better imply the active work of “capturing” information that practitioners do.  This resonates with Dr. Otis’s comment that some colleagues don’t find her research to be real “work.”  I think what Drucker says is true in that data and its “work” tend to connote independent observers just “observing,” versus a more active response in actually synthesizing, analyzing, and creating conclusions, which is the work of digital humanities.

This data capture work of digital humanities is significant, and takes time and tinkering.  Schoch discusses these types of data, points of information that have been digitized and captured from many different sources, which needs to be organized constructively to make sense of and to draw conclusions from.  I understood this more fully when we were given our own datasets of biographical data of people from 16th century Wales.  This dataset was definitely messy; you could see different conventions of recording the same information throughout the columns.  For example, a given record of birth/death could be written 1533-1600, b. 1533, d. 1600, b. 1533- d. 1600.  So you can quickly see, that arranging this messy data, particularly when there over 13,000 records, would take a long time to manipulate by hand.  With the use of technology, in this case the application OpenRefine, we can manipulate large amounts of data so then it can be analyzed.  For our exercise we had to clean the messy data, and find people who were born the same year as Queen Elizabeth and how many more were born before the start of the first plague.  I found that easier than trying to parse out the difficult Welsh names with varying ways of entry, prefixes, titles, and what not.  In the end, another challenge of working with big data, is the uncertainty whether you’ve in fact done it correctly, since you can’t easily “check your work” on 13,000 entries.

Beyond bias against the work of digital humanities practitioners, Blaney and Siefring (in Digital Humanities Quarterly) have noted bias against utilizing digitized sources in citations for humanities research.  They question the reasoning behind this, as people cling to much older print sources despite the authors’ opinion that newer digital sources that are more robust.  To illustrate this example they criticize the dependence on the Oxford English Dictionary versus the more verbose Wikipedia.  To me it underscored the the differences between print and digital media and why those concerned about instability of language rely on print sources.  The impetus behind the brevity of the OED is seated in the need to create a hand held dictionary of English words.  Wikipedia can devote a whole website page to the definition of the word hubris while the OED cannot afford to do that with every single word without losing practicality and money.  OED is solidified, stable; Wikipedia may expound on meaning but it is more slippery, as anyone may edit Wikipedia at any time.


« »
css.php Skip to content