Facilitating and supporting large-scale text mining in the field of digital humanities
The impact of globalization on culture and literature is quite significant. However, works of fiction still face stumbling blocks when translated to for readers from another culture. In many cases, when reading a book from another culture, the reader is required to have at least some knowledge of persons, places, or events from that other culture. Some names will be universally known, such as New York or Angela Merkel, but others will be mostly unknown and perhaps even mysterious, such as Monnikendam or Meneer Beerta. We could therefore ask the question of how “international” a work of fiction actually is.
If we want to know how “international” a work of fiction is, we first need to understand which textual features play a role in how readable a work of fiction for readers from a different culture. Which of these features can be identified and measured in a digital text (and which not)? And how can the identifiable and measurable features be turned into a usable indicator of readability for readers from other cultures?
“Beyond the Book” will apply a set of text analytical tools focusing on the recognition of named entities, on linking proper names in fiction to entries in Wikipedia (named entity resolution) and on the statistical analysis of a set of other, more general features in the digital text of works of fiction. Another aim of the project is to visualize the level of international readability of works of fiction.
The project will make use of existing software for named entity recognition and named entity resolution developed in the project Namescape (www.namescape.nl), and build on the visualization tools also developed in this context. New next steps will be developing ways to analyze the output of these tools and calculate them into measures for “global readability”. For this, the existence of a Wikipedia entry is important as well as for instance the amount of languages in which the entry is available. Another issue could be in which languages or language groups entries are available, to predict readability for different parts of the globe.
Visualization of the results of the measurements play an important role as well as the underlying data for a quantitative analysis of the usage and functions of proper names in fiction. The results will be a demonstrator based on a small set of publicly available novels in digital form. Another deliverable will be a document describing possible business models building on the tools developed for the demonstrator. Intermediate and end users of the tools to be developed may be translators of fiction (annotation of proper names), producers of e-books, and readers of e-books.