This is an article length version of the authors’s presentation given at BiB V, which is also available as video below.
The advent of EPUB 3 has made e-books a full-fledged portable web format. However, the content inside a digital publication remains separated from the rest of the Web. This means that the information of a digital book is currently closed off from you until after you read it. In this paper, we propose a way to make the content of digital publications machine-readable by connecting its content with the Semantic Web. This enables automatic processing of the content of a publication, which in turn enables a better discoverability and ingestion of the content of digital publications.
However, the way the content of a digital publication is processed is still the same as the last 3000 years: by reading it. Despite the advancements in technology concerning (re-)presentation of data and text, the actual information of a book is sealed off from its environment.
The Semantic Web is an effort to make concepts machine-understandable on the Web (Berners-Lee). Mentions of a concept can be linked across the Web, which means that all information concerning a certain concept can be linked together, and enables its machine-understandability. By connecting the concepts mentioned in a digital publication with their semantic counterparts, we can break down the barriers of digital books and link its contents throughout the Web.
In Section 2, we review the current additional possibilities of digital publications to ingest information, as compared to their printed counterparts. In Section 3 and Section 4, we outline how digital publications can provide more conceptual and technical possibilities. In Section 5, we demonstrate some of these possibilities, and we finally conclude in Section 6.
2 Links and Searches
To review the advantages of digital publications over printed publications concerning content ingestion, we will compare the book Les Misérables by Victor Hugo with its digitized version on Project Gütenberg . Comparing the possibilities of the two versions, there are mainly two ways a digital publication can provide more information at a faster pace, namely, (1) the usage oflinks and (2) by searching for words in the digital text.
Links such as a table of contents allow for a quick way to go from one part of a digital book to another and enables users to quickly jump to an–for them–interesting part of a book. However, being able to go to an interesting part of a digital book implies that the user already knows which parts of the book are interesting. The same is true for searching through the content via word search in the digital text. Embedded search functions allow for the same ease of jumping from one part of a publication to another, just as links provide. Sadly, the same problem holds as with the usage of links: a user first needs to know what to look for, before she can search for it. For example, Figure 1 shows how searching for the string “Jean Valjean”– one of the main characters of Les Misérables–can provide some interesting insights into the story of Les Misérables. Among other things, it shows how important he is for the ending of the story, and how his presence is very low in the middle. This example also illustrates that the user needs to know the story beforehand (in this case, who the main characters are) before she can do interesting searches.
Figure 1: Searching a word can provide some insights into the content of a digital publication, but implies that the user knows what to search for.
This is the main problem for current book discoverability: most book recommendation systems are targeted to social recommendation (e.g., “your friends also liked this book”) instead of content recommendation (e.g., “this book is similar to that book”). This results in the current situation where very popular books are read very often, and most other books are left in the shadows (the so-called long tail (Anderson)). Current techniques do not allow users to find out about a story before reading it, and are only able to recommend books based on their social presence. This way, unpopular books–although possibly very relevant–will remain undiscoverable.
3 From Text to Information
The information that is talked about in a digital publication can be made machine-understandable by using the Semantic Web. The entities inside a digital publication can be recognized by using, for example, publicly available web services such as DBpedia Spotlight (Mendes), and can be linked to their resources on the Semantic Web (e.g., on dbpedia.org). By recognizing these entities and putting the recognized links inside the digital publication (e.g., by using RDFa), we go from a publication with text to an enhanced publication with machine-understandable information (see Figure 2).
By using the Semantic Web, this machine-understandable information is not only limited to the entities inside the publication, but can be interlinked with other relevant information. For example, by providing a link to the DPpedia page of Paris inside a digital publication , we know that the word Paris actually depicts a place, but also that it is the capital of France, we know an estimate of the amount of inhabitants, we can look up its climate, etc. This additional information can lead us to draw conclusions that we would not be able to draw with the entity alone. If one doesn’t recognize Paris as a place in a certain book, but it is not linked it to its semantic resource, we would not be able to conclude that the book takes place in France, as that information is not explicitly mentioned in that book. It however is mentioned in its semantic resource on the Web.
4 Digital Books as an API to Information
Once we have machine-understandable information inside digital publications, we can actually consider these books as a database of knowledge and process them accordingly. This means that we can use a querying API to retrieve the information we need from inside digital publications. For example, once all entities inside a book are annotated with their semantic resource, we can query that book to retrieve all people mentioned inside. By sorting book characters on their amount of mentions, we get a good estimate of the most important characters of a publication, and this without any manual intervention needed.
We envision this querying API to be fully compatible with the current (Semantic) Web, by processing books as if they were (Semantic) database endpoints. The SPARQL query language is used for semantic endpoints in Semantic Web applications. By allowing SPARQL queries directly on digital publications (Figure 3), the information inside a publication can be retrieved automatically, and the knowledge inside a publication becomes more easily accessible.
As can be seen in Figure 3, the link to a digital publication would serve a double role: on the one hand, it remains the link to download a digital publication, and on the other hand, by adding a query parameter, digital publications can be queried–manually or automatically–to retrieve information that otherwise could only be retrieved by reading the entire publication. In this case, the query will request all people that are mentioned in the book, sorted by mention count. The answer to this query is the list of the main characters of a certain story, in descending order.
5 Interlink Books
This querying API makes it possible to interlink information inside books, and returns information such as the main characters of a certain publication. That information can then be used for automatic analysis, such as analyzing the occurrences of characters throughout a book. Whereas previous efforts have always been targeted to a single publication, this querying API is a generic solution, allowing for automatic analysis of a large diversity of books.
This API also makes it possible to interlink books with each other. For example by querying books that are all located in the same city. By using the Semantic Web, we can also query books that are located in the same country, without the need for the country to be mentioned in any book, as this is information that can be derived from information available on the Semantic Web. More advanced queries can be achieved, such as querying the travel path of a certain character in a digital book.
All this allows the information available in a digital book to be interlinked with worldwide information, which in the end enables a better machine-understandability of the content of a digital publication. This in turn can lead to a better discoverability for publications enhanced with these machine-understandable concepts and querying API.
In this paper, we propose a methodology to unlock the information inside a digital publication, and link it with the remainder of the (digital) world. By linking the entities inside digital publications with the Semantic Web, these entities get a machine-understandable meaning. By allowing a querying API on top of a published digital publication, these entities can then be queried and processed automatically, which allows for an easier analysis of digital books, a better interlinking between books, and ultimately, makes it easier to discover books by being able to compare books on a content level.
This article was originally published in the Journal of Electronic Publishing (http://dx.doi.org/10.3998/3336451.0018.114).
The research activities described in this paper were funded by Ghent University, iMinds, the IWT Flanders, the FWO-Flanders, and the European Union, in the context of the project “Uitgeverij van de Toekomst” (Publisher of the Future).
This paper was written also by Ben De Meester, Tom De Nies, Wesley De Neve, Erik Mannens, Rik Van de Walle.
 There used to be a similar effort dubbed Dracula dissected, however, the link is no longer online
- Berners-Lee, Tim, James Hendler, Ora Lassila, et al. "The Semantic Web." Scientific American 284, no. 5 (2001): 28-37.
- Anderson, Chris. The Long Tail: Why the Future of Business is Selling Less of More. New York, NY: Hachette, 2006.
- Mendes, Pablo N., Max Jakob, Andrés García-Silva, and Christian Bizer. "DBpedia spotlight: shedding light on the web of documents." Proceedings of the 7th International Conference on Semantic Systems (2011): 1-8.