|Target Users||Publishers, principally in the German-speaking markets (German, Austria, Switzerland); large-scale consumers of book product data (e.g. book wholesalers, data aggregator services), Web-shops|
|Dimension||National, Language group-wide|
|Nature of the initiative||Private|
Semantic Search is a project of the German books-in-print service VLB, beginning in the context of a concerted effort of the German publishing and bookselling industry (VLB+ Metadatenbank) towards finer control and quality assurance of book metadata and value-added enhancements and enrichments based on multimedia and textual marketing materials.
Potential partners, collaborators and contributors have been consulted for a project to apply statistical text analysis, ontology design and maintenance, and targeted editorial data enhancement to make books and other media products more findable for end customers.
This project seeks to build on the perfect fit between new semantic search and Semantic Web technologies which are reaching a zenith in the research and start-up worlds, and the needs of publishers and booksellers to communicate with ever more specific audience groups in a world of exploding volumes of content and rapidly changing new business models.
First investigations have revealed the need to adopt a combined strategy of applied semantic search to create valuable raw data sets, sustainable data representation using appropriate standards, and making the best use of the expert knowledge of editorial teams in concert with technological tools.
Potential partners and collaborators are invited to discuss these opportunities and present possible solutions and approaches.
Discoverability of commercial media products by end customers is a central problem of the search engine age; with astronomically increasing volumes of content produced daily, how can a publisher or other media producer ensure that potential customers become aware of content that may interest them, in the format required and presented in an attractive and informative enough way to encourage them to buy, in preference to other, apparently similar offerings?
Current efforts in the book industry to vastly increase the use of an open, standard, localised subject category scheme, the new Thema classification, go a long way to meeting this need, by providing agreed, extensible categories for international markets which can be reliably translated into each market’s main languages for maximum reach.
However, this solution is limited by the vast proliferation of interest groups made possible by new content formats and the Internet: customers expect to find products that are interesting to them not just people like them. This level of personalisation is only possible through a highly specific description of each book’s content centered on the 'things' – personal names, companies, local places, specific everyday problems, needs, activities, ideas and even emotions – that real customers will first search for.
Customers are not, and cannot be expected to become, 'expert searchers'. Also professional 'intermediaries' such as traditional booksellers, Web shop managers, and wholesalers face the challenge of placing their products in a complex and rapidly changing marketplace of ideas. Instead, the need arises for expert tools and services at the point of production.
Semantic Search promises to offer highly targeted search results to end customers, along with suggested products and areas of interest, links to related products and content, and more attractive browsing experiences. Publishers, booksellers and wholesalers can use the tools and platforms created to enhance their services and make their internal systems more efficient.
The Semantic Search project proposes to combine the existing strengths of the book industry’s marketing and subject-specific expertise with the new possibilities offered by statistical text analysis, entity extraction and Semantic Web knowledge representation.
New solutions will include:
- Creation and management of new, more specific ontologies of 'things' (people, places, facts and opinions) to enhance and 'flesh out' existing classifications in specific areas of interest, e.g. sports, medicine, literary fiction etc. These will be based on analysis of the actual texts of the books, descriptive texts sourced from the book covers by the publishers, reviews etc.
- This is the centrepiece of the project combining:
- Computerised textual analysis and entity recognition to produce new ontologies and data sources describing the detailed contents of books and media products, and giving links to their wider context. For example, a book (e.g. a biography of Sartre) might be classified not only by its general topics (e.g. philosophers, existentialism) but also by connection with particular historical persons (e.g. Jean Paul Sartre), the events and streams of history they were caught up in (the Second World War) and even personal friendships (Albert Camus)
- Editorial quality control by a team of experts to ensure consistent results in each specialised field of knowledge
- Further technical and editorial work to integrate the new databases of ontological results into existing book trade and subject-specific ontologies, classifications and concordances
- Services to add automatic suggestions of specific products, areas of interest, and related products based on previous searches and the ontologies created from book contents
- Improved browsing functions, including graphical network browsing in addition to the usual hierarchical 'drill-down' functions
- New search types, such as time-based searches and suggestions created from linked data on current and historical persons and events
- Services for publishers to enhance the catalogue data describing e.g. backlist products to revive sales of 'classic' products, or to market highly specialised products in a more targeted way and reach more of the potentially interested customers, thus maximizing return on investment.
The role of technology
The Sematic Search project will apply new techniques in text analysis and entity extraction to create new data sources which can be attached to book industry classifications to 'fill out' the details, or used as a standalone tagging and search tool to enhance internal processing or display within a Web-shop.
Semantic text analysis identifies the 'things' of importance to the author and readers of a text, through the frequency of word occurrence, patterns in the contexts where names appear and the overall statistical 'landscape' of a text. The 'entities' thus discovered can then be checked, either against existing, trusted data sources such as those described above, or by a team of experts relying on knowledge and experience of the relevant fields. In the idea scenario both approaches are combined to enable the editorial team to maximise the results their efforts using semi-automated “best guesses” which are manually checked and improved.
These efforts will generate valuable datasets and structures of knowledge ('ontologies') that must be curated and managed to ensure quality control and compatibility with the wider classification systems of the media industries.
Semantic Web standards enable such raw data to be structured and linked together to create useful information, in a simple, intuitive way. Data can be linked across many data sources internally in an organisation, or even across many 'data owners' who agree to share their data, if they use the open standards of the W3C consortium.
Semantic Web data is expressed in a simple format consisting of 'sentences', each made up of a subject, verb and object. The object of one 'sentence' can be made the subject of another, thus forming a complex web of statements about real-world things, or even about other “sentences”. This means that Semantic Web formats are highly modular and together with the right definitions, can effectively extend existing classifications like Thema by simply attaching new statements to the existing ones. Existing 'ontologies' available to describe topics of interest to book buyers already use the Semantic Web formats:
- Thema: used to classify the subject(s) of a book, with clarifying notes, cross-references and concepts in many languages; expressed as W3C SKOS
- OntoMedia: describes the characters, plots and themes of the narrative contents (stories) of stories, novels, plays, poems and histories; expressed as W3C RDF
- BBC Sports Ontology: describes sporting events, athletes and their achievements; uses W3C RDF
- MeSH (Medical Subject Headings): internationally used classification of medical conditions, treatments and medications; already available as XML format and has been experimentally converted to Semantic Web standards OWL and SKOS
- The Onto-Med research group’s General Formal Ontology for integrating medical data; uses W3C OWL
In addition, a wide range of “general” data sources , such as DBPedia, are published as Semantic Web data for general use. Most of these require quality control and tailoring by the editorial team as they are mostly created by volunteers without strong coordination, standardisation or (sometimes) detailed expertise.
Ontologies and data sets created by the project are envisaged to be curated as RDF, SKOS or OWL data, to ensure future-proofing and compatibility with other projects, as well as potentially adapting and improving the existing Semantic Web resources.
The project is in the investigation and planning phase, drawing on the results obtained by experienced researchers in this field and evaluating them for their potential adoption by the commercial book marketing sector. These results include the identification of specific thematic areas of interest and commercial promise, such as medicine, sport and various areas of literary fiction and the humanities.
A central consideration will be finding technical partners able to adapt to the needs of the publishing industry, perhaps with existing relevant experience.
Initial investigations have also identified key needs of the book industry in transitioning this type of research into commercial use, such as licensing models, versioning and updates (together with synchronisation with the industry-wide keywording schemes), and the necessity of maintenance and support by a central team of specialists.
The project will also draw on current highly-focused analysis of the existing book data state-of-the-art in Germany to identify particularly promising thematic areas and markets. Next steps will include ascertaining the relative levels of automated and human editing required and the optimal technical platforms and editorial processes to adopt.
The project could in principle be highly scalable to other countries, or, more specifically, language-based areas and markets, through the use of quality controlled and managed translations along the lines of the existing Thema model. Appropriate licensing arrangements and business models would be sought to enable wider use of the systems and IP generated by the project.