Wednesday, March 14, 2007

Adding Value to Open Scholarly Content

One of the ways in which the Perseus Digital Library increases accessibility to and interest in the humanities is through making its content freely available. Perseus can give away its content and still keep its users coming back because it provides semantically precise associations between content. The Canonical Text Services (CTS) protocol and the CTS-URN syntax it defines exemplify the types of services Perseus uses to intra and inter-connect content. However text services are just one aspect of Perseus' service layer, a layer which represents just one-quarter of Perseus' overall logical architecture. Each of the layers of Perseus' logical architecture increase the value of the humanities and thereby the value of Perseus in the eyes of its users.

To understand how Perseus can give away its texts without losing users, it is helpful to distinguish between Perseus' static and dynamic content. Perseus' TEI-XML texts, artifact and image metadata, and named-entity and morphological datasets are static data that Perseus currently or will distribute freely under the Creative Commons license. Perseus can afford to do this because its value lies in the associations it can create by making its data dynamically accessible. These semantically precise associations within Perseus' own content (intra-connecting) and between its own content and external data and/or services (inter-connecting) give users a wealth of context for understanding and interpreting Classical data.

The CTS protocol will make Perseus texts dynamically accessible, allowing them to be connected in ways that downloading a bunch of TEI-XML texts simply does not provide. The Canonical Text Services protocol uses URNs to reference texts in terms of their hierarchical structure. A cousin of the Fundamental Records For Bibliographic Records (FRBR), CTS URNs may identify the work of an author or an edition or translation of a work, but extend FRBR by referencing texts by their logical citation scheme. Rather than navigating by page number, CTS interprets the semantics of these URNs to retrieve logical sections of a text organized by chapter, section, or some other scheme. Furthermore, the protocol specifies extensions to the CTS URN for referencing an arbitrary sequence of characters within a passage, providing a syntax for textual alignment across editions without losing context.

CTS URNs enable Perseus to create associations which increase the value of its data. Just as Google Page Rank takes HTML links into consideration to measure the value of a page to a user, Perseus can increase the value of its content to its users with far greater semantic clarity and precision by using CTS URNs. Some services that will eventually be exposed via index services include Perseus' named-entity disambiguation, citations, and morphological information. For each of these examples, CTS-URNs and collection IDs will be combined with Index Services to make connections within Perseus' own data. However, URNs also allow Perseus to connect its highly-structured texts with external data and services; one such service is searching using Google Base.

Connecting highly-structured data with less structured data through search is not unfamiliar. When searching in Google Earth, the display window represents a range of geographic coordinates currently visible to the user. When a user performs a search, hits are displayed in terms of their geographic coordinates as markers. A CTS-aware search would be similar, results are limited to the range of textual coordinates associated with the current passage and displayed in terms of their textual coordinates. Just as Google Earth parses KML documents and provides services consistent with the seamntics of longitude and latitude, so could a CTS-enabled search parse CTS URNs, interpreting the semantics of this coordinate system for textual reference. Currently, Perseus is experimenting with using CTS-URNs within queries in Google Base. After creating an item for each logical section of text and specifying the corresponding URN in the page's metadata, this less-structured data is uploaded to Google Base. Each of these items links to Perseus' text page, which provides rich context for the item. Using this approach, one can search for a text from the author to passage level (it would be very difficult to generate an item for each character on a page) and get relevant hits that point directly to Perseus. Furthermore, a URN specified in combination with a term, such as 'horse' provides a mechanism for limiting search results to the text(s) represented by that URN. (Experimental Examples: Caesar's The Gallic War, a Perseus edition English translation of Caesar's The Gallic War, Book 1, Chapter 1 of the aforementioned edition, occurrences of 'horse' within the aforementioned edition)

CTS provides numerous benefits not inherent in the raw data, but which are emergent through the behaviors it defines. The CTS protocol defines a standard mechanism for referencing and retrieving texts. Since it is an open protocol, with a functioning implementation exposed via an API, the semantics and behaviors of CTS are specified independently of implementation, and are possible to implement. The specification allows one to quantitatively measure how effective a given implementation is by unit testing the its conditions. The implementation provides a well-defined API that encodes the domain knowledge resulting from working with and a desire to reference texts independently of representation whether a manuscript, book, pdf, or web page. Since all requests to CTS require a CTS-URN, tracking users in a semantically-meaningful way becomes possible as the URN will appear in HTTP access logs. Questions such as how users are navigating the text, what services are being invoked on a given reference, and what requests were performed on that reference can all be answered in terms of the underlying logical structure of the text. Furthermore, since CTS was designed to work with Ancient Greek and Latin texts, it can handle multi-lingual content, and provides a syntax for datasets of aligned texts. Logical referencing of text independent of physical representation, a specification with clear meaning, an API whose functionality can be quantitatively verified, and a notation for semantically precise associations illustrate how dynamic services add value to static content and keep users of Perseus coming back.

Although text services are central to Perseus' mission, they are just one of the services Perseus offers and the entire service layer only accounts for one-quarter of Perseus' logical architecture. Perseus' logical architecture can be used to classify Perseus' other sources of value that both increases the value of the humanities and the value of Perseus in the eyes of its users. First, Perseus' data layer, comprised of TEI-XML texts, databases, and other raw data is freely distributed under the Creative Commons license. Not only does this establish Perseus as a data source to the community, but distributing multiple copies of each text increases its chances of surviving into the future. Rather than having to copy a text by hand, digitization provides the humanities with the ability to make and distribute an arbitrary number of copies, increasing the accessibility and survivability of each text. Furthermore, Perseus' expertise in digitizing these texts serves as a source of value for those who wish to create their own digital editions. Second, the domain layer, where behaviors are associated with the raw data encodes the knowledge and experience gained while working with the content. Working in the domain of Classical texts provides Perseus with a unique perspective on the nature of text that others may find useful, and so gives Perseus the opportunity to help others make sense of their content. Third, Perseus' service layer, provides a series of APIs implementing a set of protocols for each of the types of data Perseus serves. Many of these services rely upon the protocols specified in the TICI stack. Through the service layer, others are free to repurpose Perseus' content through an API that encodes domain knowledge and since the API freely available under an open source license, the community using the API becomes a source of information and value as well. Finally, the display layer, whether widgets, HTML pages, or PDFs gives all users a convenient and easy way to access Perseus' data. The user interface reflects the knowledge gained when building the other layers, and so helps the general public visually see the relations between Perseus' data.

Perseus can give away its static data because it adds value through providing semantically rich associations, adding context to its content. The CTS protocol offers a new way to conceive of, reference, and deliver texts and CTS-URNs provide a syntax for specifying relations among Perseus' and external content and services. These relations increase the value of Perseus in a way that is not inherent in the raw data, but comes from creating associations among the data. In giving away its raw data, Perseus encourages others to develop their own associations, increasing its value as a data provider and as service developers while increasing access to and therefore innovation within the humanities.

Thanks to John Blossom whose "Shoreviews. Content Industry Outlook 2007: Reality Checks" gave me criteria for evaluating CTS as a value-added service in the context of the publishing community. Thanks to Gregory Crane for his ideas on interconnecting primary and secondary sources within the Perseus Digital Library and his initial recommendation to look at Google Base. Thanks to Neel Smith for the comparison between searching within Google Earth and a CTS-aware search. Thanks to my brother, Michael Weaver, for his work on the logical layers of an application and their relation to business processes.

Based upon slides presented on the panel Getting Search Right for Premium Content during the Spring 2007 ASIDIC meeting.

Labels: , , ,


Post a Comment

<< Home