Thursday, November 1, 2007

Semiotic Domain Models

Introduction to Semiotic Domain Models


Traditionally, when one wants to build a system to organize information and process it in a manner consistent with a given discipline, one must first develop a domain model. The domain model consists of a data model, a set of data structures used to organize the information in a meaningful way, and a set of behaviors associated with the data structures within that data model. Object-oriented programming provides a familiar paradigm for associating behaviors with structured data. For example, a circle object may consist of a Segment denoting its radius, and a Point denoting its center. Behaviors associated with the circle may include calculating circumference, area, or even changing the radius or point. Another way of thinking of these data structures and associated behaviors is as a set of terms and rules, where terms correspond with the properties of the object and rules with the behaviors describing how to operate on that object.

Each of the data structures (which may be an object in its own right) often corresponds to a term from a given discipline, in fact a recommended best practice is to develop a glossary of such terms before creating the domain objects. The behaviors for processing and manipulating these objects often encode a set of business rules consistent with the operations performed upon these domain objects. What is interesting is that the semantic interpretation of these terms and behaviors is left to those people using the system. The meaning of the objects like Segment and Point is encoded for use by the programmer in the API documentation, and for the manager in the glossary of terms, but has been underspecified for another type of user of these systems, machines. Unable to encode meaning in a machine actionable way means that such 'formal models' of domains are unable to automatically re-evaluate objects with respect to new, changing meanings that may occur during processing. In such models, called 'formal models', each term has a fixed and unique meaning. However, depending upon the application, it may be desirable to have a machine re-interpret an object with respect to a new meaning. Semiotic models extend formal models by adding an additional 'semantic layer', mapping the terms and symbols used in formal models to their meaning. "Differently from the logic-linguistic models developed in the West, terms and rules were not just ungrounded symbols building purely syntactical systems. The formalization of SSC (semiotic situational control) took into account sophistications like the grounding of linguistic terms and rules (its semantics)." (Towards an Introduction to Computational Semiotics)

When trying to model phenomena in which the meaning of these domain objects changes and impacts the behaviors of those objects, semiotic systems become necessary. A very simple example occurs in geometric construction, a reasoning process in which geometric primitives are drawn upon the page and constantly reinterpreted with respect to newly inferred knowledge. For example, the first construction of the Planisphere logically references the presentational point e as a north pole, and the presentational circle abgd as a sphere. However, later on in the same construction Ptolemy associates the presentational point e with a Euclidean point, and abgd with a Euclidean circle, changing the logical reference for those presentational primitives at that point in the construction. The power of semiotic systems over formal systems is that they allow one to explore the consequences of interpreting a given symbol (or presentational primitive) as a certain logical concept (meaning). Furthermore, the user (including machines) can operate upon these symbols in a manner consistent with the logical constraints associated with the interpretation of that symbol. For example, if an intersection point is logically referenced within a construction, then whenever that construction is redrawn (by man or machine), that point must result from an intersection of those two lines. Without this logical constraint (arising from mapping meaning in the form of a logical model to the symbols on the page), there would not be enough information for the machine to automatically generate a different presentation of the same logical construction process. (Step 13, edition 1), (Step 13, edition 2) (Step 13, multiple editions) The ability to generate different presentations of the same process is seen in transmission of such diagrams such as those by Heiberg and Drecker for the second construction of the Planisphere. (Heiberg's Diagram) (Drecker's Diagram)

Relevance of Semiotics for Intelligent Systems


Intelligent systems like humans associate meaning with signs, they create symbols. Therefore a reasonable feature of any intelligent system modeled after humans should have some mechanism for operating upon some unit of information with respect to a particular semantic 'interpretation'. When looking at a sign on the page, the meaning of that sign, and even the proper way to recognize that sign is ambiguous without some logical context or information. An open contest problem at GREC 2007 required developing an algorithm to segment a binary image of arcs. Assuming that such a difficult problem can be reasonably solved, the utility of such a solution could be further increased (and perhaps difficulty decreased) if the problem was augmented with a glossary of logical terms and their corresponding presentational conventions representing them within the diagram. In this manner, segmentations could be pruned according to their consistency with respect to diagramming conventions. The resultant segmentations could also be understood by man and machine with respect to their underlying logical semantics. This is just one example in which logical context is needed to resolve presentational ambiguity. If one draws an arrangement of lines on a piece of paper, there is no way of knowing whether this arrangement represents a work of art or the construction for an astrolabe or both. There needs to be a way to associate domain objects from art or astronomy with the domain objects from the presentational geometric model seen on the page. In order for machines to produce, navigate, and alter these diagrams on a symbolic level, there needs to be a mechanism for encoding symbols. A symbol is an association between a logical domain object and a presentational domain object. More specifically, a symbol can be encoded as a mapping from a logical domain object to a presentational domain object, where the mapping encodes the act of interpreting the symbol. This definition of symbol directly corresponds to its Greek root σύμβaλλω meaning to throw together, to reckon, compute, to interpret, to agree upon. Through agreeing upon a mapping that throws together two domain objects, meaning is created. I would argue that this symbol-making process is central to reasoning in which new meaning is repeatedly inferred from previous information.

It is worth noting that two different symbols can serve the same purpose. Using another example from Euclidean geometric diagrams, labels are often used to encode the association from logical concept to presentational primitive within the diagram. Examples of these can be seen in the diagrams by Heiberg and Drecker for the second construction of the Planisphere. However, other symbols such as colored shapes can serve the same purpose as these alphabetic labels. Byrne's edition of Euclid assigns meaning to such presentational primitives within the diagram by literally putting them within the logical context of the text. Although diagrammatic information like that seen in Ptolemy and Euclid has traditionally been visualized using a two-dimensional geometric model, there is no reason why this same information may not be able to be visualized using a three-dimensional model, a haptic visualization model, or any type of visualization rendered into a continuous space with a notion of Euclidean distance. The properties of the presentation or representation space can be used to inform the construction of a logical argument if the representation space captures the logical properties one wishes to explore. One of the things which makes diagrams so valuable as a reasoning tool is their variation in presentation. Diagrams allow one to literally look at a geometric model with stated properties from a new perspective. With each variation spatial intuition is further developed. Seeing how the diagram's referenced primitives relate to each other spatially informs one's understanding of how they relate mathematically, a property intensified when studying geometry using Euclidean distance.

Diagrams allow one to recruit one's spatial specialists, parts of the brain that process spatial information, to inform the reasoning process. The notion of different specialists interpreting information in different ways and providing different outputs may be a useful presentational model for understanding knowledge creation during the reasoning process. In Beal's model, the brain consists of different 'specialists' that learn to communicate with each other by finding similarities between their 'interpretations' of a common signal. ((Learning By Learning to Communicate). The properties of a signal that survive translation between these specialists are seen as likely to represent something 'real' about the world, rather than a fluke of one's specialists's processing. If one thinks of these specialists as domain specialists, each processing a signal relative to a specific domain model, then those concepts which can be mapped from the domain model of one specialist into the domain model of another specialist, are more likely to represent real knowledge. Beal argues that this translation process, that the struggle of specialists to communicate, teases out new information. This argument is consistent with translation in general. The process of mapping the logical concepts presented by the Ancient Greek text into English clarifies the nature of the logical model. A similar process is used when mapping the logical concepts of that same Greek text into a diagrammatic presentation. Both presentational mediums provide information about the relations between objects within the logical model.

The specialist model also accommodates the previous definition of symbol, in which objects from two different domain models are 'thrown together' through an association. Beal argues that when two specialists agree upon a signal, they may each interpret it differently, so that the signal captures a relationship between two concepts. In otherwords, the signal encodes the relation between concepts in two different specialist domains. Traditionally, the signals input into such specialist models consist of sensory information such as visual and audio cues. Specialists process each of these signals and through communicating their outputs to each other, decide upon an encoding relating the processed version of the original phenomenon. In otherwords, these specialists dynamically develop a mapping between two separate interpretations which is defined in terms of a common signal.

Intuitively, if the brain passes such signals between specialists, relating concepts and thereby inferring new knowledge to understand sensory information, could more of the same neurological circuitry pass other, higher-level signals between other specialists to infer other types of knowledge. For example, could such circuits pass signals encoding logical information obtained from text and diagram to glean more information about the nature of the subject being discussed? Does it hold that adding more circuits of this type to the brain increases one's ability to process symbolic information? Even more interesting, does serializing these higher-level signals into a language enable specialists in other people's brains to process information and thereby relate to each other? If the brain does pass signals encoding higher-level thoughts between specialists; if the brain processes a common signal in two different ways and thereby generates a mapping between two specialist domain objects, then Beal's specialist model could be a mechanism for symbol-generation at the neurological level. A symbol is an association between a domain object representing meaning to another domain object or sign for representing that meaning.

What could one do with a system that could generate symbolic associations using a specialist model? The first task would be to decide upon what types of symbols one wants to reason upon as the meaning and presentation of those symbols would determine the type of signal and specialist used. Lets say that one wants to reason upon symbols mapping logical astronomical and geometric entities to a presentational space of two dimensional geometric objects. The signal used would have to encode the logical and geometric concepts and serve as input for a specialist that could generate an appropriate two dimensional geometric representation of these concepts. Rather than encoding the signal for a logical entity in a format useful for neural networks, to start, this signal could be encoded in a string format that could be passed across another network, the internet. The string representation of the logical meaning of the symbol would then serve as the signal to a specialist that would process this signal and produce a two dimensional visualization of the logical concept. To obtain one interpretation of this visualization, the same signal string can be passed to another specialist which outputs the text describing logical entity being visualized. Through encoding the signal in a manner that can be parsed by two specialists, machine-actionable symbol resolution becomes possible.

Towards a Symbolic Reasoning Engine


The Planisphere Reader represents a first step towards a symbolic reasoning engine. The architecture of the application consists of two specialists, a CTS 2.0 implementation for retrieving text, and a suite of diagram services for generating diagrams. Both of these specialists take a CTS-URN encoding a step of the reasoning process represented by text and diagram in Ptolemy's Planisphere. These signals, encoded as URNs, are then sent off to both specialists, generating a diagram and retrieving the corresponding text. The act of retrieving diagram and text using the same signal associates logical information with presentational information, thereby creating a symbol, two objects from different domains that are 'thrown together' by an association. When reading the text associated with a given geometric presentation, the reader then relies upon specialists for vision, and language to resolve the sensory input signals corresponding to the page. In other words, whether specialists operate within one's brain, the brain of a friend, or a computer, one can utilize all of them by deciding upon a communication protocol, be it neurological circuits, natural language, or HTTP protocols. Central to this process is determining how best to encode information so that it can be used by all. Much can be learned from the history of textual transmission, which details the technologies developed to encode such information and translate it into forms useful for specialists using different encodings (languages) of the same underlying signal (concept).

One extremely useful logical structure that has been used throughout this paper is the reasoning process. Through the reasoning process, prior knowledge is used to infer new knowledge until a conclusion has been reached. What better way to explore the structure of the reasoning process than to look at how the Greek's, considered by many to be the origin of rational thought, encoded their mathematical proofs. These Euclidean-style proofs actually have a formal structure as discovered by Proclus. Each proof has an 'enunciation' in which the general problem is described, a 'setting-out' in which the logical elements of the problem are labeled, a 'construction' in which the machinery necessary for the proof is developed, a 'proof' which "draws the required inference by reasoning scientifically from acknowleged facts", and a 'conclusion' in which that which is to be shown has been demonstrated. Currently, the Planisphere reader encodes the construction process which may be thought of as an ordered sequence of 'steps'. Each step consists of prior knowledge which has been given, and an action which is to be performed, in this case drawing upon the piece of paper. Each step is justified by a property of the knowledge domain, in this case geometry. It may be possible to generalize the logical models for construction and proof to reasoning processes beyond Euclidean geometry. In general the rational thought process consists of a sequence of reasoning elements which are combined and semantically reinterpreted to infer new knowledge. In the construction, these reasoning elements are steps, in the proof, they are assertions.

The Planisphere reader allows for the navigation and production of geometric diagrams. Unlike traditional diagrams however, the meaning underlying a particular geometric shape drawn can be resolved by man or machine. Rather than requiring a human to associate a diagram with its text, such an association, the symbolism of the diagram, is explicitly encoded in a machine-actionable format. The explicit encoding of semantic information is necessary for representing the construction process used to generate the image of the diagram. As a construction progresses, new logical meaning is gained from manipulating the presentational symbols on the page. In order for the reasoning to progres, this new meaning must be associated with the appropriate presentational primitives, the meaning of the symbol must be changed. For example, the first construction of the Planisphere logically references the presentational point e as a north pole, and the presentational circle abgd as a sphere, but later these primitives are interpreted as a Euclidean point and circle respectively. A further requirement of the reasoning process used in Euclidean proof: logical meaning must be assigned to emergent primitives, any primitive which results from manipulating one or more previously defined primitives and whose coordinates are completely determined by those primitives. Needless to say, encoding such a reasoning process depends upon the ability to encode and process the changing meanings of diagrammatic symbols.

Labels: , , ,

Thursday, April 5, 2007

Developing a Humanities Business...

Although graduate school will help me develop the academic skills I need to teach, conduct research, and preserve world cultural heritage objects, I will need to supplement this experience to realize an additional professional goal, to develop sustainable business models that increase accessibility and awareness of our world's cultural heritage. It is my hope that such models will increase the general public's interest in these areas, attract more funding and talent to the humanities, and help to increase communication and therefore innovation in the field as a whole. During the course of this essay I will describe the motivation for developing such models, the role of graduate school, and specific areas where I will need to supplement my graduate school experience, both in terms of information and financial support, to achieve this professional goal.

Sustainable business models for digitizing, archiving, and disseminating humanities data is relevant, necessary, and represents some interesting research opportunities in computer science. The destruction of ancient Buddha statues by Afghanistan's Taliban government, and the burning, looting, and plundering of "eight thousand years of human history," in Iraq, both due to sanctions and the war, underscore the necessity of finding creative ways of preserving these objects for future generations ("Crisis in Iraq"). The importance of these issues, and the role of technology in addressing them is recognized by organizations like the United Nations' Educational, Scientific, and Cultural Organization (UNESCO), and the European Digital Library l2010 project. However, funding for the humanities in the United States is a fraction of that for
the sciences. The budget request of the National Endowment for the Humanities (NEH) for fiscal year 2007 seeks $140.955 million, as opposed to the $6.02 billion requested by the National Science Foundation (NSF), and the $28.4 billion requested by the National Institutes of Health in fiscal year 2007 ("2007 Budget Request", "Fiscal Year 2007", "US NSF - About"). Compound that with the influx of funding from corporations in the sciences and health care industries and it is clear that humanities must do something to increase its value in the eyes of our society and thereby attract the talent and resources that our world's cultural heritage deserves. One approach is to develop sustainable business models for digitizing, archiving, and disseminating humanities data.

In addition to being necessary, this goal represents some interesting challenges, especially in terms of the requirements of a business model for the humanities. Since cultural objects belong to everybody, no one business should hold exclusive rights to data once it has been digitized, for this would be a direct contradiction to the principle of increasing accessibility to humanities data. Furthermore, since such an endeavor should archive these objects, data should be represented in open formats, not trapped in proprietary formats. Finally, people should be free to create derivative works from the data produced, facilitating innovation. Some examples of corporations that successfully profit from freely available data include Google, which provides a "value-added service" to freely available data by indexing the web for searching (Crane). The relevance of businesses like Google, Yahoo, and Microsoft to the availability to ancient texts was recently mentioned in a CNN article on Google Books and the Open Content Alliance ("Google library: Open culture?"). Another requirement of such a business model would be the availability of the source code implementing these services. Since I hope to increase humanities research, mechanisms for arguing theories on datasets, such as services, should be able to be analyzed, adapted, and improved. Open sourcing the code for these services accommodates this requirement (Smith). Several businesses
using an open source business model include Red Hat and Mandriva Linux. The base operating system for Apple's OS X, Darwin, is also open source. These companies can still make a profit by seamlessly integrating their services in an easy to use, manageable product, and through selling their domain expertise (M. Weaver).

The technological questions I hope to explore during my Ph.D. work represent the next logical step in my formal education and research experience. I have already been exposed to digitizing and archiving texts and images. Now I hope to expand this knowledge to diagrams and physical artifacts such as paintings, sculptures, or even buildings. Through applying computational geometry to developing a diagram markup language, ultimately I will develop a generic way to represent (and thus generate) diagrams using a domain-specific language. Researching ways to create, disseminate, and visualize 3D datasets while useful, will also provide the general public with an exciting mechanism to learn about and interact with their heritage. While my computer science training will prove very useful in solving problems outside of the humanities, it will give me a unique insight into how such technologies could be scaled, further developed, and used within the context of a business.

My background, combined with my proposed research will give me the technological skills necessary to accomplish these academic goals. However, in order to apply these concepts within the context of a business, I need a business mentor to provide constructive feedback to my research and suggestions as to how these ideas might be used in business, both in general and specifically to businesses serving the humanities. I need a business mentor to help me to develop metrics to measure my research performance in terms of its utility towards accomplishing this goal. I need a business mentor to help me to develop a business model to preserve and increase accessibility to cultural heritage objects. In return, I represent a mechanism for making deeper connections into research involving computer science (US News and World Reports 2007). Furthermore, I would act as a vehicle for getting technology and ideas of interest to the business mentor into the program (bidirectional exchange) ("MIT Media Lab Graduate Fellows"). Finally, as a Computer Science research professor, I would grow the mentor's network of well-educated professionals through professional contacts and graduate students.

While my academic background is solid, I need a business mentor to teach me how to frame such research with respect to industry. In return I offer extensive technological experience, a solid work ethic, and an undying curiosity that would undoubtedly serve a mentor or sponsor well, especially in the context of graduate program with ties to industry. During my entire academic career, I have exhibited a curiosity, drive, and ability to work with others. If given the opportunity, I will use these strengths to develop a plan that is realistic, profitable, and compelling, a plan that will increase the value of humanities in the eyes of our current society and preserve our world's cultural heritage for generations to come.

Thursday, March 22, 2007

The Importance of Archiving and Preservation

Archiving and preservation are central to the long-term success of Perseus as well as the vitality of the discipline of the humanities. A brief examinination of the Perseus Digital Library will underscore the importance of preserving both the data and behaviors implicit in some of its user interfaces. The paper will then focus on three of Perseus' current and future options for archiving both data and associated behaviors: being open, distributing for redundancy, and leveraging institutions. In applying each of these principles, not only are issues of preservation addressed, but accessibility and innovation within the humanities as a whole is increased.

One of the newest tools Perseus has developed is the tiled image viewer. Using this interface, built upon the Google Maps API, users are free to explore Perseus' high-resolution images by zooming in up to five levels and/or panning back and forth to view the image in full. Currently, the tiled image viewer is in beta and exposes Perseus' coin images from the Dewing collection (numbers 1990.26.0001 - 1990.26.0499). In addition, images of the Comparetti, a Homeric manuscript photgraphed by Harvard's Center for Hellenic Studies, are also exposed. (Example 1, Example 2).

Looking closely at Example 1 from the Comparetti, one can see the main text of the Iliad, illustrations, and a wealth of notes. The text's visual richness reflects a semantic richness that makes it extremely valueable to this day. Just as the Comparetti manuscript uses the sze and position of text to describe numerous associations between primary and secondary texts, Perseus' HTML text reader makes use of similar visual conventions for illustrating these relations. Like the Comparetti, Perseus' reader displays passages from alternative editions alongside the main text. However, Perseus uses technology to go a step further by providing a lists of all passages within the digital library which reference the text, and extracted named entities like people, places, and organizations, occurring within the passage. Furthermore, for Ancient Greek and Latin passages, the user may click on any word and retrieve possible meanings.

The Comparetti illustrates the importance of archiving and preservation. In the past, to view teh Comparetti, one would have to travel to Italy and gain access. However, technology now makes it possible to simultaneously increase both the liklihood of survival and accessibility to such manuscripts. Access tot he Comparetti is literally a click away. Now people can do more than just view the Comparetti, they can zoom in to read a character which isn't clear or zoom out to view the overall structure of the page. The Perseus reader further increases accessibility by lowering the language barrier. Using the reader, the definition of an Ancient Greek, Latin, and soon Arabic word is just a click away. These interfaces do more than just present data, they define mechanisms for interacting with it. Such mechanisms are not limited to the user interface, but can be defined for the service layer as well. This is precisely what CTS does for texts. For Perseus, it is no longer adequate to preserve the data, the behaviors defining interactions with the data also must be preserved.

Perseus' three current options for helping its data and behaviors to survive include being open, leveraging distribution, and relying upon institutions. These options are not mutually exclusive and each of these principles can be applied to data and behaviors. The first preservation principle, being open, lies at the center of Perseus' strategy for long term preservation. For data, being open means using formats which are application independent. A proprietary file format depends upon an institution or a company which can go out of existance. To increase the probability of survival of the data, it must be formatted in a manner that can be read by many different types of programs and platforms. Furthermore, the data should be stored in a manner so that it is easily transformable. Currently, Perseus uses XML, allowing data to easily be transformed into any format desired. Perseus' history illustrates the importance of easily transformable, open data formats. The XML texts in use today originally were SGML documents. The open nature of SGML enabled Perseus to view and transform the data into XML as technology evolved. Finally, open data means that other people are free to use it. By licensing texts under the Creative Commons license, not only does Perseus increase their survivability, but it increases access to the data. Since people who access the data are free to create derivative works, Perseus increases the opportunities for innovation within the field of the humanities.

Open behaviors also play a crucial role in long term preservation. The Canonical Text Services exemplify this principle because they are a protocol specification, but also have a working implementation expressed as an open source API. A protocol specification details both the semantics and syntax for interacting with data, independent or loosly-dependent upon technology. Ideally, behaviors defined by the specification should transcend technology while providing a consistent interface for interacting with the data. For example, CTS' requests might be implemented using a variety of mechanisms, perhaps its a series of XSLT transformations used in a Cocoon pipeline, or perhaps its a war deployed in a java servlet container. Regardless of the technology used, one can interact with the data in the same way, and one knows the meaning of these interactions through the semantics detailed in the protocol specification. Protocol specifications are essential to scholarly work since they formally express the meaning of the behaviors used. When actually using CTS however, people are free to download an implementation, modify it, and explore how it works. In the best case, implementations define an API which others can use. APIs encode domain knowledge and by allowing others to download them, more people think in terms of that model of domain knowledge. APIs have the added benefit of lending themselves to quantitatively measuring the degree to which they comply to a specification through unit testing. Furthermore, people can see exactly how a CTS request works, making it easier to implement using a future technology and increasing the opportunity for that behavior's survival into the future. Just as open data had the added benefit of increasing innovation within the humanities, so do open behaviors have the additional benefit of enabling rigorous scholarship to occur. If someone uses a behavior to support a thesis, another researcher can examine exactly how that behavior works and decide if the thesis still holds. Not everyone will choose to look this closely to determine the merit of an argument, but to not be able to look at the code, when desired, is analogous to asking a mathematician to believe a theorem without seeing a proof. Every step of the rational argument must be exposed for true rigorous scholarship to occur. Open behaviors enable the use of technology in the rational thought process by formally defining their meaning while allowing others to examine their inner workings. For a scholarly institution like Perseus, behaviors must be open if the rational processes they represent are to be preserved.

In using open data and behaviors, Perseus is free to fully leverage distribution as a preservation mechanism. Currently, Perseus is pursuing two strategies for preserving its data through distribution. The first strategy for distribution is to give away as many copies of its raw data as possible. Perseus can do this because it has licensed it data through the Creative Commons license. The second strategy involves using technologies such as SRB and iRods to transparently distribute data on a grid while appearing to be on one logical filesystem. Distribution of data greatly increases the chances of its survival, as demonstrated by the Library of Alexandria. The Library of Alexandria had a wealth of information which was lost due to fires. This archiving failure was not due to the incompetence of the librarians of Alexandria, but a reflection of the technology during that time. To create a copy of a text, hand copying was the only option. Because of this, the likelihood of survival of data was tied to a few physical artifacts. Today, technology enables us to make one or one billion copies by typing a single command, rather than writing billions of words by hand. With such technology available, it is the duty of content providers like Perseus to make and distribute copies of their data so that future generations might also marvel at works like the Comparetti and enhance them using the technologies of their generation.

Perseus applies geographic distribution to its behaviors as well. One of the most basic ways this is accomplished is through mirror sites. Over Perseus' history, Perseus has been hosted at Berlin, and the University of Chicago, in addition to Tufts. If anything happens to one of these sites, users are still able to access Perseus' services from another server. A side effect of using mirrors is that Perseus' codebase then becomes geographically distributed, producing more copies of this resource, increasing the chances of its survival into the future. A second technology that may enable the geographic distribution of Perseus' behaviors is grid technology. Not only would the grid lead to an increased gain in performance, but it would transparently distribute these behaviors geographically under one logical interface. Perseus is in a position to distribute its code because it is open source, underscoring yet again the importance of being open with behaviors for long term preservation and increased short term accessibility.

Finally, Perseus relies upon institutional repository as the third component of its preservation strategy. In the context of data, institutions have policies for ingest that ensure a certain level of quality. Institutions provide a nice complement to the distribution strategy. The Library of Alexandria had immense amounts of quality content. Similarly, institutions have policies that ensure a certain level of quality of content and an expertise in archiving and preserving data. When combined with distribution, institutional repositories allow Perseus to leverage this expertise and quality, without having to worry about putting all of its eggs in one basket.

Institutional repositories using frameworks such as Fedora also enable Perseus to preserve behaviors. The process of moving Perseus' services into an institutional repository has forced more documentation of the behaviors which they provide. Documentation, of both the specification and implementation has been improved as a result of working with Tufts' Digital Collections And Archives (DCA). If the DCA chooses to use the same implementation of a behavior as Perseus, then the documentation of the implementation, as well as the implementation itself may be improved. If the DCA decides to use a different implementation, then the extent to which the specification is truly independent of the implementation is tested. In going through this process, Perseus ensures that its behaviors are well defined in a manner that transcends the current technology, and can be maintained over time depending upon the technology used.

Being open, distributing for redundancy, and leveraging instutitions form the three components of Perseus' long term preservation strategy. Through applying these principles, Perseus greatly increases the chances of long term survival of data and associated behavior while fostering accessibility and innovation within the field. In a time when technology makes it easier than ever before to preserve this data for generations to come, it is the duty of content providers like Perseus to apply technology to this end. In doing so, Perseus does its part in ensuring that future generations might also enjoy editions like the Comparetti.

Wednesday, March 14, 2007

Adding Value to Open Scholarly Content

One of the ways in which the Perseus Digital Library increases accessibility to and interest in the humanities is through making its content freely available. Perseus can give away its content and still keep its users coming back because it provides semantically precise associations between content. The Canonical Text Services (CTS) protocol and the CTS-URN syntax it defines exemplify the types of services Perseus uses to intra and inter-connect content. However text services are just one aspect of Perseus' service layer, a layer which represents just one-quarter of Perseus' overall logical architecture. Each of the layers of Perseus' logical architecture increase the value of the humanities and thereby the value of Perseus in the eyes of its users.

To understand how Perseus can give away its texts without losing users, it is helpful to distinguish between Perseus' static and dynamic content. Perseus' TEI-XML texts, artifact and image metadata, and named-entity and morphological datasets are static data that Perseus currently or will distribute freely under the Creative Commons license. Perseus can afford to do this because its value lies in the associations it can create by making its data dynamically accessible. These semantically precise associations within Perseus' own content (intra-connecting) and between its own content and external data and/or services (inter-connecting) give users a wealth of context for understanding and interpreting Classical data.

The CTS protocol will make Perseus texts dynamically accessible, allowing them to be connected in ways that downloading a bunch of TEI-XML texts simply does not provide. The Canonical Text Services protocol uses URNs to reference texts in terms of their hierarchical structure. A cousin of the Fundamental Records For Bibliographic Records (FRBR), CTS URNs may identify the work of an author or an edition or translation of a work, but extend FRBR by referencing texts by their logical citation scheme. Rather than navigating by page number, CTS interprets the semantics of these URNs to retrieve logical sections of a text organized by chapter, section, or some other scheme. Furthermore, the protocol specifies extensions to the CTS URN for referencing an arbitrary sequence of characters within a passage, providing a syntax for textual alignment across editions without losing context.

CTS URNs enable Perseus to create associations which increase the value of its data. Just as Google Page Rank takes HTML links into consideration to measure the value of a page to a user, Perseus can increase the value of its content to its users with far greater semantic clarity and precision by using CTS URNs. Some services that will eventually be exposed via index services include Perseus' named-entity disambiguation, citations, and morphological information. For each of these examples, CTS-URNs and collection IDs will be combined with Index Services to make connections within Perseus' own data. However, URNs also allow Perseus to connect its highly-structured texts with external data and services; one such service is searching using Google Base.

Connecting highly-structured data with less structured data through search is not unfamiliar. When searching in Google Earth, the display window represents a range of geographic coordinates currently visible to the user. When a user performs a search, hits are displayed in terms of their geographic coordinates as markers. A CTS-aware search would be similar, results are limited to the range of textual coordinates associated with the current passage and displayed in terms of their textual coordinates. Just as Google Earth parses KML documents and provides services consistent with the seamntics of longitude and latitude, so could a CTS-enabled search parse CTS URNs, interpreting the semantics of this coordinate system for textual reference. Currently, Perseus is experimenting with using CTS-URNs within queries in Google Base. After creating an item for each logical section of text and specifying the corresponding URN in the page's metadata, this less-structured data is uploaded to Google Base. Each of these items links to Perseus' text page, which provides rich context for the item. Using this approach, one can search for a text from the author to passage level (it would be very difficult to generate an item for each character on a page) and get relevant hits that point directly to Perseus. Furthermore, a URN specified in combination with a term, such as 'horse' provides a mechanism for limiting search results to the text(s) represented by that URN. (Experimental Examples: Caesar's The Gallic War, a Perseus edition English translation of Caesar's The Gallic War, Book 1, Chapter 1 of the aforementioned edition, occurrences of 'horse' within the aforementioned edition)

CTS provides numerous benefits not inherent in the raw data, but which are emergent through the behaviors it defines. The CTS protocol defines a standard mechanism for referencing and retrieving texts. Since it is an open protocol, with a functioning implementation exposed via an API, the semantics and behaviors of CTS are specified independently of implementation, and are possible to implement. The specification allows one to quantitatively measure how effective a given implementation is by unit testing the its conditions. The implementation provides a well-defined API that encodes the domain knowledge resulting from working with and a desire to reference texts independently of representation whether a manuscript, book, pdf, or web page. Since all requests to CTS require a CTS-URN, tracking users in a semantically-meaningful way becomes possible as the URN will appear in HTTP access logs. Questions such as how users are navigating the text, what services are being invoked on a given reference, and what requests were performed on that reference can all be answered in terms of the underlying logical structure of the text. Furthermore, since CTS was designed to work with Ancient Greek and Latin texts, it can handle multi-lingual content, and provides a syntax for datasets of aligned texts. Logical referencing of text independent of physical representation, a specification with clear meaning, an API whose functionality can be quantitatively verified, and a notation for semantically precise associations illustrate how dynamic services add value to static content and keep users of Perseus coming back.

Although text services are central to Perseus' mission, they are just one of the services Perseus offers and the entire service layer only accounts for one-quarter of Perseus' logical architecture. Perseus' logical architecture can be used to classify Perseus' other sources of value that both increases the value of the humanities and the value of Perseus in the eyes of its users. First, Perseus' data layer, comprised of TEI-XML texts, databases, and other raw data is freely distributed under the Creative Commons license. Not only does this establish Perseus as a data source to the community, but distributing multiple copies of each text increases its chances of surviving into the future. Rather than having to copy a text by hand, digitization provides the humanities with the ability to make and distribute an arbitrary number of copies, increasing the accessibility and survivability of each text. Furthermore, Perseus' expertise in digitizing these texts serves as a source of value for those who wish to create their own digital editions. Second, the domain layer, where behaviors are associated with the raw data encodes the knowledge and experience gained while working with the content. Working in the domain of Classical texts provides Perseus with a unique perspective on the nature of text that others may find useful, and so gives Perseus the opportunity to help others make sense of their content. Third, Perseus' service layer, provides a series of APIs implementing a set of protocols for each of the types of data Perseus serves. Many of these services rely upon the protocols specified in the TICI stack. Through the service layer, others are free to repurpose Perseus' content through an API that encodes domain knowledge and since the API freely available under an open source license, the community using the API becomes a source of information and value as well. Finally, the display layer, whether widgets, HTML pages, or PDFs gives all users a convenient and easy way to access Perseus' data. The user interface reflects the knowledge gained when building the other layers, and so helps the general public visually see the relations between Perseus' data.

Perseus can give away its static data because it adds value through providing semantically rich associations, adding context to its content. The CTS protocol offers a new way to conceive of, reference, and deliver texts and CTS-URNs provide a syntax for specifying relations among Perseus' and external content and services. These relations increase the value of Perseus in a way that is not inherent in the raw data, but comes from creating associations among the data. In giving away its raw data, Perseus encourages others to develop their own associations, increasing its value as a data provider and as service developers while increasing access to and therefore innovation within the humanities.

Credits:
Thanks to John Blossom whose "Shoreviews. Content Industry Outlook 2007: Reality Checks" gave me criteria for evaluating CTS as a value-added service in the context of the publishing community. Thanks to Gregory Crane for his ideas on interconnecting primary and secondary sources within the Perseus Digital Library and his initial recommendation to look at Google Base. Thanks to Neel Smith for the comparison between searching within Google Earth and a CTS-aware search. Thanks to my brother, Michael Weaver, for his work on the logical layers of an application and their relation to business processes.

Notes:
Based upon slides presented on the panel Getting Search Right for Premium Content during the Spring 2007 ASIDIC meeting.

Labels: , , ,