The Importance of Archiving and Preservation
Archiving and preservation are central to the long-term success of Perseus as well as the vitality of the discipline of the humanities. A brief examinination of the Perseus Digital Library will underscore the importance of preserving both the data and behaviors implicit in some of its user interfaces. The paper will then focus on three of Perseus' current and future options for archiving both data and associated behaviors: being open, distributing for redundancy, and leveraging institutions. In applying each of these principles, not only are issues of preservation addressed, but accessibility and innovation within the humanities as a whole is increased.
One of the newest tools Perseus has developed is the tiled image viewer. Using this interface, built upon the Google Maps API, users are free to explore Perseus' high-resolution images by zooming in up to five levels and/or panning back and forth to view the image in full. Currently, the tiled image viewer is in beta and exposes Perseus' coin images from the Dewing collection (numbers 1990.26.0001 - 1990.26.0499). In addition, images of the Comparetti, a Homeric manuscript photgraphed by Harvard's Center for Hellenic Studies, are also exposed. (Example 1, Example 2).
Looking closely at Example 1 from the Comparetti, one can see the main text of the Iliad, illustrations, and a wealth of notes. The text's visual richness reflects a semantic richness that makes it extremely valueable to this day. Just as the Comparetti manuscript uses the sze and position of text to describe numerous associations between primary and secondary texts, Perseus' HTML text reader makes use of similar visual conventions for illustrating these relations. Like the Comparetti, Perseus' reader displays passages from alternative editions alongside the main text. However, Perseus uses technology to go a step further by providing a lists of all passages within the digital library which reference the text, and extracted named entities like people, places, and organizations, occurring within the passage. Furthermore, for Ancient Greek and Latin passages, the user may click on any word and retrieve possible meanings.
The Comparetti illustrates the importance of archiving and preservation. In the past, to view teh Comparetti, one would have to travel to Italy and gain access. However, technology now makes it possible to simultaneously increase both the liklihood of survival and accessibility to such manuscripts. Access tot he Comparetti is literally a click away. Now people can do more than just view the Comparetti, they can zoom in to read a character which isn't clear or zoom out to view the overall structure of the page. The Perseus reader further increases accessibility by lowering the language barrier. Using the reader, the definition of an Ancient Greek, Latin, and soon Arabic word is just a click away. These interfaces do more than just present data, they define mechanisms for interacting with it. Such mechanisms are not limited to the user interface, but can be defined for the service layer as well. This is precisely what CTS does for texts. For Perseus, it is no longer adequate to preserve the data, the behaviors defining interactions with the data also must be preserved.
Perseus' three current options for helping its data and behaviors to survive include being open, leveraging distribution, and relying upon institutions. These options are not mutually exclusive and each of these principles can be applied to data and behaviors. The first preservation principle, being open, lies at the center of Perseus' strategy for long term preservation. For data, being open means using formats which are application independent. A proprietary file format depends upon an institution or a company which can go out of existance. To increase the probability of survival of the data, it must be formatted in a manner that can be read by many different types of programs and platforms. Furthermore, the data should be stored in a manner so that it is easily transformable. Currently, Perseus uses XML, allowing data to easily be transformed into any format desired. Perseus' history illustrates the importance of easily transformable, open data formats. The XML texts in use today originally were SGML documents. The open nature of SGML enabled Perseus to view and transform the data into XML as technology evolved. Finally, open data means that other people are free to use it. By licensing texts under the Creative Commons license, not only does Perseus increase their survivability, but it increases access to the data. Since people who access the data are free to create derivative works, Perseus increases the opportunities for innovation within the field of the humanities.
Open behaviors also play a crucial role in long term preservation. The Canonical Text Services exemplify this principle because they are a protocol specification, but also have a working implementation expressed as an open source API. A protocol specification details both the semantics and syntax for interacting with data, independent or loosly-dependent upon technology. Ideally, behaviors defined by the specification should transcend technology while providing a consistent interface for interacting with the data. For example, CTS' requests might be implemented using a variety of mechanisms, perhaps its a series of XSLT transformations used in a Cocoon pipeline, or perhaps its a war deployed in a java servlet container. Regardless of the technology used, one can interact with the data in the same way, and one knows the meaning of these interactions through the semantics detailed in the protocol specification. Protocol specifications are essential to scholarly work since they formally express the meaning of the behaviors used. When actually using CTS however, people are free to download an implementation, modify it, and explore how it works. In the best case, implementations define an API which others can use. APIs encode domain knowledge and by allowing others to download them, more people think in terms of that model of domain knowledge. APIs have the added benefit of lending themselves to quantitatively measuring the degree to which they comply to a specification through unit testing. Furthermore, people can see exactly how a CTS request works, making it easier to implement using a future technology and increasing the opportunity for that behavior's survival into the future. Just as open data had the added benefit of increasing innovation within the humanities, so do open behaviors have the additional benefit of enabling rigorous scholarship to occur. If someone uses a behavior to support a thesis, another researcher can examine exactly how that behavior works and decide if the thesis still holds. Not everyone will choose to look this closely to determine the merit of an argument, but to not be able to look at the code, when desired, is analogous to asking a mathematician to believe a theorem without seeing a proof. Every step of the rational argument must be exposed for true rigorous scholarship to occur. Open behaviors enable the use of technology in the rational thought process by formally defining their meaning while allowing others to examine their inner workings. For a scholarly institution like Perseus, behaviors must be open if the rational processes they represent are to be preserved.
In using open data and behaviors, Perseus is free to fully leverage distribution as a preservation mechanism. Currently, Perseus is pursuing two strategies for preserving its data through distribution. The first strategy for distribution is to give away as many copies of its raw data as possible. Perseus can do this because it has licensed it data through the Creative Commons license. The second strategy involves using technologies such as SRB and iRods to transparently distribute data on a grid while appearing to be on one logical filesystem. Distribution of data greatly increases the chances of its survival, as demonstrated by the Library of Alexandria. The Library of Alexandria had a wealth of information which was lost due to fires. This archiving failure was not due to the incompetence of the librarians of Alexandria, but a reflection of the technology during that time. To create a copy of a text, hand copying was the only option. Because of this, the likelihood of survival of data was tied to a few physical artifacts. Today, technology enables us to make one or one billion copies by typing a single command, rather than writing billions of words by hand. With such technology available, it is the duty of content providers like Perseus to make and distribute copies of their data so that future generations might also marvel at works like the Comparetti and enhance them using the technologies of their generation.
Perseus applies geographic distribution to its behaviors as well. One of the most basic ways this is accomplished is through mirror sites. Over Perseus' history, Perseus has been hosted at Berlin, and the University of Chicago, in addition to Tufts. If anything happens to one of these sites, users are still able to access Perseus' services from another server. A side effect of using mirrors is that Perseus' codebase then becomes geographically distributed, producing more copies of this resource, increasing the chances of its survival into the future. A second technology that may enable the geographic distribution of Perseus' behaviors is grid technology. Not only would the grid lead to an increased gain in performance, but it would transparently distribute these behaviors geographically under one logical interface. Perseus is in a position to distribute its code because it is open source, underscoring yet again the importance of being open with behaviors for long term preservation and increased short term accessibility.
Finally, Perseus relies upon institutional repository as the third component of its preservation strategy. In the context of data, institutions have policies for ingest that ensure a certain level of quality. Institutions provide a nice complement to the distribution strategy. The Library of Alexandria had immense amounts of quality content. Similarly, institutions have policies that ensure a certain level of quality of content and an expertise in archiving and preserving data. When combined with distribution, institutional repositories allow Perseus to leverage this expertise and quality, without having to worry about putting all of its eggs in one basket.
Institutional repositories using frameworks such as Fedora also enable Perseus to preserve behaviors. The process of moving Perseus' services into an institutional repository has forced more documentation of the behaviors which they provide. Documentation, of both the specification and implementation has been improved as a result of working with Tufts' Digital Collections And Archives (DCA). If the DCA chooses to use the same implementation of a behavior as Perseus, then the documentation of the implementation, as well as the implementation itself may be improved. If the DCA decides to use a different implementation, then the extent to which the specification is truly independent of the implementation is tested. In going through this process, Perseus ensures that its behaviors are well defined in a manner that transcends the current technology, and can be maintained over time depending upon the technology used.
Being open, distributing for redundancy, and leveraging instutitions form the three components of Perseus' long term preservation strategy. Through applying these principles, Perseus greatly increases the chances of long term survival of data and associated behavior while fostering accessibility and innovation within the field. In a time when technology makes it easier than ever before to preserve this data for generations to come, it is the duty of content providers like Perseus to apply technology to this end. In doing so, Perseus does its part in ensuring that future generations might also enjoy editions like the Comparetti.
One of the newest tools Perseus has developed is the tiled image viewer. Using this interface, built upon the Google Maps API, users are free to explore Perseus' high-resolution images by zooming in up to five levels and/or panning back and forth to view the image in full. Currently, the tiled image viewer is in beta and exposes Perseus' coin images from the Dewing collection (numbers 1990.26.0001 - 1990.26.0499). In addition, images of the Comparetti, a Homeric manuscript photgraphed by Harvard's Center for Hellenic Studies, are also exposed. (Example 1, Example 2).
Looking closely at Example 1 from the Comparetti, one can see the main text of the Iliad, illustrations, and a wealth of notes. The text's visual richness reflects a semantic richness that makes it extremely valueable to this day. Just as the Comparetti manuscript uses the sze and position of text to describe numerous associations between primary and secondary texts, Perseus' HTML text reader makes use of similar visual conventions for illustrating these relations. Like the Comparetti, Perseus' reader displays passages from alternative editions alongside the main text. However, Perseus uses technology to go a step further by providing a lists of all passages within the digital library which reference the text, and extracted named entities like people, places, and organizations, occurring within the passage. Furthermore, for Ancient Greek and Latin passages, the user may click on any word and retrieve possible meanings.
The Comparetti illustrates the importance of archiving and preservation. In the past, to view teh Comparetti, one would have to travel to Italy and gain access. However, technology now makes it possible to simultaneously increase both the liklihood of survival and accessibility to such manuscripts. Access tot he Comparetti is literally a click away. Now people can do more than just view the Comparetti, they can zoom in to read a character which isn't clear or zoom out to view the overall structure of the page. The Perseus reader further increases accessibility by lowering the language barrier. Using the reader, the definition of an Ancient Greek, Latin, and soon Arabic word is just a click away. These interfaces do more than just present data, they define mechanisms for interacting with it. Such mechanisms are not limited to the user interface, but can be defined for the service layer as well. This is precisely what CTS does for texts. For Perseus, it is no longer adequate to preserve the data, the behaviors defining interactions with the data also must be preserved.
Perseus' three current options for helping its data and behaviors to survive include being open, leveraging distribution, and relying upon institutions. These options are not mutually exclusive and each of these principles can be applied to data and behaviors. The first preservation principle, being open, lies at the center of Perseus' strategy for long term preservation. For data, being open means using formats which are application independent. A proprietary file format depends upon an institution or a company which can go out of existance. To increase the probability of survival of the data, it must be formatted in a manner that can be read by many different types of programs and platforms. Furthermore, the data should be stored in a manner so that it is easily transformable. Currently, Perseus uses XML, allowing data to easily be transformed into any format desired. Perseus' history illustrates the importance of easily transformable, open data formats. The XML texts in use today originally were SGML documents. The open nature of SGML enabled Perseus to view and transform the data into XML as technology evolved. Finally, open data means that other people are free to use it. By licensing texts under the Creative Commons license, not only does Perseus increase their survivability, but it increases access to the data. Since people who access the data are free to create derivative works, Perseus increases the opportunities for innovation within the field of the humanities.
Open behaviors also play a crucial role in long term preservation. The Canonical Text Services exemplify this principle because they are a protocol specification, but also have a working implementation expressed as an open source API. A protocol specification details both the semantics and syntax for interacting with data, independent or loosly-dependent upon technology. Ideally, behaviors defined by the specification should transcend technology while providing a consistent interface for interacting with the data. For example, CTS' requests might be implemented using a variety of mechanisms, perhaps its a series of XSLT transformations used in a Cocoon pipeline, or perhaps its a war deployed in a java servlet container. Regardless of the technology used, one can interact with the data in the same way, and one knows the meaning of these interactions through the semantics detailed in the protocol specification. Protocol specifications are essential to scholarly work since they formally express the meaning of the behaviors used. When actually using CTS however, people are free to download an implementation, modify it, and explore how it works. In the best case, implementations define an API which others can use. APIs encode domain knowledge and by allowing others to download them, more people think in terms of that model of domain knowledge. APIs have the added benefit of lending themselves to quantitatively measuring the degree to which they comply to a specification through unit testing. Furthermore, people can see exactly how a CTS request works, making it easier to implement using a future technology and increasing the opportunity for that behavior's survival into the future. Just as open data had the added benefit of increasing innovation within the humanities, so do open behaviors have the additional benefit of enabling rigorous scholarship to occur. If someone uses a behavior to support a thesis, another researcher can examine exactly how that behavior works and decide if the thesis still holds. Not everyone will choose to look this closely to determine the merit of an argument, but to not be able to look at the code, when desired, is analogous to asking a mathematician to believe a theorem without seeing a proof. Every step of the rational argument must be exposed for true rigorous scholarship to occur. Open behaviors enable the use of technology in the rational thought process by formally defining their meaning while allowing others to examine their inner workings. For a scholarly institution like Perseus, behaviors must be open if the rational processes they represent are to be preserved.
In using open data and behaviors, Perseus is free to fully leverage distribution as a preservation mechanism. Currently, Perseus is pursuing two strategies for preserving its data through distribution. The first strategy for distribution is to give away as many copies of its raw data as possible. Perseus can do this because it has licensed it data through the Creative Commons license. The second strategy involves using technologies such as SRB and iRods to transparently distribute data on a grid while appearing to be on one logical filesystem. Distribution of data greatly increases the chances of its survival, as demonstrated by the Library of Alexandria. The Library of Alexandria had a wealth of information which was lost due to fires. This archiving failure was not due to the incompetence of the librarians of Alexandria, but a reflection of the technology during that time. To create a copy of a text, hand copying was the only option. Because of this, the likelihood of survival of data was tied to a few physical artifacts. Today, technology enables us to make one or one billion copies by typing a single command, rather than writing billions of words by hand. With such technology available, it is the duty of content providers like Perseus to make and distribute copies of their data so that future generations might also marvel at works like the Comparetti and enhance them using the technologies of their generation.
Perseus applies geographic distribution to its behaviors as well. One of the most basic ways this is accomplished is through mirror sites. Over Perseus' history, Perseus has been hosted at Berlin, and the University of Chicago, in addition to Tufts. If anything happens to one of these sites, users are still able to access Perseus' services from another server. A side effect of using mirrors is that Perseus' codebase then becomes geographically distributed, producing more copies of this resource, increasing the chances of its survival into the future. A second technology that may enable the geographic distribution of Perseus' behaviors is grid technology. Not only would the grid lead to an increased gain in performance, but it would transparently distribute these behaviors geographically under one logical interface. Perseus is in a position to distribute its code because it is open source, underscoring yet again the importance of being open with behaviors for long term preservation and increased short term accessibility.
Finally, Perseus relies upon institutional repository as the third component of its preservation strategy. In the context of data, institutions have policies for ingest that ensure a certain level of quality. Institutions provide a nice complement to the distribution strategy. The Library of Alexandria had immense amounts of quality content. Similarly, institutions have policies that ensure a certain level of quality of content and an expertise in archiving and preserving data. When combined with distribution, institutional repositories allow Perseus to leverage this expertise and quality, without having to worry about putting all of its eggs in one basket.
Institutional repositories using frameworks such as Fedora also enable Perseus to preserve behaviors. The process of moving Perseus' services into an institutional repository has forced more documentation of the behaviors which they provide. Documentation, of both the specification and implementation has been improved as a result of working with Tufts' Digital Collections And Archives (DCA). If the DCA chooses to use the same implementation of a behavior as Perseus, then the documentation of the implementation, as well as the implementation itself may be improved. If the DCA decides to use a different implementation, then the extent to which the specification is truly independent of the implementation is tested. In going through this process, Perseus ensures that its behaviors are well defined in a manner that transcends the current technology, and can be maintained over time depending upon the technology used.
Being open, distributing for redundancy, and leveraging instutitions form the three components of Perseus' long term preservation strategy. Through applying these principles, Perseus greatly increases the chances of long term survival of data and associated behavior while fostering accessibility and innovation within the field. In a time when technology makes it easier than ever before to preserve this data for generations to come, it is the duty of content providers like Perseus to apply technology to this end. In doing so, Perseus does its part in ensuring that future generations might also enjoy editions like the Comparetti.
