The goal of the DSpace installation at plazi.org is to demonstrate how the corpus of texts covering the descriptions of the world's species can be assembled into a digital repository for stable, long term access. In this presentation we will focus on our deployment of DSpace working in combination with a community based text mark up tool (resulting in an XML encoded version of the original scanned or electronically published document) as well as a web service allowing to extraction of individual descriptions from within the body of publications.
The published record of biological systematics, including the descriptions of the world's 1,8 million species has some unique characteristics. The scientific naming of species is regulated by Codes and thus the publications are quasi legal documents. Descriptions remain relevant for a very long time, even if they are complemented by more comprehensive ones. Additionally, access to existing descriptions is vital for the understanding not only the 1,8 million known species, but also of the yet to be described 20+ million. Valid treatments for animals, for example, span back to 1758, and include perhaps more than 10M pages, of which almost all are only available in hard copy. Taxonomic treatments are as well highly structured documents and very rich in data. A wealth of important morphological descriptions and data, geographic distribution data, bibliographical references, and more resides latent in the taxonomic literature
Items in the repository are made up of several files. A PDF is usually available, but in many (given enough time and resources, all) cases another representation of the publication, encoded in the XML schema TaxonX is provided. The encoding opens up the treatments, exposing the data contained within to extraction, data mining, analysis fo r a variety, and other purposes. Since the mark-up process is a slow and expensive and involves the knowledge of the systematics domain, a community mark up server is added, so that interested parties can not only upload new pdf documents, but download and enhance the documents in discrete well defined steps towards valid taxonx documents. Similarly, other applications can build upon the foundation provided by the DSpace repository, such as a search/retrieval interface oriented towards the needs of the Systematics domain, and integrate into the wider and growing Systematics, Conservation, and Biodiversity cyberinfrastructure.