CIF Best Practice Guide - Digitization Projects and the Preservation of Digital Content
1.0 Digitization
More and more cultural institutions are putting their content online. As part of the process, they are digitizing their collections and/or creating new digital cultural content.
Pictures of artefacts, books and maps can be scanned. A scanned image of an artefact then becomes a new digital resource in itself that must be managed and preserved like the original artefact.
Managing a digitization project
The literature of the domain provides lots of information on how to manage digitization projects. Before digitizing, a number of issues must be considered, and decisions made. For instance:
- What content or documents are to be digitized and for what purposes?
- How large is the collection of items to be digitized?
- What expertise is needed to digitize and describe each digitized object?
- What digitization and description standards are appropriate?
- Who will digitize?
- Who will describe the digitized objects?
- What equipment is needed?
- How much time will it take to digitize and describe all items of the collection?
- Where will digitized items be stored? On what media?
- How will digitized objects be preserved?
- For how long do digitized objects need to be preserved?
- What is the budget for the digitization project?
- What results are realistic to expect?
- etc.
Digitization Standards
When beginning to digitize, use standards as prescribed. This will ensure quality and interoperability.
The Canadian Museum of Civilization (CMC) authored a document, « Digitization Standards for the Canadian Museu m of Civilization Corporation, »1 to assist its staff in digitization projects. The document is available in English and French.
It is common practice to digitize and preserve in TIFF format, then make a JPEG copy for diffusion on the Web. The TIFF format, although proprietary, is the de facto standard for the long-term preservation of digital assets, as it provides a high-resolution and detailed image of the item being digitized. However, the format produces large files that are not appropriate for wide distribution. On the Web, JPEG files are more appropriate as they are much smaller. The resolution of JPEGs is not as good as with TIFFs, but it is usually sufficient for viewing.
Books are often digitized into Adobe PDF documents. In such case, the PDF document is simply a collection of images of each page from the book. Although this proprietary format has become a de facto standard, it presents a number of accessibility issues. At the moment of digitization, make sure you have the ability to also produce various output formats that will both meet the needs of your Web site's target audience, and be accessible to users with disabilities or technological barriers.
Also keep in mind all Canada Interactive Fund (CIF) Technical Requirements and Recommendations when digitizing collections for publication on the Web. For instance, on a CIF-funded Web site, large images will need to be available as thumbnails and be described in (X)HTML using the alt attribute. PDF must only be used as a secondary or alternate version to a (X)HTML text-based and accessible version. Metadata will have to be provided for the main sections of the Web site, but they can also be provided for each picture stored in a database. More information on CIF Technical Requirements is available from the CIF Web site2.
2.0 Preservation
Preservation comprises of all activities related to the long-term availability, integrity, understandability, and usability of a resource, (e.g. an artefact, a document, a picture). This includes activities such as conservation, storage, description, physical security, and access to the resource. Typically, a resource is preserved for its business, legal, or historical value to the organization.
Digital preservation is specifically concerned with the preservation of electronic (digital) content. It is defined as "a broad range of activities designed to extend the usable life of machine-readable computer files and protect them from media failure, physical loss, and obsolescence." 3 Digital preservation activities are divided "into those that promote the long-term maintenance of a bitstream (the zeros and ones) and those that provide continued access to its content.4"
Preservation of digitized and born-digital contents
It is widely believed that it is easier to preserve digital information than physical resources, such as books and photographs. In practice, the opposite is true. It is very easy to permanently erase a digital document or image, and even a large content database or Web site. In comparison, old documents, such as Middle Age manuscripts and even older documents stored in various media are still preserved under the right environmental conditions in museums and archives. It takes more than just clicking on the "delete" button to destroy a book. In addition, the multiplicity of printed copies makes it even more difficult to entirely erase all traces of the content across time and space.
Once an artefact is digitized, the resulting digital resource must be preserved for the long-term just like the original artefact. In digital preservation, there are a number of challenges to address:
- Unlike a book, which yields direct access to the content, a digital resource is not directly accessible. It requires software, middleware and hardware, all of which are subject to technological obsolescence. This is particularly true with ever-evolving file formats.
- Digital content storage media, such as compact disks, optical disks, diskettes, servers, hard drives, are unstable over time and subject to media decay.
- Context (also known as metadata), which is any relevant information about the digital resource, must also be preserved. Context helps maintain the resource's meaning over time.
- The look and feel (presentation, layout, format) provides the user experience with the digital resource. Preserving the look and feel, as well as the original functionalities, such as hypertext and interactivity, is another challenge.
- As digital resources are easily modified, steps to preserve the authenticity (the genuine characteristics) of a resource must be taken to avoid unauthorized modifications. Authentication is the process that determines the authenticity of a resource. However, it makes use of encryption, certificates, and similar elements, that must also be preserved in order that continued access to the content be possible.
- Copyright issues may impede access to digital content.
- Preservation requires time, expertise, and financial commitments from an organization.
- Finally, lack of planning for digital preservation may prevent this important activity from taking place. Planning for policies, procedures, equipment, infrastructure, budget, and training is too often overlooked.
As there is no magical solution, institutions must use a mix of different strategies to preserve digital resources, in the short, medium, and long terms. Preservation of digital resources includes the following strategies. Each has pros and cons. Choose the ones that are the most appropriate for your project.
- Migration: Migration is the conversion from an older file format to a newer one. Performing conversions on a regular basis keeps a digital resource up to date with the current technological environment; however, bits of information are sometimes lost in the process. Conversion is also time-consuming. Another viable strategy is to save digital content in standardized or open file formats that should better endure technological changes than would proprietary ones.
- Emulation: Emulation is a behaviour of software, middleware and/or hardware that imitates the functionalities of obsolete technologies. While emulators can reproduce the look and feel of an original digital resource, it is a costly solution. Emulators themselves are also subject to obsolescence.
- Migration/emulation combination: Combining the features of both migration and emulation is a sound strategy as it allows an organization to count on more than one solution. However, the limitations of both must be taken into account.
- Preserving original technologies: Holding on to old technologies can certainly preserve the look and feel and genuine functionalities of a resource; however, it is costly (in terms of finances, space, and expertise) and certainly not a viable solution for the future, unless you plan to open a museum of past technologies!
- Creating hard copies: The low-tech solution of copying digital content to media that are easily preserved (paper or micro-fiche for example) presents many limitations, such as the loss of the digital resources' functionalities. It is definitely not adapted to a Web environment. It is also a fastidious task, and it requires significant physical storage space.
Preserving your website
A Web site is just another digital resource to be preserved over time. Contrary to the popular belief, it is not because content is online that it is going to be preserved for the long-term or remain accessible over time. According to studies 5, the average life of a Web page is 100 days . Actually, an increasing number of Web sites simply vanish from cyberspace. Reasons6 include:
- Change of URL: The main entry point of the Web site was changed without notice.
- Costs of maintaining a Web site: There are time and financial implications to the maintenance of a Web site that organizations have not carefully planned for. This is especially true when a Web site features interactive resources or is dynamically generated though the use of a content management system. Managing broken links is also costly in terms of time. In addition to technology and time, skilled people are required to maintain a Web site over time.
- Legal issues: When copyrights have not been cleared, online content may have to be removed indefinitely.
- Misuse or non-compliance with Web standards: The misuse of standards jeopardizes the viability, and thus access to the content, of a Web site over time.
- Obsolescence of technology and media decay: A Web site faces the same digital preservation challenges as any digital content in terms of technological obsolescence and the decay of storage media.
- Internet service providers (ISP): When ISPs go out of business, hosted Web sites are closed. If no backup files exist, a Web site's content may be lost forever. Or, if a site is moved to another host or server, it often causes a change of the URL that users had come to learn.
To preserve your Web site in the long-term, a number of measures7 can be taken.
- Contractual agreements: Formal contracts can be established between stakeholders involved in the creation and maintenance of a Web site. For example, CIF fund recipients are required to maintain their Web sites for a number of years, depending on the contribution agreement.
- Technological solutions:
- Do not change URLs if you reorganize the structure of your Web site.
- Use permanent URLs (known as PURLs) for your pages, and adopt consistent file naming conventions.
- Link to directories or ".html" files rather than proprietary or specific file formats, e.g. PDF or ASP.
- Mirror your Web site in other locations.
- Have your content harvested by either joining harvesting initiatives relevant to the content of your Web site, or depositing your important site documents in trusted repositories.
- Awareness: When a Web site disappears or is not properly maintained for long-term viability, there is a risk that valuable cultural content will be unavailable to future generations. All stakeholders involved in the development and management of a Web site must be made aware of the importance of preserving the content of the site, as well as the site itself, according to preservation standards and best practices.
Conclusion
By applying digitization and preservation best practices, and conforming to CIF Technical Standards, fund recipients maximize their chances of keeping their Web sites—and most importantly, their contents—retrievable and accessible in the long-term, despite organizational and technological changes.
Resources
Brosseau, Kathleen and Mylène Choquette and Louise Renaud. Canadian Museum of Civilization and Canadian War Museum. Digitization Standards for the Canadian Museum of Civilization Corporation. March 2006. http://www.chin.gc.ca/ATutor/bounce.php?course=29.
Canadian Heritage Information Network. Creating and Managing Digital Content [Web resources related to the digitization and preservation of digital content]. June 2007. http://www.chin.gc.ca/English/Digital_Content/index.html.
Cornell University . Digital Preservation Management: Implementing Short-term Strategies for Long-term Problems. October 2005. http://www.library.cornell.edu/iris/tutorial/dpm/index.html.
Côté, Marie-Claude. Canadian International Development Agency. CIDA's Long-term Accessibility Framework. November 2006. [Not available online.]
Kelly, Brian. Approaches to the Preservation of Web Sites. Online Information 2002 Conference. December 2002. http://www.ukoln.ac.uk/web-focus/events/conferences/online-information-2002/abiword-html/
1http://www.chin.gc.ca/ATutor/bounce.php?course=29
2http://www.pch.gc.ca/pc-ch/org/sectr/ac-ca/bdc/tech-eng.cfm
3 Cornell University. Digital Preservation Management Tutorial. October 2005. http://www.library.cornell.edu/iris/tutorial/dpm/index.html.
4 Ibid.
5 Kelly, Brian. Approaches to the Preservation of Web Sites. Online Information 2002 Proceedings. http://www.ukoln.ac.uk/web-focus/events/conferences/online-information-2002/paper.pdf.
6 Ibid.
7 Ibid.
Note:
To access the Portable Document Format (PDF) version you must have a PDF reader installed. If you do not already have such a reader, there are numerous PDF readers available for free download or for purchase on the Internet:
Please note that all saveable and fillable PDF forms require Adobe Acrobat Reader version 8.1 or higher.