Panos Constantopoulos is Professor in the Department of Informatics at Athens University of Economics and Business & Digital Curation Unit in Greece

An answer to the question “what should be preserved?” would typically include digital content of various forms and related metadata.  But what stands for content and where is the limit between content and metadata? The changing notion of ‘document’ and the evolution and spread of computationally supported business and research processes, entail a greater need for keeping process and contextual knowledge.

Digital content traditionally includes material that is the product or record of some activity and comes in several forms - text, graphics, images, audio, video, data sets. Recent years have witnessed a shift in content granularity and structure. This is related with the ability to manage both the identification of entities that possess independent information value and the associations of those entities. Reference to self-contained parts of documents is an old practice (chapters, sections, tables, images, columns, etc.). Yet it is in the universe of the Web and linked data that this practice becomes fully operational and on a large scale: here the parts are independently identified through their URIs following some naming scheme and the conceptual bond that makes them parts of a whole is explicitly represented in the form of appropriate relations between them. The parts can then be reused in different ways. In this perspective, it is useful to maintain the knowledge of the successive uses as well as of the contexts of use.

Nanopublications (see perhaps manifest the ultimate of the granularity shift in the domain of scientific publication by driving the size of the unit of publishable information down to “an assertion about anything that can be uniquely identified and attributed to its author”. At a minimum, a nanopublication comprises an assertion, the provenance of the assertion, and the provenance of the nanopublication, represented as RDF graphs. The graph-structured nature is not unique to the highly granular nanopublications; in fact it is common to all linked data collections, either user- or machine- generated. There arises a new kind of text: written in RDF (or, for that matter, any future alternative) the graph blends information on the way data is generated, interpreted and used with the data itself in a seamless fashion enabling access and reuse on a larger scale.

Reuse and interoperability of digital resources relies upon metadata (here meant to encompass structured metadata as well as unstructured documentation). I copy from the DP Handbook online: “From the outset of creating a digital collection, it is important to document as much as possible about a collection's assets, the tools and workflows. This documentation is an important component of technical and descriptive Metadata.” Metadata is “often the only material providing insight into how a digital resource was created, manipulated, managed and used by its creator and it is often the key to others to make informed use of the resource.” Several kinds of digital assets, such as digital surrogates of material cultural assets, have metadata attached that is in part automatically generated and, usually in large part, the outcome of long and intense intellectual effort, expressed in structured form (e.g. in RDF) according to a particular information model (e.g. the Europeana Data Model or the Monument Inventory Data Standard). This metadata not only describes the corresponding digital item, but also carries information about the original cultural asset which the digital item is a surrogate of.

Preservation then extends over three layers of information: the data, the metadata and the model that specifies the information structure and semantics. Without the latter it is impossible to interpret the metadata, or to transform it to other forms – a recurrent need. Models, as representations of conceptual structures, provide the intellectual basis for interpreting and reusing digital content. Increasingly encoded in machine–processable form (e.g. in RDFS, OWL), but also in graphical notations, models are themselves digital content worth preserving.

Models may represent state or process information describing real entities, as well as events and actions. Business processes, including digital preservation, and research processes in almost all domains are increasingly supported by computational systems in terms of planning, controlling and, to varying degree, automating workflows. Business process models are meant to capture and communicate to both men and machines why and how processes run, thus being the basis for this computational support, and also for understanding, reasoning on, reusing and changing processes. Either explicitly specified or, all the more often, extracted from textual sources, processes encoded as graph structures capture vital aspects of the context of use of digital resources, therefore they are worth preserving in their own right. Do models display some ‘granularity effect’? Content granularity aims at maximizing reuse potential. In the same vein, model granularity is manifested by the emergence of reusable patterns and ontologies, certainly resources of continuing value.

In conclusion, the various kinds of knowledge in the form of structured data are a class of digital content of increasing size and complexity deserving attention and preservation in their own right and in connection with the data explained by them.

Scroll to top