What can we learn from the Semantic Web to ‘revamp’ our Historical Research?
Hervin Fernandes-Aceves | Lancaster University
For us, dilettantes in Digital Humanities, learning how to use new approaches and techniques can be as exciting as it is challenging. Despite the growing accessibility of these new tools, keeping up with the most recent fads is still a very demanding task for those of us whose areas of study are not the Digital Humanities themselves. Have you ever played Shadow of the Colossus? It feels like that: one stands as a tiny player before a massive, imposing giant, and although one cannot aspire to match its size or reach, the trick resides in finding your own path to climb it. The emergence in our fields of previously unknown representation languages – which usually come in hand with new editing software – contributes to the confusion, and, in many cases, widens the gap between the community of Digital Humanists and the rest of students and scholars in the humanities looking for the practical applications and tangible benefits of the digital turn. One of those trends, which have acquired more and more attention in the last decade, has been the semantic web model and its Linked Data.
I remember the first time that I heard about Linked Data, it was in the form of a question: ‘have you thought of using Linked Data instead for your prosopographical databases?’ I did not understand what Linked Data actually meant, so I automatically assumed it had to be a different name for ‘connected’ data, as it can be expressed already in a relational database or in an adjacency matrix. I must confess, I did not give any importance to the notion; I was already feeling rather comfortable with the conceptualisations and tools I already had. Little I knew what I was talking about. In my defence, in none of those engagements I was told of its definition as opposed to other types of connected data – perhaps they were too kind to correct me publicly. In any case, this is how this voyage of discovery started.
From utter unawareness, things started to change when I finally understood the key concept: in Linked Data, all relationships between data are represented in an explicit way! (As opposed to relational databases, which represent relationships between data in an implicit manner only). The distinction between explicit and implicit was the kernel from which the right questions began to emerge. If the relationships in Linked Data are explicit, then, how can I practically structure data in that manner so that I no longer need queries or schemas to determine relations between individual entries? How would a database of Linked Data look like? In my attempt to answer these, I have prepared the introductory reflection that I wished I had known when I first heard of Linked Data. With it, I will briefly cover the applications we historians can obtain from the ‘semantic turn’, without necessary being computer or Web experts.
Linked Data as the building blocks of graph databases
What is the difference between relational data and Linked Data? Relational data concerns the interactions, connections, group affiliations, and event attendances that relate one actor to another, and so cannot be reduced to the properties of the individual actors themselves. In other words, relations are not attributes of individuals, but properties of a connected system. Relational data is, in this way, a crucial concern to not only to sociological research – in the sociological tradition, the emphasis is upon the study of the structure that arises from intertwined social action – but also to historical investigations that focus on persons, communities and places. The database technology on which relational data is gathered is normally hierarchical or relational. An example of the former are XML documents, which may use specific mark-up standards of textual representation, such as TEI and CEI. A XML document typically contains nodes of information each with a parent node; a digitised medieval charter would, for instance, have a parent node for the entire document, containing a node for each diplomatic section (arenga, narratio, eschatocol etc.), which would in turn contain more specific nodes representing subsections or recorded persons and actions.
Conversely, a relational database can be made without a marked-up transcription. Continuing with the medieval charter example, in a prosopographical relational database, the relationships between recorded historical persons and documents (stored in different content-tables, one for people and another for charters) are given with additional link-tables made by tying unique identifiers (a.k.a. primary keys).
This means that in either model relations between elements need to be ‘deduced’ from either a tree-like structure in which data is classified by subordination, or a relation model that organises data into one or more tables of columns and rows, with a unique key identifying each row. This is why relations are implicit in this type of databases. So, which is the alternative model that would allow relations to be explicitly asserted?
The alternative technology is based on a network model that ties data entities not by tables or subordination but directly between data entities; the databases that are built on this model are called graph databases. The limitations of relational databases are the result of the implicit condition of the relations between data. Since connections are expressed as links between tables and, specific relations only come into existence in a query statement – which in the case of relational databases is made in SQL (a query language, which is a code use for managing data). These queries need to be written in accordance to the structure of the database (e.g. a schema), as the parsed relation is limited to the extent of how the tables are connected to each other. So, in shorter words, in a relational database, relations between entities can only be asserted by writing queries, a process that demands exact knowledge of the specific data structure. On the contrary, the graph model explicitly lays out the relations individual entities, as if they each were nodes of data. In this way, relationships in a graph database are, what is called in the programming jargon, first-class citizens, and hence can be labelled, directed, and given properties. Without any limiting tree-like structure of additional link-tables, all the entities, including the relations themselves, can be related to each other, not a single one having any particular intrinsic importance over another. And this type of data, where all relationships between data are asserted in an explicit way, is precisely what is called Linked Data. In this manner, there is no need to write queries in order to ‘extract’ basic relationships, as these can now be identified automatically by ‘walking’ the network of information created by the graph database.
The first lesson from the Semantic Web: The Resource Description Framework
Now that we know what Linked Data is, and how it is stored in a graph database, the next step is to learn how to ‘express’ such database. For this, we turn to one of the foundational technologies that allows the existence of the Semantic Web: the Resource Description Framework (RDF). The graph data model is how the semantic web stores data, and RDF is the format in which it is written; RDF specifications offer a sort of language that allows us to define and express a graph database. Let us start with a simple example of a graph database looking at three statements as read from an actual medieval charter (the donation made c. 1066 by Baresone, ruler of the Sardinian kingdom of ‘Torres’ to the abbey of Montecassino – and the earliest extant example of a Sardinian charter): 1) Bareson is a lord; 2) Ore is a kingdom; 3) Bareson was reigning in Ore. As a data graph, these statements can be represented thus:
In RDF terminology, statements like these are named semantic triples. A semantic triple is made of three elements: subject, predicate and object. This triple is the atomic data unit of the RDF data model, and as such, it is the foundation of defining data structures in the semantic web. From the previous example, statement 1 can be expressed as a triple, were subject is ‘Bareson’, the predicate is the action of bearing a title, and the object is ‘lord’. The links, represented by the arrows in the diagram, are called ‘properties’, which are the same to what the triples’ predicates.
Now, how does an RDF statement actually look like? Very similarly to how an HTML or XML document is expressed, RDF is written using strings of mark-up tags and content. Following our Sardinian charter examples, the RDF triples for the statement describing our historical person Bareson can be written in the following way:
Do not worry about the details for now; as with other representation language, it takes some time to learn it and get used to it – luckily, there are numerous tutorials online that would help you to familiarise with writing in RDF: for instance, the w3schools.com’s introduction to RDF, or the European Commission Open Data Support’s training module on RDF & SPARQL. For the moment, let us focus on two key aspects of this way of expressing triples: 1) the use of specific WWW links as unique identifiers for our data entities, and 2) objects in RDF triples can be either ‘literal’ values (meaning data you input directly as you write the statement) or ‘resources’, which are the ID of another entity described in RDF. Consequently, a subject in an RDF triple may also be referenced as an object of a predicate (i.e. property) in another RDF triple – this may be a confusing concept at the beginning, but things start to make sense once one tries to write some examples. Let us take a look at our examples: the name ‘Bareson’, the object of the predicate ‘http://www.medievalsardinia.com/prosopographic-feature#name’ (has a name) is given as a literal value, as ‘Bareson’ is spelled out in the statement. Conversely, the object of the predicate ‘http://www.medievalsardinia.com/prosopographic-feature#reigningPlace’ (is reigning in a place) is a resource identified with the link ‘http://www.medievalsardinia.com/place#Ore’, which means the object is an RDF subject whose name has been identified as an object with the literal value ‘Ore’.
Regarding the unique identifiers mentioned before (e.g. http://www.medievalsardinia.com/people#Bareson), these are called Uniform Resource Identifiers – URIs for short. In RDF, URIs are used to give a unique ID to the subjects, predicates or objects of statements. These are given as links in the World Wide Web because they are not limited to fulfil the purpose of unique identification in a single document, but to achieve the foundational, driving purpose behind the Semantic Web: to make data exchangeable globally.
By using the URIs, we can now represent our RDF example as a graph:
As a graph-base model, RDF offers a standardised yet flexible platform to record data than can be shared and linked globally. Nonetheless, in its own it does not offer a model to record ‘meaning’ – what in informatics and Linked Data contexts is called semantics! For that, we need of something else to complement RDF, an additional language to give our data a semantic dimension. Semantics: Giving meaning to our data with ontologies and OWL In order to ‘imbue’ data with meaning, the semantics model uses vocabularies and ontologies. A vocabulary is a consistent collection of defined terms with a contextual distinct meaning – a sort of a raw list of terms. An ontology, on the other hand, defines the contextual relationships of a given vocabulary. In an ontology, classes and interrelationships are recognised behind the terms provided by a vocabulary. As you can imagine from this, the two terms overlap, and in many instances the term ‘vocabulary’ may be used to refer to an ontology which does not claim a rigidly formal backing, so beware. By defining terms and determining and implying the relations that are expected between those terms, ontologies are the tools with which a specific information domain can be defined. This is one of the central features of an ontology: providing an explicit description of a domain of knowledge by fixing relations between terms. The syntax that allow us to express an ontology is the Web Ontology Language, or OWL. This language, however, did not start from nothing; before OWL, a set of classes with specific properties had been designed using the RDF data model, known as the RDF Schema (RDFS) – published back in 1998. This is the starting vocabulary that provides the basic elements for the production of vocabularies in RDF, and as such, it is also the basis for the description of ontologies. OWL is in this way and extension of RDFS. If you are interested in making and using your own ontologies, then you will have to first familiarise yourself with these two vocabularies! So, for example, if we would like to determine a knowledge domain where data from both a medieval archaeology project and a medieval charters project merge in order to link their respective data, a common vocabulary should first be defined. For this to work, this vocabulary should be consistent; e.g. the terms ‘historical person’ and ‘artefact’ should mean the same for both fields. This can be accomplished by using the same ontology for expressing the meaning and the connections of the data that both fields attempt to cover in a common domain. By using the same ontology, data can be then parsed and published using the same query language – SPARQL, in the case of data expressed in RDF – so both projects can communicate with each other. And there is even more: once a common vocabulary from a knowledge domain has been formalised as an ontology, this can be shared and linked with other ontologies. Likewise, new ontologies can be written on the basis of other ontologies, taking advantage of already defined terms and interconnections. As mentioned before, ontologies are written annotating RDF statements expressed in two basis syntaxes: RDFS and OWL. Just like with RDF, there are a handful of online tutorials that will help you to familiarise yourself with OWL and its specifications – such as the W3 OWL Web Ontology Language Guide or the University of Manchester’s Protégé OWL Tutorial – and I would not like to burden this space with a half-cooked tutorial of my own. However, it is crucial at this point to understand the basic operating parameters that allows OWL to give ‘meaning’ to data. This is accomplished by using instances called individuals, classes and subclasses. In OWL, individuals that share common characteristic are grouped under a class. As a member of an OWL class, an individual is implied to fall under a specific semantic classification. Likewise, a class can be placed as a subclass under another class, creating in this way a hierarchy of semantic terms. This hierarchy of OWL individuals is called a taxonomy. A taxonomy is not the only way in which semantic terms can be linked to each other in OWL. Individuals can be related to other individuals by means of properties; remember what properties are in RDF? Exactly, the predicates of a semantic triple, and in OWL properties are divided in two types, depending on the object they have: if the object is a literal value, then the predicate is an OWL datatype property; if instead the object is another OWL instance, then the predicate is called an object property. In shorter words, OWL adds ‘semantics’ to data that has been expressed first in RDF, and it does so by basically creating a taxonomy (i.e. a hierarchy of classes and subclasses) and a network of data entities (connecting individuals to other individuals by means of Object properties, or connecting individuals to literal values by means of datatype properties). In essence, the notion of ontology refers to an explicit and formal way of conceptualization, and one exercises this conceptualization by defining a structure of relations. Therefore, in the language of the web ontologies, meaning is given by explicitly defining hierarchy and interconnectivity. But, beyond the world of the Semantic Web, how can OWL be applied to our specific research needs? Landing the semantic model on our own research Until this point, I have intentionally omitted one of the key purposes of the Semantic Web, and I have done so in order to postpone a discussion that should now be easier to understand. The goal of making data globally accessible and imbue it with explicit meaning is to make it machine-readable. Although I cannot and will not discard the importance and potential of making our data machine-readable, I must admit this was not an attractive feature for me initially, and perhaps I am not the only one out there. I was more interested in how to gather and store my own historical data, focusing on the social and spatial relations as attested in the sources.
It was not the automatization potential of an ontology, but its capability of moving beyond XML files and relational databases that allowed me to understand the benefits of the semantic model. An ontology is not just another formal representation for machine reading, but an integral and flexible platform for structuring the contents of documents. This is the result of speaking of data conceptualization and semantic relationships, rather than sequences, nesting concepts, and implied relationships by assigning key references. The mark-up language approach (e.g. TEI and CEI) leans towards the analysis of the text itself, which could be either a medieval manuscript or a charter, forcing the digital editor into reproducing the formal structure of the textual artefact. As one tags places or historical persons, the assertions are limited to the position and shape of the reference in the text, leaving almost no space for an interpretative representation of the function that the place or the person played in their respective historical context – and the same can be said about dates! With its ontologies, the semantic model provides a structured data approach to the digitisation of historical sources that, contrary to the traditional textual reproductions in XML, incentivises the assimilation of conceptual relations that arise from the historical sense of the cultural and social context from which the source emerges. A recent article on the semantics of medieval charters has already identified this useful lesson learned from the Linked Data approach, arguing that, through the use of the ontological formal structure, there is ‘a way towards a perspective on charters that moves the focus from a charter's actual text to an historical interpretation of what that charter was doing in its society’. That is spot on! And this could apply not only to charters, but to any other textual artefact used as a historical source. The semantic model offers us a larger and nevertheless transparent data-gathering method for historical interpretation. The conceptual dimension added by an ontology enhances a graph database into a wider, interconnected expression of domains of knowledge – a knowledge network, as it were. This advantage should not be underestimated; as a formal vocabulary shared by a multi, trans, and interdisciplinary scientific community, ontologies are a path towards collective and cooperative knowledge. There are plenty of already written ontologies out there ready for historical research – just to mention some: the ‘Friend of a Friend (FOAF)’ project, ‘RELATIONSHIP’: a vocabulary for describing relationships between people, ‘BIO’: a vocabulary for biographical information, and J. Bradley’s Factoid Prosopography Ontology –, and perhaps we should explore them in some other occasion. The semantic web has become the fertile ground on which new initiatives have grown towards a re-understating of scientific partnership and public collaboration. The Linking Open Data community project, for example, provides a platform for the publication on the web of open and interlinked RDF datasets, owned and managed by the entire community of users. We should then speak not only Linked Data, but of linked open data, the basis for making content freely available to everyone and vastly improving the ability of researchers around the globe to identify and use connections across domains of knowledge in a way that standard publication of data does not allow. Even if you would like to design an ontology and build a graph database just for yourself, they can be built on and linked to other agreed formal vocabularies and other datasets – and you never know, your own ontology might serve as the basis for other vocabularies and thus further expanding common domains of knowledge. This is just the beginning, and it seems there is no research too small to join the semantic turn. ----------------------  Edited in E. Blasco Ferrer, Crestomazia sarda dei primi secoli, Officina linguistica, 4.4, 2 vols (Nuoro, 2003), i, no. 1 pp. 27–32.  For an discussion on the mathematical and philosophical origins of the notion ontology, what conceptualization means, and the implications of speaking of ‘formal’ and ‘explicit’, see N. Guarino, D. Oberle, and S. Staab, ‘What Is an Ontology?’, in Handbook on Ontologies, ed. by S. Staab and R. Studer, 2nd edn (Berlin: Springer, 2009), pp. 1–17. Also, see an important precedent reading in T.R. Gruber, ‘Toward Principles for the Design of Ontologies Used for Knowledge Sharing?’, International Journal of Human-Computer Studies, 43.5 (1995), 907–28 <https://doi.org/10.1006/ijhc.1995.1081>.  J. Bradley and others, ‘Exploring a Model for the Semantics of Medieval Legal Charters’, International Journal of Humanities and Arts Computing, 13 (2017), 136–54 (p. 160). Hervin Fernández-Aceves (@HervinFA) is a Postdoctoral Research Associate at the University of Lancaster, working on Mediterranean societies, Italian historiography and medieval charters. He was educated at the National Autonomous University of Mexico (UNAM), at the Central European University (CEU) and at the University of Leeds. He has been recently an Overseas Fellow of the Mexican National Council of Science and Technology (CONACYT) and a Rome Awardee at the British School at Rome.