Article of the
Month - December 2020
|
Land Governance Lost in Translation - Exploring
Semantic Technologies to Increase Discoverability of New Technologies &
Data
This article in .pdf-format
(10 pages)
|
|
Lisette Mey |
Laura
Meggiolaro |
Lisette Mey, Netherlands And Laura Meggiolaro, Italy
1. INTRODUCTION
Language and technology barriers are a very
serious constraint to effectively exchange and learn from land data,
information and technologies across the globe. We would like to explore
whether we can gain inspiration from how semantic web technologies have
overcome knowledge-sharing challenges in other sectors, such as the
agriculture sector. With emerging technologies, new tools and
ever-growing amounts of land data, we face a very real risk of losing
the overview. Without this overview, data is much less likely to be used
and thus be useful. We will particularly look at the use and value of
controlled vocabularies for the land sector.
Land is a topic that is
debated in many languages, across different (academic) disciplines and
in all parts of the world. Furthering our collective agenda, sharing and
learning from knowledge and perspectives from other contexts, or
transitioning technological innovations from one country to the other is
complicated by - among many other aspects - language and terminology
barriers. Many attempts have been made in the past to find common
definitions and terminologies for issues related to land, but a wide
consensus or adoption has never been reached. Understandably so: one can
only imagine the heated and controversial discussion to reach agreement
on what we mean exactly when we use the word ‘property’. It simply does
not have the same meaning in each country or context. It is a daunting
and arguably impossible task to reach this global consensus. In this
paper, we will present our experience with controlled vocabularies and
the opportunities and challenges it can bring.
2. THE POTENTIAL OF THE SEMANTIC WEB
2.1 What
is the semantic web?
Tim Berners-Lee, the inventor of the world wide
web, once described the semantic web as follows:
I have a dream for
the Web [in which computers] become capable of analyzing all the data on
the Web – the content, links, and transactions between people and
computers. A "Semantic Web", which makes this possible, has yet to
emerge, but when it does, the day-to-day mechanisms of trade,
bureaucracy and our daily lives will be handled by machines talking to
machines. The "intelligent agents" people have touted for ages will
finally materialize.
There is a wealth of data and information available on the web, more
being added every day from every part of the world. It has become
impossible for humans to digest this all and be aware of all elements
online. It is sometimes said ironically, that the answer to the world’s
problems lie in a PDF somewhere online. But someone needs to find,
access and digest this information before being able to actually solve
the world’s problems. We would not want to go that far as saying “all
the world’s problems” can be solved with already existing information,
but there is definitely truth in the fact that we can benefit more from
existing knowledge and tools to address issues that happen globally.
Generally, new technologies (for example, on data capture or
innovative surveying methods) or newly generated knowledge are shared
among personal networks, such as the FIG network. But what about people
that do not have access to such networks? Knowledge remains confined
within certain siloes, whether they are thematic (land administration
vs. gender experts, for example), sectorial (surveyors vs. grassroots
activists, for example) or geographical. Not accessing all potential
beneficial knowledge and tools is therefore partially an issue of
breaking out of old habits, but even if the will was there - where do
you possibly begin? If a simple Google search for ‘surveying techniques’
returns over 34 million records, even the best intentions are not going
to be enough. It is simply too much for a human to digest this wealth of
information.
The semantic web aims to address just this. The goal of the ‘semantic
web’ is to make information available online machine-readable. Humans
cannot digest all this data and information, meaning that important
knowledge will never reach its full potential or even, in the worst case
scenario, remain unused. Machines can help us read and digest this
information at an unprecedented speed or scale. In order to effectively
share knowledge and technologies across the globe and increase our
collective efficiency - we need to embrace a tool like the semantic web.
2.2. What is machine readability?
To understand how we can embrace the semantic web as a tool for
effective knowledge sharing globally, we need to understand what machine
readability is. The common perception that anything put on the web can
be read by machines, is woefully incorrect. It is true that many
applications or software instances have been developed to digest more
and diverse types of information, such as pictures, PDFs or even
satellite images. But such applications are often very expensive to
develop and perfect, and as such, as hardly ever affordable for
non-commercial organizations to use. Particularly when we consider
people and organizations working in less developed countries. The idea
of the semantic web does not envision ‘machine readability’ through
applications or software, but rather non-proprietary machine
readability.
Important to remember is that machines read in 0s and 1s, and
therefore structure, standards and formats are incredibly important for
a machine to fully understand the meaning of data or information. The
semantic web is based on ‘Resource Description Framework’ (RDF) which is
a machine-readable technology based on triples: object, predicate and
subject.[1] Structuring information, particularly
metadata, in such a way allows machines to understand what it is about
and help retrieve information to an end user. This may sound convoluted,
but it is something anyone that has uploaded any information to a
repository, has dealt with.
Think of a simple example of uploading a paper to an online library
or journal. You will be required to fill in certain fields describing
your paper. The ‘object’ (first of the triples) you are describing is:
your paper. The ‘predicates’ (second of the triples) are the different
fields that you are required to fill in. A title-field, for example,
will have “hastitle” as predicate in the backend of the online library.
The subject (third of the triples) is the actual title of your
publication. A machine will read: “your paper” >> hastitle >> “title”.
Three elements are of crucial importance in the back end to make this
information machine-readable: format, uniqueness and standards.
2.2.1. Format
Firstly, the format needs to be open. As mentioned before, for a machine
to read PDF or an Excel file, it will need programs such as Adobe or
Microsoft Excel. The principle of machine readability is that such
proprietary software will not be needed. This RDF-based metadata
therefore should be in a format such as CSV, JSON or other open-formats.
We will not go into this topic of formats much deeper, because much has
been written on the topic.
2.2.2 Uniqueness
Secondly, uniqueness is very important. Remember that machines read in
0s and 1s, therefore the title of a paper such as “New Surveying
Methods” is read as a combination of certain 0s and 1s. Another paper
with an exact same title, will have the same combination of 0s and 1s.
Or if we are talking about the name of a tool for example, this may
change over time. How will the machine be able to understand that papers
with the same name, are in fact two different papers (and how will it
attribute the right RDF information to the right paper)? Or how will a
machine know that the two names the same tool has had over time, are in
fact the same tool?
A machine will need to be able to differentiate. This is why in the
semantic web, the use of unique IDs is of crucial importance. Think of
how papers in journals often have a DOI-number or published books have
an ISBN-number. The same should go for resources published on the
(semantic) web: resources should have a unique ID to ensure that
machines will always be able to attribute meta-information about this
content to the correct and unique resource.
2.2.3. Standards
A third crucial element to machine readability is standards. Take the
example we mentioned above: how does a machine know that the
“hastitle”-predicate is actually a title of an object? Because the
predicate is based on a standard. Standards have been developed for
metadata, formats, data structures -- all in a way that machines are
able to understand them. We can write hundreds of papers and probably
several PhD-studies can be conducted digging into the different
standards, how they work and how they were developed. In this paper we
want to focus on one type of standard in particular: controlled
vocabularies.
2.3 What are controlled vocabularies?
A controlled vocabulary, in short, provides a way to search and discover
data and information. Controlled vocabularies are used in libraries,
repositories and any other knowledge storage system for indexing
information.[2] The concepts in such a controlled
vocabulary are used to tag data and information. Using a controlled list
of concepts, issues such as synonyms, homographs or translations are
circumvented. It is, in other words, a standard for keywords.
This is another critical element for the effectiveness of the semantic
web. If a user queries a database, for a machine to be able to retrieve
relevant information, it is important that the computer also understands
what the topic is. If anyone can fill in anything when they upload
content to this database, the machine has no way of knowing
relationships between terms of how a resource tagged with a synonym,
might also be of interest to this user.
Controlled vocabularies work with unique IDs for each concept, with the
possibility of adding several labels to that ID: the preferred term,
translations in an endless number of languages, relationships between
terms (A is related to B, or X influences Y, etc.). This way the machine
can understand the languages and the nuances we use in languages, and
help retrieve the most relevant and to-the-point information to a user’s
query. We will dive deeper into the potential of controlled vocabularies
by highlighting the case of AGROVOC, the agriculture thesaurus.
3.THE CASE OF AGROVOC
AGROVOC is a controlled vocabulary established and facilitated by the
Food and Agriculture Organization (FAO) of the United nations. It covers
“all areas of interest to the FAO, including food, nutrition,
agriculture, fisheries, forestry, environment etc.”.
[3]The AGROVOC thesaurus was first published (in English, Spanish and
French) in the early 1980s. In 2000, AGROVOC went digital. It has
evolved and grown over the years, with a vibrant and international
community of editors behind it, contributing new concepts and new
translations every month. Today, AGROVOC consists of over 36,000
concepts and over 750,000 terms (synonyms or translations to those
concepts, etc.) related to agriculture and is translated to over 35
languages.
AGROVOC is widely used in specialized libraries as well as digital
libraries and repositories to index content and for the purpose of text
mining. It is also used as a specialized tagging resource for content
organization by FAO and third-party stakeholders. FAO statistics show
that the vocabulary is used by 1.8 million users every month to classify
agriculture data and bibliographic resources. AGROVOC has thus increased
the visibility and discoverability of agriculture data and information
to an immeasurable scale.
A controlled vocabulary such as AGROVOC, has helped no less than 10
million users a year in overcoming the language barriers we just
described. Through AGROVOC’s technical infrastructure, computers can
read concepts beyond 0s and 1s and understand how ‘maize’ as a concept
is the same as ‘Maïs’ in French or ‘ذرة صفراء’ in Arabic. Translations,
synonyms and relationships of this one concept are captured in one
unique code, a ‘Uniform Resource Identifier’ (URI) , that computers,
including search engines, can read and understand.
4. WHERE IS THE LAND SECTOR?
With such an incredible tool and even more incredible user base as
AGROVOC, one quickly starts thinking: what about land? If the AGROVOC
tool covers all areas of interest to the FAO, surely land governance
must be one of the topics they cover. When the Land Portal Foundation
first discovered AGROVOC and engaged with the team, only 20 concepts
related to land governance were included in the AGROVOC vocabulary.
4.1 Gap exploration research in use of controlled vocabularies in land
sector
As a part of the GODAN Action-consortium, in 2016 the Land Portal
Foundation did a scoping study of land information providers online and
the way they classified their information. Or in very simple words: what
kind of tags do they use? The main conclusions about the use of standard
vocabularies within the land governance community is that there is no
structured or uniform approach to use them to publish information. We
saw a range of sophistication in the way to classify the materials the
organization publishes, starting from no classification at all, to a
standard set of keywords that could be used.
Roughly, five types of classification were identified. The first being
no classification at all for content or merely categorizing content by
resource type (see for example the
Asian Farmers
Association’s website). Secondly, many organizations use a ‘free
tagging’-system, allowing the users to create new tags as they add new
resources (see for example the
AgEcon website,
maintained at the University of Minnesota by the Department of Applied
Economics and University Libraries, and the Agricultural and Applied
Economics Association), leading to an unstructured list of thousands of
keywords that overlap. The third situation is where organizations have a
standard set of keywords that can be used to classify content, but there
is no real structure to these keyword lists. For example, organizations
do not differentiate between resource type, geographical keywords or
topical keywords within these lists (see for example the
Asian NGO Coalition or the
Focus on Land in Africa
(FOLA)-website, a joint initiative of the World Resources Institute
(WRI) and Landesa). Similarly, some organizations do have a standard set
of keywords or topics, but that standard is only applicable to their own
organizations and not meant to be re-used or accepted by other
organizations. See for example the
International Land
Coalition website, that has structured their publications under
their own strategic commitments – that not even their partners, who as
members of the Coalition have committed themselves to the same goals -
have adopted on their own websites.
Finally, there have been attempts to standardize a set of topical
keywords – a glossary - within the land sector and to gain general
acceptance of the entire sector to these initiatives, such as
Focus on Land
in Africa (FOLA) and more recently, the
Global Land Indicator Initiative (GLII). However, these glossaries
are stand-alone lists in HTML or PDF format, but not used or applied in
any way. Focus on Land in Africa (FOLA), as mentioned above, does not
use their own glossary to classify their content – it is meant to merely
guide users through the documents they can read on the website and to
create an understanding behind the meaning of the different keywords.
The Global Land Indicator Initiative has created a glossary with key
land-related terms, which has been a collaborative process by several
prominent organizations working on land. However, this list has not been
published yet, nor are there any concrete plans to use this glossary
other than as a reference for generally accepted and determined key
concepts and definitions for land governance issues.
Conclusions from these different classifications within the land sector
that were identified during the scoping research, is that there is a
very limited awareness about standards to classify data within the land
sector. Some organizations do not use topical keywords at all and those
that do, have not designed these lists to be seen or used by other
organizations at all. Therefore, there is a clear gap in the use of
standards for the land sector and in the existence of standards for the
land sector specifically.
5. INTRODUCING LANDVOC - THE LINKED LAND
GOVERNANCE THESAURUS
The Land Portal Foundation has responded to this gap, not by creating
yet another new standard, but by taking a widely accepted and used
standard such as AGROVOC and enriching the concepts related to land
within this vocabulary. By building on existing land glossaries, such as
the FAO’s Land Tenure Thesaurus (developed as a reference point for FAO
staff), or the Land Administration Domain Model or the Global Land
Indicators Initiative. New concepts were added and translated to several
languages. This particular set of land-governance related concepts in
AGROVOC is now called “LandVoc - the linked land governance thesaurus”.
LandVoc can be an extremely powerful tool in making data and information
more discoverable. It can connect knowledge and experiences from across
the world, bridging both language and culture barriers. LandVoc is
intended to be an unbranded linking tool between the different
classification and tagging systems information providers in the land
sector use.
5.1 Challenges
There is no doubt that the land community experiences the same struggles
in language-differences as they do in agriculture -- however, arguably,
these are much more nuanced and complex. With a topic such as land,
classifications are controversial and immediately become political.
Furthermore, in a sector where multiple tenure systems coexist within
one country (all with their own associated terminologies) and that
harbors immense power imbalances between global and local, between
government, private sector and local communities -- uttering the phrase
‘standardizing’ is often considered either naive or some sort of utopia
we will never reach. In such discussions, we hear that land experts feel
that acknowledging the differences in the way we choose to name or
describe the issues we face, however evident or subtle these differences
may be, has to be more important than increasing discoverability of
information.
Enriching the land concepts in AGROVOC to try and capture the nuances of
land governance in the LandVoc vocabulary goes beyond technical
features, people tend to argue, but is something more fundamental: it is
scientific, psychological and political in nature. We could not agree
more. As a team whose everyday business involves managing an information
technology platform, we cannot help but see the technological benefits
of such a tool. But we also see that in global thesauri, English remains
the dominant language and the starting point that other languages build
on, rather than entering from their own perspective. We see that, when
it comes to definitions or preferred terms to use, Western perspectives
and interpretations of concepts are much more dominant than those of
stakeholders in the global South.
In facilitating a standard vocabulary for land, our intention is not to
counteract such differences or ‘impose’ a standard for a particular
concept -- but rather, to build a tool that embraces and highlights our
differences. Thus, providing a basis to gain a deeper understanding of
the issues we deal with and how they vary from stakeholder to
stakeholder and context to context. We are aware of the fact that we
will never be able to capture all languages, nuances and differences,
but, in our opinion, this isn’t a reason to not begin trying! We would
argue it is actually quite important to realize and acknowledge that
when a researcher that has a PhD with regards to a certain topic uses a
certain term, it means something different than when a practitioner
working at intergovernmental organization uses the same term. Currently,
there is no way for a layman to realize this, other than by speaking to
such stakeholders individually.
We have a choice: we can carry on conversations with those select few
that understand and acknowledge our particular conceptualization of land
governance and limit the outreach and impact of our work, or we can
choose to be more inclusive and decide to embrace and convey these
important differences to a wider public. If tools such as a Google
search engine are used by millions of people already, LandVoc can help
to ensure that others can also begin to gain an understanding of the
rich complexity and controversy of a topic such a land governance.
5.2 Opportunities
Not only is the Land Portal Foundation active in the land sector to
promote standardization and work constructively on making land data and
information more discoverable - however daunting that task may be - the
Land Portal is also a major advocate within the open data-community not
to duplicate efforts or standards, but still make universal standards
useful for smaller expert communities.
Of course AGROVOC largely overlaps with possible land concepts, but
using solely the agriculture standard will not be relevant enough to
meet the land sector’s needs, because it also contains thousands of
concepts that are not relevant to land. Recognizing that the overlap
between the two standards would be significant and not wishing to
duplicate efforts, the Land Portal and FAO explored options on how the
AGROVOC thesaurus could be made useful to specific expert communities.
The solution brought forward and currently implemented, is that of the
multi-hierarchy scheme. Land concepts will be in AGROVOC, within the
AGROVOC hierarchy, but there will also be a separate scheme within
AGROVOC, that only contains concepts related to land governance:
“LandVoc”. This LandVoc scheme can have its own independent hierarchy
from AGROVOC. This solution allowed to avoid duplication of efforts, but
still making the thesauri relevant for the specific expert communities.
AGROVOC is now exploring these options for other expert communities as
well, such as fisheries and soil.
With such a great infrastructure for a new tool as LandVoc, the Land
Portal Foundation has performed a year-long consultation with experts
building the independent hierarchy for LandVoc. This will make it an
even more useful tool for the land sector to use.
6. CONCLUSION
We have seen how semantic technologies, and particularly the use of
controlled vocabularies, can increase the discoverability of data and
information considerably. AGROVOC, has increased the visibility of
agriculture data and information and serves an audience of over 1.8
million users per month. Land Portal’s research has shown that the land
sector is far from reaching such a potential since no standards are
being used to classify land data and information online.
The Land Portal saw this gap and worked with the AGROVOC team at FAO to
increase the 20 land-related concepts in AGROVOC to 300 unique concepts,
excluding the added translations and synonyms. This set of land-related
concepts within AGROVOC is called “LandVoc”. LandVoc could similarly
increase the visibility of land data and information and help the way we
exchange land data across the world. More than that, it can also serve
as a reference document for translations and to capture and understand
the richness and complexity of land governance terms.
REFERENCES
AIMS (2019), “AGROVOC | Agricultural Information Management Standards”.
Berners-Lee, Tim; Fischetti, Mark (1999).
Weaving the Web.
HarperSanFrancisco.
chapter
12.
World Wide Web Consortium (2004), "RDF/XML Syntax Specification
(Revised)".
CONTACTS
Lisette Mey
Land Portal Foundation
Bakboord 35 3823TB
Amersfoort
THE NETHERLANDS
+31657710841
lisette.mey@landportal.org
www.landportal.org
[1] Berners-Lee, Tim; Fischetti, Mark
(1999).
Weaving the Web.
HarperSanFrancisco.
chapter
12.
[2] World Wide Web Consortium (W3C),
"RDF/XML Syntax Specification (Revised)", 10 Feb. 2004
[3] AIMS (2019), “AGROVOC | Agricultural
Information Management Standards”.