An Argument for a Semantic Web Based FRBR Union Catalogue
UCLA
- Dept. of Information Studies
IS
277: User Centered Design
Prof. Phil Agre
June 14th, 2004
Abstract. IFLA's FRBR is a semantic expression of the relationships between items in the library catalog. The web technologies currently being developed by the W3C could be used to implement these expressions. A new layer would need to be developed on top of the MARC XML layer, to aggregate all of the holdings and descriptive data into a new union catalogue. Thus, the FRBR data could then live in this layer and give the library catalog the new functionality required by FRBR.
Introduction. The IFLA (International Federation of Library Associations) final report on Functional Requirements for Bibliographic Records (FRBR) has changed the way the library world perceives the library catalogue and the interaction of records with one another. FRBR describes relations between catalogue items using the concept of bibliographic families, pointing out just how closely items are related and precise relationships. The mapping of bibliographic families can only be possible if all catalogue records contain the FRBR metadata, and if they are able to be united into a single catalogue of holdings. The current union catalogues rely on MARC records. At this point in time, the MARC standard is too entrenched to be able to accept these new FRBR specifications without serious renovation and record conversion. A better solution would be to layer the FRBR metadata on top of the existing MARC metadata.
The semantic
web is slowly growing in scope and maturity, but the promise it shows in
this capacity to layer. This ability could be harnessed by the library community
to make the relationships described in FRBR
explicit. As a step in the right direction, the XML
expression of MARC records
has already been embraced. This could serve an extensible layer on which it
would be possible to add the more semantic FRBR
layer using RDF or one
of the other truly semantic XML
versions. Using harvesting tools to extract holdings information from the MARC
XML records, this FRBR
layer could then form a union catalogue that contains all of the FRBR
relationships down to the holdings information for each item in the catalogue.
The
Semantic Web. At
the moment, the internet is ruled by HTML
(HyperText Markup Language),
a presentation standard that allows text to be "marked up" with instructions
for display within a browser window. HTML
is platform independent, but some browsers are more liberal with HTML
tags than others. For example, a page that has been designed for Internet Explorer
may not look the same when displayed in Mozilla. HTML
allowed the internet revolution to occur, because of this independence and the
ease with which new pages can be added to the web and hyper-linked to other
pages and items.
As the web grows larger, just having a presentation standard is no longer sufficient.
The shear volume of information needs some measure of control. This control
can come in the form of attached metadata to sufficiently describe web resources,
using author, title, subject, and other information, similar to the way books
are described. While HTML
includes provision for metadata in the header material tags, this is not descriptive
enough to satisfy most information communities.
In order to fill this need, the eXtensible Markup Language (XML) has been developed by W3C (World Wide Web Consortium) which is headed by Tim Berners-Lee, the so called father of the world wide web. XML makes it possible for a community to define their own set of tags that can be used to markup text, as well as set up relations between tags using a Schema or DTD (Document Type Definition) to form an ontology that is defined by the community's domain. Thus XML is a context standard, instead of a presentation standard, although it can still control the presentation of text through the use of style-sheets. The community designed ontology becomes a content standard, as it helps to define what items should be described by each tags. If this is compared to the world of cataloguing, the ontology serves a similar function as the Anglo-American Cataloguing Rules 2 (AACR2), and the tags as defined by XML would be similar to the MAchine Readable Catalogue (MARC) fields.
Tim Berners-Lee's ultimate vision for XML is the Semantic Web. By layering a logic layer over the top of this new metadata layer, it is possible for the information on the web to have logical or semantic relationships. With logical relationships enumerated it would be possible to build logic engines as well as search engines that could find precisely what the user was querying, thus making the Web accessible once more (Berners-Lee). In this vein, the Resource Description Framework (RDF) was designed to set up this logical framework for semantic relationships. Using the RDF model, it is possible to make assertions using "triples." A triple consists of two nodes, one subject and one object, which are connected with a predicate relationship. In this way it is possible to start forming relationships between concrete items in the world that can be located using URIs (Universal Resource Indicator). As these concrete relationships grow in number, more abstract relationships begin to emerge. (Fitch)
In order to form richer and more strict relationships, other tools that work with RDF and XML are being developed. DAML+OIL is an extension of RDF that is used to define ontologies. Ontologies defined with the DAML+OIL namespace are able to use more predefined triples that are called "primitives." For example setting up a catalogue entry with an author in RDF would not imply that the author has a birth date, but in DAML+OIL, if the author is defined as a person, then the relationship between an author and their birthday is already made (Fitch). This allows for richer relationship construction with less syntax being defined in the ontology.
Another way to define richer relationships is the use of Topic Maps. Topic Maps are also defined in triples, in this case the triples consist of topics, associations, and scopes. Topics themselves are defined by three characteristics: name, occurrences, and roles placed in associations. These associations are always reified, or stand in for real world associations, and can be used in other triples (Coverpages). Hence, Topic Maps tend to address abstract relationships and work down to real world instances, which is the opposite of the RDF model (Fitch).
The ability to set
up community defined ontologies that are able create semantic meaning for web
information is a vision, a vision that is slowly resolving. In order to make
the vision reality, semantic tools are being developed to assist in the creation
and maintenance of XML
documents. There are validators to essentially debug XML
and there are over-arching ontologies that serve as compilers in different namespaces.
XML search tools are slowly
coming to fruition, as with Swoogle.
Logic engines have been developed for other programming languages, such as ALE
and PALE, but have not
necessarily for these XML
semantic description schemes.
MARC XML. For over 30 years, the MARC standard has been used by the library community to hold bibliographic and authority record information. Cataloguing itself is a highly codified practice, with rules to guide the cataloguer through any decision in the process, such as AACR2 and LCSH (Library of Congress Subject Headings). The fact that the process was already so codified made it ripe for being able to automate at least the record creation process. Thus, the output of the cataloguing process is a MARC record in one of its formats, depending on the type of record being created.
As a standard it has slowly evolved to accommodate new fields for the description of not only books, but electronic files, music, movies, and other file formats (McCallum). The high level of uniform use of the standard has been key to exchanging records data between institutions and the creation of union catalogues, as well as tools for copy-cataloguing. The standard has been implemented worldwide, with only minor differences between each countries implementation if any at all.
When XML technology began to emerge, task forces were created to determine if this was a direction that MARC should be moved towards. The extensibility of XML was an attractive feature. As was the ability to create conversion scripts that could automate the conversion into a new format. Conversion in both directions is loss-less, so the integrity of MARC record format would not be compromised. This conversion would also extend to other XML standards, such as Dublin Core (DC). Additionally, this new web-based version of MARC would be open to web harvesters, such as the Open Archives Initiative (OAI), which could increase visibility for these deep web objects. And so it came to be that MARC 21 was translated into an XML Schema, and named MARC XML which maintained by the Library of Congress (LoC).
The MARC XML Schema essentially marks up the traditional MARC data fields, in order to compose a database of MARC authority and bibliographic records. Unfortunately more advanced semantic capabilities are unavailable in this layer of the semantic web. So it is not possible to set up abstract relationships and be able to logically deduce anything from the MARC XML records, without another layer of RDF (or DAML+OIL or Topic Maps) pulling information from the MARC XML databases.
The process to add
new fields to the standard can take years to occur and involved the appointment
of task forces to review the possible additions before making a decision. While
this low level of change has made the standard very stable, it has also very
resistant to change, especially of the radical nature. This movement to MARC
XML is the largest change to occur in the world of MARC
records for some time, and it should be noted that the move to XML
has not affected the underlying structure of the MARC
record. Unfortunately, if FRBR
were implemented as a part of the record the underlying structure would need
to change to accommodate new functionality. As such the FRBR
specifications could not just be added to the existing MARC
formats or to MARC
XML.
Union Catalogues. In order to understand how union catalogues function, their reasons for existing must first be examined. There is a certain librarian mode that dictates the actions of librarians and the principles of library science. "[T]he librarian way of organizing communication is very much oriented towards aggregation of information" (Gradmann) Librarians are concerned with providing access to information, a such there is much emphasis on collecting information for library users and imposing an structure that helps the user find what they are looking for. The ultimate expression of this need to aggregate information is the creation of union catalogues.
Library union catalogues, such as OCLC or RLIN, bring together holdings from many different libraries, to create a list of all items available in libraries, as well as their locations. This is accomplished by having a copy of the local catalogue exists both in each library and at the union catalogue server. The centralized nature of the union catalogue means that even if a library's local server is down, the information can still be found at the union catalogue. Centralization also allows for quicker searching, because the search does not need to be broadcast to a distributed database network (Coyle). This type of catalogue sharing allows copy-cataloguing and inter-library loan or reciprocal borrowing systems to exist because all of the holdings information is available. Alas, these are subscription systems, which can sometimes be out of the price range for smaller libraries.
Union catalogues would not have been able to exist without MARC and a transmission standard that would facilitate the movement of MARC records. When the ARPANET was implemented, data transmission protocols and standards were devised to talk over the network and share data. It was at this point that the library community started work on what would eventually become the "Information Retrieval (Z39.50); Application Service Definition and Protocol Specification, ANSI/NISO Z39.50-1995" (Z39.50) standard. As can be seen in the full title, Z39.50 is recognized by both the National Information Standards Organization (NISO) and the American National Standards Institute (ANSI).
At it was developed, the Z39.50 standard "is a protocol which specifies data structures and interchange rules that allow a client machine (called an "origin" in the standard) to search databases on a server machine (called a "target" in the standard) and retrieve records that are identified as a result of such a search." (Lynch) Because this standard did not try to determine the specifics of how the data was stored and only focussed on how information was exchanged as a part of the query process, Z39.50 existed as a standard layer between different types of databases with different specifications.
More recently, a number of libraries have attempted to set up distributed databases using the Z39.50 standard. The entire state of Iowa set up a virtual union catalog in 1997, in order to avoid the overwhelming cost of subscribing to OCLC or RLIN. The participating libraries each had different OPAC vendors, but searching still worked because of the standard's ability to work with different database structures. They found that the standard was able to satisfy their users and maintained all of their normal services (Stark). The UC system also implemented a distributed union catalogue that utilized Z39.50 in order to compare the performance of a distributed and a centralized union catalogue. In the case of the UC libraries the centralized catalog was faster and more reliable, because of its built in redundancy (Coyle).
Z39.50
is similar to XML in a
number of different respects. As was mentioned above they can both form distributed
databases. Z39.50
also has semantic capabilities, or would if there were not a number of problems
in implementation. There is no community consensus for the structure or attributes
of information content classes. Without this agreement on an ontology there
is no interoperability. There is also some belief that semantic capabilities
were out of the standard's scope (Lynch).
All the more reason to move toward the semantic web to fully realize IFLA's
FRBR.
FRBR. The objective of cataloguing is to make resources available to the user. This objective can be manifest in a number of different ways, but the end result is the same. If searching by title, author, or subject were all that were needed to make resources available, the cataloguer's job would be simple. Unfortunately, there are many other means of access to catalogue records, but not all of the avenues have yet been explored. The composite of all of the MARC record fields tries to capture all of the information about an item, but the "network of potential related editions and translations of works" (Leazer & Smiraglia) are not made explicit by this framework.
Bibliographic families are this network of related editions and translations, as well as revisions, abridged or illustrated editions, parodies, etc. Bringing together items of the same family, say everything pertaining to Shakespeare's Hamlet, allows the user the ability to look for one of any number of prints of the main work, as well as annotated versions, or analysis and critiques of this work. This is a method for collocating materials that had not been explored by a library catalog until the following occurred. In IFLA's final report on the Functional Requirements for Bibliographic Description, this method was outlined as one of the functional requirements. As such researchers in the information retrieval arena have been trying to implement the following scheme, affectionately known as FRBR (fûr•bûr).
FRBR is broken into three groups of entity relationships. The first group consists of work, expression, manifestation, and item. These function as the four levels of detail in actually showing relationships, with work being the overall bibliographic family, and the item being a specific holding. The second group are those responsible for the work, expression, manifestation, or item. These can either be a person on corporate body, and they must have a role that defines their responsibility. The third entity group can include the entities from the previous two groups, as well as concept, object, event, and place. As such this third group is what the work is about (IFLA). These three groups of entities reflect the traditional descriptive elements that are used to catalogue a work. The group I entities are analogous to title, group II entities are analogous to the statement of responsibility, and group III is the subject. Any more relation to traditional descriptive cataloguing at this point ceases.
The group I entities form a hierarchical relationship. Starting at the top, a bibliographic family or "work", is "a distinct intellectual or artistic creation" This work normally lends a title to the bibliographic family that follows. The expression is "the intellectual or artistic realization of a work." An expression could be the print run or reproduction of a work, a new expression could be a translation or revision. For each expression a relationship is defined with the above work, and those responsible for the expression are made explicit. The manifestation is "the physical embodiment of an expression." In other words the manifestation is the actual print run of a book, or edition, and as such carries the publisher and other edition information. And finally the item is "a single exemplar of a manifestation" (O'Neill). The item is similar to an instance of a book, and reveals the holdings of an item.
There are many different types of relationships that can be expressed using the FRBR model. There are work-to-work relationships, such as successor, adaptation, or criticism. There are expression-to-expression relationships between expressions of the same work, such as translation, or between expressions of different works, such as supplement. There is also the expression's relationship to the work. There are manifestation-to-manifestation relationships, such as reproduction or a whole/part relationship. There is the manifestation-to-item relationship, and finally item-to-item relationships, such as reconfiguration (Beacom).
FRBR
sets up a semantic relationships between different items within a bibliographic
family. The use of a hierarchical also reduces the redundancy of information
a user may be presented with. For example if the user is looking at different
manifestations of Hamlet, the fact that all of the items are titled Hamlet,
and are by Shakespeare, is implicit. This semantic relationship between members
of the same family can also be exploited graphically to generate maps of bibliographic
families. These families can then be studies using visual pattern recognition.
A
FRBR Union catalogue. Even examining the quick description of
the FRBR
hierarchy and relationships, it is possible to start forming triples of two
nodes that are described by an association. Because of the inherent semantic
structure of FRBR,
RDF, DAML+OIL,
or Topic
Maps could all be used to set up a FRBR
ontology with different classes that describe the possible relationships between
items, as well as the hierarchy between the work, expression, manifestation,
and item.
Once all of the MARC
records have been converted to MARC
XML, harvesting of holdings information would be easy to obtain. Then unique
identifiers for each item could be created using a combination of the ISBN and
the location. Given an RDF
Schema for FRBR
and those who are willing to actually discover and document the possible relationships
within and between bibliographic families, it would be possible to automatically
combine the holdings information into a FRBR
union catalogue. Having a union catalogue is necessary, because only in the
presence of large amounts of data can create a rich network of relationships.
This is not the first paper to proclaim that at least RDF would be a possible means of implementing FRBR (Fitch, Powell). In fact there have already been some attempts at implementing FRBR as a part of a library catalog system. VTLS, an Integrated Library System (ILS) vendor, has examples of FRBR records. VTLS provides an open source ILS called Virtua, that is based in XML, their value added being the installation and support staff. VTLS has also created an RDF implementation of FRBR that is able to interact with the XML based catalog.
Other projects that
are implementing FRBR
are the AusLIT: Australian Literature Gateway, which is using
a combination of RDF, DAML+OIL,
and Topic
Maps to create new methods of resource description. (Fitch)
VisualCat:
Danish cataloguing client uses XML
and RDF to manage traditional
cataloguing structures as well as FRBR
(Beacom).
One of the world's largest union catalogues, OCLC,
has been performing research on "FRBR-izing"
their holdings. They have managed to create a downloadable FRBR
Work-Set
Algorithm to convert MARC
21 record databases into FRBR
catalogues. They have also developed a FRBR
tool which can be used on their fiction collection, called FictionFinder.
Conclusion.
A FRBR
union catalogue would not be a replacement for the existing MARC
based catalogs, union or otherwise. The FRBR
model acts as an enhancement to the existing means of access to resources within
the catalogue. FRBR
acts to create new meaning with the same information by highlighting implicit
semantic relationships. It would only be possible to create such union catalogue
with a set of semantic tools, and as such the semantic web presents the perfect
opportunity for FRBR
to finally be implemented. RDF
could serve as a semantic layer that is expressive enough to make the FRBR
relationships expressive and can also be layered over MARC
XML, which is a rich data resource.
Works Cited.
* only available to those with a subscription