Matching and merging

From swissbib
Revision as of 08:43, 24 June 2014 by Tobias (Talk | contribs)

Jump to: navigation, search

Bibliographic data and authorities

Swissbib contains the bibliographic data of RERO, the IDS-libraries, the Swiss national library and the IDS-partner-libraries. Due to the organizational structure of the Swiss library system the data is stored in eleven different databases. The amount of doubles in this data is high because of the similar collection structure of the different libraries. The data is cataloged according to three sets of rules: AACR2, Recaro and KIDS. The data is structured in formats MARC21 and IDSMARC. A fourth factor is multilingualism that influences mostly authorities and subject headings.

Whereas the formats can be mapped rather easily the different cataloging rules and to a bigger extent the different cataloging practices pose problems.

Study of the deduplication of bibliographic data and authorities

In order to get a picture of the difficulties related to eliminating duplicates in catalog-data the swissbib project mandated Pierre Gavin and Jean-Bertrand Gonin in 2007 to conduct a study of the feasibility of deduplication of bibliographic and name-authority data. Their findings are summarized on the page deduplication study.

swissbib matching&merging

swissbib record matching is completely index based. In order to enhance the chance that the elements match the data is normalized and transformed.

match elements

The following elements of a record are used to generate match indexes:

Element Weight
material type 1
year 2.5
decade 2.5
century 2.5
title (245$a,$b,$n,$p) 3
edition 2
(corporate) authors 1.5
impressum: editor name 2.5
extent 2.5
volume number 1
coordinates and scale 2.5

match threshold

The match threshold is 1, so all elements that can be compared must match.


Instead of classical merging which merges two or more matching records into a new one and is a destructive process, swissbib build clusters of matching records. Out of a cluster of matched records then a display record is built. This record is temporary and is rebuilt or updated each time a record in the cluster is updated or deleted.

However there are some rules that prevent merging:

  • a document is marked with a "nomerge"-code by a library
  • two or more matching documents that are from the same source aren't merged
  • documents which are still in process (and are quite likely to be merged the wrong way)