Matching and merging
swissbib contains the bibliographic data of RERO, the IDS-libraries, the Swiss national library and IDS-partner-libraries. Due to the organizational structure of the Swiss library system the data is stored in eleven different databases. The amount of duplicates in this data is high because of the similar collection structure of the different libraries. The data is catalogued according to four sets of rules: AACR2, Recaro and KIDS (IDS-libraries until 2015), RDA (IDS-libraries from 2016). The data is structured in formats MARC21 and IDSMARC. A fourth factor is multilingualism that influences mostly authorities and subject headings.
Whereas the formats can be mapped rather easily the different cataloguing rules and to a bigger extent the different cataloguing practices pose problems.
In addition swissbib contains the bibliographic data of institutional repositories of the Swiss universities and of the collections of e-codices and e-periodica. This data is structured in Dublin Core or MODS.
In order to get a picture of the difficulties related to eliminating duplicates in catalogue-data the swissbib project mandated Pierre Gavin and Jean-Bertrand Gonin in 2007 to conduct a study of the feasibility of deduplication of bibliographic and name-authority data. Their findings are summarized on the page deduplication study.
swissbib matching and merging
swissbib record matching is completely index based. In order to enhance the chance that the elements match the data is normalized and transformed.
The following elements of a record are used to generate match indexes:
- ID (ISSN, ISBN)
- title (245$a,$b,$n,$p / 246 $a)
- edition (250$a)
- (corporate) authors
- impressum: editor name (260$b / 264$b)
- extent (300$a, +/- 1 page)
- volume number
- coordinates and scale
The match threshold is 1, so all elements that can be compared must match.
Instead of classical merging which merges two or more matching records into a new one and is a destructive process, swissbib builds a cluster of matching records. Out of a cluster of matched records a display record is built ("master record") based on the "richest" record in the cluster. Additional information from the other records in the cluster is added. This record is temporary and is rebuilt or updated each time a record in the cluster is updated or deleted.
However there are some rules that prevent merging:
- a document is marked with a "nomerge"-code by a library
- two or more matching documents that are from the same source aren't merged
- documents which are still in process (and are quite likely to cause wrong merges)