Matching and merging
Swissbib contains the bibliographic data of RERO, the IDS-libraries, the Swiss national library and the IDS-partner-libraries. Due to the organizational structure of the Swiss library system the data is stored in eleven different databases. The amount of doubles in this data is high because of the similar collection structure of the different libraries. The data is cataloged according to three sets of rules: AACR2, Recaro and KIDS. The data is structured in formats MARC21 and IDSMARC. A fourth factor is multilingualism that influences mostly authorities and subject headings.
Whereas the formats can be mapped rather easily the different cataloging rules and to a bigger extent the different cataloging practices pose problems.
In order to get a picture of the difficulties related to eliminating duplicates in catalog-data the swissbib project mandated Pierre Gavin and Jean-Bertrand Gonin in 2007 to conduct a study of the feasibility of deduplication of bibliographic and name-authority data. Their findings are summarized on the page deduplication study.
Deduplication: matching and merging
A standard deduplication process consists of two steps:
- matching possible duplicates
- merging the files according a predefined rule-set or cluster them by ID
swissbib matching criteria
swissbib basically uses different elements of a record as matching criteria. These criteria are each weighted individually and as a result a match-factor is calculated.
- control-numbers and IDs
specific data problems
The Swiss libraries apply some cataloging practices that aren't helping matching. The reason for this practices are different but can be boiled down to "time saving" and "record beautification" which are mutually exclusive:
- years of series are not set if there are hard to evaluate
- control numbers and other tags are removed if they are deemed useless by the cataloger
- the format of a record is incorrectly set or not specific enough
- the practice to copy old records for acquisition leads to wrongly assigned OCLC-numbers or alike
- old migration data wasn't cleaned out instead some workaround were put in place locally to use the data