Difference between revisions of "Matching and merging"

From swissbib
Jump to: navigation, search
(swissbib matching criteria)
Line 7: Line 7:
 
=Study of the deduplication of bibliographic data and authorities=  
 
=Study of the deduplication of bibliographic data and authorities=  
 
In order to get a picture of the difficulties related to eliminating duplicates in catalog-data the swissbib project mandated Pierre Gavin and Jean-Bertrand Gonin in 2007 to conduct a study of the feasibility of deduplication of bibliographic and name-authority data. Their findings are summarized on the page [[deduplication_study|deduplication study]].
 
In order to get a picture of the difficulties related to eliminating duplicates in catalog-data the swissbib project mandated Pierre Gavin and Jean-Bertrand Gonin in 2007 to conduct a study of the feasibility of deduplication of bibliographic and name-authority data. Their findings are summarized on the page [[deduplication_study|deduplication study]].
 
=Deduplication: matching and merging=
 
A standard deduplication process consists of two steps:
 
# matching possible duplicates
 
# merging the files according a predefined rule-set or cluster them by ID
 
 
=swissbib matching criteria=
 
swissbib basically uses different elements of a record as matching criteria. These criteria are each weighted individually and as a result a match-factor is calculated.
 
* control-numbers and IDs
 
* year
 
* format
 
* title
 
* edition/impressum
 
 
==specific data problems==
 
The Swiss libraries apply some cataloging practices that aren't helping matching. The reason for this practices are different but can be boiled down to "time saving" and "record beautification" which are mutually exclusive:
 
* years of series are not set if there are hard to evaluate
 
* control numbers and other tags are removed if they are deemed useless by the cataloger
 
* the format of a record is incorrectly set or not specific enough
 
* the practice to copy old records for acquisition leads to wrongly assigned OCLC-numbers or alike
 
* old migration data wasn't cleaned out instead some workaround were put in place locally to use the data
 

Revision as of 08:42, 24 June 2014

Bibliographic data and authorities

Swissbib contains the bibliographic data of RERO, the IDS-libraries, the Swiss national library and the IDS-partner-libraries. Due to the organizational structure of the Swiss library system the data is stored in eleven different databases. The amount of doubles in this data is high because of the similar collection structure of the different libraries. The data is cataloged according to three sets of rules: AACR2, Recaro and KIDS. The data is structured in formats MARC21 and IDSMARC. A fourth factor is multilingualism that influences mostly authorities and subject headings.

Whereas the formats can be mapped rather easily the different cataloging rules and to a bigger extent the different cataloging practices pose problems.

Study of the deduplication of bibliographic data and authorities

In order to get a picture of the difficulties related to eliminating duplicates in catalog-data the swissbib project mandated Pierre Gavin and Jean-Bertrand Gonin in 2007 to conduct a study of the feasibility of deduplication of bibliographic and name-authority data. Their findings are summarized on the page deduplication study.