Talk:Matching and merging

From swissbib
Revision as of 09:17, 24 June 2014 by Tobias (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

swissbib matching&merging

swissbib record matching is completely index based. In order to enhance the chance that the elements match the data is normalized and transformed.

match elements

The following elements of a record are used to generate match indexes:

Element Weight
material type 1
ID (ISSN, ISBN) 4
year 2.5
decade 2.5
century 2.5
title (245$a,$b,$n,$p) 3
edition 2
(corporate) authors 1.5
impressum: editor name 2.5
extent 2.5
volume number 1
coordinates and scale 2.5


match threshold

The match threshold is 1, so all elements that can be compared must match.


merging

Instead of classical merging which merges two or more matching records into a new one and is a destructive process, swissbib build clusters of matching records. Out of a cluster of matched records then a display record is built. This record is temporary and is rebuilt or updated each time a record in the cluster is updated or deleted.

However there are some rules that prevent merging:

  • a document is marked with a "nomerge"-code by a library
  • two or more matching documents that are from the same source aren't merged
  • documents which are still in process (and are quite likely to be merged the wrong way)

Alter Text von 2010 oder so...

Deduplication: matching and merging

A standard deduplication process consists of two steps:

  1. matching possible duplicates
  2. merging the files according a predefined rule-set or cluster them by ID

swissbib matching criteria

swissbib basically uses different elements of a record as matching criteria. These criteria are each weighted individually and as a result a match-factor is calculated.

  • control-numbers and IDs
  • year
  • format
  • title
  • edition/impressum

specific data problems

The Swiss libraries apply some cataloging practices that aren't helping matching. The reason for this practices are different but can be boiled down to "time saving" and "record beautification" which are mutually exclusive:

  • years of series are not set if there are hard to evaluate
  • control numbers and other tags are removed if they are deemed useless by the cataloger
  • the format of a record is incorrectly set or not specific enough
  • the practice to copy old records for acquisition leads to wrongly assigned OCLC-numbers or alike
  • old migration data wasn't cleaned out instead some workaround were put in place locally to use the data