Deduplication study

From Swissbib

Jump to: navigation, search

Deduplication is of importance in two fields: in catalogue data and authorities.

Pierre Gavin and Jean-Bertrand Gonin conducted a study in both sectors - catalogue data and authorities. They checked the feasibility of deduplication of multilingual content in different MARC flavours (IDSMARC and MARC21) and catalogued after different rule sets (AACR2, KIDS).

This page is of historical value as matching&margeing was successfully put in place during the first phase of the project.


Contents

Deduplication of bibliographic records

Proceeding

A sample of 80'000 records out of seven databases from two networks (RERO and IDS) and the Swiss National Library was taken. To assure a high density of duplicates the records were extracted by keyword search via z39.50 and not just sequentially by record number. To evaluate the accuracy of the deduplication a group of 700 records were first checked by algorithm and then rechecked manually.

Limitations

Not to spoil the accuracy of the findings media types with special cataloguing requirements were excluded:

  • serials
  • analytics
  • historic books
  • dummies (should not be taken into account at all)


The algorithm

The algorithm takes into account the content of the following fields:

  • ISBN
  • title
  • author
  • editor
  • pagination
  • media type


For each field the algorithm assigns a number that signifies the similarity of its content. The fields mentioned above are of different importance and therefore the assigned numbers are of different values.

  • A number between 0 and 10 signifies a duplicate.
  • 11 and 12 are still strong indicators for duplicates. Whether a record is finally taxed as a duplicate depends onthe assignment of these values to specific fields.
  • numbers over 20 indicate that it could not be a duplicate


The algorithm still has potential to be refined. The zone for collection 490 and the language code of 008 or 040 could be included into the analytical framework.

Findings

The preliminary study shows that deduplication of multilingual content by algorithm is feasible.

Accuracy of algorithm

The accuracy of the algorithm in its present state shows the following characteristics. It produces approximately:

  • 5.3% of false duplicates
  • 2.2% of false non-duplicates
  • 9.2% of probable duplicates
  • 3.9% of probable non-duplicates

As said there is still potential to refine the algorithm.


Positive factors concerning deduplication

  • ISBD is very similar sometimes identical
  • ISBN is very frequent
  • MARC21 and IDSMARC are in large parts identical and applied in a standard compliant manner
  • the quality of the catalogue records is generally of good quality
  • multilingualism poses practically no problems within field 300 and 500
  • multilingualism poses practically no problems with personal names - exceptions are popes or kings


Problems (to be solved)

It is to say that some of the points mentioned below are not very problematic. So the issues that are complicated to solve are marked with ***.

  • inconsistent cataloguing levels for the same title
  • converted records in various grades of quality
  • high differences in the amount and order of 7xx fields
  • different approach to pagination mainly in old or recatalogued records
  • typos (not very frequent in the data set)
  • flawed cataloguing
  • differences in MARC-Codification
    • no author in field 1xx (problems for FRBRization)***
    • different grades of multilevel cataloguing***
  • field 020
    • the ISBN has to be validated
    • the ISBN has to be normalized either 10 to 13 or 13 to 10
  • field 1xx and 7xx - author: unfortunately the allocation of one author could differ form one site to the other***
  • field 245 - title: the algorithm must be refined to cope with minor differences
  • field 260 - editor: the algorithm must be refined to cope with differences
  • field 300 - collation: different treatment of pagination results in rather weak significance for the overall deduplication process
  • transliteration: RERO uses different methods of transliteration as IDS and national library***
  • multilingualism:
    • names of corporate bodies pose problems
    • subject headings are difficult to handle
  • uniform titles***
    • it is not completely clear in which cases an uniform title was set (differs from site to site)
    • as an example: different location in RERO (7xx$t) and IDS (240)
  • original titles in the case of translations: 509 IDS and 500 RERO***


Data concerning the deduplication of 80'000 records

Numbers of duplicates found in the sample

Numbers of duplicates found
  * N Z B S L R H
* 37904 13746 8761 17044 4016 5413 16732 9318
N 6679 480 1872 4008 827 1199 3346 1756
Z 3782 1859 191 2527 469 690 2070 802
B 9088 4124 2619 1040 1076 1483 4866 2127
S 1526 804 464 1038 48 261 897 375
L 2088 1180 684 1400 264 128 1069 534
R 9533 3514 2115 4873 935 1094 1613 2864
H 5208 1785 816 2158 397 558 2871 860


N = Nebis Z = IDS Zürich B = IDS Basel/Bern S = IDS St. Gallen (UniSG) L = IDS Luzern R = RERO H = Helveticat


Numbers of retrieved sample data in proportion to total catalogue size

(January 2008)

Catalogue size in proportion to sample
  %
Site kRec retrieved proportion (x100)
N 3800 14731 3.88
Z 1700 6076 3.57
B 4400 17371 3.95
S 500 2357 4.71
L 500 2973 5.95
R 4300 24871 5.78
H 1500 12756 8.5
TOTAL 16700 81135