Deduplication study
From Swissbib
Deduplication is of importance in two fields: in catalogue data and authorities.
Pierre Gavin and Jean-Bertrand Gonin conducted a study in both sectors - catalogue data and authorities. They checked the feasibility of deduplication of multilingual content in different MARC flavours (IDSMARC and MARC21) and catalogued after different rule sets (AACR2, KIDS).
This page is of historical value as matching&margeing was successfully put in place during the first phase of the project.
Contents |
Deduplication of bibliographic records
Proceeding
A sample of 80'000 records out of seven databases from two networks (RERO and IDS) and the Swiss National Library was taken. To assure a high density of duplicates the records were extracted by keyword search via z39.50 and not just sequentially by record number. To evaluate the accuracy of the deduplication a group of 700 records were first checked by algorithm and then rechecked manually.
Limitations
Not to spoil the accuracy of the findings media types with special cataloguing requirements were excluded:
- serials
- analytics
- historic books
- dummies (should not be taken into account at all)
The algorithm
The algorithm takes into account the content of the following fields:
- ISBN
- title
- author
- editor
- pagination
- media type
For each field the algorithm assigns a number that signifies the similarity of its content. The fields mentioned above are of different importance and therefore the assigned numbers are of different values.
- A number between 0 and 10 signifies a duplicate.
- 11 and 12 are still strong indicators for duplicates. Whether a record is finally taxed as a duplicate depends onthe assignment of these values to specific fields.
- numbers over 20 indicate that it could not be a duplicate
The algorithm still has potential to be refined. The zone for collection 490 and the language code of 008 or 040 could be included into the analytical framework.
Findings
The preliminary study shows that deduplication of multilingual content by algorithm is feasible.
Accuracy of algorithm
The accuracy of the algorithm in its present state shows the following characteristics. It produces approximately:
- 5.3% of false duplicates
- 2.2% of false non-duplicates
- 9.2% of probable duplicates
- 3.9% of probable non-duplicates
As said there is still potential to refine the algorithm.
Positive factors concerning deduplication
- ISBD is very similar sometimes identical
- ISBN is very frequent
- MARC21 and IDSMARC are in large parts identical and applied in a standard compliant manner
- the quality of the catalogue records is generally of good quality
- multilingualism poses practically no problems within field 300 and 500
- multilingualism poses practically no problems with personal names - exceptions are popes or kings
Problems (to be solved)
It is to say that some of the points mentioned below are not very problematic. So the issues that are complicated to solve are marked with ***.
- inconsistent cataloguing levels for the same title
- converted records in various grades of quality
- high differences in the amount and order of 7xx fields
- different approach to pagination mainly in old or recatalogued records
- typos (not very frequent in the data set)
- flawed cataloguing
- differences in MARC-Codification
- no author in field 1xx (problems for FRBRization)***
- different grades of multilevel cataloguing***
- field 020
- the ISBN has to be validated
- the ISBN has to be normalized either 10 to 13 or 13 to 10
- field 1xx and 7xx - author: unfortunately the allocation of one author could differ form one site to the other***
- field 245 - title: the algorithm must be refined to cope with minor differences
- field 260 - editor: the algorithm must be refined to cope with differences
- field 300 - collation: different treatment of pagination results in rather weak significance for the overall deduplication process
- transliteration: RERO uses different methods of transliteration as IDS and national library***
- multilingualism:
- names of corporate bodies pose problems
- subject headings are difficult to handle
- uniform titles***
- it is not completely clear in which cases an uniform title was set (differs from site to site)
- as an example: different location in RERO (7xx$t) and IDS (240)
- original titles in the case of translations: 509 IDS and 500 RERO***
Data concerning the deduplication of 80'000 records
Numbers of duplicates found in the sample
| Numbers of duplicates found | ||||||||
| * | N | Z | B | S | L | R | H | |
| * | 37904 | 13746 | 8761 | 17044 | 4016 | 5413 | 16732 | 9318 |
| N | 6679 | 480 | 1872 | 4008 | 827 | 1199 | 3346 | 1756 |
| Z | 3782 | 1859 | 191 | 2527 | 469 | 690 | 2070 | 802 |
| B | 9088 | 4124 | 2619 | 1040 | 1076 | 1483 | 4866 | 2127 |
| S | 1526 | 804 | 464 | 1038 | 48 | 261 | 897 | 375 |
| L | 2088 | 1180 | 684 | 1400 | 264 | 128 | 1069 | 534 |
| R | 9533 | 3514 | 2115 | 4873 | 935 | 1094 | 1613 | 2864 |
| H | 5208 | 1785 | 816 | 2158 | 397 | 558 | 2871 | 860 |
| N = Nebis | Z = IDS Zürich | B = IDS Basel/Bern | S = IDS St. Gallen (UniSG) | L = IDS Luzern | R = RERO | H = Helveticat |
Numbers of retrieved sample data in proportion to total catalogue size
(January 2008)
| Catalogue size in proportion to sample | ||||
| % | ||||
| Site | kRec | retrieved | proportion (x100) | |
| N | 3800 | 14731 | 3.88 | |
| Z | 1700 | 6076 | 3.57 | |
| B | 4400 | 17371 | 3.95 | |
| S | 500 | 2357 | 4.71 | |
| L | 500 | 2973 | 5.95 | |
| R | 4300 | 24871 | 5.78 | |
| H | 1500 | 12756 | 8.5 | |
| TOTAL | 16700 | 81135 | ||
