Features Deduplication

From swissbib
Jump to: navigation, search

The attributes described below are an extract of Swissbib's full MARC 21 data set used for deduplication with machine learning.

Feature Origin in MARC 21 Description / Range of Values Degree of Filling [%] Relevant for Training
035liste 035 $a List of identifiers for bibliographic unit
  1.  OCLC identifier of worldcat
  2. Unique identifier of originating library

A record may have more than one unique identifier of the originating library.

100 no
century 008/07-10 Year of origin of the bibliographic unit
  • Number digits
  • 'u' : digit of date element is unknown
  • 100
  • 4 (fully unknown)
no
coordinate 034 $d, $f

Coded Cartographic Mathematical Data.

  • $d : westernmost longitude
  • $f : northernmost lattitude

relevant for formats of Maps, only.

0.4 yes
corporate 110 $a, $b, $c
710 $a, $b, $c
810 $a, $b, $c

Corporate institution as publisher of the bibliographic unit.

  • Main Entry-Corporate Name - with filter 'a', 'b', 'c'.
  • Added Entry-Corporate Name - with filter 'a', 'b', 'c'.
  • Series Added Entry-Corporate Name : no relevant data in this field - with filter 'a', 'b', 'c'.
  • 6.2
  • 12.6
  • 0.0
  • yes
  • yes
  • no
decade 008/07-10 Same MARC 21 basis as century. The decade of publishing the bibliographic unit can be identified based with this field. - no
docid - Unique identifier of Swissbib's DataHub 100 no
doi 024 $a
  • Other Standard Identifier with filter for indicator = 7.
  • Digital Object Identifier, Identifier for scientific online articles
5.2 yes
edition 250 $a Edition statement 13.8 yes
exactDate 008/7-14
  • First 4 digits are identical to attribute century., last 4 digits give additional date information.
  • Instead of number digits, letter 'u' or ' ' may be given in last 4 digits.
  • 100 (in digits [:4])
  • 12.6 (in digits [5:])
yes
format 898 $a Describes format of bibliographical unit.
  • See Swissbib marc.
  • Some records may hold multiple format descriptions.
98.0 yes
isbn 020 $a
022 $a
International Standard Book Number
International Standard Serial Number
44.0 yes
ismn 024 $a Other Standard Identifier with filter for indicator = 2. 0.5 yes
musicid 028 $a Publisher or Distributor Number 7.4 yes
pages 300 $a Same MARC 21 basis as volumes. Number of physical pages, volumes, cassettes, total playing time, etc., of each type of unit. - no
part 830 $v
490 $v
440 $v
773 $g
245 $n

Compound of several parts descriptions.

  • Series Added Entry-Uniform Title
  • Series Statement
  • Series Statement - Obsolete
  • Host Item Entry
  • Title Statement
23.5 yes
person 100
700
800
245 $c
  • Personal Name - with filter 'a', 'D', 'q'.
  • Added Entry-Personal Name - with filter 'a', 'D', 'q'.
  • Series Added Entry-Personal Name - with filter 'a', 'c', 'q'.
  • Title Statement of Responsibility
  • 62.9
  • 39.9
  • 0.6
  • 86.2
yes
pubinit 260 $b
264 $b
  • Publication, Distribution, etc. (Imprint)
  • Production, Publication, Distribution, Manufacture, and Copyright Notice
33.5 yes
pubword 260 $b
264 $b
Same MARC 21 basis as pubinit. - no
pubyear 008/07-14 Same MARC 21 basis as exactDate. The year of publication of the bibliographic unit can be identified based with this field. - no
scale 034 $b
if not present: 255 $a
  • Coded Cartographic Mathematical Data
  • Cartographic Mathematical Data
0.4 yes
ttlfull 245 $a, $b, $p, $n (if $p not present)
246 $a, $b, $p, $n (if $p not present)
  • Title Statement
  • Varying Form of Title
  • 100
  • 8.7
yes
ttlpart 245 $a, $b, $p, $n (if $p not present) Same as Title Statement of ttlfull 100 no
volumes 300 $a Physical Description, subfield Extent : Number of physical pages, volumes, cassettes, total playing time, etc., of each type of unit. 88.0 yes