Difference between revisions of "Features Deduplication"

From swissbib
Jump to: navigation, search
m
m
(20 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
! Feature !! Origin in [https://www.loc.gov/marc/bibliographic/ MARC 21] !! Meaning / Description / Range of Values !! Degree of Filling [%] !! Relevant for Training !! Remarks / Requirements / Questions
+
! Feature !! Origin in [https://www.loc.gov/marc/bibliographic/ MARC 21] !! Description / Range of Values !! Degree of Filling [%] !! Relevant for Training !! Observation / Requirement / Question
 
|-
 
|-
 
| 035liste || [https://www.loc.gov/marc/bibliographic/bd035.html 035 $a] || List of identifiers for bibliographic unit
 
| 035liste || [https://www.loc.gov/marc/bibliographic/bd035.html 035 $a] || List of identifiers for bibliographic unit
 
# [https://www.oclc.org/en/worldcat.html OCLC] identifier of [https://en.wikipedia.org/wiki/WorldCat worldcat]
 
# [https://www.oclc.org/en/worldcat.html OCLC] identifier of [https://en.wikipedia.org/wiki/WorldCat worldcat]
 
# Unique identifier of originating library
 
# Unique identifier of originating library
|| 100 || No || A record may have more than one unique identifier of the originating library.
+
|| 100 || no || A record may have more than one unique identifier of the originating library.
 
|-
 
|-
 
| century || [https://www.loc.gov/marc/bibliographic/concise/bd008a.html 008/07-10] || Year of origin of the bibliographic unit
 
| century || [https://www.loc.gov/marc/bibliographic/concise/bd008a.html 008/07-10] || Year of origin of the bibliographic unit
 
* Number digits
 
* Number digits
* 'u' : date element is totally or partially unknown
+
* 'u' : digit of date element is unknown
 
||
 
||
 
* 100
 
* 100
 
* 4 (fully unknown)
 
* 4 (fully unknown)
|| Yes || -
+
|| yes || -
 
|-
 
|-
| coordinate || 034 $d, $f || Geographical coordinate; relevant for maps, only || 0.4 || Yes || -
+
| coordinate || 034 $d, $f || Geographical coordinate; relevant for maps, only || 0.4 || yes || -
 
|-
 
|-
| corporate || 110 $a, $b, $c<br>710 $a, $b, $c<br>810 $a, $b, $c || - || 18 || - || -
+
| corporate || [https://www.loc.gov/marc/bibliographic/bd110.html 110 $a, $b, $c]<br>[https://www.loc.gov/marc/bibliographic/bd710.html 710 $a, $b, $c]<br>[https://www.loc.gov/marc/bibliographic/bd810.html 810 $a, $b, $c] ||
 +
* Main Entry-Corporate Name
 +
* Added Entry-Corporate Name
 +
* Series Added Entry-Corporate Name
 +
||  
 +
* 6.2
 +
* 12.6
 +
* 0.0
 +
|| yes || How to identify the most relevant field?
 
|-
 
|-
| decade || 008/07-10 || - || Same as century || - || -
+
| decade || 008/07-10 || Same as century || - || - || -
 
|-
 
|-
| docid || - || Unique identifier of Swissbib's DataHub || 100 || No || -
+
| docid || - || Unique identifier of Swissbib's DataHub || 100 || no || -
 
|-
 
|-
| doi || 024 $a || [https://en.wikipedia.org/wiki/Digital_object_identifier Digital Object Identifier]
+
| doi || [https://www.loc.gov/marc/bibliographic/bd024.html 024 $a] ||  
Identifier for scientific online articles
+
* Other Standard Identifier
|| 11 || - || -
+
* [https://en.wikipedia.org/wiki/Digital_object_identifier Digital Object Identifier], Identifier for scientific online articles
 +
|| 5.5 || yes || -
 
|-
 
|-
| edition || 250 $a || - || 14 || - || -
+
| edition || [https://www.loc.gov/marc/bibliographic/bd250.html 250 $a] || Edition statement || 14 || yes || -
 
|-
 
|-
| exactDate || [https://www.loc.gov/marc/bibliographic/concise/bd008a.html 008/7-14] || First 4 digits are identical to attribute century, last 4 digits give additional date information. ||
+
| exactDate || [https://www.loc.gov/marc/bibliographic/concise/bd008a.html 008/7-14] ||
 +
* First 4 digits are identical to attribute century., last 4 digits give additional date information.
 +
* Instead of number digits, letter 'u' or ' ' may be given in last 4 digits.
 +
||
 
* 100 (in digits [0:4])
 
* 100 (in digits [0:4])
 
* 13 (in digits [4:])
 
* 13 (in digits [4:])
 
|| - || -
 
|| - || -
 
|-
 
|-
| format || 898 $a || - || 98 || - || [http://www.swissbib.org/wiki/index.php%3Ftitle%3DSwissbib_marc Swissbib marc]
+
| format || 898 $a || Describes format of bibliographical unit.
 +
* See [http://www.swissbib.org/wiki/index.php%3Ftitle%3DSwissbib_marc Swissbib marc].
 +
* Some records may hold multiple format descriptions. 
 +
|| 98 || yes || -
 
|-
 
|-
 
| isbn || 020 $a<br>022 $a || [https://en.wikipedia.org/wiki/International_Standard_Book_Number International Standard Book Number]<br>[https://en.wikipedia.org/wiki/International_Standard_Serial_Number International Standard Serial Number] || 44 || - || -
 
| isbn || 020 $a<br>022 $a || [https://en.wikipedia.org/wiki/International_Standard_Book_Number International Standard Book Number]<br>[https://en.wikipedia.org/wiki/International_Standard_Serial_Number International Standard Serial Number] || 44 || - || -
Line 47: Line 62:
 
| musicid || 028 $a || - || 7 || - || -
 
| musicid || 028 $a || - || 7 || - || -
 
|-
 
|-
| pages || 300 $a || - || 24 || - || -
+
| pages || 300 $a || Same as volumes || - || - || -
 
|-
 
|-
 
| part || 830 $v<br>490 $v<br>440 $v<br>773 $g<br>245 $n || - || Same as pages || - || -
 
| part || 830 $v<br>490 $v<br>440 $v<br>773 $g<br>245 $n || - || Same as pages || - || -
 
|-
 
|-
| person || 100<br>700<br>800<br>245 $c || - || 90 || - || How do I identify the most relevant field? : 245 $c
+
| person || [https://www.loc.gov/marc/bibliographic/bd100.html 100]<br>[https://www.loc.gov/marc/bibliographic/bd700.html 700]<br>[https://www.loc.gov/marc/bibliographic/bd800.html 800]<br>[https://www.loc.gov/marc/bibliographic/bd245.html 245 $c] ||
 +
* Personal Name
 +
* Added Entry-Personal Name
 +
* Series Added Entry-Personal Name
 +
* Title Statement of Responsibility
 +
||  
 +
* 62.9
 +
* 39.9
 +
* 0.6
 +
* 86.2
 +
|| yes || How to identify the most relevant field? : 245 $c
 
|-
 
|-
 
| pubinit || 260 $b<br>264 $b || - || 34 || - || -
 
| pubinit || 260 $b<br>264 $b || - || 34 || - || -
Line 57: Line 82:
 
| pubword || 260 $b<br>264 $b || - || Mostly same as pubinit || - || -
 
| pubword || 260 $b<br>264 $b || - || Mostly same as pubinit || - || -
 
|-
 
|-
| pubyear || 008/07-14 || - || Same as exactDate || - || -
+
| pubyear || 008/07-14 || Same as exactDate || - || - || -
 
|-
 
|-
 
| scale || 034 $b<br>if not present: 255 $a || - || 0.4 || - || -
 
| scale || 034 $b<br>if not present: 255 $a || - || 0.4 || - || -
Line 63: Line 88:
 
| source || 035 || - || - || - || Is missing
 
| source || 035 || - || - || - || Is missing
 
|-
 
|-
| ttlfull || 245 $a, $b, $p, $n (if $p not present)<br>246 $a, $b, $p, $n (if $p not present) || - || 100 || - || -
+
| ttlfull || [https://www.loc.gov/marc/bibliographic/bd245.html 245 $a, $b, $p, $n] (if $p not present)<br>[https://www.loc.gov/marc/bibliographic/bd246.html 246 $a, $b, $p, $n] (if $p not present) ||  
 +
* Title Statement
 +
* Varying Form of Title
 +
||  
 +
* 100
 +
* 8.7
 +
|| yes || -
 
|-
 
|-
 
| ttlpart || 245 $a, $b, $p, $n (if $p not present) || - || 100 || - || -
 
| ttlpart || 245 $a, $b, $p, $n (if $p not present) || - || 100 || - || -
 
|-
 
|-
| volumes || [https://www.loc.gov/marc/bibliographic/bd300.html 300 $a] || Physical Description, subfield Extent : Number of physical pages, volumes, cassettes, total playing time, etc., of each type of unit. || 88 || - || -
+
| volumes || [https://www.loc.gov/marc/bibliographic/bd300.html 300 $a] || Physical Description, subfield Extent : Number of physical pages, volumes, cassettes, total playing time, etc., of each type of unit. || 88 || yes || -
 
|}
 
|}

Revision as of 21:33, 12 January 2020

The attributes described below are an extract of Swissbib's full MARC 21 data set used for deduplication with machine learning.

Feature Origin in MARC 21 Description / Range of Values Degree of Filling [%] Relevant for Training Observation / Requirement / Question
035liste 035 $a List of identifiers for bibliographic unit
  1.  OCLC identifier of worldcat
  2. Unique identifier of originating library
100 no A record may have more than one unique identifier of the originating library.
century 008/07-10 Year of origin of the bibliographic unit
  • Number digits
  • 'u' : digit of date element is unknown
  • 100
  • 4 (fully unknown)
yes -
coordinate 034 $d, $f Geographical coordinate; relevant for maps, only 0.4 yes -
corporate 110 $a, $b, $c
710 $a, $b, $c
810 $a, $b, $c
  • Main Entry-Corporate Name
  • Added Entry-Corporate Name
  • Series Added Entry-Corporate Name
  • 6.2
  • 12.6
  • 0.0
yes How to identify the most relevant field?
decade 008/07-10 Same as century - - -
docid - Unique identifier of Swissbib's DataHub 100 no -
doi 024 $a 5.5 yes -
edition 250 $a Edition statement 14 yes -
exactDate 008/7-14
  • First 4 digits are identical to attribute century., last 4 digits give additional date information.
  • Instead of number digits, letter 'u' or ' ' may be given in last 4 digits.
  • 100 (in digits [0:4])
  • 13 (in digits [4:])
- -
format 898 $a Describes format of bibliographical unit.
  • See Swissbib marc.
  • Some records may hold multiple format descriptions.
98 yes -
isbn 020 $a
022 $a
International Standard Book Number
International Standard Serial Number
44 - -
ismn 024 $a - 11 - -
language  ? - - - Additionally required
musicid 028 $a - 7 - -
pages 300 $a Same as volumes - - -
part 830 $v
490 $v
440 $v
773 $g
245 $n
- Same as pages - -
person 100
700
800
245 $c
  • Personal Name
  • Added Entry-Personal Name
  • Series Added Entry-Personal Name
  • Title Statement of Responsibility
  • 62.9
  • 39.9
  • 0.6
  • 86.2
yes How to identify the most relevant field? : 245 $c
pubinit 260 $b
264 $b
- 34 - -
pubword 260 $b
264 $b
- Mostly same as pubinit - -
pubyear 008/07-14 Same as exactDate - - -
scale 034 $b
if not present: 255 $a
- 0.4 - -
source 035 - - - Is missing
ttlfull 245 $a, $b, $p, $n (if $p not present)
246 $a, $b, $p, $n (if $p not present)
  • Title Statement
  • Varying Form of Title
  • 100
  • 8.7
yes -
ttlpart 245 $a, $b, $p, $n (if $p not present) - 100 - -
volumes 300 $a Physical Description, subfield Extent : Number of physical pages, volumes, cassettes, total playing time, etc., of each type of unit. 88 yes -