Normalization

From swissbib
Jump to: navigation, search

This article needs a refresh - it does not reflect the current state of swissbib.

Import and Normalization

The import of the data from local sites to SwissBib will undergo filtering and normalization rules to ensure data consistency. These rules are explained here.


Import

The Import of the data is rule based on the level of a library network's database. Generally some type of records should be discarded on import:

  • IDS Dummy records containing a combination of or at least one of the following fields: 920 --, 925 -- and 999 --
  • IDS acquisition records to be identified by a combination of leader position 17 value 5 or 8 and field 920 1-
  • records missing field 245
    • records missing field 245 but containing parallel title field 246 (237 records) should be corrected.
    • special treatment for records from ETHICS (password protected page) without field 245


Normalization

Normalization means here the process by which record fields are transformed in some way to make them consistent in a way which they might not have been ... Normalization is needed to ensure that the content of record fields of different sites conform to the standard the Discovery Engine of SwissBib expects.

There are several levels of transformation: for fixed fields, for codes, etc.

Another good example is ISBN that recently changed form 10 to 13 digits. In order to get a match the old 10 digit numbers need to be converted to new 13 digit numbers.


Fields to normalize

  • ISBN: old 10 digit numbers need to be converted to new 13 digit numbers


Data Analysis Findings

The Septembers 2008 test export of seven SwissBib sites (except SBT) was used to analyze the data for normalization and correction requirements.

The following table shows three common irregularities in data structure totaled over the whole SwissBib test data pool. System librarians should have a look at the log files under Import DataSources (members only).


  numbers   percentage
Total Records 17847245   100 %
LDR 49831   0.27920836 %
first subfieldcode missing 9767   0.05472553 %
forbidden characters 25815   0.14464417 %
       
Total Inconsistencies 85413   0.47857807 %


The irregularities have different implications for SwissBib as well as for local systems. They will be corrected during import on the SwissBib side. Nevertheless some of the cases should be solved on the local level as they affect indexing or presentation in the local system's OPAC.

SwissBib will be happy to deliver lists with system numbers affected if wished by the local networks.


Irregularities 20081204.png


LDR irregularities

LDR irregularities are the biggest share by numbers. Very common are missing zeroes on positions 22 and 23 as well as missing numbers on position 10 and 11. Very common for dummy records is the missing position 17 (encoding level), which defines them as records of highest quality.


first subfield code missing

This is pressing for most Aleph local systems as these codes are used to define inclusion into indexing as well as presentation in OPAC. Certain fields are easily fixable as the first subfield is always "a". Numbers are not that big in most networks.


forbidden characters

This means "control characters" on indicator positions, umlauts as subfield codes only to mention the most common cases.