Record clustering configuration details

Technical overview

We are using Primo's FRBR vector feature to support clustering in search results. We have defined FRBR keys locally to determine which records should cluster. Data elements used to determine which records should cluster are detailed below.

Keys are comprised of 2 parts, numeric identifiers and titles. The keys are created by joining the identifier and title parts using all possible combinations. If one part does not exist, the key is not created. In the clustering process, the keys between records are compared. If a record has a matching key with another record, it is added to the same group. Once a match is found, the system does not continue searching for matches since a record can belong to one group only.

Journals

The key definitions below apply to serial records only (as determined by LDR/07). 

Numeric identifiers evaluated and at least one numeric identifier must be present in both records: 

  • 022 $a (excluding content after the space if present)
  • 022 $l (excluding content after the space if present)
  • 035 $a if string starts with OCoLC
  • 775 $x (excluding content after the space if present)
  • 776 $x (excluding content after the space if present)
  • 776 $w only the OCCL number is used

Titles evaluated (titles are normalized for punctuation, capitalization, and removal of initial articles based on 2nd indicator), and at least one matching title must be present in both records: 

  • 245 a,b,n,p
  • 222 a

Example of record keys for the print record for JAMA compared to the Alma community zone record for the e-version of JAMA. In this case, 3 of the keys match (only 1 is need to cluster the records together). 

Record 1 keys (print JAMA)Record 2 keys (e-JAMA)
jama the journal of the american medical association~0221-7678
jama the journal of the american medical association~0211-4445
jama the journal of the american medical association~(OCoLC)36366429jama the journal of the american medical association~(OCoLC)36366429
jama the journal of the american medical association~(OCoLC)1124917
jama the journal of the american medical association~1538-3598jama the journal of the american medical association~1538-3598
jama the journal of the american medical association~0098-7484jama the journal of the american medical association~0098-7484
jama chicago ill~0098-7484

jama~0098-7484
jama chicago ill~(OCoLC)36366429

jama~(OCoLC)36366429
jama chicago ill~(OCoLC)1124917
jama chicago ill~0211-4445
jama chicago ill~1538-3598

jama~1538-3598
jama chicago ill~0221-7678


To see a record's FRBR keys, view the PNX by having the full record open in your browser and adding this to the end of the URL: &snowPnx=true

Other formats

Any formats with ISBNs are also clustered. The matching algorithm is the same, except that ISBN fields are used instead of ISSN fields, i.e.: 

  • 020 $a (excluding content after the space if present)
  • 035 $a if string starts with OCoLC
  • 775 $z (excluding content after the space if present)
  • 776 $z (excluding content after the space if present)
  • 776 $w only the OCCL number is used