Not all, but a good portion of the records in the associated mrc file, when read, prod

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Smaller test case attached, from <a href="http://id.sbn.it/bid/BVE0764705" rel="nofoll

MARCReader obscure warning: couldn't find 0xa0 in g0=66 g1=69 about pymarc HOT 11 OPEN

edsu commented on August 22, 2024

MARCReader obscure warning: couldn't find 0xa0 in g0=66 g1=69

from pymarc.

Comments (11)

josephalway commented on August 22, 2024 2

import pymarc as pym

with open('C:\\Users\\MY_USER\\Downloads\\IE001_MIL_EL_00017104\\IE001_MIL_EL_00017104.mrc', 'rb') as fh:
    reader = pym.MARCReader(fh, to_unicode=True, force_utf8=True)
    for record in reader:
        for field in record.get_fields('020'):
            if field['a'] is not None:
                print(field['a'])
            elif field['a'] is None:
                print('No ISBN')
            else:
                pass

I tested your data and found that setting "to_unicode=True, force_utf8=True" when reading the file removes all of the "couldn't find errors."

From the MARCReader class docstring:

If you find yourself in the unfortunate position of having data that
is utf-8 encoded without the leader set appropriately you can use
the force_utf8 parameter:

    reader = MARCReader(file('file.dat'), to_unicode=True,
        force_utf8=True)

from pymarc.

nemobis commented on August 22, 2024

Ah, the input is UNIMARC. Does this make the report invalid?

from pymarc.

edsu commented on August 22, 2024

I didn't think UNIMARC was a problem. Is that the only error you see? Perhaps a codepoint was added to a MARC-8 character set that pymarc doesn't know about yet? It would sadden me a great deal to learn MARC-8 was still being actively developed.

from pymarc.

nemobis commented on August 22, 2024

I don't think our data is so advanced! It might just be some control character entered by mistake, because this is data endured some funny travels between various platforms.

from pymarc.

edsu commented on August 22, 2024

Actually it looks like pymarc doesn't have any character mappings for g0=66 or g1=69.

https://github.com/edsu/pymarc/blob/master/pymarc/marc8_mapping.py

Depending on what you are doing (a one off, or part of a workflow) you might want to consider converting your data to utf-8 with yaz-marcdump and then work with it from python?

from pymarc.

nemobis commented on August 22, 2024

Ed Summers, 13/03/2018 23:24:

Depending on what you are doing (a one off, or part of a workflow) you might want to consider converting your data to utf-8 with yaz-marcdump <https://software.indexdata.com/yaz/doc/yaz-marcdump.html> and then work with it from python?

Thank you a lot for the suggestion. It's been a while since I last used yaz so I had neglected to consider it. I'll let you know how it goes (if it's relevant for this report; feel free to close as invalid!).

from pymarc.

nemobis commented on August 22, 2024

After yaz-marcdump -i marc -f marc8 -t utf8 -o marc I get

  11710 couldn't find 0xaf in g0=66 g1=69
   7844 couldn't find 0x80 in g0=66 g1=69
   3205 couldn't find 0xbf in g0=66 g1=69
   1335 couldn't find 0xca in g0=66 g1=69
   1175 couldn't find 0xa0 in g0=66 g1=69
   1042 couldn't find 0xcc in g0=66 g1=69
    299 couldn't find 0xbb in g0=66 g1=69
    122 couldn't find 0xbe in g0=66 g1=69

from pymarc.

edsu commented on August 22, 2024

That's weird. Why would it be processing MARC-8 if it had been converted to UTF-8?

from pymarc.

nemobis commented on August 22, 2024

Smaller test case attached, from http://id.sbn.it/bid/BVE0764705

>>> from pymarc import MARCReader
>>> print(MARCReader(open('BVE0764705.marc21.mrc', 'rb')).next().get_fields('650')[0].subfields[3])
Attività professionale
>>> print(MARCReader(open('BVE0764705.unimarc.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xa0 in g0=66 g1=69
Attivit©  professionale

Note the UNIMARC has 0 in Leader/09, not a space nor a (cf. https://www.loc.gov/marc/bibliographic/bdleader.html ).

None of the yaz-marcdump conversion options which do something seem to help:

$ for code in iso5426 iso8859-1 marc8; do yaz-marcdump -i marc -o marc -t utf8 -f $code BVE0764705.unimarc.mrc > BVE0764705.unimarc.$code.mrc.new ; done
$ python 
>>> print(MARCReader(open('BVE0764705.unimarc.iso5426.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xcc in g0=66 g1=69
Attivit  professionale
>>> print(MARCReader(open('BVE0764705.unimarc.marc8.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
Attivit℗♭ professionale
>>> print(MARCReader(open('BVE0764705.unimarc.iso8859-1.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xa0 in g0=66 g1=69
Attivit©℗  professionale

The yaz-marcdump default conversion to UTF-8 appears correct in itself (cf. https://lists.uni-bielefeld.de/mailman2/unibi/public/librecat-dev/2017-January/000175.html):

$ yaz-marcdump -o marcxml -t utf8 BVE0764705.unimarc.mrc | grep 606 -A 4
  <datafield tag="606" ind1=" " ind2=" ">
    <subfield code="a">Operatori turistici</subfield>
    <subfield code="x">Attività professionale</subfield>
    <subfield code="2">FN </subfield>
    <subfield code="3">IT\ICCU\MILC\267308</subfield>
$ yaz-marcdump -t utf8 BVE0764705.unimarc.mrc | grep 606
606    $a Operatori turistici $x Attività professionale $2 FN  $3 IT\ICCU\MILC\267308
$ yaz-marcdump BVE0764705.unimarc.mrc | grep 606
606    $a Operatori turistici $x Attività professionale $2 FN  $3 IT\ICCU\MILC\267308

Sorry if I'm missing something obvious...

BVE0764705.marc21.mrc.gz
BVE0764705.unimarc.mrc.gz

from pymarc.

josephalway commented on August 22, 2024

The obscure warning is coming from lines 135-136 of the marc8.py file.

Generally, this section:

            try:
                if code_point > 0x80 and not mb_flag:
                    (uni, cflag) = marc8_mapping.CODESETS[self.g1][code_point]
                else:
                    (uni, cflag) = marc8_mapping.CODESETS[self.g0][code_point]
            except KeyError:
                try:
                    uni = marc8_mapping.ODD_MAP[code_point]
                    uni_list.append(unichr(uni))
                    # we can short circuit because we know these mappings
                    # won't be involved in combinings.  (i hope?)
                    continue
                except KeyError:
                    pass
                if not self.quiet:
                    sys.stderr.write("couldn't find 0x%x in g0=%s g1=%s\n" %
                        (code_point, self.g0, self.g1))

It's unable to read the character and spits out that bit of information: "couldn't find 0x%x in g0=%s g1=%s\n", with the %x and %s being replaced with relevant pieces. Which is really not helpful, if you don't already know what it's doing.

A simple change on line 135 would make the error much more human friendly:
sys.stderr.write("Unable to read character, couldn't find 0x%x in g0=%s g1=%s\n" %

In my case, I was able to correct the single character that was giving me the problem. A math symbol that wasn't being read correctly or had been corrupted.

from pymarc.

tfmorris commented on August 22, 2024

Actually it looks like pymarc doesn't have any character mappings for g0=66 or g1=69.

I think 66. (0x42) & 69. (0x45) are the actually default character sets:

42(hex) [ASCII graphic: B] = Basic Latin (ASCII)
21(hex)45(hex) [ASCII graphics: !E] = Extended Latin (ANSEL) (the 21(hex) technically is a second character of the Intermediate segment of this escape sequence.)

per: https://www.loc.gov/marc/specifications/speccharmarc8.html#field066

Based on the comment above: #114 (comment)
it sounds like the MARC file contains UTF-8 encoded characters.

from pymarc.

MARCReader obscure warning: couldn't find 0xa0 in g0=66 g1=69 about pymarc HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent