Giter VIP home page Giter VIP logo

Comments (11)

josephalway avatar josephalway commented on August 22, 2024 2
import pymarc as pym

with open('C:\\Users\\MY_USER\\Downloads\\IE001_MIL_EL_00017104\\IE001_MIL_EL_00017104.mrc', 'rb') as fh:
    reader = pym.MARCReader(fh, to_unicode=True, force_utf8=True)
    for record in reader:
        for field in record.get_fields('020'):
            if field['a'] is not None:
                print(field['a'])
            elif field['a'] is None:
                print('No ISBN')
            else:
                pass

I tested your data and found that setting "to_unicode=True, force_utf8=True" when reading the file removes all of the "couldn't find errors."

From the MARCReader class docstring:

If you find yourself in the unfortunate position of having data that
is utf-8 encoded without the leader set appropriately you can use
the force_utf8 parameter:

    reader = MARCReader(file('file.dat'), to_unicode=True,
        force_utf8=True)

from pymarc.

nemobis avatar nemobis commented on August 22, 2024

Ah, the input is UNIMARC. Does this make the report invalid?

from pymarc.

edsu avatar edsu commented on August 22, 2024

I didn't think UNIMARC was a problem. Is that the only error you see? Perhaps a codepoint was added to a MARC-8 character set that pymarc doesn't know about yet? It would sadden me a great deal to learn MARC-8 was still being actively developed.

from pymarc.

nemobis avatar nemobis commented on August 22, 2024

I don't think our data is so advanced! It might just be some control character entered by mistake, because this is data endured some funny travels between various platforms.

from pymarc.

edsu avatar edsu commented on August 22, 2024

Actually it looks like pymarc doesn't have any character mappings for g0=66 or g1=69.

https://github.com/edsu/pymarc/blob/master/pymarc/marc8_mapping.py

Depending on what you are doing (a one off, or part of a workflow) you might want to consider converting your data to utf-8 with yaz-marcdump and then work with it from python?

from pymarc.

nemobis avatar nemobis commented on August 22, 2024

from pymarc.

nemobis avatar nemobis commented on August 22, 2024

After yaz-marcdump -i marc -f marc8 -t utf8 -o marc I get

  11710 couldn't find 0xaf in g0=66 g1=69
   7844 couldn't find 0x80 in g0=66 g1=69
   3205 couldn't find 0xbf in g0=66 g1=69
   1335 couldn't find 0xca in g0=66 g1=69
   1175 couldn't find 0xa0 in g0=66 g1=69
   1042 couldn't find 0xcc in g0=66 g1=69
    299 couldn't find 0xbb in g0=66 g1=69
    122 couldn't find 0xbe in g0=66 g1=69

from pymarc.

edsu avatar edsu commented on August 22, 2024

That's weird. Why would it be processing MARC-8 if it had been converted to UTF-8?

from pymarc.

nemobis avatar nemobis commented on August 22, 2024

Smaller test case attached, from http://id.sbn.it/bid/BVE0764705

>>> from pymarc import MARCReader
>>> print(MARCReader(open('BVE0764705.marc21.mrc', 'rb')).next().get_fields('650')[0].subfields[3])
Attività professionale
>>> print(MARCReader(open('BVE0764705.unimarc.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xa0 in g0=66 g1=69
Attivit©  professionale

Note the UNIMARC has 0 in Leader/09, not a space nor a (cf. https://www.loc.gov/marc/bibliographic/bdleader.html ).

None of the yaz-marcdump conversion options which do something seem to help:

$ for code in iso5426 iso8859-1 marc8; do yaz-marcdump -i marc -o marc -t utf8 -f $code BVE0764705.unimarc.mrc > BVE0764705.unimarc.$code.mrc.new ; done
$ python 
>>> print(MARCReader(open('BVE0764705.unimarc.iso5426.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xcc in g0=66 g1=69
Attivit  professionale
>>> print(MARCReader(open('BVE0764705.unimarc.marc8.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
Attivit℗♭ professionale
>>> print(MARCReader(open('BVE0764705.unimarc.iso8859-1.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xa0 in g0=66 g1=69
Attivit©℗  professionale

The yaz-marcdump default conversion to UTF-8 appears correct in itself (cf. https://lists.uni-bielefeld.de/mailman2/unibi/public/librecat-dev/2017-January/000175.html):

$ yaz-marcdump -o marcxml -t utf8 BVE0764705.unimarc.mrc | grep 606 -A 4
  <datafield tag="606" ind1=" " ind2=" ">
    <subfield code="a">Operatori turistici</subfield>
    <subfield code="x">Attività professionale</subfield>
    <subfield code="2">FN </subfield>
    <subfield code="3">IT\ICCU\MILC\267308</subfield>
$ yaz-marcdump -t utf8 BVE0764705.unimarc.mrc | grep 606
606    $a Operatori turistici $x Attività professionale $2 FN  $3 IT\ICCU\MILC\267308
$ yaz-marcdump BVE0764705.unimarc.mrc | grep 606
606    $a Operatori turistici $x Attività professionale $2 FN  $3 IT\ICCU\MILC\267308

Sorry if I'm missing something obvious...

BVE0764705.marc21.mrc.gz
BVE0764705.unimarc.mrc.gz

from pymarc.

josephalway avatar josephalway commented on August 22, 2024

The obscure warning is coming from lines 135-136 of the marc8.py file.

Generally, this section:

            try:
                if code_point > 0x80 and not mb_flag:
                    (uni, cflag) = marc8_mapping.CODESETS[self.g1][code_point]
                else:
                    (uni, cflag) = marc8_mapping.CODESETS[self.g0][code_point]
            except KeyError:
                try:
                    uni = marc8_mapping.ODD_MAP[code_point]
                    uni_list.append(unichr(uni))
                    # we can short circuit because we know these mappings
                    # won't be involved in combinings.  (i hope?)
                    continue
                except KeyError:
                    pass
                if not self.quiet:
                    sys.stderr.write("couldn't find 0x%x in g0=%s g1=%s\n" %
                        (code_point, self.g0, self.g1))

It's unable to read the character and spits out that bit of information: "couldn't find 0x%x in g0=%s g1=%s\n", with the %x and %s being replaced with relevant pieces. Which is really not helpful, if you don't already know what it's doing.

A simple change on line 135 would make the error much more human friendly:
sys.stderr.write("Unable to read character, couldn't find 0x%x in g0=%s g1=%s\n" %

In my case, I was able to correct the single character that was giving me the problem. A math symbol that wasn't being read correctly or had been corrupted.

from pymarc.

tfmorris avatar tfmorris commented on August 22, 2024

Actually it looks like pymarc doesn't have any character mappings for g0=66 or g1=69.

I think 66. (0x42) & 69. (0x45) are the actually default character sets:

42(hex) [ASCII graphic: B] = Basic Latin (ASCII)
21(hex)45(hex) [ASCII graphics: !E] = Extended Latin (ANSEL) (the 21(hex) technically is a second character of the Intermediate segment of this escape sequence.)

per: https://www.loc.gov/marc/specifications/speccharmarc8.html#field066

Based on the comment above: #114 (comment)
it sounds like the MARC file contains UTF-8 encoded characters.

from pymarc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.