Hi, Having problems with files with quotes and specials chars like:<

On my install, I solved this issue by changing the source from: <div class="snippe

Theoretically, using unicode types would work with Python 2 (<code class="notranslate"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Encode problems on parse() function,about sbraz/pymediainfo

Comments (32)

apwelsh commented on May 25, 2024 1

On my install, I solved this issue by changing the source from:

lib.MediaInfoA_Open(handle, filename.encode("utf8"))

to:
lib.MediaInfoA_Open(handle, filename.encode("ascii"))

This is caused when the character is in the upper range (ASCII CODE) and being encoded to UTF-8 translates those into a two byte value. It also works if I drop the encode call completely.

from pymediainfo.

miigotu commented on May 25, 2024 1

Utf-8 is not ASCII compatible.

from pymediainfo.

sbraz commented on May 25, 2024

Theoretically, using unicode types would work with Python 2 (u"/home/dmartin/mounted/Multimedia/01-Incomming/2015-11-17_Instal·lació DUES TEXTIL/20151110_180916.mp4"), however it seems to cause the mediainfo library to return in an empty result. I'll look into it further to try and isolate the issue.
In the meantime, you could just use Python 3 :)

from pymediainfo.

miigotu commented on May 25, 2024

@DaWy I would check two things. Either that string is already bytes, or the file system encoding is not utf-8.

if isinstance(filename, unicode):
    filename = filename.encode(sys.getfilesystemencoding())
else:
    raise Exception("You shoul be using unicode throughout your module until it leaves for filesystem or network, etc")
lib.MediaInfoA_Open(handle, filename, 0)

If it is already bytes and you tell python to encode it to utf-8, it has to first decode it to unicode before encoding to utf-8, and ascii is the default encoding in python2.

Im this you already have a byte string there and that is the cause, look further back at where filename comes from and make sure it is converted to unicode on entry to your program.

If this string is coming via cli argument, then it is already in the proper byte encoding.

from pymediainfo.

sbraz commented on May 25, 2024

@miigotu you probably meant to ping me, not DaWy? The main issue here is that I was not able to make the ctypes code work with Python 2 and accentuated files although it worked with Python 3.

from pymediainfo.

miigotu commented on May 25, 2024

I would suggest using six in that case

I'm guessing the library still has some issues with py2 since parse causes a segfault for some of my users still? =P

from pymediainfo.

sbraz commented on May 25, 2024

IMO developers should be aware of str/unicode. I dropped the dependency on six a while back as there is very little specific code.
As for segfaults, they may occur when using Werkzeug's debugger but that's all I've heard of, please open another bug if you think it is relevant :)

from pymediainfo.

miigotu commented on May 25, 2024

I would, but can't reproduce it myself. It works ok when the user does mediainfo on the file on the command line, but crashes the entire app with pymediainfo.
SiCKRAGE/SiCKRAGE#1445

When I have more info I will let you know.

from pymediainfo.

sbraz commented on May 25, 2024

I don't really understand how this works for you: unicode strings can't be encoded as ASCII:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
I'll wait for feedback from the upstream bug.

from pymediainfo.

apwelsh commented on May 25, 2024

Ascii stings, stores in a wide char are not automatically unicode, at best, it marches up to utf16 version of unicode. The filesystem directory is returning ascii, (or perhaps utf16) but not utf 8. In utf 8 you need two bytes for any characters above 127. When certain ascii characyers are sent to the encode function (128-255) you will get an encode error if the source is not ascii, or utf16.

That being said, i just came across a file that dis not work, and removing encode completely from the argument solved all the encoding errors in my several thousand file scan. So i guess on some operating systems, just remove the encode feature.

Armand
--Sent from iPhone

On Apr 28, 2016, at 12:29 AM, Louis Sautier [email protected] wrote:

I don't really understand how this works for you: unicode strings can't be encoded as ASCII:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)

I'll wait for feedback from the upstream bug.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub

from pymediainfo.

DaWy commented on May 25, 2024

I ended doing the same as @apwelsh, changing encoding to UTF-8. Sorry for the late feedback.

from pymediainfo.

sbraz commented on May 25, 2024

I am confused: you changed the filename.encode line to lib.MediaInfoA_Open(handle, filename.encode("ascii"))? This is the same as lib.MediaInfoA_Open(handle, filename) for Python 2 and if the file name contains non-ASCII chars, it will fail.

from pymediainfo.

miigotu commented on May 25, 2024

Or use lib.MediaInfoW_Open and not encode at all...

from pymediainfo.

sbraz commented on May 25, 2024

You mean MediaInfo_Open?

from pymediainfo.

sbraz commented on May 25, 2024

@miigotu @DaWy @apwelsh if any of you managed to make the mediainfo library open a file containing accents (like in this test https://github.com/sbraz/pymediainfo/blob/master/tests/test.py#L79) with Python 2, I'm all ears.

from pymediainfo.

apwelsh commented on May 25, 2024

I will test with an accented filename and report back. I am on Mac though so my results may differ

Armand
--Sent from iPhone

On Apr 28, 2016, at 6:26 AM, Louis Sautier [email protected] wrote:

@miigotu @DaWy @apwelsh if any of you managed to make the mediainfo library open a file containing accents (like in this test https://github.com/sbraz/pymediainfo/blob/master/tests/test.py#L79) with Python 2, I'm all ears.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

from pymediainfo.

apwelsh commented on May 25, 2024

So, I renamed a file to include à in the filename, and it worked fine after removing the .encode() statement from the open command.

Perhaps because the à is already in utf-8 format by format. Of course, this behavior is according to an OS X system in the US locale.

To ensure I was truly in Unicode, I renamed the same file to include 😇 as one of the filename characters, and it processed it correctly too.

I am stumped since the documentation on all these libraries is very poor. and I am too rusty on my C++ to spend the time working through the logic when it is working now.

On Apr 28, 2016, at 6:26 AM, Louis Sautier [email protected] wrote:

@miigotu https://github.com/miigotu @DaWy https://github.com/DaWy @apwelsh https://github.com/apwelsh if any of you managed to make the mediainfo library open a file containing accents (like in this test https://github.com/sbraz/pymediainfo/blob/master/tests/test.py#L79 https://github.com/sbraz/pymediainfo/blob/master/tests/test.py#L79) with Python 2, I'm all ears.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub #22 (comment)

from pymediainfo.

miigotu commented on May 25, 2024

MediaInfoA_Open requires acsii.

from pymediainfo.

apwelsh commented on May 25, 2024

That much is understood. But utf-8 is ascii compatible, in that it will generate 8 bit characters w/o any nulls, as opposed to utf-16 which generates a lot of nulls, and thus breaks ascii systems. Without examining the source and the full data flow, we cannot know for sure why on a mac, unicode works fine when passed to the ansi version of open. My suspicion is that mediainfo library relies on the OS for openning the file, and the OS libraries are allowing utf-8 encoded filenames to be resolved correctly. This is why i was clear to point out that I am on a U.S. Locale Mac. Because, this configuration likely results in utf-8 which works fine as-is, but of course will break if you try to encode utf-8 to ascii but will work fine if you leave it alone.

Armand
--Sent from iPhone

On Apr 28, 2016, at 10:50 PM, miigotu [email protected] wrote:

MediaInfoA_Open requires acsii.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

from pymediainfo.

miigotu commented on May 25, 2024

If you simply assure it is left as Unicode and use MediaInfo_Open() it will work on all systems as long as you cast it with types

from pymediainfo.

apwelsh commented on May 25, 2024

https://tools.ietf.org/html/rfc3629

A null terminated Char array will correctly store a UTF-8 encoded string, it cannot correctly store any other unicode string, because of the dependence on Null characters in the byte array. In this way, a function that receives char* can handle utf-8 encode unicode without prematurely terminating the text parsing methods. This is what UTF-8 was invented for. Because UTF-16 would result in a early termination when parsing a null terminated string if the data was not correctly encoded.
Surely, if the data is processed in mediainfo directly, it would be very likely to cause problems. In the case of a Mac, using utf-8, this is not the case, which is evidence that the open function merely passes the char* or wchar* off to the OS fopen() method accordingly, and is handled correctly.

This is not a statement about best practice, nor what the code ought to do, it is a statement of fact, based on observations on a system, using unicode... Including emoticons, on a filename and verifying the passing results.

Surely, the correct solution is to use the right method for each platform, the same way MediaInfo_DLL.py does. There is no, one solution works on all platforms.

But as i have proven already, UTF-8 is compatible with ASCII (within reason) but not the other way around.

Armand
--Sent from iPhone

On Apr 29, 2016, at 10:08 AM, miigotu [email protected] wrote:

Utf-8 is not ASCII compatible.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

from pymediainfo.

miigotu commented on May 25, 2024

null termination has nothing to do with the fact that the mediainfo method MediaInfoA cannot handle mbcs, it expects 1 byte per character.

from pymediainfo.

apwelsh commented on May 25, 2024

That depends on what MediaInfoA_Open does with the byte array.

And ignore me as you will, on a Mac, it works just fine. And again, this is not about what should be done for cross platform use, it is a statement about observed effects on this improper use.

Yes, the code should use MediaInfo_Open for windows, dos, os2, ce. For darwin (mac) and linux, they should use MediaInfoA_Open
-- as per the actual MediaInfoDLL source.

This has to do with the source, which is C++, and receives the argument String for Open method.
As such they both are unicode based. The difference in which method to use is based on which OS you are in, as the windows OS represent the data differently than others.
Given your adamant insistence on the matter of Unicode, It is a good bet you are on a Windows system which implements Unicode only in wchar, even though char is unicode compatible. And thus the reason for the two methods in the API.

I have taken the time to look up the source code, and refresh my memory of C++ (which i hate with a passion) to put this matter to rest. When running on any DOS, OS2 or Windows variant, you must use wchar as argument to enable unicode, as these systems ignore unicode on char. When using every other operating system, use char, because this is a String which is of type char, and unicode is always in play.

Armand
--Sent from iPhone

On Apr 29, 2016, at 1:11 PM, miigotu [email protected] wrote:

null termination has nothing to do with the fact that the mediainfo method MediaInfoA cannot handle mbcs

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

from pymediainfo.

sbraz commented on May 25, 2024

@apwelsh MediaInfoDLL still does not work on Python 2 with accentuated file names on Linux, see the upstream issue which I opened. It does not matter if I use the wchar or the char variant, it fails.
I have cleaned up the code and decided to use the wchar_t functions everywhere. I don't understand why MediaInfoDLL does not use them for Linux as they are functional for me both on old and recent Linux versions, as well as Windows 7. Can you try it on Mac OS?

from pymediainfo.

miigotu commented on May 25, 2024

@apwelsh I think you dont realize how much I know what i'm talking about. And no, I use linux, and I program in languages numbering in the upper teens.

@sbraz thats my advice what you just said, it can work on linux with accents if you pass everything just right (unless there is a MAJOR bug upstream).

import ctypes
lib.MediaInfo_Open(handle, ctypes.c_wchar_p(unicode(filename))

lib.MediaInfoA_x is not unicode aware and unicode works fine directly in windows as well as linux
https://mediaarea.net/en-us/MediaInfo/Support/SDK/Quick_Start#Unicode

I wouldnt have an app with 70,000 users translated in like 20 languages if I didnt understand character encoding.

from pymediainfo.

apwelsh commented on May 25, 2024

When using char variant on open, did you pass the filename without using the encode method against the filename? To fix the issue on my i removed the encode statement. This mirrors the MediaInfoDLL.py behavior.

If you read my last comment, the wchar argumented calls are for Microsoft operating systems. On all other systems, it should be char. This too is documented in the MediaInfoDLL.py sample code.

I do have a Linux VM and can test but it will take a day to get my system prepped with all the access it needs for this since only use it for very specialized AWS development purposes (Java, JS, but no python)

Armand
--Sent from iPhone

On Apr 29, 2016, at 3:13 PM, Louis Sautier [email protected] wrote:

@apwelsh MediaInfoDLL still does not work on Python 2 with accentuated file names on Linux, see the upstream issue which I opened. It does not matter if I use the wchar or the char variant, it fails.
I have cleaned up the code and decided to use the wchar_t functions everywhere. I don't understand why MediaInfoDLL does not use them for Linux as they are functional for me both on old and recent Linux versions, as well as Windows 7. Can you try it on Mac OS?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

from pymediainfo.

miigotu commented on May 25, 2024

All modern operating systems use multibyte encoding. Windows was the LAST one to adopt it.

from pymediainfo.

apwelsh commented on May 25, 2024

Yes. It does. But on wchar, for backward compatibility with the old style codepage from DOS, Windows treats char as codepage mapped and wchar as native unicode. All other operating systems support unicode with char.
The char vs wchar is not an ascii vs unicode issue.. Its a data alignment issue. Only on MS systems is it tied to unicode. The wchar vs char data alignment issue is important to understand, because the OS libraries dictate how data is stored and passed around.

If you take care to define the call (like you added to the functions recently) then python seems to convert the filename to work with wchar. Without it, a seg fault still occurs.

But still remove the .encode("utf8") conversion regardless of the function you use so that mac dont throw a sev fault, and so that the UnicodeDecodeError does not occur. The encode() call is what is causing this error. If you do both, it will work on the mac.

Armand
--Sent from iPhone

On Apr 29, 2016, at 4:57 PM, miigotu [email protected] wrote:

All modern operating systems use multibyte encoding. Windows was the LAST one to adopt it.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

from pymediainfo.

sbraz commented on May 25, 2024

OK, so if the current code passes tests on Mac OS, Windows and Linux, I think it's time for me to make a new release. This issue will still remain open though as I still run into it on Linux with Python 2 (it works on Windows).

from pymediainfo.

apwelsh commented on May 25, 2024

If the final version uses:

    lib.MediaInfo_Open.argtypes = [c_void_p, c_wchar_p]
    lib.MediaInfo_Open.restype = c_size_t

and:
lib.MediaInfo_Open(handle, filename)

Then, yes. It will work. I think the definition was the missing piece in our first round of tests to make it work on Macs.

On Apr 29, 2016, at 6:13 PM, Louis Sautier [email protected] wrote:

OK, so if the current code passes tests on Mac OS, Windows and Linux, I think it's time for me to make a new release. This issue will still remain open though as I still run into it on Linux with Python 2 (it works on Windows).

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub #22 (comment)

from pymediainfo.

sbraz commented on May 25, 2024

It does, indeed :) https://github.com/sbraz/pymediainfo/blob/master/pymediainfo/__init__.py#L86
I just don't understand why accentuated file names fail on Linux and Python 2 but I'll leave it to upstream as this looks like either a bug in ctypes or in libmediainfo.

from pymediainfo.

apwelsh commented on May 25, 2024

I’m still setting up my linux VM to run some tests. I’ll see what I can find.

On Apr 29, 2016, at 6:19 PM, Louis Sautier [email protected] wrote:

It does, indeed :) https://github.com/sbraz/pymediainfo/blob/master/pymediainfo/__init__.py#L86 https://github.com/sbraz/pymediainfo/blob/master/pymediainfo/__init__.py#L86
I just don't understand why accentuated file names fail on Linux and Python 2 but I'll leave it to upstream as this looks like either a bug in ctypes or in libmediainfo.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub #22 (comment)

from pymediainfo.

Encode problems on parse() function about pymediainfo HOT 32 CLOSED

Comments (32)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent