h2non / filetype.py Goto Github PK

View Code? Open in Web Editor NEW

611.0 14.0 108.0 1.35 MB

Small, dependency-free, fast Python package to infer binary file types checking the magic numbers signature

Home Page: https://h2non.github.io/filetype.py

License: MIT License

Makefile 1.67% Python 98.33%

magic-numbers filetype python mime extension type inference

filetype.py's Issues

1.0.7 release fixes

The 1.0.7 release tarball is missing the sample.tar file used in test_infer_zip_from_disk and test_infer_tar_from_disk.
Also, the History.md contents end at version 1.0.5.

Price-matching other repos

These are Python repos with lots of file signatures that might not have been covered by filetype.py

https://github.com/floyernick/fleep-py/blob/master/fleep/data.json (193 stars)
https://github.com/h2non/filetype.py/tree/master/filetype/types (this repo)
https://github.com/openpreserve/fido/blob/master/fido/conf/format_extensions.xml (79 stars)
https://github.com/cdgriffith/puremagic/blob/master/puremagic/magic_data.json (47 stars)
https://github.com/omriher/Whatype/blob/master/whatype/magics.csv (12 stars)
https://github.com/schlerp/pyfsig/blob/master/src/pyfsig/file_signatures.py (9 stars)
https://github.com/7h3rAm/cigma/blob/master/cigma/magicbytes.json (1 star)

AttributeError: 'function' object has no attribute 'archive'

FYI

import filetype
import gzip

>>> filetype.helpers.is_archive(gzip.compress(b'test'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/dist-packages/filetype/helpers.py", line 74, in is_archive
    return match.archive(obj) is not None
AttributeError: 'function' object has no attribute 'archive'

Seems there is a naming conflict when using helpers directly with the filetype.match.match and filetype.match.archive

Going around the helpers works though:

import gzip
from filetype.match import archive

>>> bool(archive(gzip.compress(b'test')))
True

Unable to detect plain text

Filetype.guess on a plain text file always yields 'None'

None returned for plain text files

I've read #30, but I find that having do something like this (pardon the comments, direct copy-paste):

    def is_mimetype_family(self, want_family):
        our_type = filetype.guess(self.path)

        if our_type is None:
            # sometimes, filetype fails horribly
            # # like with text files. works great for images though
            type, encoding = mimetypes.guess_type(self.path)
            if type is None:
                return False
            if re.match(want_family, type) is not None:
                return True
            return False
        if re.match(want_family, our_type.mime) is not None:
            return True
        return False

It's a little counter-intuitive. Perhaps, you could utilize mimetypes as I have done.

I expected filetype to be a "one-stop shop" for mimetypes. Nevertheless, it is an excellent library for detecting mimetypes.

pip install filetype doesn't work: ImportError: No module named filetype

Your code is great and it's helping me so much.
BTW I have to flag you a problem: pip install filetype doesn't work and your test code keep repeating:

Traceback (most recent call last):
  File "TestFiletype.py", line 4, in <module>
    import filetype
ImportError: No module named filetype

even if filetype is correclty installed:

So I had to download your setup.py file but it wasn't installing because 'README.rst' was missing:

...really, do I need 'README.rst' in order to install your repository?
So I downloaded the complete .zip and filetype installed correctly and now everything works.

arriving to install filetype through pip install filetype would be amazing

Thank you to keep up the good code

[Feature request] Accept os.PathLike

Currently filetype.guess does not accept PathLike objects (see https://docs.python.org/3/library/os.html#os.PathLike) as used by pathlib.Path. It would be nice if this was possible.

Any way to feed .guess() with bytes instead of a file ?

To be able to check file content in form data receiving from front end.

File does not evaluate mp4 properly.

This signature returns "None" filetype:

>>> kind = filetype.utils.get_signature_bytes('HDVWM419.mp4')
>>> print(kind)
bytearray(b"\x00\x00\x00 ftypisom\x00\x00\x02\x00isomiso2avc1mp41\x00\x00\x00\x08free\x8b_\x04\xf6mdat\x00\x01\'xe\xb8\x04_\xdb\xb3`+@\x85a)>\x10\x88\xebv{\x1ec(\x96(\xc9W^\x8a\\\xa7\xaaB$\xfaz\xb1\x8c&\xca\t\xa9\x04\x95\xb7\x87P\xf2\xaew~,\x8f\xa2\xda>\xe6\xe4\n$&\x9dm\x19\xeb4\xbd0\x00\xc6\x91\xf0\xb0\x85\x0f\xab<\x04\xf5\xe00\xaa\rm\xdc\xa6<a\x08\xcf\x8c\\\x0f\x18)\xdd\xc7\x8e\n\xd6\xd7\xd7\x05\x0fdPj\x15\x1f\xc5H\xd4\x98\x0cx\xce\xb9\xa7\xa8\t\xea\x8d\xe1\xb7\xe2F\x8fQoD\xadKT{\xc9D\xcapZ\xb8\xa2\xeez\xbd\xab\x9e7\x9a\xf7G\xbe/\xbdQ>P\xf6\xa3f\xdc\x17\xfb\xcb\x9c\x9a\x14\x06\xd4J\xb2\xe2\x15\x05\xda\xc5oL\x0b\xbd!\xb7>-\xe2\xb6\xda\x8bi\xab\x8c\xe3\xc1\xa7\x82c\x83\x93\x17$\xd9\xa8zM\xe4@Q\xab\\\xc5\xb4<\x04")

file HDVWM419.mp4
HDVWM419.mp4: ISO Media, MP4 Base Media v1 [IS0 14496-12:2003]


    Stream #0:0(eng): Video: h264 (High) (avc1 / 0x31637661), yuv420p(tv, GBR), 1920x1080 [SAR 1:1 DAR 16:9], 12168 kb/s, 30 fps, 30 tbr, 30 tbn, 60

I guess the issues is using the correct magic file or metadata evaluation. I looked at your source code but not sure how to get it to see this as an MP4.

Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    creation_time   : 2016-09-19T13:43:30.000000Z
    encoder         : Lavf51.12.1

Videos with metadata that matches:
  Metadata:
    major_brand     : mp42
    minor_version   : 1
    compatible_brands: mp42mp41

Is there a way to read more than the first 256 Bytes?

I am trying to pass a long array of bytes to the function but the characters that identify the file go beyond 256 bytes.

Fix pip package installation

Cleanup Makefile and use tox

The current makefile destroys my git repo refs on make clean.

I can do a PR

Add command line

Only first 261 bytes representing the max file header is required, so you can just pass a list of bytes

can you link to an example?

Support SVG images

Hi!

We have a use case to detect images including SVG. This library is perfect except for missing the SVG format. I'm happy to make a PR in the next few days if that's alright.

Edit with sample: https://upload.wikimedia.org/wikipedia/commons/0/02/SVG_logo.svg

Check file type from request data

I am trying to get the file type from request data and save it, but the saved file can't use if I call filetype.guess( audio_file ).

I also try to save the data first, read the saved file and save again after i checked the file type, the saved file can't use too.

how to I solved this problem?

get the file type from request data and save it

audio_file = request.files['data']
kind = filetype.guess( audio_file )

if kind is not None :
    file_type = kind.extension
    wav_path = os.path.join( current_app.config['UPLOAD_FOLDER'], 'audio.'+file_type )

save the data first, read it and save again

audio_file = request.files['data']
tmp_path = os.path.join( current_app.config['UPLOAD_FOLDER'], "upload_audio.tmp" )
audio_file.save( tmp_path )

tmp_file = open( tmp_path , 'rb' )
tmp_data = tmp_file.read()
kind = filetype.guess( tmp_data )
tmp_file.close()

if kind is not None :
    file_type = kind.extension
    wav_path = os.path.join( current_app.config['UPLOAD_FOLDER'], wav_id+'.'+file_type )
    wav_file = open( wav_path , 'wb' )
    wav_file.write( tmp_data )
    wav_file.close()

Incorrect handling of CR2 files

Hello. I have a problem when trying to process Cr2 files. filetype recognize it as both tiff and cr2 type. It's not surprise since cr2 basen on tiff .

Filetype version 1.0.7
Sample code:

from filetype.types.image import Tiff, Cr2
from filetype import match
match("Path to cr2 file", matchers=[Cr2()])
match("Path to cr2 file", matchers=[Tiff()])

Result is:

>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Cr2()])
<filetype.types.image.Cr2 object at 0x0000029F0EF2C9E8>
>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Tiff()])
<filetype.types.image.Tiff object at 0x0000029F0EF2CA20>

Should be:

>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Cr2()])
<filetype.types.image.Cr2 object at 0x0000029F0EF2C9E8>
>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Tiff()])

You can take sample cr2 here
I think to solve this problem we need to add something like and not(buf[8] == 0x43 and buf[9] == 0x52)
here to make sure that there is no Cr2 magic word in buffer.

Use py.test

Switch to pytest

I can do a PR.

Add coveralls support

WebM are not recognized

I have some webm and it always return None, like this one
http://video.webmfiles.org/big-buck-bunny_trailer.webm

Use a file signatures table to speed up the file type recognition

I think that pre-build a dict and put there all the magic signatures for the file header lookup is more time efficient than call time to time each type object to find the matching file header.

filetype should add filetype.BYTES_MINIMUM

From https://github.com/h2non/filetype.py/blob/v1.0.5/filetype/utils.py#L3-L18

_NUM_SIGNATURE_BYTES = 262

_NUM_SIGNATURE_BYTES number of bytes is read from passed data to determine the signature.
_NUM_SIGNATURE_BYTES could be considered the recommended number of bytes needed for best signature matching. Some users of the filetype may want to know the minumum number of bytes needed before calling any filetype API.
However, the variable is somewhat obscured; filetype.utils._NUM_SIGNATURE_BYTES is an awkward reference and the leading _ suggests a "private" variable.

_NUM_SIGNATURE_BYTES should be exposed at the root of the filetype package level as a Python "constant". , e.g. filetype.BYTES_MINIMUM or filetype.BYTES_SUGGESTED.

please add a image type named dcm

There is a new image type named dcm, which is always used in radiation medicine. CT or other radiation data are always record in this. I think filetype will be better if dcm is added in . Because now many researchers are studying on this kind image.

Update PyPI

It seems the latest version is still not updated on PyPI.
Maybe it's time to create a github workflow for automatic upload?

support svg

it doesn't support svg format.
how can i do?
many are svg format now.
you have a go lang version, but how to use in python?

pip could not install

λ pip install filetype
Collecting filetype
  Using cached filetype-0.1.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "D:\LOCAL_TEMP\pip-build-605g52_c\filetype\setup.py", line 16, in <module>
        with open(path.join(here, 'README.md'), encoding='utf-8') as f:
      File "C:\Anaconda3\lib\codecs.py", line 895, in open
        file = builtins.open(filename, mode, buffering)
    FileNotFoundError: [Errno 2] No such file or directory: 'D:\\LOCAL_TEMP\\pip-build-605g52_c\\filetype\\README.md'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in D:\LOCAL_TEMP\pip-build-605g52_c\filetype\

I could install by source, pip could not work.

API design

Based on Go implementation with Python idioms

Full test coverage

Licensing for tests/fixtures/*

Hi @h2non,

I am trying put filetype.py in Debian. However, the tests/fixtures/* files seems are not originally developed by you and the Debian FTP Masters rejected the package. I can see it, in sample.jpg:

Please, can you clarify?

I suggest you generate all files and put a specific notice about it.

Regards,

Eriberto

Add more file types: doc,docx,xls,xlsx and open office

Hello,

In the description you say "Pluggable: add new custom type matchers" , by the way the links doesn't work.

Do I need to modify your code or there is a function call or something, I don't see any example ?

docx is recognize as zip (which it is), I suppose I need a second step to extract the archive and do a check again ?

Add support for Brotli compression

Brotli is now a widely supported compression type. We should include that too.

Is there any way to recognize whether it the file is a pdf file directly?

Add support for LX4 compression format

It would be great to have the support for the lx4 compression format.
Thanks a lot for making this nice package.

conda install doesn't recognize /opt/conda/bin/python3.6

> ls -lah /opt/conda/bin/python3.6
-rwxrwxr-x 1 root root 3.6M Jun  8  2018 /opt/conda/bin/python3.6

>conda install filetype
...

The following NEW packages will be INSTALLED:
...
    filetype:                   1.0.7-pyh9f0ad1d_0         conda-forge

...


conda install filetype
ERROR conda.core.link:_execute(502): An error occurred while installing package 'conda-forge::filetype-1.0.7-pyh9f0ad1d_0'.
FileNotFoundError(2, "No such file or directory: '/opt/conda/bin/python3.6'")
Attempting to roll back.

Rolling back transaction: done

FileNotFoundError(2, "No such file or directory: '/opt/conda/bin/python3.6'")

AttributeError while guessing image's kind

>>> import filetype
>>> import requests
>>> url = "https://45.img.avito.st/image/1/lCc6lra4OM5MM8rIHszqVLE1PsiYNTjI_1Y-woo1OM7K"
>>> r = requests.get(url)
>>> filetype.guess(r.content[:10])
<filetype.types.image.Jpeg object at 0x10b209f70>
>>> filetype.guess(r.content[:10]).kind
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Jpeg' object has no attribute 'kind'

From the initial response it would appear that the kind has been detected but once we call the method we have an AttributeError.

Python 3.8.0

Why the original file has to be broken in get_bytes(obj)?

Here → obj = obj.read(_NUM_SIGNATURE_BYTES)

If I check my object type like if filetype.guess_mime(file) != "image/jpeg":, the file itself will be broken and I can't use it later.

I wonder if it is something intentioned or not.

Thank you!

Feature: Recognize lzo compressed files

It would be great if lzo compressed files would be recognized.

i.e.

00000000  89 4c 5a 4f 00 0d 0a 1a  0a 10 30 20 80 09 40 03  |.LZO......0 ..@.|

add_type always fails

If you attempt to do an add_type with a subclass of Type, you always get the "instance must inherit from filetype.types.Type". This appears to be because isinstance only returns true for actual instances and not for subclasses. You need to use issubclass to check for subclasses. I am using Python 3.8, so this may be new behavior.

If I fix this error, I get a further buffer error. Attached is a zip file with my example code. You will need to change the file locations.
detect_file_type.zip

Not able to identify "tar" package

I created a simple tar package "dummy.tar" using "tar -cvf" and filetype always returns None.

I can see that tar is supported but not working for me

XLS Support

Is it doable to also check for XLS?

Add Support for PathLike objects

Python 3.6 introduced the file system path protocol with PEP 519. All python (3.6+) builtins, and most (if not all) standard library modules accept a path-like object where only string or bytes were previously accepted. This is especially useful when using os.path alternatives like pathlib.

Any class can include this protocol by inheriting from os.PathLike and implementing a concrete definition of __fspath__ that returns either a str or bytes object.

The area most suitable for providing this support seems to be utils.get_bytes. While checking if isinstance(obj, os.PathLike) makes use of the formally defined interface, its only compatible with versions 3.6+. To reconcile this, this idiom recommended by PEP 518 offers compatibility with previous versions:

obj = obj.__fspath__() if hasattr(obj, '__fspath__') else obj

If a more explicit implementation is desired, a Python 2+3 compatible PathLike interface matching the signature provided by os.PathLike can be used as well.

import abc

# provides the features of the abc.ABC helper class introduced in 3.4
ABC = abc.ABCMeta('ABC', (object,), {'__slots__': ()}) 

class PathLike(ABC):
    """Abstract base class for implementing the file system path protocol."""

    @abc.abstractmethod
    def __fspath__(self):
        """Return the file system path representation of the object."""
        raise NotImplementedError

    @classmethod
    def __subclasshook__(cls, subclass):
        if cls is not PathLike:        
            return NotImplemented
        for parent in subclass.__mro__:            
            attrs = parent.__dict__
            if '__fspath__' in attrs:
                return NotImplemented if attrs['__fspath__'] is None else True
        return NotImplemented

Regardless of the implementation, I feel including PathLike support offers value without adding dependencies or introducing compatibility problems with older versions of Python.

get_type uses string object identity instead of equality

Anything that isn't a literal will result in None.

from filetype import get_type
x = '.mp4'
get_type(ext=x.replace('.', ''))
>>> None
get_type(ext='mp4')
>>> <filetype.types.video.Mp4 object at 0x7f1f9c3bec90>

get_type('text/plain')

I can understand that detecting text/plain is hard.

But it would be great if I could guess the file extension if the mime-type is known.

Please make get_type('text/plain') work.

Thank you

Avi.match does not check byte 12 of file header

The last check in Avi.match is buf[10] == 0x49.

As far as I understand, the first four bytes is the RIFF signature (\x52\x49\x46\x46), followed by four bytes referring to the file size, followed by four bytes identifying the file type, which would be \x41\x56\x49\x20 in the case of an AVI.

Does the method lack a buf[11] == 0x20 check?

Specific MP3 file not detected

test.zip
This mp3 is not detected as it.
It has a bit rate of 8 kbps and a sample rate or 24000 Hz. It is basically one second of silence.

PS. The go version has the same problem h2non/filetype#91

detecting mp4 video

I have an mp4 video that returns on a call to _get_ftyp() like so:

('isom', 1, ['isom', 'avc1', 'mp42'])

Should the matching be more lenient for 'compatible brands'? I'm asking because I don't know what isom is and its unclear what the intention is with parsing out compatible brands.

Ebook support

Not every ebook is Epub, so these might be good to note.

MOBI
DJVU
AZW and AZW3
FB2

Stabilise API

Release 1.0.7 broke the specialised matchers that are still documented here https://h2non.github.io/filetype.py/v1.0.0/match.m.html

One could make the argument that these functions are internal API since they're not officially documented in the examples, so it's ok to break them without even a minor version bump.

However, given the usefulness of these functions (e.g. for scenarios in which one only looks for images -- something often encountered in web development) please expose them officially in the examples, and keep them stable.

h2non / filetype.py Goto Github PK

filetype.py's Issues

Recommend Projects

Recommend Topics

Recommend Org