h2non / filetype.py Goto Github PK
View Code? Open in Web Editor NEWSmall, dependency-free, fast Python package to infer binary file types checking the magic numbers signature
Home Page: https://h2non.github.io/filetype.py
License: MIT License
Small, dependency-free, fast Python package to infer binary file types checking the magic numbers signature
Home Page: https://h2non.github.io/filetype.py
License: MIT License
The 1.0.7 release tarball is missing the sample.tar
file used in test_infer_zip_from_disk
and test_infer_tar_from_disk
.
Also, the History.md
contents end at version 1.0.5.
These are Python repos with lots of file signatures that might not have been covered by filetype.py
FYI
import filetype
import gzip
>>> filetype.helpers.is_archive(gzip.compress(b'test'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/dist-packages/filetype/helpers.py", line 74, in is_archive
return match.archive(obj) is not None
AttributeError: 'function' object has no attribute 'archive'
Seems there is a naming conflict when using helpers directly with the filetype.match.match and filetype.match.archive
Going around the helpers works though:
import gzip
from filetype.match import archive
>>> bool(archive(gzip.compress(b'test')))
True
Filetype.guess on a plain text file always yields 'None'
I've read #30, but I find that having do something like this (pardon the comments, direct copy-paste):
def is_mimetype_family(self, want_family):
our_type = filetype.guess(self.path)
if our_type is None:
# sometimes, filetype fails horribly
# # like with text files. works great for images though
type, encoding = mimetypes.guess_type(self.path)
if type is None:
return False
if re.match(want_family, type) is not None:
return True
return False
if re.match(want_family, our_type.mime) is not None:
return True
return False
It's a little counter-intuitive. Perhaps, you could utilize mimetypes
as I have done.
I expected filetype to be a "one-stop shop" for mimetypes. Nevertheless, it is an excellent library for detecting mimetypes.
Your code is great and it's helping me so much.
BTW I have to flag you a problem: pip install filetype
doesn't work and your test code keep repeating:
Traceback (most recent call last):
File "TestFiletype.py", line 4, in <module>
import filetype
ImportError: No module named filetype
even if filetype
is correclty installed:
So I had to download your setup.py file but it wasn't installing because 'README.rst'
was missing:
...really, do I need 'README.rst'
in order to install your repository?
So I downloaded the complete .zip and filetype
installed correctly and now everything works.
arriving to install filetype through pip install filetype
would be amazing
Thank you to keep up the good code
Currently filetype.guess does not accept PathLike objects (see https://docs.python.org/3/library/os.html#os.PathLike) as used by pathlib.Path. It would be nice if this was possible.
To be able to check file content in form data receiving from front end.
This signature returns "None" filetype:
>>> kind = filetype.utils.get_signature_bytes('HDVWM419.mp4')
>>> print(kind)
bytearray(b"\x00\x00\x00 ftypisom\x00\x00\x02\x00isomiso2avc1mp41\x00\x00\x00\x08free\x8b_\x04\xf6mdat\x00\x01\'xe\xb8\x04_\xdb\xb3`+@\x85a)>\x10\x88\xebv{\x1ec(\x96(\xc9W^\x8a\\\xa7\xaaB$\xfaz\xb1\x8c&\xca\t\xa9\x04\x95\xb7\x87P\xf2\xaew~,\x8f\xa2\xda>\xe6\xe4\n$&\x9dm\x19\xeb4\xbd0\x00\xc6\x91\xf0\xb0\x85\x0f\xab<\x04\xf5\xe00\xaa\rm\xdc\xa6<a\x08\xcf\x8c\\\x0f\x18)\xdd\xc7\x8e\n\xd6\xd7\xd7\x05\x0fdPj\x15\x1f\xc5H\xd4\x98\x0cx\xce\xb9\xa7\xa8\t\xea\x8d\xe1\xb7\xe2F\x8fQoD\xadKT{\xc9D\xcapZ\xb8\xa2\xeez\xbd\xab\x9e7\x9a\xf7G\xbe/\xbdQ>P\xf6\xa3f\xdc\x17\xfb\xcb\x9c\x9a\x14\x06\xd4J\xb2\xe2\x15\x05\xda\xc5oL\x0b\xbd!\xb7>-\xe2\xb6\xda\x8bi\xab\x8c\xe3\xc1\xa7\x82c\x83\x93\x17$\xd9\xa8zM\xe4@Q\xab\\\xc5\xb4<\x04")
file HDVWM419.mp4
HDVWM419.mp4: ISO Media, MP4 Base Media v1 [IS0 14496-12:2003]
Stream #0:0(eng): Video: h264 (High) (avc1 / 0x31637661), yuv420p(tv, GBR), 1920x1080 [SAR 1:1 DAR 16:9], 12168 kb/s, 30 fps, 30 tbr, 30 tbn, 60
I guess the issues is using the correct magic file or metadata evaluation. I looked at your source code but not sure how to get it to see this as an MP4.
Metadata:
major_brand : isom
minor_version : 512
compatible_brands: isomiso2avc1mp41
creation_time : 2016-09-19T13:43:30.000000Z
encoder : Lavf51.12.1
Videos with metadata that matches:
Metadata:
major_brand : mp42
minor_version : 1
compatible_brands: mp42mp41
I am trying to pass a long array of bytes to the function but the characters that identify the file go beyond 256 bytes.
The current makefile destroys my git repo refs on make clean.
I can do a PR
can you link to an example?
Hi!
We have a use case to detect images including SVG. This library is perfect except for missing the SVG format. I'm happy to make a PR in the next few days if that's alright.
Edit with sample: https://upload.wikimedia.org/wikipedia/commons/0/02/SVG_logo.svg
I am trying to get the file type from request data and save it, but the saved file can't use if I call filetype.guess( audio_file )
.
I also try to save the data first, read the saved file and save again after i checked the file type, the saved file can't use too.
how to I solved this problem?
audio_file = request.files['data']
kind = filetype.guess( audio_file )
if kind is not None :
file_type = kind.extension
wav_path = os.path.join( current_app.config['UPLOAD_FOLDER'], 'audio.'+file_type )
audio_file = request.files['data']
tmp_path = os.path.join( current_app.config['UPLOAD_FOLDER'], "upload_audio.tmp" )
audio_file.save( tmp_path )
tmp_file = open( tmp_path , 'rb' )
tmp_data = tmp_file.read()
kind = filetype.guess( tmp_data )
tmp_file.close()
if kind is not None :
file_type = kind.extension
wav_path = os.path.join( current_app.config['UPLOAD_FOLDER'], wav_id+'.'+file_type )
wav_file = open( wav_path , 'wb' )
wav_file.write( tmp_data )
wav_file.close()
Hello. I have a problem when trying to process Cr2 files. filetype recognize it as both tiff and cr2 type. It's not surprise since cr2 basen on tiff .
Filetype version 1.0.7
Sample code:
from filetype.types.image import Tiff, Cr2
from filetype import match
match("Path to cr2 file", matchers=[Cr2()])
match("Path to cr2 file", matchers=[Tiff()])
Result is:
>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Cr2()])
<filetype.types.image.Cr2 object at 0x0000029F0EF2C9E8>
>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Tiff()])
<filetype.types.image.Tiff object at 0x0000029F0EF2CA20>
Should be:
>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Cr2()])
<filetype.types.image.Cr2 object at 0x0000029F0EF2C9E8>
>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Tiff()])
You can take sample cr2 here
I think to solve this problem we need to add something like and not(buf[8] == 0x43 and buf[9] == 0x52)
here to make sure that there is no Cr2 magic word in buffer.
I can do a PR.
I have some webm and it always return None, like this one
http://video.webmfiles.org/big-buck-bunny_trailer.webm
I think that pre-build a dict and put there all the magic signatures for the file header lookup is more time efficient than call time to time each type object to find the matching file header.
From https://github.com/h2non/filetype.py/blob/v1.0.5/filetype/utils.py#L3-L18
_NUM_SIGNATURE_BYTES = 262
_NUM_SIGNATURE_BYTES
number of bytes is read from passed data to determine the signature.
_NUM_SIGNATURE_BYTES
could be considered the recommended number of bytes needed for best signature matching. Some users of the filetype
may want to know the minumum number of bytes needed before calling any filetype
API.
However, the variable is somewhat obscured; filetype.utils._NUM_SIGNATURE_BYTES
is an awkward reference and the leading _
suggests a "private" variable.
_NUM_SIGNATURE_BYTES
should be exposed at the root of the filetype
package level as a Python "constant". , e.g. filetype.BYTES_MINIMUM
or filetype.BYTES_SUGGESTED
.
There is a new image type named dcm, which is always used in radiation medicine. CT or other radiation data are always record in this. I think filetype will be better if dcm is added in . Because now many researchers are studying on this kind image.
It seems the latest version is still not updated on PyPI.
Maybe it's time to create a github workflow for automatic upload?
it doesn't support svg format.
how can i do?
many are svg format now.
you have a go lang version, but how to use in python?
λ pip install filetype
Collecting filetype
Using cached filetype-0.1.2.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "D:\LOCAL_TEMP\pip-build-605g52_c\filetype\setup.py", line 16, in <module>
with open(path.join(here, 'README.md'), encoding='utf-8') as f:
File "C:\Anaconda3\lib\codecs.py", line 895, in open
file = builtins.open(filename, mode, buffering)
FileNotFoundError: [Errno 2] No such file or directory: 'D:\\LOCAL_TEMP\\pip-build-605g52_c\\filetype\\README.md'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in D:\LOCAL_TEMP\pip-build-605g52_c\filetype\
I could install by source, pip could not work.
Based on Go implementation with Python idioms
Hi @h2non,
I am trying put filetype.py in Debian. However, the tests/fixtures/* files seems are not originally developed by you and the Debian FTP Masters rejected the package. I can see it, in sample.jpg:
Copyright (c) 1998 Hewlett-Packard Company
Please, can you clarify?
I suggest you generate all files and put a specific notice about it.
Regards,
Eriberto
Hello,
In the description you say "Pluggable: add new custom type matchers" , by the way the links doesn't work.
Do I need to modify your code or there is a function call or something, I don't see any example ?
docx is recognize as zip (which it is), I suppose I need a second step to extract the archive and do a check again ?
Brotli is now a widely supported compression type. We should include that too.
It would be great to have the support for the lx4 compression format.
Thanks a lot for making this nice package.
> ls -lah /opt/conda/bin/python3.6
-rwxrwxr-x 1 root root 3.6M Jun 8 2018 /opt/conda/bin/python3.6
>conda install filetype
...
The following NEW packages will be INSTALLED:
...
filetype: 1.0.7-pyh9f0ad1d_0 conda-forge
...
conda install filetype
ERROR conda.core.link:_execute(502): An error occurred while installing package 'conda-forge::filetype-1.0.7-pyh9f0ad1d_0'.
FileNotFoundError(2, "No such file or directory: '/opt/conda/bin/python3.6'")
Attempting to roll back.
Rolling back transaction: done
FileNotFoundError(2, "No such file or directory: '/opt/conda/bin/python3.6'")
>>> import filetype
>>> import requests
>>> url = "https://45.img.avito.st/image/1/lCc6lra4OM5MM8rIHszqVLE1PsiYNTjI_1Y-woo1OM7K"
>>> r = requests.get(url)
>>> filetype.guess(r.content[:10])
<filetype.types.image.Jpeg object at 0x10b209f70>
>>> filetype.guess(r.content[:10]).kind
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Jpeg' object has no attribute 'kind'
From the initial response it would appear that the kind has been detected but once we call the method we have an AttributeError.
Python 3.8.0
Here → obj = obj.read(_NUM_SIGNATURE_BYTES)
If I check my object type like if filetype.guess_mime(file) != "image/jpeg":
, the file itself will be broken and I can't use it later.
I wonder if it is something intentioned or not.
Thank you!
It would be great if lzo compressed files would be recognized.
i.e.
00000000 89 4c 5a 4f 00 0d 0a 1a 0a 10 30 20 80 09 40 03 |.LZO......0 ..@.|
If you attempt to do an add_type with a subclass of Type, you always get the "instance must inherit from filetype.types.Type". This appears to be because isinstance only returns true for actual instances and not for subclasses. You need to use issubclass to check for subclasses. I am using Python 3.8, so this may be new behavior.
If I fix this error, I get a further buffer error. Attached is a zip file with my example code. You will need to change the file locations.
detect_file_type.zip
I created a simple tar package "dummy.tar" using "tar -cvf" and filetype always returns None.
I can see that tar is supported but not working for me
Is it doable to also check for XLS?
Python 3.6 introduced the file system path protocol with PEP 519. All python (3.6+) builtins, and most (if not all) standard library modules accept a path-like object where only string or bytes were previously accepted. This is especially useful when using os.path
alternatives like pathlib.
Any class can include this protocol by inheriting from os.PathLike
and implementing a concrete definition of __fspath__
that returns either a str
or bytes
object.
The area most suitable for providing this support seems to be utils.get_bytes. While checking if isinstance(obj, os.PathLike)
makes use of the formally defined interface, its only compatible with versions 3.6+. To reconcile this, this idiom recommended by PEP 518 offers compatibility with previous versions:
obj = obj.__fspath__() if hasattr(obj, '__fspath__') else obj
If a more explicit implementation is desired, a Python 2+3 compatible PathLike interface matching the signature provided by os.PathLike
can be used as well.
import abc
# provides the features of the abc.ABC helper class introduced in 3.4
ABC = abc.ABCMeta('ABC', (object,), {'__slots__': ()})
class PathLike(ABC):
"""Abstract base class for implementing the file system path protocol."""
@abc.abstractmethod
def __fspath__(self):
"""Return the file system path representation of the object."""
raise NotImplementedError
@classmethod
def __subclasshook__(cls, subclass):
if cls is not PathLike:
return NotImplemented
for parent in subclass.__mro__:
attrs = parent.__dict__
if '__fspath__' in attrs:
return NotImplemented if attrs['__fspath__'] is None else True
return NotImplemented
Regardless of the implementation, I feel including PathLike support offers value without adding dependencies or introducing compatibility problems with older versions of Python.
Anything that isn't a literal will result in None
.
from filetype import get_type
x = '.mp4'
get_type(ext=x.replace('.', ''))
>>> None
get_type(ext='mp4')
>>> <filetype.types.video.Mp4 object at 0x7f1f9c3bec90>
I can understand that detecting text/plain is hard.
But it would be great if I could guess the file extension if the mime-type is known.
Please make get_type('text/plain')
work.
Thank you
The last check in Avi.match
is buf[10] == 0x49
.
As far as I understand, the first four bytes is the RIFF signature (\x52\x49\x46\x46), followed by four bytes referring to the file size, followed by four bytes identifying the file type, which would be \x41\x56\x49\x20 in the case of an AVI.
Does the method lack a buf[11] == 0x20
check?
test.zip
This mp3 is not detected as it.
It has a bit rate of 8 kbps and a sample rate or 24000 Hz. It is basically one second of silence.
PS. The go version has the same problem h2non/filetype#91
I have an mp4 video that returns on a call to _get_ftyp() like so:
('isom', 1, ['isom', 'avc1', 'mp42'])
Should the matching be more lenient for 'compatible brands'? I'm asking because I don't know what isom
is and its unclear what the intention is with parsing out compatible brands.
Not every ebook is Epub, so these might be good to note.
Release 1.0.7 broke the specialised matchers that are still documented here https://h2non.github.io/filetype.py/v1.0.0/match.m.html
One could make the argument that these functions are internal API since they're not officially documented in the examples, so it's ok to break them without even a minor version bump.
However, given the usefulness of these functions (e.g. for scenarios in which one only looks for images -- something often encountered in web development) please expose them officially in the examples, and keep them stable.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.