psychoinformaticslab / pliers Goto Github PK
View Code? Open in Web Editor NEWAutomated feature extraction in Python
Home Page: https://pliers.readthedocs.io/en/latest/
License: BSD 3-Clause "New" or "Revised" License
Automated feature extraction in Python
Home Page: https://pliers.readthedocs.io/en/latest/
License: BSD 3-Clause "New" or "Revised" License
It would be useful to support text feature extraction from subtitle files.
Some tests currently fail because OpenCV was difficult to install on Python 3 until recently. There now appears to be a conda installer, so we should fix the travis config to properly install OpenCV on both Python 2 and 3 (and make sure the tests pass).
Now that we have added the functionality thanks to #69 we should implement batch processing for transformers that can gain from it. This includes a fair chunk of the API and Google transformers.
We now have working continuous integration testing via travis-ci; the coveralls report is here. We're not doing too badly, but we should be able to get to 95%+ coverage without too much work. Additionally, as a secondary priority, many of the earliest tests I wrote are overly broad, and could stand to be refactored.
Currently the quickstart doc only provides the bare minimum of information about what the package does and how it runs. Pretty much any doc contributions would be great at this point. The easiest place to start might be by adding example Jupyter notebooks illustrating usage for different stimuli. A more comprehensive tutorial would also be nice. Ultimately we want to have a comprehensive user guide, but that can probably wait on #4.
Pro: easy to use
Con: security liability
There's no centralized tracking of Extractors
at the moment, which makes it difficult to search for specific extractors, properly attribute credit, etc. We should add some tools for annotating Extractors with information like author, purpose, description, citation, tags, etc.
There's some ambiguity over what a Stim
name means. Right now it defaults to the filename, but it's probably a good idea to separately track the source file and name. This becomes an issue mainly in the context of graphs, where we might want to propagate the initial source file to a Stim
as it flows through the graph (e.g., annotated text extracted from a VideoStim
should retain some indication of the original video file).
Some extractors return string values. Users should have the option of automatically having these dummy-coded as binary columns when exporting or converting to pandas DFs.
A few of the API's impose a request limit, with no penalty for including several stimuli in one request. Therefore it is much more efficient to chunk stimuli into single API calls. Currently, each API converter/extractor is written to request using one stimulus at a time. This may be resolved by improving the graph module to automatically handle collections of stimuli.
It's a bit annoying that there's no way to know what the columns are in the lookup dictionaries supported in datasets/dictionaries.json without fetching them. We should add a mandatory 'column_names' field to the JSON objects that lists all valid column names (even if all columns in the target file are valid for use). This way users can easily scan dictionaries.json (and eventually, we can dynamically generate a table inside the docs). We could even extend this eventually to include an optional 'column_descriptions' that describes each column.
It's getting hard to keep track of all the optional dependencies; we should add an optional_dependencies.txt file in the package root that users can pip install -r
with if they want everything.
To really unlock the potential of the graph API, we need to support implicit conversion between Stim
types that involve multiple steps--e.g., VideoStim
to ComplexTextStim
via an extracted AudioStim
. There are (at least) two ways we could go about this:
Stim
to the output Stim
, and stop as soon as one is found. E.g., suppose we pass a VideoStim
to a TextExtractor
. Then get_converter
would search all possible paths from VideoStim
to TextStim
until it found VideoStim
--> AudioStim
--> ComplexTextStim
.Converter
classes for all valid paths, which explicitly call the full chain internally. E.g., we would write a new VideoToComplexTextStimConverter
with a _convert
method that explicitly uses a VideoToAudioConverter
class, then an AudioToTextConverter
.In principle, (1) is the cleaner and more extensible approach. But it introduces completely unnecessary computation when the number of valid paths between Stims is small (as it currently is). The main disadvantage of (2) is if we add many more Stim
types, we could end up with combinatorial explosion.
I guess for now I favor (2), and if it starts to get unwieldy, we can move to (1).
This is a high-priority issue that we should try to get done before revamping the README, because it would be nice to be able to show a Graph
example where the user only has to worry about the leaf nodes (all of which are Extractors), and doesn't have to explicitly think about the Converters.
A fairly common potential use case involves chaining multiple extractors--e.g., transcribing the audio track from a movie, and then feeding it into a DictionaryExtractor
. Currently there's no automatic way to convert the results returned from one extractor and converting them into a Stim
to feed into another. We should add a scikit-learn-like pipeline module that allows easy chaining of extractors.
Some Extractors now create intermediate files en route to generating feature values. Since some of these are movies or images of same dimension as the original Stims, we could end up consuming a lot of memory. At some point we should add an economy config variable that determines how intermediate files are handled/stored. We'll then need to go over all existing Extractors and make sure they condition properly on that setting.
The Wit.ai API has stellar speech recognition, and has no strict rate limit. It would be great to add a feature extractor for it. It's supported by the SpeechRecognition package, so we could either wrap SR, or implement our own interface (see https://wit.ai/docs/http/20160330#get-intent-via-speech-link).
from featurex.stims import VideoStim
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-2-fd7d648ab17b> in <module>()
----> 1 from featurex.stims import VideoStim
ImportError: No module named stims
The flattened output structure of the extractor does not contain an indicator that allows for binding individual features to a face in the case of multiple faces being detected. For every additional face a set of new columns will be added that have identical names. It seems that column order cannot be used at present to infer the start of a set of features for an additional face.
It looks as if a per-face column name prefix could be a solution.
The current implementation only takes a single name of a model as input; we should be able to pass in, e.g., ['sentiment', 'emotion'], and have the extractor return features for all valid models.
A high priority (perhaps the highest?) for new extractors should be the Google Cloud Vision API and Cloud Speech API. These will probably deliver much better performance than most of the other APIs we currently interface with or are considering. We may want to consider creating a separate module just for Google APIs, since they share a common Python interface (the Google API Python client, which we can wrap).
At the moment, the graph API doesn't do anything to prevent a user from trying to run a full-length movie file through an image extractor, which could result in a very large number of queries (1 per frame) to an API extractor if users aren't careful. It might be a good idea to at minimum issue a warning when a large set of queries (e.g., > 100) to an API Extractor
is detected, and possibly even require the user to set an explicit flag (e.g., large_jobs=True
). Alternatively, we could disallow automatic VideoToImageStim conversion in cases where the resulting video frame set is very large.
While exporing the Google vision API I found that it makes a big difference if movie frames are cropped (freed of any horizontal bars) before labeling. Without cropping they get "Screenshot" labels, but after cropping more of the actual content is tagged.
At the moment the standard way to apply extractors to a Stim
is via an .extract
call to the Stim
--e.g.,
stim = ImageStim('my_image.jpg')
extractors = [ExtractorA(), ExtractorB(), ExtractorC()]
stim.extract(extractors)
This allows multiple Extractors to be applied at once to a single Stim, but it would be useful to do multiple stims at once. Some kind of StimCollection container that implicitly loops over Stims might be worth adding. Thoughts?
Most of the Extractors haven't been updated yet to reflect the move from Value/Event/Timeline to a single ExtractorResult
class. We should finish this ASAP.
Say we are passing an AudioStim
through a LengthExtractor
, which takes TextStim
inputs. The implicit conversion will look for converters that go audio->text. However, most of the converters will instead have AudioStim
->ComplexTextStim
specified.
Should the implicit conversion also look for conversions to collection stimuli who's elements are of LengthExtractor
's input type? Either way it may be a good idea to put an element_type
specification in all CollectionStimMixin
s.
Alternatively, which we coincidentally have implemented now, we could just have converters specify AudioStim
->TextStim
, (even though they actually output ComplexTextStim
) and have the logic in transformers.py
take over from there.
OpenCV (and/or its Python bindings) doesn't install properly on the travis env, so cv2-dependent tests fail.
FeatureX has a PredefinedDictionaryExtractor class that takes a block of text as input and returns values for each word. For example, via an affective norms database, one can get the valence and arousal of the words in one's text.
Adding new dictionaries is as simple as adding new JSON dictionaries to the dictionaries.json file bundled with the package. Any file added there can subsequently be used in the PredefinedDictionaryExtractor
. Since there are potentially hundreds of usable and useful text feature dictionaries on the web, it would be great to expand the current list of supported resources.
E.g., when converting AudioStim
--> ComplexTextStim
, there are several possible candidates. The get_converter
method will get the first match, but we should have some way of specifying a default.
assert
Consider a situation where a user wants to take a VideoStim
as input and apply the STFTExtractor
(i.e., short-time Fourier transform) to the audio track. Currently, an exception will be raised, because the STFTExtractor
only handles AudioStim
inputs. However, since most movies have an audio track, featurex should be smart enough to attempt to automatically extract an AudioStim
from a VideoStim
and apply the audioextractor to the result (i.e., basically building an implicit graph) before it raises an exception. This isn't a high priority, but would be nice to have at some point.
I'm implementing A-weighting, which filters the audio timeseries, and I was thinking about differentiating filters and extractors. It seems almost wasteful to create an event for every frame in an audio stream, and filters seem like they'd be used to preprocess data rather than to generate timelines.
If filters are sufficiently different, they may merit another submodule along with extractors and stimuli.
Thoughts?
Wrap nltk's part-of-speech tagging and return a set of binary column features for, e.g., the universal part-of-speech tagset.
Currently ImageStims
and VideoStims
are loaded via opencv, which imposes an unnecessary (and difficult-to-install) dependency. OpenCV should only be imported when running extractors that depend on it; we should find an alternative solution for reading in stimuli. For images we could use scipy.misc.imread
. Not sure about movies, but I think MoviePy might be the way to go.
We need some testing code that iterates over all dictionaries listed in dictionaries.json and makes sure they're all still available and work properly.
This is a standing issue for tracking potential libraries and API services to wrap in pliers.
Multimodal/major APIs
Audio
Image
Language
For consistency and clarity, we should use _input_type
and _output_type
attributes to identify the expected types of all Stim
inputs (and for Converters
, the expected returned type).
When multiple Transformers are applied to a single Stim, the returned Value
objects are nested, such that the keys in the top-level Value.data
dict are Transformer
names, and the values are other Value
instances (whose data
attribute is a normal dictionary of values). This is counter-intuitive and kind of horrendous. The returned top-level object should probably be either a plain dict
, or some new container class (e.g., ValueList
).
We can use pygraphviz or something.
For the Google extractors (and possibly other API extractors), we currently flatten the returned JSON object into a one-level dictionary. This makes life easy when working with pandas DFs, but users could potentially want direct access to the original result. This will require adding a new attribute to ExtractorResult
, maybe called something like response
, that can optionally be set when the instance is initialized.
Alternatively, we could have a generic metadata
attribute on ExtractorResult
that is itself a dictionary, which would allow different kinds of Extractors to set different kinds of metadata.
There will be a lot of overhead calling Converters repeatedly if implicit Stim conversion is required. We can address this by memoizing the conversion functions with joblib or something similar.
Durations in wide format data frames repeat if multiple values are extracted (e.g., from indico API). For srt file types, text is not provided.
Many of the APIs only work on images, but we want to process videos by passing in individual frames. To keep processing efficient (and costs low for paid services), we want to pass in as few frames as we can get away with. Rather than processing every Nth frame, we could take the diff between every two frames and identify frames where the scene changes to a significant degree. This could be a method implemented in VideoStim
that could be called by any API-based extractor that loops over frames.
Most of the existing feature extractors have no docstrings. Add them.
When I try to execute nose tests fail with:
from .stimuli import VideoStim, AudioStim, TextStim, ImageStim
ImportError: No module named stimuli
This probably means that some modifications were not pushed to github yet.
Many data files useful for extraction/annotation can be repackaged under their current license. This is particularly true of word norms (e.g., frequency, emotional valence and intensity, etc.), which can be included in the package to make text feature extraction much more useful out of the box. Key data files should be bundled with the package (or maintained in a separate submodule).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.