Giter VIP home page Giter VIP logo

alphalens's People

Contributors

a-campbell avatar dmichalowicz avatar eigenfoo avatar fawce avatar hereticsk avatar ivigamberdiev avatar jameschristopher avatar jimportico avatar jmccorriston avatar luca-s avatar michaeljmath avatar mmargenot avatar richafrank avatar timshawver avatar twiecki avatar vikram-narayan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alphalens's Issues

Documentation

  • Verify all arguments and types are correct in docstrings
  • README
    • Overview
    • link to your code & issue tracker
    • Frequently Asked Questions (FAQ)
    • How to get support
    • Information for people who want to contribute back
    • Installation instructions
    • explain input data types
    • Example tear sheet
    • project’s license
    • credits
  • Notebook
    • explain calculation of:
      • IC
      • alpha, alpha tstat, beta
      • IR
    • walk through various plots

Avoid grouping by day when calculating quantile mean returns

As alphalens is data frequency agnostic, the aggregation by day performed when calculating quantile mean returns doesn't make sense all the time. It could produce better plots for some scenarios, depending on data frequency and data time span, but not for all the possible data given in input to alphalens.

I am suggesting to remove the aggregation by day or to make it optional and in this latter case the aggregation period should be configurable: minute, day, hour, week, whatever.

Also the feature should be applied to all the plots consistently (not only on some of them). Currently it is applied only to plotting.plot_quantile_returns_violin, plotting.plot_cumulative_returns_by_quantile and plotting.plot_mean_quantile_returns_spread_time_series.

Return vs Signal distribution plot

While the "mean daily return by quantile" plot tells us how well the factor differentiates forward returns across the signal/factor values, the shown results depend strongly on the number of quantiles used. So it would be nice to add a distribution plot of returns vs signal/factor to properly see the relationship between returns and signal

Just to give an idea of what I mean, here are some pictures.

kdist1
kdist3
regdist2

Check that datetimes have the correct timezone

We spent way too long debugging an odd exception:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-59-34a94da3a0f6> in <module>()
      2 
      3 alphalens.tears.create_factor_tear_sheet(factor=signal['signal'],
----> 4                                          prices=pricing, quantiles=2, periods=(1,))
      5                                          #show_groupby_plots=True)

/usr/local/lib/python2.7/dist-packages/alphalens/plotting.pyc in call_w_context(*args, **kwargs)
     41                 # sns.set_style("whitegrid")
     42                 sns.despine(left=True)
---> 43                 return func(*args, **kwargs)
     44         else:
     45             return func(*args, **kwargs)

/usr/local/lib/python2.7/dist-packages/alphalens/tears.pyc in create_factor_tear_sheet(factor, prices, groupby, show_groupby_plots, periods, quantiles, filter_zscore, groupby_labels, long_short, avgretplot, turnover_for_all_periods)
    117     mean_monthly_ic = perf.mean_information_coefficient(factor,
    118                                                         forward_returns,
--> 119                                                         by_time="M")
    120 
    121     factor_returns = perf.factor_returns(factor, forward_returns, long_short)

/usr/local/lib/python2.7/dist-packages/alphalens/performance.pyc in mean_information_coefficient(factor, forward_returns, group_adjust, by_time, by_group)
    133 
    134     else:
--> 135         ic = (ic.reset_index().set_index('date').groupby(grouper).mean())
    136 
    137     ic.columns = pd.Int64Index(ic.columns)

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in set_index(self, keys, drop, append, inplace, verify_integrity)
   2835                 names.append(None)
   2836             else:
-> 2837                 level = frame[col]._values
   2838                 names.append(col)
   2839                 if drop:

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1995             return self._getitem_multilevel(key)
   1996         else:
-> 1997             return self._getitem_column(key)
   1998 
   1999     def _getitem_column(self, key):

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   2002         # get column
   2003         if self.columns.is_unique:
-> 2004             return self._get_item_cache(key)
   2005 
   2006         # duplicate columns & possible reduce dimensionality

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1348         res = cache.get(item)
   1349         if res is None:
-> 1350             values = self._data.get(item)
   1351             res = self._box_item_values(item, values)
   1352             cache[item] = res

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in get(self, item, fastpath)
   3288 
   3289             if not isnull(item):
-> 3290                 loc = self.items.get_loc(item)
   3291             else:
   3292                 indexer = np.arange(len(self.items))[isnull(self.items)]

/usr/local/lib/python2.7/dist-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
   1945                 return self._engine.get_loc(key)
   1946             except KeyError:
-> 1947                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   1948 
   1949         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)()

KeyError: 'date'

Just to find out that the reason was that the factor index we passed was not UTC localized. Since we only support daily data the timezone does not matter in the first place so we should just cast it to be UTC.

Top and Bottom Quantile Daily Turnover for each day/period

Top and Bottom Quantile Daily Turnover plot

The plot shows only the daily turnover, but It would be more interesting to plot the turnover after X days, for each X in 'days' (the create_factor_tear_sheet's argument). This suggested behaviour is a superset of the current one because a user can add '1' to 'days' argument to get the current behaviour.

Cannot install alphalens via pip

When I use pip to install alphalens, it shows

Could not find a version that satisfies the requirement alphalens (from versions: )
No matching distribution found for alphalens

Average cumulative quantile return plot

The choice of what forward days/period to analyse with create_factor_tear_sheet is somehow arbitrary in my opinion. It would be useful to plot the average cumulative return for each quantile, with standard deviation, over a configurable period of time (this is very similar to the plot you get when running an event study). That would help understanding the average performance of each quantile and decide what days/periods to investigate with the tear sheet.

Please see attached images for a better understanding of what I mean.

cumret1
cumret2

fix namespace conflicts for `factor__returns`

And also need to fix the bug in factor_alpha_beta function, where it requires factor_returns as an argument, but it uses the same name as function factor_returns. So if factor_returns = None, it would raise an error that None cannot be called.

use grispec to lay out plots?

pyfolio uses gridspec to pull tear sheets into a single fig. Makes tweaking the layout and spacing a little easier. Worth it?

WIP AlphaLens Review Notes

General Notes:

  • What is AlphaLens' convention for symbol identifiers going to be? I continue
    to be of the opinion that using ticker symbols is broken by design in our
    domain. If we just want to use arbitrary, guaranteed-unique identifiers, I
    think that's fine, but if we do that we need to be careful to not use
    string-specific methods when working with asset indices.
  • What is AlphaLens' convention for price adjustments going to be? All the
    algorithms, as far as I can tell, are assuming prices that are adjusted
    relative to the end date of the alpha period. Many users won't have an easy
    way to get data in that format (it's currently quite difficult to get such
    data on quantopian for example.) I'd expect, moreover, that it's often
    easier for users to provide adjusted forward returns than it is for them to
    provide the adjusted prices that yield those returns, but there's no way for
    me to run a tearsheet using my own forward returns. A fix for this would be
    to make create_factor_tear_sheet accept the same data format as all the
    plotting functions, and require that the user call format_input_data (see
    notes below for thoughts on that function's name) if they want to convert
    wide-form data into narrow-form data.
  • There are a bunch of notes scattered across various functions on the
    importance of correctly aligning your pricing data with your factor
    calculations, but nowhere that I can find is it clearly specified what the
    expectations are for how factor and pricing data should be represented.
  • There's at least one place where we're including vendor-specific data in
    Alphalens. Both for licensing reasons, and in the interest of ensuring
    generality to all users, I think we probably should avoid providing
    vendor-specific data.
  • There don't seem to be any tests for the top-level plotting functions that
    actually invoke the calculation functions. It would be nice if we had
    coverage of the functions we actually expect users invoke. This isn't an API
    issue, but it makes me nervous about the reliability and correctness of the
    plotting functions as we make changes.
  • There are a bunch of computation functions
    (e.g. compute_mean_returns_spread and mean_return_by_quantile) that
    conditionally return different numbers of values based on boolean parameters.
    Almost all the call-sites for these are broken because they're
    unconditionally unpacking the result into a fixed number of values.
    Generally speaking, I'd recommend against functions that return different
    numbers of values, precisely because they make call sites much more
    complex. In most of these cases, we only ever test in the mode that computes
    the maximum number of values, so I'd recommend just making that always the
    behavior instead of having buggy optional behavior.
  • We have a few parameter names that refer to values of different types in different functions (e.g. std_err, which is sometimes a bool and sometimes a series.)
  • We're missing test coverage for several important, complex functions:
    • There's no test coverage whatsoever for tears.py, despite the fact that
      we do a substantial amount of meaningful, tricky computation there.
    • mean_information_coefficient is uncovered when len(grouper) == 0.
    • factor_returns is uncovered when long_short is False.
    • factor_alpha_beta is uncovered when factor_daily_returns is None. (Do
      we need that case at all?).
    • compute_forward_returns is uncovered when filter_zscore isn't passed
      (do we need that case? All the docstrings warn about it being unsafe, but
      it's the only thing we ever use apparently?).
    • There's no coverage at all for mean_return_by_quantile, or
      compute_mean_returns_spread, or format_input_data. (And there are
      currently bugs in the only call-sites of compute_mean_returns_spread and
      mean_return_by_quantile: See Below).
  • Most of the docstrings have typos of varying degrees of severity.

API Notes

tears.py

  • from plotting import *: Please don't do this. It immediately breaks any
    intelligent editor's ability to tell you where names come from, and it
    prevents any linting tool from telling you about serious errors like
    undefined variable names.

create_factor_tear_sheet

Most of the comments here apply to the parameters with the same names elsewhere:

  • The docstring descriptions here could be clearer about what the content of
    the expected data is. Should factor/prices/sectors be floats? Strings?
    Ints? This comment applies to most of the other API functions.
  • sector_plots: I would probably call this show_sector_plots to be clearer
    that this toggles whether or not to show them (as opposed to, say,
    controlling the number of sector plots).
  • Why can filter_zscore only be an int? It seems like this would be
    perfectly fine as a float. This is also noted as incorporating lookahead
    bias. When is it safe for a user to supply this parameter? If it's not
    safe, why does it exist? Also, filtering away outliers after computing
    zscore still has the effect of compressing the z-scores of all our other data
    observations. Are we okay with that distortion?
  • pedantic comment: days is annotated as list, but the default is a tuple
    (the tuple default is reasonable). The correct annotation is probably
    something like sequence[int]. More seriously, if I pass in my own list,
    this function mutates that list in place! We should make a copy if we
    need to change input data in place.

performance.py

Several functions in this file accept an argument named std_err. Sometimes
it's a bool indicating whether an optional value should be computed. Other
times, it's a Series of computed values. Most of the boolean invocations are
subtly broken at their only call-sites, so I'd argue we should just remove
them. If we decide they're really important, I'd rename the bools to something
like compute_std_err, to be clearer that it's a toggle and not the computed
values themselves.

factor_alpha_beta

  • I'm a little surprised this doesn't have a per-sector mode. Is sector
    exposure of factors not something we're generally interested in
    plotting/knowing? Is that already covered somewhere else?

quantize_factor

  • This currently returns floats. It should return ints or categoricals.
  • "A list of equities and their factor values indexed by date.": list means a
    specific datastructure in Python. Especially in documentation of a parameter
    to a function, you don't want to say list when it's not what you actually
    mean. This appears in a bunch of places:
    alphalens/performance.py:144: A list of equities and their factor values indexed by date. alphalens/performance.py:182: A list of equities and their factor values indexed by date. alphalens/performance.py:230: A list of equities and their factor values indexed by date. alphalens/performance.py:266: A list of equities and their N day forward returns where each column contains the N day forward returns. alphalens/utils.py:154: A list of equities and their factor values indexed by date. alphalens/utils.py:163: A list of equities and their sectors. alphalens/utils.py:178: A list of equities and their factor values indexed by date, docs/source/conf.py:99: # A list of ignored prefixes for module index sorting.

factor_rank_autocorrelation

  • This is parameterized on a time rule, but the docstring explicitly talks
    about weeks.

compute_mean_returns_spread

  • This sometimes returns a single Series and sometimes a pair of Series', but
    the only call site is in tears.py, where we do:
    mean_ret_spread_quant, std_spread_quant = perf.compute_mean_returns_spread( mean_ret_quant_daily, quantiles, 1, std_err=std_quant_daily)
    This will crash if we ever don't pass std_err.

mean_return_by_quantile

  • This has the same bug as compute_mean_returns_spread: we're always
    unpacking at call-sites, but this conditionally returns a tuple.

utils.py

format_input_data

  • The name here is pretty generic. Is there a clearer description of what this
    does?

Example Notebook Line Notes:

  • typo: "insturments"
  • typo: "is not nessisarily important"
  • run-on sentence: "Trading algorithms cover execution and risk constraints,
    the business of turning predictions into profits." (I'd change the comma to a
    colon.)
  • reference to qfactor: "qfactor does not contain analyses of things like..."
  • ticker_sector = {...}: it'd be nice if there was some context/explanation
    for what this giant dict is (e.g. "This is an example of a constant sector
    mapping ...").
  • from pandas.io.data import DataReader: pandas.io.data is deprecated. We
    should use pandas_datareader.
  • for ticker in ticker_sector.keys(): we don't need .keys() here.
  • "The pricing data passed to alphalens should reflect the next available price
    after a factor value was observed at a given timestamp." What does it mean
    for "a factor value to be observed at a given timestamp"? This seems both
    important and tricky. We should try to be very clear, both in our
    documentation and in our internal usage, exactly what it means to label an
    observation with a given day. This will also get much harder if we ever
    start caring about non-NYSE calendars, or if we want to be able to ingest and
    analyze non-daily data: are either of those things in the near-term roadmap?
  • typo: "how our factor looks accross"
  • reference to qfactor: "you'll need to pass qfactor a sector mapping"
  • wording: "This mapping can come in the form of a MultiIndex Series" should be
    "MultiIndexed"
  • reference to qfactor: " you may also pass qfactor a dict of sector names"
  • wording: "Returns analysis gives us a very raw and real description of a
    factor's value." I'm not sure what this means.
  • typo: "we can get an idea about the consitency ..."
  • wording: "Turnover Analysis gives us an idea about the dynamic nature of a
    factor's make up." This also feels too vague to be useful. Is there
    something more precise we can say here?
  • wording (nitpick): "Factor turnover is super important ..." 'super' is
    suddenly very informal relative to the tone of the rest of the tutorial.

factor_alpha_beta forgot to return t-value

documentation of factor_alpha_beta mentions that it should return t-value, but function factor_alpha_beta did not add t-value in the returned dataframe. And in the function, it actually computes the t-value from OLS. Forgot to add a new row ? Or not ready to commit?

wrong forward daily returns ?

I noticed the daily forward returns are calculated as the forward returns divides by the forward period (see utils.compute_forward_returns) .

I don't exactly understand the reason so I might be wrong, but shouldn't the daily returns calculated as:

# let's pretend those are the variables
# 'days' is our period
# 'fwd_ret' is the return after the period

# the current implementation does something like the following (what does this value mean?)
daily_returns = fwd_ret / days

# while I am suggesting to calculate the daily returns as the value the returns would have
# had every single day for the whole period if they had grown at a steady rate
daily_returns = (fwd_ret+1)**(1./days) - 1

set a fixed y axis for the IC hist plots

when looking at the IC histogram plots it seems like each of them is about the same height and thus contain the same number of occurrences. Upon further inspection they all have different sized y axis, this could potentially be misleading

Make copy of passed data

Just noticed that we rename the levels inplace, we should first create a copy to not change the users' data.

Turnover plot for each quantile

It would be great to have an option to plot the turnover for all quantiles, similar to "Top and Bottom Quantile Daily Turnover" plot but where each quantile has a separate plot.

Making forward returns "demeaning" optional

The forward returns used for plotting are converted to returns relative to mean. While this is a very clever choice (the code is a constant surprise of many smart ideas, well done!) it implies a long short portfolio and that is not always the case. One might be interested to see if their signal can be profitable in a long (or short) only portfolio over a particular time period. So, why not making the demeaning optional?

Alphalens is shuch a great swiss knife, so why using the blade only? :D

Many thanks!

choose a color for the data, and another color for the rolling mean?

right now for the IC plots we use the bluish color for the data and the green for the rolling mean. In the returns plots we use green for the data and orange for the rolling mean. it could potentially be confusing, as in it confused me. but i understand we need to differentiate the two plots and colors are one way to do that. perhaps just no overlapping colors

BUG: change of quantile mean returns value when adding/removing periods

Quantile mean return values change when adding a period (or removing one). Adding/removing a period shouldn't influence the other ones.

Here is part of alphalens output. I ran it two times on the same input, but the second time I added a period of '20'. You can see that in the second output the periods that were already present in the first run change their statistics. It shouldn't happen.

Run 1: periods = (1, 5, 10)
Returns Analysis
1 5 10
Ann. alpha 0.112 0.153 0.232
beta -0.076 0.271 0.260
Mean Period Wise Return Top Quantile (bps) 3.671 4.473 3.340
Mean Period Wise Return Bottom Quantile (bps) -5.745 -6.145 -7.425
Mean Period Wise Spread (bps) 9.416 10.618 10.764

Run 2: periods = (1, 5, 10, 20)
Returns Analysis
1 5 10 20
Ann. alpha 0.112 0.153 0.233 0.287
beta -0.072 0.273 0.261 -0.935
Mean Period Wise Return Top Quantile (bps) 3.653 4.492 3.371 3.023
Mean Period Wise Return Bottom Quantile (bps) -5.744 -6.133 -7.411 -8.493
Mean Period Wise Spread (bps) 9.400 10.622 10.781 11.516

summary stat table

  • autocorrelation of factor ranks
  • avg monthly spread
  • avg turnover
  • avg daily factor performance (returns)
  • avg monthly factor performance (returns)
  • IC data

Wrong mean return by Quantile

Here are two graphs that I got running create_factor_tear_sheet with long_short=False (I am not yet sure if this parameter matters).

If you look at quantile 2, the "Mean return by Factor Quantile" graph shows negative mean returns for period 10 and 20, while the "Average cumulative returns by Quantile" graph shows positive mean returns at periods 10 and 20.

I am not sure where the bug is, but it seems there is one.

bug

Factor Rank Autocorrelation for each day/period

Factor Rank Autocorrelation plot

The plot shows only the daily autocorrelation, but It would be more interesting to plot the autocorrelation after X days, for each X in 'days' (the create_factor_tear_sheet's argument). This suggested behaviour is a superset of the current one because a user can add '1' to 'days' argument to get the current behaviour.

Cumulative Return by Quantile for each day/period

Cumulative Return by Quantile plot

Only the 1 day forward returns graph is plotted. I expected to see plots for each day/period forward return ('days' argument passed to create_factor_tear_sheet ) .
I might have a factor that shows alpha only after X or Z days and I might not be interested to see 1 day forward returns plots, so I would pass days = (X, Z) to create_factor_tear_sheet and I would expect to see the plots for X and Z forward returns. At least it would be nice to have this feature as an option if not the default behaviour.

ENH: quantile vs equal interval bins

While using quantiles is the most straightforward choice there are use cases where it would be much more reasonable to classify the factor values using equal interval bins.

An example is a common scenario where a factor produces positive and negative values (e.g. in -1, 1 range) and alphalens should be able to calculate statistics on two distinct group: the positive values group and the negative values one. Currently this is not possible if the positive and negative values are not equally balanced (50% of positive values and 50% of negative values).

So I am proposing to have two way of classifying the factor values:

  • Quantiles (current implementation): this method classifies factor values into a certain number of categories with an equal number of units in each category. This produces bins with the same amount of elements but with different values ranges.
  • Equal Interval bins: the range of factor values is divided equally into however many categories have been chosen (.e.g. range -1,1 and 2 categories would result in one category in the range -1,0 and a second category in the range 0,1). This second option would produce bins with varying number of elements, but with the same values span.

Factor Weighted Cumulative Return plot for each day/period

Factor Weighted Cumulative Return plot

Only the 1 day forward returns graph is plotted in 'Factor Weighted Long/Short Portfolio Cumulative Return'. I expected to see plots for each day/period forward return ('days' argument passed to create_factor_tear_sheet) .
I might have a factor that shows alpha only after X or Y days and I might not be interested to see 1 day forward returns plots, so I would pass days = (X, Y) to create_factor_tear_sheet and I would expect to see the plots for X and Y forward returns. At least it would be nice to have this feature as an option if not the default behaviour.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.