robertmartin8 / machinelearningstocks Goto Github PK
View Code? Open in Web Editor NEWUsing python and scikit-learn to make stock predictions
License: MIT License
Using python and scikit-learn to make stock predictions
License: MIT License
Although I have written some unit tests, right now they really only check the datasets and some of the helper functions.
I would much appreciate any help in testing this project's core functionality.
When I first built this project, I retrieved my stock data from Quandl. However, this was about 2-3 years ago, so the API has changed considerably. As such, I strongly suspect that this code is broken (though I haven't tested it lately).
In any case, Quandl may not be the best datasource: I am keen to instead use pandas-datareader with the yahoo-finance fix (https://github.com/ranaroussi/fix-yahoo-finance).
Hi Robert, I noticed the intraQuarter/_KeyStats/ data is used,but I cannot find it,could u tell me where it is or
how I can get it?thx so much!
Hello, I luckily ended up on your project as I'm looking at scraping data from Yahoo Finance for a list of quotes (not only S&P500). I was wondering if there was a way to get a part of your script adapted to my needs?
i.e. I've got a list of quotes available in a .txt file. I currently use the YahooFinancials python api but I realised that some key figures are missing, such as "Cash, Debt, Levered free cash flow...etc".
So far, I'm collecting the data using that custom python script and then dump as a JSON file.
Would you be able to help me? Thanks :)
Hi Robert,
This is another fine project together with your excellent portfolio optimisation work.
I hope you don't mind me asking a question as someone inexperienced in this area regarding the amount of fundamental data required to produce a viable model.
My local exchange is London (LSE/FTSE) and getting historic fundamentals is hard. I am able to extract these day by day but it will take some time to produce a significant amount.
So I was wondering, how many days would I need to have processed for a viable classification model? Say 6 months, 3 months etc. I have many fields but at present this goes back a week.
Thank you in advance
Fig
could not successfully import "from utils import data_string_to_float", seemed to work after replacing with:
import utils
Hello, thanks you for all the great work done in this repo. I would suggest that you use the finance library that's gets the data from Yahoo Finance fairly easily and is much faster and more accommodating than pandas_datareader. It also has a load of other functions that might make your life easier. I would love to contribute to this repo.
tests/test_datasets.py::test_forward_sample_dimensions PASSED [ 11%]
tests/test_datasets.py::test_forward_sample_data PASSED [ 22%]
tests/test_datasets.py::test_stock_prices_dataset PASSED [ 33%]
tests/test_datasets.py::test_stock_prediction_dataset PASSED [ 44%]
tests/test_utils.py::test_status_calc PASSED [ 55%]
tests/test_utils.py::test_data_string_to_float PASSED [ 66%]
tests/test_variables.py::test_statspath PASSED [ 77%]
tests/test_variables.py::test_features_same FAILED [ 88%]
tests/test_variables.py::test_outperformance PASSED [100%]
=================================================== FAILURES ====================================================
______________________________________________ test_features_same _______________________________________________
def test_features_same():
# There are only four differences (intentionally)
assert set(parsing_keystats.features) - set(current_data.features) == {'Net Income Avl to Common', 'Qtrly Earnings Growth',
'Qtrly Revenue Growth', 'Shares Short (as of',
'Shares Short (prior month)'}
E AssertionError: assert {'Net Income ...Short (as of'} == {'Net Income A...prior month)'}
E Extra items in the right set:
E 'Shares Short (prior month)'
E Full diff:
E {'Net Income Avl to Common',
E 'Qtrly Earnings Growth',
E 'Qtrly Revenue Growth',
E - 'Shares Short (as of'}...
E
E ...Full output truncated (5 lines hidden), use '-vv' to show
tests/test_variables.py:17: AssertionError
====================================== 1 failed, 8 passed in 11.06 seconds ======================================
Was really impressed with this file. Most just look at historical price, but love what you have done with including a deeper assessment of the company.
I live in Bangkok and follow the Thai market. I can already get Thai historical prices, but how to change S&P500 to SET ( In yahoo I thought it would be SET.bk ( as stocks are .bk ), but does not work
Thanks
In the interests of stability and best practice, it would probably be a good idea to add some basic unit tests.
Only problem is I have to find a way to write meaningful tests without actually having to download all of the data etc each time I run a test.
Hello. Is it possible to integrate this with NASDAQ/IQ Option?
Hi! I just cloned your project and am messing around with it. Though I am an experienced software engineer, I am new to machine learning so feel free to tell me my insights are incorrect!
After reading the code I noticed prediction modeling heavily relies on the KeyStats, however data is extremely limited. Would it not be SUPER beneficial to back fill this data with a record per quarter (the provided data is very erratic, yet most 'feature' data points are provided be the company every quarter).
In addition to this, a cron or a simple get_missing_quartly_keystats.py script that can be invoked on demand to fill in new stats to accommodate longevity and modern accuracy of this project would help this project modeling become more accurate (more data sets), but also bring it closer to becoming a practical live use tool.
Most of the historical quarterly features
data points can be found directly or through calculations on https://www.macrotrends.net/. Example: https://www.macrotrends.net/stocks/charts/GNW/genworth-financial/financial-statements
There are many categories with sub categories that can most likely be scraped and parsed. For example, the full historical market cap chart served here: https://www.macrotrends.net/stocks/charts/GNW/genworth-financial/market-cap
can be parsed out as in the html is a <script> tag that defines var chartData
with all the values by date.
between the balance sheets and financial records they provide you may even find other influential data points to add to the ML portion of this script.
Let me know what you think, or if my logic is simply way off. If you think it is a good Idea I can help out with refactoring!
Based on some feedback received and subsequent experiments, it seems that the data download is missing out a lot of tickers (and if it's missing out the SPY, there will be an error in parsing_keystats.py
).
This project downloads price data for free from Yahoo Finance, via pandas-datareader
(and fix-yahoo-finance
). However, I've noticed lately that the data is becoming a lot more inconsistent, and sometimes just fails completely. This is because Yahoo seems to be dropping their support for this API.
The data on yahoo is still there, it's just a problem of accessing it. In the past I wrote a blog post about downloading data from the linked source, but 'deprecated it' once I realised that pandas-datareader
with fix-yahoo-finance
did the same thing but much better. My method still works, but it won't be trivial to integrate it with the project (and anyway it's a very clunky solution). I suppose that the easiest solution is to find another data source, so suggestions would be welcome.
As a temporary fix, I have added the csv files (containing all the data) to this repo.
Since the past 1 year, it seems that 'requests.get()' has stopped working with yahoo finance.
@robertmartin8 may you please guide us on how we can get around this error?
Thanks
There is error zero-sized array to reduction operation maximum which has no identity
in download_historical_prices.py
I entered this code in and the data doesn't return anything.
from pandas_datareader import data as pdr
import fix_yahoo_finance as yf
yf.pdr_override()
data = pdr.get_data_yahoo("SPY", start="2017-01-01", end="2017-04-30")
Hello,
I've got the following error when I try to run stock_prediiction.py I already tried in Linux Centos 7 and Windows 10 my python version is 3.6.5 I followed all the instructions step by step . The others files runs fine.
[root@customiseta MachineLearningStocks]# python3.6 stock_prediction.py
Building dataset and predicting stocks...
Traceback (most recent call last):
File "stock_prediction.py", line 55, in
predict_stocks()
File "stock_prediction.py", line 42, in predict_stocks
y_pred = clf.predict(X_test)
File "/usr/lib64/python3.6/site-packages/sklearn/ensemble/forest.py", line 538, in predict
proba = self.predict_proba(X)
File "/usr/lib64/python3.6/site-packages/sklearn/ensemble/forest.py", line 578, in predict_proba
X = self._validate_X_predict(X)
File "/usr/lib64/python3.6/site-packages/sklearn/ensemble/forest.py", line 357, in validate_X_predict
return self.estimators[0]._validate_X_predict(X, check_input=True)
File "/usr/lib64/python3.6/site-packages/sklearn/tree/tree.py", line 373, in _validate_X_predict
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
File "/usr/lib64/python3.6/site-packages/sklearn/utils/validation.py", line 462, in check_array
context))
ValueError: Found array with 0 sample(s) (shape=(0, 41)) while a minimum of 1 is required.
actually how to run this code I install all the libraries but also it gives error. could you please help me to run this code.
Hello,
First of all great work Robert.
I find one big mistake ( everyone do that ) in backtesting.py -> row 40 - u are using shuffle = True ( by default is true in train_test_split ) and when u doing i+1 or i+x targets data is already seen when doing learning. Because of that u get always different result when running backtesting.py. If u change shuffle = False u will get 45-50% less of trades and Accuracy score will drop to 0.6/0.65 max.
Best
Within the project there are many inconsistent naming conventions (just look at the top level python files!)
Fix this according to the holy laws of PEP8 and human decency.
Hi Robert,
I was reviewing your most excellent work earlier and was wondering..
What index did you use to generate the sp500_index.csv data?
Was this S&P 500 (^GSPC) and did you preprocess or scale this data.
The reason I ask is that the data in the 200-207 range looks on the low side.
Thanks!
Fig
ubuntu 16.04
✘-1 ~/MachineLearningStocks [master|✚ 1…21968]
05:02 $ python download_historical_prices.py
File "download_historical_prices.py", line 35
print(f"{len(missing_tickers)} tickers are missing: \n {missing_tickers} ")
^
SyntaxError: invalid syntax
Add `if name = "main" ' to most of the files to improve command line access
current_data.py
extracts the current financials of a company by scraping yahoo finance.
However, if you look at the file you will see that it is a hard-coded mess, filled with code smell and repetition. In the spirit of python, this can and should be fixed. I do have a fix ready on one of my recent versions of this project, but I will have to backwards-integrate it.
My name is Luis, I'm a big-data machine-learning developer, I'm a fan of your work, and I usually check your updates.
I was afraid that my savings would be eaten by inflation. I have created a powerful tool that based on past technical patterns (volatility, moving averages, statistics, trends, candlesticks, support and resistance, stock index indicators).
All the ones you know (RSI, MACD, STOCH, Bolinger Bands, SMA, DEMARK, Japanese candlesticks, ichimoku, fibonacci, williansR, balance of power, murrey math, etc) and more than 200 others.
The tool creates prediction models of correct trading points (buy signal and sell signal, every stock is good traded in time and direction).
For this I have used big data tools like pandas python, stock market libraries like: tablib, TAcharts ,pandas_ta... For data collection and calculation.
And powerful machine-learning libraries such as: Sklearn.RandomForest , Sklearn.GradientBoosting, XGBoost, Google TensorFlow and Google TensorFlow LSTM.
With the models trained with the selection of the best technical indicators, the tool is able to predict trading points (where to buy, where to sell) and send real-time alerts to Telegram or Mail. The points are calculated based on the learning of the correct trading points of the last 2 years (including the change to bear market after the rate hike).
I think it could be useful to you, to improve, I would like to share it with you, and if you are interested in improving and collaborating I am also willing, and if not file it in the box.
Robert, Just discovered your MachineLearningStocks. Not an issue but a suggestion on fundamental
data sources, The American Association of Individual Investors has a product (Stock Investor Pro) with a
reasonable subscription fee of US $198/year after a membership fee of $29/year. A subscriber has
access to both current and weekly non survivorship biased historical back to 2004 for ~2000 fundamental factors for ~6000 equities. It takes a significant effort to download and put the data
into a usable format. I have been using this data source in a personal Python based stock back tester
and screener for personal investing for 14 + years. Interestingly I too am wading through Eremenko Krill's Machine Learning and Deep Learning and have just purchased a GPU card with the long term
intent of adding ML stock selection to my current system.
Interested in this project and possibly working on it more. Just starting out with ML but I was curious to try and figure out the issue with the backtesting. From what I can tell it is that you are training the model on future data but then making predictions for stocks in the past...
It seems like the solution would be to first, randomly select the year you'd like to predict and then ensure the spit for both training and test is only run on years before that. Just wanted to check and see if I'm right about at least the issue. Feel free to drop me an email (on my profile) if you'd rather talk there, I know you said you want to let other people try and figure it out.
I would like to contribute to this project and have read through the readme in detail.
I have noticed you speak about a fatal flaw in the backtest, what is it? I can work on this and submit a PR.
Hello,
I tried using the script but the data for key statistics provided by Sentdex is a bit outdated. Does anyone know an API or a URL where we can get more fresh data? I am willing to submit a PR with this solution is someone provides me with enough info so I can implement it?
Thanks,
Aleksandar Serafimoski
I do not see how to get the ticker list. There isn't really much documentation on it. I can get the prices for SPY, but the director does not work. Is this something I have to get by myself?
Please let me know.
Hello, when I run command python parsing_keystats.py, keystats.cvs is created, but it's empty(except for the Date, unix, ticker and etc.). I know for sure that stock prices are downloaded and updates as I change date to present. Please can you help me as I tried everything to solve this.
Let's have some clear commenting and a much improved README so new users can understand exactly what's going on.
I get the below error when doing the pytest. I'm not sure why this is occurring.
pytest -vv
================================================================================= test session starts ==================================================================================
platform linux -- Python 3.6.5, pytest-3.4.1, py-1.5.3, pluggy-0.6.0 -- /home/chris/anaconda3/bin/python
cachedir: .pytest_cache
rootdir: /home/chris/Documents/Stocks/MachineLearningStocks, inifile:
plugins: remotedata-0.2.1, openfiles-0.3.0, doctestplus-0.1.3, arraydiff-0.2
collected 9 items
tests/test_datasets.py::test_forward_sample_dimensions PASSED [ 11%]
tests/test_datasets.py::test_forward_sample_data PASSED [ 22%]
tests/test_datasets.py::test_stock_prices_dataset PASSED [ 33%]
tests/test_datasets.py::test_stock_prediction_dataset PASSED [ 44%]
tests/test_utils.py::test_status_calc PASSED [ 55%]
tests/test_utils.py::test_data_string_to_float PASSED [ 66%]
tests/test_variables.py::test_statspath PASSED [ 77%]
tests/test_variables.py::test_features_same FAILED [ 88%]
tests/test_variables.py::test_outperformance PASSED [100%]
======================================================================================= FAILURES =======================================================================================
__________________________________________________________________________________ test_features_same __________________________________________________________________________________
def test_features_same():
# There are only four differences (intentionally)
assert set(parsing_keystats.features) - set(current_data.features) == {'Qtrly Revenue Growth', 'Qtrly Earnings Growth',
'Shares Short (as of', 'Net Income Avl to Common'}
E AssertionError: assert {'Net Income ...prior month)'} == {'Net Income A...Short (as of'}
E Extra items in the left set:
E 'Shares Short (prior month)'
E Full diff:
E {'Net Income Avl to Common',
E 'Qtrly Earnings Growth',
E 'Qtrly Revenue Growth',
E - 'Shares Short (as of',
E ? ^
E + 'Shares Short (as of'}
E ? ^
E - 'Shares Short (prior month)'}
tests/test_variables.py:17: AssertionError
========================================================================= 1 failed, 8 passed in 15.02 seconds ==========================================================================
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.