numerai / example-scripts Goto Github PK
View Code? Open in Web Editor NEWA collection of scripts and notebooks to help you get started quickly.
Home Page: https://numer.ai/
License: MIT License
A collection of scripts and notebooks to help you get started quickly.
Home Page: https://numer.ai/
License: MIT License
(venv) anson ~/numerai/example-scripts (master)
$ python3 example_model.py
Traceback (most recent call last):
File "example_model.py", line 2, in <module>
from lightgbm import LGBMRegressor
File "/Users/anson/numerai/example-scripts/venv/lib/python3.7/site-packages/lightgbm/__init__.py", line 8, in <module>
from .basic import Booster, Dataset, register_logger
File "/Users/anson/numerai/example-scripts/venv/lib/python3.7/site-packages/lightgbm/basic.py", line 95, in <module>
_LIB = _load_lib()
File "/Users/anson/numerai/example-scripts/venv/lib/python3.7/site-packages/lightgbm/basic.py", line 86, in _load_lib
lib = ctypes.cdll.LoadLibrary(lib_path[0])
File "/usr/local/Cellar/[email protected]/3.7.9_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ctypes/__init__.py", line 442, in LoadLibrary
return self._dlltype(name)
File "/usr/local/Cellar/[email protected]/3.7.9_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ctypes/__init__.py", line 364, in __init__
self._handle = _dlopen(self._name, mode)
OSError: dlopen(/Users/anson/numerai/example-scripts/venv/lib/python3.7/site-packages/lightgbm/lib_lightgbm.so, 6): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
Referenced from: /Users/anson/numerai/example-scripts/venv/lib/python3.7/site-packages/lightgbm/lib_lightgbm.so
Reason: image not found
brew install libomp
microsoft/LightGBM#1369
Dear developers,
I was wondering about the choice of using pandas.to_pickle
to save models. In particular I have two questions:
pandas.to_pickle
instead of the python pickle module? Does pandas offer some interesting features that python generic module doesn't?thanks
Luca
The current analysis_and_tips.ipynb notebook says:
so even if they are good are ranking rows
but I believe you meant:
so even if they are good at ranking rows
Thanks!
I am referring to this line. I believe the median value of the features should be computed on a per-era basis to avoid introducing biases.
This is not stricty a bug but with the 4.1 data being the current standard I was interested to run the advanced example on it, however it errors on the below neutralization step with an error I can't get to the bottom of.
# do neutralization
print("doing neutralization to riskiest features")
training_data.loc[test_split_index, f"preds_{model_name}_neutral_riskiest_50"] = neutralize(
df=training_data.loc[test_split_index, :],
columns=[f"preds_{model_name}"],
neutralizers=riskiest_features_split,
proportion=1.0,
normalize=True,
era_col=ERA_COL)[f"preds_{model_name}"]
The following is the full error trace.
---------------------------------------------------------------------------
LinAlgError Traceback (most recent call last)
[<ipython-input-10-6a1221969f79>](https://localhost:8080/#) in <cell line: 2>()
65 # do neutralization
66 print("doing neutralization to riskiest features")
---> 67 training_data.loc[test_split_index, f"preds_{model_name}_neutral_riskiest_50"] = neutralize(
68 df=training_data.loc[test_split_index, :],
69 columns=[f"preds_{model_name}"],
5 frames
[<ipython-input-2-6459d8dbad0a>](https://localhost:8080/#) in neutralize(df, columns, neutralizers, proportion, normalize, era_col, verbose)
139
140 scores -= proportion * exposures.dot(
--> 141 np.linalg.pinv(exposures.astype(np.float32), rcond=1e-6).dot(
142 scores.astype(np.float32)
143 )
/usr/local/lib/python3.9/dist-packages/numpy/core/overrides.py in pinv(*args, **kwargs)
[/usr/local/lib/python3.9/dist-packages/numpy/linalg/linalg.py](https://localhost:8080/#) in pinv(a, rcond, hermitian)
1988 return wrap(res)
1989 a = a.conjugate()
-> 1990 u, s, vt = svd(a, full_matrices=False, hermitian=hermitian)
1991
1992 # discard small singular values
/usr/local/lib/python3.9/dist-packages/numpy/core/overrides.py in svd(*args, **kwargs)
[/usr/local/lib/python3.9/dist-packages/numpy/linalg/linalg.py](https://localhost:8080/#) in svd(a, full_matrices, compute_uv, hermitian)
1646
1647 signature = 'D->DdD' if isComplexType(t) else 'd->ddd'
-> 1648 u, s, vh = gufunc(a, signature=signature, extobj=extobj)
1649 u = u.astype(result_t, copy=False)
1650 s = s.astype(_realType(result_t), copy=False)
[/usr/local/lib/python3.9/dist-packages/numpy/linalg/linalg.py](https://localhost:8080/#) in _raise_linalgerror_svd_nonconvergence(err, flag)
95
96 def _raise_linalgerror_svd_nonconvergence(err, flag):
---> 97 raise LinAlgError("SVD did not converge")
98
99 def _raise_linalgerror_lstsq(err, flag):
LinAlgError: SVD did not converge
Any help or advice from anyone would be amazing.
I have a reasonable recent laptop.
The example_model maxed out all 16gb of my RAM and used 9gb of swap and my whole laptop was unusable for anything else.
Is there a way to run the program and limit it's RAM usage to 8gb or something where I can continue to use my laptop for browsing or code editing experiences whilst having the model be trained ?
Or should the minimal system requirements be bumped to 32gb of RAM ?
Could the parquet files be read and written from in a format key value db like lmdb or rocksdb to reduce the reliance on having to upgrade my laptop from 16gb of ram to 32gb of ram ?
Alternatively should we add instructions on how to SSH into an EC2 allocated with 32gb of RAM for the purposes of running the example scripts ?
Laptop overview: ( 6 core i7 @ 2.6ghz, 16gb ram, 256gb SSD )
I would like to publish some of the code from this repository in a MIT licensed library.
Are you ok with that? Any attribution required?
Thanks for sharing such great stuff!
Trying to run example_model.py
on Windows 10, 8GB RAM notebook results in following crash:
(numer.ai) C:\Users\lsadmin\Documents\numer.ai\example-scripts>python example_model.py
Downloading dataset files...
2022-05-17 16:01:22,904 INFO numerapi.utils: target file already exists
2022-05-17 16:01:22,906 INFO numerapi.utils: download complete
2022-05-17 16:01:24,126 INFO numerapi.utils: target file already exists
2022-05-17 16:01:24,126 INFO numerapi.utils: download complete
2022-05-17 16:01:25,315 INFO numerapi.utils: target file already exists
2022-05-17 16:01:25,315 INFO numerapi.utils: download complete
2022-05-17 16:01:26,494 INFO numerapi.utils: target file already exists
2022-05-17 16:01:26,494 INFO numerapi.utils: download complete
2022-05-17 16:01:28,128 INFO numerapi.utils: target file already exists
2022-05-17 16:01:28,128 INFO numerapi.utils: download complete
Reading minimal training data
Traceback (most recent call last):
File "C:\Users\lsadmin\Documents\numer.ai\example-scripts\example_model.py", line 52, in
validation_data = pd.read_parquet('v4/validation.parquet',
File "C:\Users\lsadmin\anaconda3\envs\numer.ai\lib\site-packages\pandas\io\parquet.py", line 493, in read_parquet
return impl.read(
File "C:\Users\lsadmin\anaconda3\envs\numer.ai\lib\site-packages\pandas\io\parquet.py", line 240, in read
result = self.api.parquet.read_table(
File "pyarrow\array.pxi", line 767, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow\table.pxi", line 1996, in pyarrow.lib.Table._to_pandas
File "C:\Users\lsadmin\anaconda3\envs\numer.ai\lib\site-packages\pyarrow\pandas_compat.py", line 789, in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
File "C:\Users\lsadmin\anaconda3\envs\numer.ai\lib\site-packages\pyarrow\pandas_compat.py", line 1135, in _table_to_blocks
result = pa.lib.table_to_blocks(options, block_table, categories,
File "pyarrow\table.pxi", line 1356, in pyarrow.lib.table_to_blocks
File "pyarrow\error.pxi", line 116, in pyarrow.lib.check_status
pyarrow.lib.ArrowMemoryError: malloc of size 4209552448 failed
Any idea why or how to overcome? Failed malloc is exactly 4GB, can the example run on machine like above?
Hi, your project example-scripts requires "halo==0.0.31" in its dependency. After analyzing the source code, we found that some other versions of halo can also be suitable without affecting your project, i.e., halo 0.0.30. Therefore, we suggest to loosen the dependency on halo from "halo==0.0.31" to "halo>=0.0.30,<=0.0.31" to avoid any possible conflict for importing more packages or for downstream projects that may use example-scripts.
May I pull a request to loosen the dependency on halo?
By the way, could you please tell us whether such dependency analysis may be potentially helpful for maintaining dependencies easier during your development?
For your reference, here are details in our analysis.
Your project example-scripts(commit id: c447775) directly uses 1 APIs from package halo.
halo.halo.Halo.__init__
From which, 14 functions are then indirectly called, including 8 halo's internal APIs and 6 outsider APIs, as follows (neglecting some repeated function occurrences).
[/numerai/example-scripts]
+--halo.halo.Halo.__init__
| +--halo._utils.get_environment
| | +--IPython.get_ipython
| +--halo.halo.Halo.stop
| | +--halo.halo.Halo.clear
| | | +--halo.halo.Halo._write
| | | | +--halo.halo.Halo._check_stream
| | +--halo.halo.Halo._show_cursor
| | | +--halo.halo.Halo._check_stream
| | | +--halo.cursor.show
| | | | +--halo.cursor._CursorInfo.__init__
| | | | +--ctypes.windll.kernel32.GetStdHandle
| | | | +--ctypes.windll.kernel32.GetConsoleCursorInfo
| | | | +--ctypes.windll.kernel32.SetConsoleCursorInfo
| | | | +--ctypes.byref
| +--IPython.get_ipython
| +--atexit.register
We scan halo's versions among [0.0.30] and 0.0.31, the changing functions (diffs being listed below) have none intersection with any function or API we mentioned above (either directly or indirectly called by this project).
diff: 0.0.31(original) 0.0.30
[](no clear difference between the source codes of two versions)
As for other packages, the APIs of @outside_package_name are called by halo in the call graph and the dependencies on these packages also stay the same in our suggested versions, thus avoiding any outside conflict.
Therefore, we believe that it is quite safe to loose your dependency on halo from "halo==0.0.31" to "halo>=0.0.30,<=0.0.31". This will improve the applicability of example-scripts and reduce the possibility of any further dependency conflict with other projects/packages.
feature_subset is initially set to medium/serenity, the intended neutralization:
feature_subset = list(subgroups["medium"]["serenity"])
But later in the neutralizing different groups section is used in the for loop and ends on medium/all, which is then pickled in the submission.
for group in groups:
feature_subset = list(subgroups["medium"][group])
Installing collected packages: numpy
Attempting uninstall: numpy
Found existing installation: numpy 1.21.0
Uninstalling numpy-1.21.0:
Successfully uninstalled numpy-1.21.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.55.1 requires numpy<1.22,>=1.18, but you have numpy 1.22.3 which is incompatible.
Successfully installed numpy-1.22.3
Currently, the default install process requires to install numpy
< 1.22, due to numba
requirement.
However, the default behaviour by pip install -r requirements.txt
force the lower versioned numpy
to be uninstalled and install the newest version, causing errors.
When I try to upload validation predictions I get:
Your upload seems to be invalid:
invalid_submission_eras: Incomplete eras uploaded: ['0936']
The script fails with "Floating point is not supported" when running with xgboost 1.4.0 on Ubuntu 20.04.
$ python3 example_model.py
Loading data...
Loaded 310 features
Loading pre-trained model...
Generating predictions...
/home/andrewh/.local/lib/python3.8/site-packages/xgboost/data.py:112: UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate
extra copies and increase memory consumption
warnings.warn(
Traceback (most recent call last):
File "example_model.py", line 236, in
main()
File "example_model.py", line 80, in main
training_data[PREDICTION_NAME] = model.predict(training_data[feature_names])
File "/home/andrewh/.local/lib/python3.8/site-packages/xgboost/sklearn.py", line 820, in predict
predts = self.get_booster().inplace_predict(
File "/home/andrewh/.local/lib/python3.8/site-packages/xgboost/core.py", line 1846, in inplace_predict
_check_call(
File "/home/andrewh/.local/lib/python3.8/site-packages/xgboost/core.py", line 210, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [21:36:57] ../src/c_api/../data/array_interface.h:352: Floating point is not supported.
Stack trace:
[bt] (0) /home/andrewh/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x912df) [0x7f2f483e62df]
[bt] (1) /home/andrewh/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x9af7f) [0x7f2f483eff7f]
[bt] (2) /home/andrewh/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterPredictFromDense+0xf8) [0x7f2f483d8e78]
[bt] (3) /lib/x86_64-linux-gnu/libffi.so.7(+0x6ff5) [0x7f2f83f6cff5]
[bt] (4) /lib/x86_64-linux-gnu/libffi.so.7(+0x640a) [0x7f2f83f6c40a]
[bt] (5) /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x58c) [0x7f2f8eb122ac]
[bt] (6) /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x137e0) [0x7f2f8eb127e0]
[bt] (7) python3(_PyObject_MakeTpCall+0x296) [0x5f3446]
[bt] (8) python3(_PyEval_EvalFrameDefault+0x5dc0) [0x56f600]
I ran the example script and it started downloading v4/validation.parquet
;
My wifi was slow and my computer went to sleep, I woke up my computer and the program was hung due to wifi disconnect, I killed the program and ran the program again to "resume the download"
Instead I got
OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.
I had to manually delete v4/validation.parquet
since numerai sdk was not able to correctly resume the download.
Below is the output of the program that resumes the download.
2023-01-29 11:22:59,695 INFO numerapi.utils: resuming download
/home/raynos/.local/lib/python3.8/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'numerai-datasets-us-west-2.s3.amazonaws.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
warnings.warn(
v4/validation.parquet: 40%|█████▋ | 463M/1.15G [00:00<00:00, 3.82GB/s]
v4/validation.parquet: 1.15GB [01:05, 17.4MB/s]
2023-01-29 11:24:07,248 INFO numerapi.utils: starting download
v4/live_409.parquet: 3.42MB [00:01, 1.90MB/s]
Below is the output of the program that tries to use the data file from the resumed download.
2023-01-29 11:24:20,449 INFO numerapi.utils: starting download
v4/features.json: 562kB [00:00, 727kB/s]
Reading minimal training data
Traceback (most recent call last):
File "./example_model.py", line 52, in <module>
validation_data = pd.read_parquet('v4/validation.parquet',
File "/home/raynos/.local/lib/python3.8/site-packages/pandas/io/parquet.py", line 493, in read_parquet
return impl.read(
File "/home/raynos/.local/lib/python3.8/site-packages/pandas/io/parquet.py", line 240, in read
result = self.api.parquet.read_table(
File "/home/raynos/.local/lib/python3.8/site-packages/pyarrow/parquet.py", line 1996, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
File "/home/raynos/.local/lib/python3.8/site-packages/pyarrow/parquet.py", line 1831, in read
table = self._dataset.to_table(
File "pyarrow/_dataset.pyx", line 323, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 2311, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.
I don't know if it's possible to do an integrity check with a checksum in the resuming download
branch, but doing so would allow you to verify if the resumed download was successful or corrupted and then delete the corrupted file.
Leaving the corrupted file behind gives me a thrift protocol error since the parquet
is not valid anymore.
Running model_upload does not work when uploading the pickle file to numerai. Error with lightgbm booster has not handle
we should add numerox and numerapi to our exmaple scripts
I think the current train-test era split risks missing some eras (the last few ones)
Line 73 in 76c537a
I've stumbled upon it by using cv=5
and fixed it by calculating the test splits as
test_splits = [all_train_eras[i * len_split:(i + 1) * len_split] for i in range(cv - 1)] + \
[all_train_eras[(cv - 1) * len_split:]]
EDIT: addressed in #94
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.