numerai / example-scripts Goto Github PK

View Code? Open in Web Editor NEW

775.0 68.0 266.0 44.91 MB

A collection of scripts and notebooks to help you get started quickly.

Home Page: https://numer.ai/

License: MIT License

Python 0.43% Jupyter Notebook 99.57%

numerai machine-learning quant-finance cryptocurrency

example-scripts's Issues

OS Error: dlopen ... Reason: image not found

System

2017 Macbook Pro 15-inch
macOS Big Sur v11.5.2
python 3.7.9

Problem

(venv) anson ~/numerai/example-scripts (master)
$ python3 example_model.py
Traceback (most recent call last):
  File "example_model.py", line 2, in <module>
    from lightgbm import LGBMRegressor
  File "/Users/anson/numerai/example-scripts/venv/lib/python3.7/site-packages/lightgbm/__init__.py", line 8, in <module>
    from .basic import Booster, Dataset, register_logger
  File "/Users/anson/numerai/example-scripts/venv/lib/python3.7/site-packages/lightgbm/basic.py", line 95, in <module>
    _LIB = _load_lib()
  File "/Users/anson/numerai/example-scripts/venv/lib/python3.7/site-packages/lightgbm/basic.py", line 86, in _load_lib
    lib = ctypes.cdll.LoadLibrary(lib_path[0])
  File "/usr/local/Cellar/[email protected]/3.7.9_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ctypes/__init__.py", line 442, in LoadLibrary
    return self._dlltype(name)
  File "/usr/local/Cellar/[email protected]/3.7.9_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: dlopen(/Users/anson/numerai/example-scripts/venv/lib/python3.7/site-packages/lightgbm/lib_lightgbm.so, 6): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
  Referenced from: /Users/anson/numerai/example-scripts/venv/lib/python3.7/site-packages/lightgbm/lib_lightgbm.so
  Reason: image not found

Solution

brew install libomp
microsoft/LightGBM#1369

Question: why pickle for saving models?

Dear developers,

I was wondering about the choice of using pandas.to_pickle to save models. In particular I have two questions:

why pandas.to_pickle instead of the python pickle module? Does pandas offer some interesting features that python generic module doesn't?
Knowing that pickle has security & maintainability limitations (e.g. upgrade to a new version of LGBM), why not using the LGBM save/load api?

thanks
Luca

segmentation fault

Trying to compile example_model.py, but I am getting this error. I even tried running it with sudo but it didn't work. Any help would be nice

.

Update Numerox example script to use 3.0.0

ValueError: On entry to DLASCL parameter number 4 had an illegal value

Typo in Analysis and Tips Notebook

The current analysis_and_tips.ipynb notebook says:

so even if they are good are ranking rows

but I believe you meant:

so even if they are good at ranking rows

Thanks!

example_model_sunshine: shouldn't the NA filling be performed by era?

I am referring to this line. I believe the median value of the features should be computed on a per-era basis to avoid introducing biases.

Issue running example_model_advanced.py with v4.1 data

This is not stricty a bug but with the 4.1 data being the current standard I was interested to run the advanced example on it, however it errors on the below neutralization step with an error I can't get to the bottom of.

            # do neutralization
            print("doing neutralization to riskiest features")
            training_data.loc[test_split_index, f"preds_{model_name}_neutral_riskiest_50"] = neutralize(
                df=training_data.loc[test_split_index, :],
                columns=[f"preds_{model_name}"],
                neutralizers=riskiest_features_split,
                proportion=1.0,
                normalize=True,
                era_col=ERA_COL)[f"preds_{model_name}"]

The following is the full error trace.

---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
[<ipython-input-10-6a1221969f79>](https://localhost:8080/#) in <cell line: 2>()
     65             # do neutralization
     66             print("doing neutralization to riskiest features")
---> 67             training_data.loc[test_split_index, f"preds_{model_name}_neutral_riskiest_50"] = neutralize(
     68                 df=training_data.loc[test_split_index, :],
     69                 columns=[f"preds_{model_name}"],

5 frames
[<ipython-input-2-6459d8dbad0a>](https://localhost:8080/#) in neutralize(df, columns, neutralizers, proportion, normalize, era_col, verbose)
    139 
    140         scores -= proportion * exposures.dot(
--> 141             np.linalg.pinv(exposures.astype(np.float32), rcond=1e-6).dot(
    142                 scores.astype(np.float32)
    143             )

/usr/local/lib/python3.9/dist-packages/numpy/core/overrides.py in pinv(*args, **kwargs)

[/usr/local/lib/python3.9/dist-packages/numpy/linalg/linalg.py](https://localhost:8080/#) in pinv(a, rcond, hermitian)
   1988         return wrap(res)
   1989     a = a.conjugate()
-> 1990     u, s, vt = svd(a, full_matrices=False, hermitian=hermitian)
   1991 
   1992     # discard small singular values

/usr/local/lib/python3.9/dist-packages/numpy/core/overrides.py in svd(*args, **kwargs)

[/usr/local/lib/python3.9/dist-packages/numpy/linalg/linalg.py](https://localhost:8080/#) in svd(a, full_matrices, compute_uv, hermitian)
   1646 
   1647         signature = 'D->DdD' if isComplexType(t) else 'd->ddd'
-> 1648         u, s, vh = gufunc(a, signature=signature, extobj=extobj)
   1649         u = u.astype(result_t, copy=False)
   1650         s = s.astype(_realType(result_t), copy=False)

[/usr/local/lib/python3.9/dist-packages/numpy/linalg/linalg.py](https://localhost:8080/#) in _raise_linalgerror_svd_nonconvergence(err, flag)
     95 
     96 def _raise_linalgerror_svd_nonconvergence(err, flag):
---> 97     raise LinAlgError("SVD did not converge")
     98 
     99 def _raise_linalgerror_lstsq(err, flag):

LinAlgError: SVD did not converge

Any help or advice from anyone would be amazing.

The example_model.py file grinds my laptop to a hold on a 16RAM memory hardware

I have a reasonable recent laptop.

The example_model maxed out all 16gb of my RAM and used 9gb of swap and my whole laptop was unusable for anything else.

Is there a way to run the program and limit it's RAM usage to 8gb or something where I can continue to use my laptop for browsing or code editing experiences whilst having the model be trained ?

Or should the minimal system requirements be bumped to 32gb of RAM ?

Could the parquet files be read and written from in a format key value db like lmdb or rocksdb to reduce the reliance on having to upgrade my laptop from 16gb of ram to 32gb of ram ?

Alternatively should we add instructions on how to SSH into an EC2 allocated with 32gb of RAM for the purposes of running the example scripts ?

Laptop overview: ( 6 core i7 @ 2.6ghz, 16gb ram, 256gb SSD )

License

I would like to publish some of the code from this repository in a MIT licensed library.
Are you ok with that? Any attribution required?

Thanks for sharing such great stuff!

Crashing due to malloc failure

Trying to run example_model.py on Windows 10, 8GB RAM notebook results in following crash:

(numer.ai) C:\Users\lsadmin\Documents\numer.ai\example-scripts>python example_model.py
Downloading dataset files...
2022-05-17 16:01:22,904 INFO numerapi.utils: target file already exists
2022-05-17 16:01:22,906 INFO numerapi.utils: download complete
2022-05-17 16:01:24,126 INFO numerapi.utils: target file already exists
2022-05-17 16:01:24,126 INFO numerapi.utils: download complete
2022-05-17 16:01:25,315 INFO numerapi.utils: target file already exists
2022-05-17 16:01:25,315 INFO numerapi.utils: download complete
2022-05-17 16:01:26,494 INFO numerapi.utils: target file already exists
2022-05-17 16:01:26,494 INFO numerapi.utils: download complete
2022-05-17 16:01:28,128 INFO numerapi.utils: target file already exists
2022-05-17 16:01:28,128 INFO numerapi.utils: download complete
Reading minimal training data
Traceback (most recent call last):
File "C:\Users\lsadmin\Documents\numer.ai\example-scripts\example_model.py", line 52, in
validation_data = pd.read_parquet('v4/validation.parquet',
File "C:\Users\lsadmin\anaconda3\envs\numer.ai\lib\site-packages\pandas\io\parquet.py", line 493, in read_parquet
return impl.read(
File "C:\Users\lsadmin\anaconda3\envs\numer.ai\lib\site-packages\pandas\io\parquet.py", line 240, in read
result = self.api.parquet.read_table(
File "pyarrow\array.pxi", line 767, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow\table.pxi", line 1996, in pyarrow.lib.Table._to_pandas
File "C:\Users\lsadmin\anaconda3\envs\numer.ai\lib\site-packages\pyarrow\pandas_compat.py", line 789, in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
File "C:\Users\lsadmin\anaconda3\envs\numer.ai\lib\site-packages\pyarrow\pandas_compat.py", line 1135, in _table_to_blocks
result = pa.lib.table_to_blocks(options, block_table, categories,
File "pyarrow\table.pxi", line 1356, in pyarrow.lib.table_to_blocks
File "pyarrow\error.pxi", line 116, in pyarrow.lib.check_status
pyarrow.lib.ArrowMemoryError: malloc of size 4209552448 failed

Any idea why or how to overcome? Failed malloc is exactly 4GB, can the example run on machine like above?

Suggest to loosen the dependency on halo

Hi, your project example-scripts requires "halo==0.0.31" in its dependency. After analyzing the source code, we found that some other versions of halo can also be suitable without affecting your project, i.e., halo 0.0.30. Therefore, we suggest to loosen the dependency on halo from "halo==0.0.31" to "halo>=0.0.30,<=0.0.31" to avoid any possible conflict for importing more packages or for downstream projects that may use example-scripts.

May I pull a request to loosen the dependency on halo?

By the way, could you please tell us whether such dependency analysis may be potentially helpful for maintaining dependencies easier during your development?

For your reference, here are details in our analysis.

Your project example-scripts(commit id: c447775) directly uses 1 APIs from package halo.

halo.halo.Halo.__init__

From which, 14 functions are then indirectly called, including 8 halo's internal APIs and 6 outsider APIs, as follows (neglecting some repeated function occurrences).

[/numerai/example-scripts]
+--halo.halo.Halo.__init__
|      +--halo._utils.get_environment
|      |      +--IPython.get_ipython
|      +--halo.halo.Halo.stop
|      |      +--halo.halo.Halo.clear
|      |      |      +--halo.halo.Halo._write
|      |      |      |      +--halo.halo.Halo._check_stream
|      |      +--halo.halo.Halo._show_cursor
|      |      |      +--halo.halo.Halo._check_stream
|      |      |      +--halo.cursor.show
|      |      |      |      +--halo.cursor._CursorInfo.__init__
|      |      |      |      +--ctypes.windll.kernel32.GetStdHandle
|      |      |      |      +--ctypes.windll.kernel32.GetConsoleCursorInfo
|      |      |      |      +--ctypes.windll.kernel32.SetConsoleCursorInfo
|      |      |      |      +--ctypes.byref
|      +--IPython.get_ipython
|      +--atexit.register

We scan halo's versions among [0.0.30] and 0.0.31, the changing functions (diffs being listed below) have none intersection with any function or API we mentioned above (either directly or indirectly called by this project).

diff: 0.0.31(original) 0.0.30
[](no clear difference between the source codes of two versions)

As for other packages, the APIs of @outside_package_name are called by halo in the call graph and the dependencies on these packages also stay the same in our suggested versions, thus avoiding any outside conflict.

Therefore, we believe that it is quite safe to loose your dependency on halo from "halo==0.0.31" to "halo>=0.0.30,<=0.0.31". This will improve the applicability of example-scripts and reduce the possibility of any further dependency conflict with other projects/packages.

feature_subset actually submits medium/all neutralized

feature_subset is initially set to medium/serenity, the intended neutralization:

feature_subset = list(subgroups["medium"]["serenity"])

But later in the neutralizing different groups section is used in the for loop and ends on medium/all, which is then pickled in the submission.

for group in groups:
feature_subset = list(subgroups["medium"][group])

Install will force to install the newest numpy but numba requires lower version

Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.0
    Uninstalling numpy-1.21.0:
      Successfully uninstalled numpy-1.21.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.55.1 requires numpy<1.22,>=1.18, but you have numpy 1.22.3 which is incompatible.
Successfully installed numpy-1.22.3

Currently, the default install process requires to install numpy < 1.22, due to numba requirement.
However, the default behaviour by pip install -r requirements.txt force the lower versioned numpy to be uninstalled and install the newest version, causing errors.

v4/validation.parquet with napi.download_dataset has ~2m rows instead of the expected 500k

When I try to upload validation predictions I get:

Your upload seems to be invalid:
invalid_submission_eras: Incomplete eras uploaded: ['0936']

example_model.py fails with xgboost 1.4.0

The script fails with "Floating point is not supported" when running with xgboost 1.4.0 on Ubuntu 20.04.

$ python3 example_model.py
Loading data...
Loaded 310 features
Loading pre-trained model...
Generating predictions...
/home/andrewh/.local/lib/python3.8/site-packages/xgboost/data.py:112: UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate
extra copies and increase memory consumption
warnings.warn(
Traceback (most recent call last):
File "example_model.py", line 236, in
main()
File "example_model.py", line 80, in main
training_data[PREDICTION_NAME] = model.predict(training_data[feature_names])
File "/home/andrewh/.local/lib/python3.8/site-packages/xgboost/sklearn.py", line 820, in predict
predts = self.get_booster().inplace_predict(
File "/home/andrewh/.local/lib/python3.8/site-packages/xgboost/core.py", line 1846, in inplace_predict
_check_call(
File "/home/andrewh/.local/lib/python3.8/site-packages/xgboost/core.py", line 210, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [21:36:57] ../src/c_api/../data/array_interface.h:352: Floating point is not supported.
Stack trace:
[bt] (0) /home/andrewh/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x912df) [0x7f2f483e62df]
[bt] (1) /home/andrewh/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x9af7f) [0x7f2f483eff7f]
[bt] (2) /home/andrewh/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterPredictFromDense+0xf8) [0x7f2f483d8e78]
[bt] (3) /lib/x86_64-linux-gnu/libffi.so.7(+0x6ff5) [0x7f2f83f6cff5]
[bt] (4) /lib/x86_64-linux-gnu/libffi.so.7(+0x640a) [0x7f2f83f6c40a]
[bt] (5) /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x58c) [0x7f2f8eb122ac]
[bt] (6) /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x137e0) [0x7f2f8eb127e0]
[bt] (7) python3(_PyObject_MakeTpCall+0x296) [0x5f3446]
[bt] (8) python3(_PyEval_EvalFrameDefault+0x5dc0) [0x56f600]

Downloads are not resumable. Get thrift deserialization error

I ran the example script and it started downloading v4/validation.parquet ;

My wifi was slow and my computer went to sleep, I woke up my computer and the program was hung due to wifi disconnect, I killed the program and ran the program again to "resume the download"

Instead I got

OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

I had to manually delete v4/validation.parquet since numerai sdk was not able to correctly resume the download.

Below is the output of the program that resumes the download.

2023-01-29 11:22:59,695 INFO numerapi.utils: resuming download
/home/raynos/.local/lib/python3.8/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'numerai-datasets-us-west-2.s3.amazonaws.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
v4/validation.parquet:  40%|█████▋        | 463M/1.15G [00:00<00:00, 3.82GB/s]



v4/validation.parquet: 1.15GB [01:05, 17.4MB/s]                               
2023-01-29 11:24:07,248 INFO numerapi.utils: starting download
v4/live_409.parquet: 3.42MB [00:01, 1.90MB/s]

Below is the output of the program that tries to use the data file from the resumed download.

2023-01-29 11:24:20,449 INFO numerapi.utils: starting download
v4/features.json: 562kB [00:00, 727kB/s]                                               
Reading minimal training data
Traceback (most recent call last):
  File "./example_model.py", line 52, in <module>
    validation_data = pd.read_parquet('v4/validation.parquet',
  File "/home/raynos/.local/lib/python3.8/site-packages/pandas/io/parquet.py", line 493, in read_parquet
    return impl.read(
  File "/home/raynos/.local/lib/python3.8/site-packages/pandas/io/parquet.py", line 240, in read
    result = self.api.parquet.read_table(
  File "/home/raynos/.local/lib/python3.8/site-packages/pyarrow/parquet.py", line 1996, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/home/raynos/.local/lib/python3.8/site-packages/pyarrow/parquet.py", line 1831, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 323, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2311, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

I don't know if it's possible to do an integrity check with a checksum in the resuming download branch, but doing so would allow you to verify if the resumed download was successful or corrupted and then delete the corrupted file.

Leaving the corrupted file behind gives me a thrift protocol error since the parquet is not valid anymore.

Line 73 in 76c537a

 test_splits = [all_train_eras[i * len_split:(i + 1) * len_split] for i in range(cv)] 

I've stumbled upon it by using cv=5 and fixed it by calculating the test splits as

test_splits = [all_train_eras[i * len_split:(i + 1) * len_split] for i in range(cv - 1)] + \
        [all_train_eras[(cv - 1)  * len_split:]]

EDIT: addressed in #94

numerai / example-scripts Goto Github PK

example-scripts's Issues

System

Problem

Solution

Recommend Projects

Recommend Topics

Recommend Org