Giter VIP home page Giter VIP logo

example-scripts's Introduction

Numerai Example Scripts

A collection of scripts and notebooks to help you get started quickly.

Need help? Find us on Discord:

Notebooks

Try running these notebooks on Google Colab's free tier!

Hello Numerai

Open In Colab

Start here if you are new! Explore the dataset and build your first model.

Feature Neutralization

Open In Colab

Learn how to measure feature risk and control it with feature neutralization.

Target Ensemble

Open In Colab

Learn how to create an ensemble trained on different targets.

Model Upload

Open In Colab

A barebones example of how to build and upload your model to Numerai.

example-scripts's People

Contributors

adamvvu avatar andrewpeterpei avatar cshanes avatar forstmeier avatar furmaniak avatar gosuto-inzasheru avatar harmchop avatar jeethu avatar jonathansidego avatar jonrtaylor avatar jparyani avatar kennethlj avatar kmontag42 avatar kumikoda avatar kwgoodman avatar liamhz avatar michael-phillips-data avatar mpaepper avatar murenoha avatar ndharasz avatar oftfrfbf avatar parmarsuraj99 avatar paulelvers avatar philipcmonk avatar rwarnung avatar the-moliver avatar uditgupta10 avatar uuazed avatar xanderdunn avatar zoso95 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

example-scripts's Issues

example_model.py fails with xgboost 1.4.0

The script fails with "Floating point is not supported" when running with xgboost 1.4.0 on Ubuntu 20.04.

$ python3 example_model.py
Loading data...
Loaded 310 features
Loading pre-trained model...
Generating predictions...
/home/andrewh/.local/lib/python3.8/site-packages/xgboost/data.py:112: UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate
extra copies and increase memory consumption
warnings.warn(
Traceback (most recent call last):
File "example_model.py", line 236, in
main()
File "example_model.py", line 80, in main
training_data[PREDICTION_NAME] = model.predict(training_data[feature_names])
File "/home/andrewh/.local/lib/python3.8/site-packages/xgboost/sklearn.py", line 820, in predict
predts = self.get_booster().inplace_predict(
File "/home/andrewh/.local/lib/python3.8/site-packages/xgboost/core.py", line 1846, in inplace_predict
_check_call(
File "/home/andrewh/.local/lib/python3.8/site-packages/xgboost/core.py", line 210, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [21:36:57] ../src/c_api/../data/array_interface.h:352: Floating point is not supported.
Stack trace:
[bt] (0) /home/andrewh/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x912df) [0x7f2f483e62df]
[bt] (1) /home/andrewh/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x9af7f) [0x7f2f483eff7f]
[bt] (2) /home/andrewh/.local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterPredictFromDense+0xf8) [0x7f2f483d8e78]
[bt] (3) /lib/x86_64-linux-gnu/libffi.so.7(+0x6ff5) [0x7f2f83f6cff5]
[bt] (4) /lib/x86_64-linux-gnu/libffi.so.7(+0x640a) [0x7f2f83f6c40a]
[bt] (5) /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x58c) [0x7f2f8eb122ac]
[bt] (6) /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x137e0) [0x7f2f8eb127e0]
[bt] (7) python3(_PyObject_MakeTpCall+0x296) [0x5f3446]
[bt] (8) python3(_PyEval_EvalFrameDefault+0x5dc0) [0x56f600]

Install will force to install the newest numpy but numba requires lower version

Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.0
    Uninstalling numpy-1.21.0:
      Successfully uninstalled numpy-1.21.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.55.1 requires numpy<1.22,>=1.18, but you have numpy 1.22.3 which is incompatible.
Successfully installed numpy-1.22.3

Currently, the default install process requires to install numpy < 1.22, due to numba requirement.
However, the default behaviour by pip install -r requirements.txt force the lower versioned numpy to be uninstalled and install the newest version, causing errors.

Issue running example_model_advanced.py with v4.1 data

This is not stricty a bug but with the 4.1 data being the current standard I was interested to run the advanced example on it, however it errors on the below neutralization step with an error I can't get to the bottom of.

            # do neutralization
            print("doing neutralization to riskiest features")
            training_data.loc[test_split_index, f"preds_{model_name}_neutral_riskiest_50"] = neutralize(
                df=training_data.loc[test_split_index, :],
                columns=[f"preds_{model_name}"],
                neutralizers=riskiest_features_split,
                proportion=1.0,
                normalize=True,
                era_col=ERA_COL)[f"preds_{model_name}"]

The following is the full error trace.

---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
[<ipython-input-10-6a1221969f79>](https://localhost:8080/#) in <cell line: 2>()
     65             # do neutralization
     66             print("doing neutralization to riskiest features")
---> 67             training_data.loc[test_split_index, f"preds_{model_name}_neutral_riskiest_50"] = neutralize(
     68                 df=training_data.loc[test_split_index, :],
     69                 columns=[f"preds_{model_name}"],

5 frames
[<ipython-input-2-6459d8dbad0a>](https://localhost:8080/#) in neutralize(df, columns, neutralizers, proportion, normalize, era_col, verbose)
    139 
    140         scores -= proportion * exposures.dot(
--> 141             np.linalg.pinv(exposures.astype(np.float32), rcond=1e-6).dot(
    142                 scores.astype(np.float32)
    143             )

/usr/local/lib/python3.9/dist-packages/numpy/core/overrides.py in pinv(*args, **kwargs)

[/usr/local/lib/python3.9/dist-packages/numpy/linalg/linalg.py](https://localhost:8080/#) in pinv(a, rcond, hermitian)
   1988         return wrap(res)
   1989     a = a.conjugate()
-> 1990     u, s, vt = svd(a, full_matrices=False, hermitian=hermitian)
   1991 
   1992     # discard small singular values

/usr/local/lib/python3.9/dist-packages/numpy/core/overrides.py in svd(*args, **kwargs)

[/usr/local/lib/python3.9/dist-packages/numpy/linalg/linalg.py](https://localhost:8080/#) in svd(a, full_matrices, compute_uv, hermitian)
   1646 
   1647         signature = 'D->DdD' if isComplexType(t) else 'd->ddd'
-> 1648         u, s, vh = gufunc(a, signature=signature, extobj=extobj)
   1649         u = u.astype(result_t, copy=False)
   1650         s = s.astype(_realType(result_t), copy=False)

[/usr/local/lib/python3.9/dist-packages/numpy/linalg/linalg.py](https://localhost:8080/#) in _raise_linalgerror_svd_nonconvergence(err, flag)
     95 
     96 def _raise_linalgerror_svd_nonconvergence(err, flag):
---> 97     raise LinAlgError("SVD did not converge")
     98 
     99 def _raise_linalgerror_lstsq(err, flag):

LinAlgError: SVD did not converge

Any help or advice from anyone would be amazing.

Crashing due to malloc failure

Trying to run example_model.py on Windows 10, 8GB RAM notebook results in following crash:

(numer.ai) C:\Users\lsadmin\Documents\numer.ai\example-scripts>python example_model.py
Downloading dataset files...
2022-05-17 16:01:22,904 INFO numerapi.utils: target file already exists
2022-05-17 16:01:22,906 INFO numerapi.utils: download complete
2022-05-17 16:01:24,126 INFO numerapi.utils: target file already exists
2022-05-17 16:01:24,126 INFO numerapi.utils: download complete
2022-05-17 16:01:25,315 INFO numerapi.utils: target file already exists
2022-05-17 16:01:25,315 INFO numerapi.utils: download complete
2022-05-17 16:01:26,494 INFO numerapi.utils: target file already exists
2022-05-17 16:01:26,494 INFO numerapi.utils: download complete
2022-05-17 16:01:28,128 INFO numerapi.utils: target file already exists
2022-05-17 16:01:28,128 INFO numerapi.utils: download complete
Reading minimal training data
Traceback (most recent call last):
File "C:\Users\lsadmin\Documents\numer.ai\example-scripts\example_model.py", line 52, in
validation_data = pd.read_parquet('v4/validation.parquet',
File "C:\Users\lsadmin\anaconda3\envs\numer.ai\lib\site-packages\pandas\io\parquet.py", line 493, in read_parquet
return impl.read(
File "C:\Users\lsadmin\anaconda3\envs\numer.ai\lib\site-packages\pandas\io\parquet.py", line 240, in read
result = self.api.parquet.read_table(
File "pyarrow\array.pxi", line 767, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow\table.pxi", line 1996, in pyarrow.lib.Table._to_pandas
File "C:\Users\lsadmin\anaconda3\envs\numer.ai\lib\site-packages\pyarrow\pandas_compat.py", line 789, in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
File "C:\Users\lsadmin\anaconda3\envs\numer.ai\lib\site-packages\pyarrow\pandas_compat.py", line 1135, in _table_to_blocks
result = pa.lib.table_to_blocks(options, block_table, categories,
File "pyarrow\table.pxi", line 1356, in pyarrow.lib.table_to_blocks
File "pyarrow\error.pxi", line 116, in pyarrow.lib.check_status
pyarrow.lib.ArrowMemoryError: malloc of size 4209552448 failed

Any idea why or how to overcome? Failed malloc is exactly 4GB, can the example run on machine like above?

Downloads are not resumable. Get thrift deserialization error

I ran the example script and it started downloading v4/validation.parquet ;

My wifi was slow and my computer went to sleep, I woke up my computer and the program was hung due to wifi disconnect, I killed the program and ran the program again to "resume the download"

Instead I got

OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

I had to manually delete v4/validation.parquet since numerai sdk was not able to correctly resume the download.

Below is the output of the program that resumes the download.

2023-01-29 11:22:59,695 INFO numerapi.utils: resuming download
/home/raynos/.local/lib/python3.8/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'numerai-datasets-us-west-2.s3.amazonaws.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
v4/validation.parquet:  40%|█████▋        | 463M/1.15G [00:00<00:00, 3.82GB/s]



v4/validation.parquet: 1.15GB [01:05, 17.4MB/s]                               
2023-01-29 11:24:07,248 INFO numerapi.utils: starting download
v4/live_409.parquet: 3.42MB [00:01, 1.90MB/s]                  

Below is the output of the program that tries to use the data file from the resumed download.

2023-01-29 11:24:20,449 INFO numerapi.utils: starting download
v4/features.json: 562kB [00:00, 727kB/s]                                               
Reading minimal training data
Traceback (most recent call last):
  File "./example_model.py", line 52, in <module>
    validation_data = pd.read_parquet('v4/validation.parquet',
  File "/home/raynos/.local/lib/python3.8/site-packages/pandas/io/parquet.py", line 493, in read_parquet
    return impl.read(
  File "/home/raynos/.local/lib/python3.8/site-packages/pandas/io/parquet.py", line 240, in read
    result = self.api.parquet.read_table(
  File "/home/raynos/.local/lib/python3.8/site-packages/pyarrow/parquet.py", line 1996, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/home/raynos/.local/lib/python3.8/site-packages/pyarrow/parquet.py", line 1831, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 323, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2311, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

I don't know if it's possible to do an integrity check with a checksum in the resuming download branch, but doing so would allow you to verify if the resumed download was successful or corrupted and then delete the corrupted file.

Leaving the corrupted file behind gives me a thrift protocol error since the parquet is not valid anymore.

Typo in Analysis and Tips Notebook

The current analysis_and_tips.ipynb notebook says:

so even if they are good are ranking rows

but I believe you meant:

so even if they are good at ranking rows

Thanks!

Suggest to loosen the dependency on halo

Hi, your project example-scripts requires "halo==0.0.31" in its dependency. After analyzing the source code, we found that some other versions of halo can also be suitable without affecting your project, i.e., halo 0.0.30. Therefore, we suggest to loosen the dependency on halo from "halo==0.0.31" to "halo>=0.0.30,<=0.0.31" to avoid any possible conflict for importing more packages or for downstream projects that may use example-scripts.

May I pull a request to loosen the dependency on halo?

By the way, could you please tell us whether such dependency analysis may be potentially helpful for maintaining dependencies easier during your development?



For your reference, here are details in our analysis.

Your project example-scripts(commit id: c447775) directly uses 1 APIs from package halo.

halo.halo.Halo.__init__

From which, 14 functions are then indirectly called, including 8 halo's internal APIs and 6 outsider APIs, as follows (neglecting some repeated function occurrences).

[/numerai/example-scripts]
+--halo.halo.Halo.__init__
|      +--halo._utils.get_environment
|      |      +--IPython.get_ipython
|      +--halo.halo.Halo.stop
|      |      +--halo.halo.Halo.clear
|      |      |      +--halo.halo.Halo._write
|      |      |      |      +--halo.halo.Halo._check_stream
|      |      +--halo.halo.Halo._show_cursor
|      |      |      +--halo.halo.Halo._check_stream
|      |      |      +--halo.cursor.show
|      |      |      |      +--halo.cursor._CursorInfo.__init__
|      |      |      |      +--ctypes.windll.kernel32.GetStdHandle
|      |      |      |      +--ctypes.windll.kernel32.GetConsoleCursorInfo
|      |      |      |      +--ctypes.windll.kernel32.SetConsoleCursorInfo
|      |      |      |      +--ctypes.byref
|      +--IPython.get_ipython
|      +--atexit.register

We scan halo's versions among [0.0.30] and 0.0.31, the changing functions (diffs being listed below) have none intersection with any function or API we mentioned above (either directly or indirectly called by this project).

diff: 0.0.31(original) 0.0.30
[](no clear difference between the source codes of two versions)

As for other packages, the APIs of @outside_package_name are called by halo in the call graph and the dependencies on these packages also stay the same in our suggested versions, thus avoiding any outside conflict.

Therefore, we believe that it is quite safe to loose your dependency on halo from "halo==0.0.31" to "halo>=0.0.30,<=0.0.31". This will improve the applicability of example-scripts and reduce the possibility of any further dependency conflict with other projects/packages.

segmentation fault

Trying to compile example_model.py, but I am getting this error. I even tried running it with sudo but it didn't work. Any help would be nice
Screen Shot 2021-12-12 at 7 32 46 PM
.

Question: why pickle for saving models?

Dear developers,

I was wondering about the choice of using pandas.to_pickle to save models. In particular I have two questions:

  • why pandas.to_pickle instead of the python pickle module? Does pandas offer some interesting features that python generic module doesn't?
  • Knowing that pickle has security & maintainability limitations (e.g. upgrade to a new version of LGBM), why not using the LGBM save/load api?

thanks
Luca

OS Error: dlopen ... Reason: image not found

System

  • 2017 Macbook Pro 15-inch
  • macOS Big Sur v11.5.2
  • python 3.7.9

Problem

(venv) anson ~/numerai/example-scripts (master)
$ python3 example_model.py
Traceback (most recent call last):
  File "example_model.py", line 2, in <module>
    from lightgbm import LGBMRegressor
  File "/Users/anson/numerai/example-scripts/venv/lib/python3.7/site-packages/lightgbm/__init__.py", line 8, in <module>
    from .basic import Booster, Dataset, register_logger
  File "/Users/anson/numerai/example-scripts/venv/lib/python3.7/site-packages/lightgbm/basic.py", line 95, in <module>
    _LIB = _load_lib()
  File "/Users/anson/numerai/example-scripts/venv/lib/python3.7/site-packages/lightgbm/basic.py", line 86, in _load_lib
    lib = ctypes.cdll.LoadLibrary(lib_path[0])
  File "/usr/local/Cellar/[email protected]/3.7.9_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ctypes/__init__.py", line 442, in LoadLibrary
    return self._dlltype(name)
  File "/usr/local/Cellar/[email protected]/3.7.9_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: dlopen(/Users/anson/numerai/example-scripts/venv/lib/python3.7/site-packages/lightgbm/lib_lightgbm.so, 6): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
  Referenced from: /Users/anson/numerai/example-scripts/venv/lib/python3.7/site-packages/lightgbm/lib_lightgbm.so
  Reason: image not found

Solution

brew install libomp
microsoft/LightGBM#1369

Adjust train-test era split

I think the current train-test era split risks missing some eras (the last few ones)

test_splits = [all_train_eras[i * len_split:(i + 1) * len_split] for i in range(cv)]

I've stumbled upon it by using cv=5 and fixed it by calculating the test splits as

test_splits = [all_train_eras[i * len_split:(i + 1) * len_split] for i in range(cv - 1)] + \
        [all_train_eras[(cv - 1)  * len_split:]]

EDIT: addressed in #94

License

I would like to publish some of the code from this repository in a MIT licensed library.
Are you ok with that? Any attribution required?

Thanks for sharing such great stuff!

The example_model.py file grinds my laptop to a hold on a 16RAM memory hardware

I have a reasonable recent laptop.

The example_model maxed out all 16gb of my RAM and used 9gb of swap and my whole laptop was unusable for anything else.

Is there a way to run the program and limit it's RAM usage to 8gb or something where I can continue to use my laptop for browsing or code editing experiences whilst having the model be trained ?

Or should the minimal system requirements be bumped to 32gb of RAM ?

Could the parquet files be read and written from in a format key value db like lmdb or rocksdb to reduce the reliance on having to upgrade my laptop from 16gb of ram to 32gb of ram ?

Alternatively should we add instructions on how to SSH into an EC2 allocated with 32gb of RAM for the purposes of running the example scripts ?

Laptop overview: ( 6 core i7 @ 2.6ghz, 16gb ram, 256gb SSD )

image

feature_subset actually submits medium/all neutralized

feature_subset is initially set to medium/serenity, the intended neutralization:

feature_subset = list(subgroups["medium"]["serenity"])

But later in the neutralizing different groups section is used in the for loop and ends on medium/all, which is then pickled in the submission.

for group in groups:
feature_subset = list(subgroups["medium"][group])

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.