python-qds / qdscreen Goto Github PK

Quasi-determinism screening for fast Bayesian Network Structure Learning (from T.Rahier's PhD thesis, 2018)

Home Page: https://python-qds.github.io/qdscreen/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

functional-dependency correlation categorical feature-selection bayesian-network structure-learning determinism quasi screening

qdscreen's People

Contributors

Stargazers

Watchers

qdscreen's Issues

Add a `version` attribute

Following best practice https://smarie.github.io/python-getversion/#package-versioning-best-practices

Add mnsbc and pumpitup datasets for testing ?

Objective: get (almost) the same curves as in p154 of the thesis

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

Linked with #36

This bug arises when some columns in the dataframe are not categorical, and therefore are removed by the model. If the same columnsa re provided later to fit_selector for example, the error is raised

df = pd.DataFrame({
    "nb": [1, 2],
    "name": ["A", "B"]
})
qd_forest = qd_screen(df, categorical_mode="convert")
feat_selector = qd_forest.fit_selector_model(df)
only_important_features_df = feat_selector.remove_qd(df)

A good idea would be to protect our method against invalid inputs (not the expected names or data)

AttributeError: module 'numpy' has no attribute 'object'.

It seems that np.object was deprecated, we should use object instead

Remove bandit<1.7.3 in requirements when not needed anymore

See tylerwince/flake8-bandit#21

We should remove our extra line in flake8-requirements.txt when not needed anymore

Warning with Scipy on python 3.9

https://python-qds.github.io/qdscreen/generated/gallery/1_remove_correlated_vars_demo/

we can see this:

/home/runner/work/qdscreen/qdscreen/.nox/publish-3-9/lib/python3.9/site-packages/pkg_resources/__init__.py:2804: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('mpl_toolkits')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
/home/runner/work/qdscreen/qdscreen/.nox/publish-3-9/lib/python3.9/site-packages/sklearn/utils/multiclass.py:14: DeprecationWarning: Please use `spmatrix` from the `scipy.sparse` namespace, the `scipy.sparse.base` namespace is deprecated.
  from scipy.sparse.base import spmatrix
/home/runner/work/qdscreen/qdscreen/.nox/publish-3-9/lib/python3.9/site-packages/sklearn/utils/optimize.py:18: DeprecationWarning: Please use `line_search_wolfe2` from the `scipy.optimize` namespace, the `scipy.optimize.linesearch` namespace is deprecated.
  from scipy.optimize.linesearch import line_search_wolfe2, line_search_wolfe1
/home/runner/work/qdscreen/qdscreen/.nox/publish-3-9/lib/python3.9/site-packages/sklearn/utils/optimize.py:18: DeprecationWarning: Please use `line_search_wolfe1` from the `scipy.optimize` namespace, the `scipy.optimize.linesearch` namespace is deprecated.
  from scipy.optimize.linesearch import line_search_wolfe2, line_search_wolfe1

Split forest and feature selection model

That will be clearer to use and maintain

Rename `QDSSelector` to `QDScreen` and update the doc so that the `sklearn` submodule (rename it too) appears

from qdscreen.sklearn import QDScreen seems nice

Get rid of conda for builds

Let's use virtualenv now - it should be ok even on windows.

Issue with one test on python 2.7 and 3.5

https://github.com/python-qds/qdscreen/actions/runs/4441558867/jobs/7796796417?pr=41#step:6:101

We should investigate this later, maybe it is worth understanding this as it seems related to our "nan policy", see #34

For now I'll mark the test as xfail

CI Error: The version '3.6' with architecture 'x64' was not found for Ubuntu 22.04.

This happens with 3.5 3.6 and 2.7.
See actions/setup-python#544

The issue can be fixed by forcing ubuntu to 20.04

UserWarning: object dtype is not supported by sparse matrices

From the gallery example

from qdscreen import qd_screen

# detect strict deterministic relationships
qd_forest = qd_screen(df)

# Fit selector model
qd_forest.fit_selector_model(df)

yields

UserWarning: object dtype is not supported by sparse matrices
  warnings.warn("object dtype is not supported by sparse matrices")

It seems that this is not supported but works...
https://stackoverflow.com/questions/47845327/convert-numpy-object-array-to-sparse-matrix

"Interactive" mode: we could provide a method in qdeterscreen to returned a list of arcs in increasing order of conditional/relative conditional entropy

Got this idea from qd_forest.print_arcs(): it would be cool, before actually creating the forest, to be able to get such a list so as to plot a bar chart of the ranking.

Seeing the bar chart could help us define a "good" threshold.

Support both dataframes and numpy arrays (both structured and unstructured) as input

`QDForest.get_arcs` returned order is different whether the object was created from adj matrix or array of arcs

This can be seen with get_arcs_str_list too, see test_qd_forest in test_core.py

Provide a bridge to PyGOBN

From PyGOBN doc:

>>> gobn = GOBN()
>>> edge_reqs = {'A':['B','C'],'B':['D']} # require that A->B, A->C, and B->D
>>> ind_reqs = [('A','D'),(('A','B'),'D','C')] # require that A _|_ D and A,B _|_ D | C
>>> nonedge_reqs = {'B':['C']} # disallow that B->C
>>> gobn.set_constraints(edge_reqs, ind_reqs, nonedge_reqs)
>>> gobn.learn(data) # Call the GOBNILP solver

So we can maybe provide a <QDForest>.to_pygobn() method that would return a tuple edge_reqs, ind_reqs, and nonedge_reqs. The user would therefore do:

from qdscreeen import qdeterscreen
from pyGOBN import GOBN

# screen for (quasi-)determinism and get the contraints to pass to GOBN
qd_forest = qdeterscreen(data[, options])
reqs = qd_forest.to_pygobn()

# run GOBN with these constraints
gobn = GOBN()
gobn.set_constraints(*reqs)
gobn.learn(data)

Print methods for `QDForest`

It would be convenient to have a few methods to print information about QDForest, with an optional debug mode (showing the entropies on top of the links).

`IndexError` when a Nan is present in the dataframe

When NaNs are present in the dataframe, an IndexError can occur:

df = pd.DataFrame([
    ["A", "B"],
    ["A", "B"],
    ["N", np.nan],
])

qd_forest = qd_screen(df)
qd_forest.fit_selector_model(df)

yields

  File "C:\_dev\python_ws\_Libs_OpenSource\qdscreen\qdscreen\selector.py", line 112, in <lambda>
    levels_mapping_df = pd.DataFrame(X_ar[:, (parent, child)]).groupby(0).agg(lambda x: x.value_counts().index[0])
  File "C:\Miniconda3\envs\tools_py37\lib\site-packages\pandas\core\indexes\base.py", line 4297, in __getitem__
    return getitem(key)
IndexError: index 0 is out of bounds for axis 0 with size 0

Provide a parameter in order to decide if `None`/`np.nan` should be considered missing value or normal value

Two different concepts:

A missing value should not be used in calculations of entropy or conditional entropies, and should be only used as "last resort" in the feature selection model (Series.mode have a flag to ignore nans)
a "none" level, if considered an acceptable value, should be used in entropy calculations and in the calculation of mode in the feature selection model.

Provide a sklearn compliant feature selector

Provide a method to generate a graphviz-compliant representation of the forest

Support non-categorical columns

By default all columns must be categorical in the input, for now.

We could try to support numerical data too.

If this is not possible at all for some reason, we could provide a helper function to convert an input easily.

Annoying scipy warning

The recent 0.6.3 release introduces an annoying scipy warning

warnings.warn("The input array could not be properly "

Get a string representation of arcs in the found forest

Use latest genbadge and remove dependency to xunitparser

See smarie/python-genbadge#18

`Entropies`: provide methods to return the ranked absolute and relative conditional entropies

To ease finding the best threshold

Move the examples to an example gallery

Using mkdocs-gallery

`predict_qd` raises `ValueError: invalid literal for int() with base 10`

This bug happens when one column starts with a nan, and then contains a string. Numpy vectorize is guessing from nan that the mapping operation creates a number, but then it fails when it hits a string.

UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions

THis warning happens during packaging

ValueError: object dtype is not supported by sparse matrices

In recent scipy (1.9.1) using sparse.dok_matrix does not work anymore with object-typed arrays

..\main.py:524: in fit_selector_model
    model.fit(X)
..\selector.py:100: in fit
    self._maps = maps = sparse.dok_matrix((n, n), dtype=object)
..\..\.nox\tests-3-8\lib\site-packages\scipy\sparse\_dok.py:78: in __init__
    self.dtype = getdtype(dtype, default=float)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

dtype = <class 'object'>, a = None, default = <class 'float'>

    def getdtype(dtype, a=None, default=None):
        """Function used to simplify argument processing. If 'dtype' is not
        specified (is None), returns a.dtype; otherwise returns a np.dtype
        object created from the specified dtype argument. If 'dtype' and 'a'
        are both None, construct a data type out of the 'default' parameter.
        Furthermore, 'dtype' must be in 'allowed' set.
        """
        # TODO is this really what we want?
        if dtype is None:
            try:
                newdtype = a.dtype
            except AttributeError as e:
                if default is not None:
                    newdtype = np.dtype(default)
                else:
                    raise TypeError("could not interpret data type") from e
        else:
            newdtype = np.dtype(dtype)
            if newdtype == np.object_:
>               raise ValueError(
                    "object dtype is not supported by sparse matrices"
                )
E               ValueError: object dtype is not supported by sparse matrices

..\..\.nox\tests-3-8\lib\site-packages\scipy\sparse\_sputils.py:113: ValueError

Dont use `scipy_mode` as this will be deprecated

Improve speed of qd_screen

We should ensure that this python package is at least as fast (even faster !) as the R experiments used in the thesis and paper.

For this we should do some profiling with pytest-profile to understand where the time is consumed and where this could be optimized.

Note for windows users: to see the svg there is still a manual command to execute to see the svg: man-group/pytest-plugins#162

Better internal structure for `QDForest`

Currently QDForest can either be created from an adjmat or a parents array. Is this the most optimal ?
We should probably use a scipy.sparse structure or two instead.

This should be tailored to usage:

get_arcs method should be efficient for QDSelectorModel.fit(X)
walk_arcs method should be efficient for QDSelectorModel.predict_qd(X)

Remove `pyitlib` dependency in order to master (and possibly accelerate) entropy computations

Currently we use the following functions in pyitlib:

It would be great to include functions directly in our code so as to not rely on an external dependency "just for 2 functions". Also another interesting thing is that we could compile it, for example as shown here: https://gist.github.com/kudkudak/dabbed1af234c8e3868e

This migh tbe related to #29

UserWarning: object dtype is not supported by sparse matrices warnings.warn("object dtype is not supported by sparse matrices")

In the doc when executing feat_selector = qd_forest.fit_selector_model(df) this warning appears

python-qds / qdscreen Goto Github PK

qdscreen's People

Contributors

Stargazers

Watchers

qdscreen's Issues

Recommend Projects

Recommend Topics

Recommend Org