Giter VIP home page Giter VIP logo

python-qds / qdscreen Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 2.49 MB

Quasi-determinism screening for fast Bayesian Network Structure Learning (from T.Rahier's PhD thesis, 2018)

Home Page: https://python-qds.github.io/qdscreen/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
functional-dependency correlation categorical feature-selection bayesian-network structure-learning determinism quasi screening

qdscreen's People

Contributors

smarie avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

qdscreen's Issues

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

Linked with #36

This bug arises when some columns in the dataframe are not categorical, and therefore are removed by the model. If the same columnsa re provided later to fit_selector for example, the error is raised

df = pd.DataFrame({
    "nb": [1, 2],
    "name": ["A", "B"]
})
qd_forest = qd_screen(df, categorical_mode="convert")
feat_selector = qd_forest.fit_selector_model(df)
only_important_features_df = feat_selector.remove_qd(df)

A good idea would be to protect our method against invalid inputs (not the expected names or data)

Warning with Scipy on python 3.9

https://python-qds.github.io/qdscreen/generated/gallery/1_remove_correlated_vars_demo/

we can see this:

/home/runner/work/qdscreen/qdscreen/.nox/publish-3-9/lib/python3.9/site-packages/pkg_resources/__init__.py:2804: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('mpl_toolkits')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
/home/runner/work/qdscreen/qdscreen/.nox/publish-3-9/lib/python3.9/site-packages/sklearn/utils/multiclass.py:14: DeprecationWarning: Please use `spmatrix` from the `scipy.sparse` namespace, the `scipy.sparse.base` namespace is deprecated.
  from scipy.sparse.base import spmatrix
/home/runner/work/qdscreen/qdscreen/.nox/publish-3-9/lib/python3.9/site-packages/sklearn/utils/optimize.py:18: DeprecationWarning: Please use `line_search_wolfe2` from the `scipy.optimize` namespace, the `scipy.optimize.linesearch` namespace is deprecated.
  from scipy.optimize.linesearch import line_search_wolfe2, line_search_wolfe1
/home/runner/work/qdscreen/qdscreen/.nox/publish-3-9/lib/python3.9/site-packages/sklearn/utils/optimize.py:18: DeprecationWarning: Please use `line_search_wolfe1` from the `scipy.optimize` namespace, the `scipy.optimize.linesearch` namespace is deprecated.
  from scipy.optimize.linesearch import line_search_wolfe2, line_search_wolfe1

UserWarning: object dtype is not supported by sparse matrices

From the gallery example

from qdscreen import qd_screen

# detect strict deterministic relationships
qd_forest = qd_screen(df)

# Fit selector model
qd_forest.fit_selector_model(df)

yields

UserWarning: object dtype is not supported by sparse matrices
  warnings.warn("object dtype is not supported by sparse matrices")

It seems that this is not supported but works...
https://stackoverflow.com/questions/47845327/convert-numpy-object-array-to-sparse-matrix

Provide a bridge to PyGOBN

From PyGOBN doc:

>>> gobn = GOBN()
>>> edge_reqs = {'A':['B','C'],'B':['D']} # require that A->B, A->C, and B->D
>>> ind_reqs = [('A','D'),(('A','B'),'D','C')] # require that A _|_ D and A,B _|_ D | C
>>> nonedge_reqs = {'B':['C']} # disallow that B->C
>>> gobn.set_constraints(edge_reqs, ind_reqs, nonedge_reqs)
>>> gobn.learn(data) # Call the GOBNILP solver

So we can maybe provide a <QDForest>.to_pygobn() method that would return a tuple edge_reqs, ind_reqs, and nonedge_reqs. The user would therefore do:

from qdscreeen import qdeterscreen
from pyGOBN import GOBN

# screen for (quasi-)determinism and get the contraints to pass to GOBN
qd_forest = qdeterscreen(data[, options])
reqs = qd_forest.to_pygobn()

# run GOBN with these constraints
gobn = GOBN()
gobn.set_constraints(*reqs)
gobn.learn(data)

Print methods for `QDForest`

It would be convenient to have a few methods to print information about QDForest, with an optional debug mode (showing the entropies on top of the links).

`IndexError` when a Nan is present in the dataframe

When NaNs are present in the dataframe, an IndexError can occur:

df = pd.DataFrame([
    ["A", "B"],
    ["A", "B"],
    ["N", np.nan],
])

qd_forest = qd_screen(df)
qd_forest.fit_selector_model(df)

yields

  File "C:\_dev\python_ws\_Libs_OpenSource\qdscreen\qdscreen\selector.py", line 112, in <lambda>
    levels_mapping_df = pd.DataFrame(X_ar[:, (parent, child)]).groupby(0).agg(lambda x: x.value_counts().index[0])
  File "C:\Miniconda3\envs\tools_py37\lib\site-packages\pandas\core\indexes\base.py", line 4297, in __getitem__
    return getitem(key)
IndexError: index 0 is out of bounds for axis 0 with size 0

Support non-categorical columns

By default all columns must be categorical in the input, for now.

We could try to support numerical data too.

If this is not possible at all for some reason, we could provide a helper function to convert an input easily.

Annoying scipy warning

The recent 0.6.3 release introduces an annoying scipy warning

warnings.warn("The input array could not be properly "

ValueError: object dtype is not supported by sparse matrices

In recent scipy (1.9.1) using sparse.dok_matrix does not work anymore with object-typed arrays

..\main.py:524: in fit_selector_model
    model.fit(X)
..\selector.py:100: in fit
    self._maps = maps = sparse.dok_matrix((n, n), dtype=object)
..\..\.nox\tests-3-8\lib\site-packages\scipy\sparse\_dok.py:78: in __init__
    self.dtype = getdtype(dtype, default=float)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

dtype = <class 'object'>, a = None, default = <class 'float'>

    def getdtype(dtype, a=None, default=None):
        """Function used to simplify argument processing. If 'dtype' is not
        specified (is None), returns a.dtype; otherwise returns a np.dtype
        object created from the specified dtype argument. If 'dtype' and 'a'
        are both None, construct a data type out of the 'default' parameter.
        Furthermore, 'dtype' must be in 'allowed' set.
        """
        # TODO is this really what we want?
        if dtype is None:
            try:
                newdtype = a.dtype
            except AttributeError as e:
                if default is not None:
                    newdtype = np.dtype(default)
                else:
                    raise TypeError("could not interpret data type") from e
        else:
            newdtype = np.dtype(dtype)
            if newdtype == np.object_:
>               raise ValueError(
                    "object dtype is not supported by sparse matrices"
                )
E               ValueError: object dtype is not supported by sparse matrices

..\..\.nox\tests-3-8\lib\site-packages\scipy\sparse\_sputils.py:113: ValueError

Improve speed of qd_screen

We should ensure that this python package is at least as fast (even faster !) as the R experiments used in the thesis and paper.

For this we should do some profiling with pytest-profile to understand where the time is consumed and where this could be optimized.

Note for windows users: to see the svg there is still a manual command to execute to see the svg: man-group/pytest-plugins#162

Better internal structure for `QDForest`

Currently QDForest can either be created from an adjmat or a parents array. Is this the most optimal ?
We should probably use a scipy.sparse structure or two instead.

This should be tailored to usage:

  • get_arcs method should be efficient for QDSelectorModel.fit(X)
  • walk_arcs method should be efficient for QDSelectorModel.predict_qd(X)

Remove `pyitlib` dependency in order to master (and possibly accelerate) entropy computations

Currently we use the following functions in pyitlib:

It would be great to include functions directly in our code so as to not rely on an external dependency "just for 2 functions". Also another interesting thing is that we could compile it, for example as shown here: https://gist.github.com/kudkudak/dabbed1af234c8e3868e

This migh tbe related to #29

Fix virtualenv / nox builds

Currently some builds are skipped silently because the interpreter is missing. We should make such case fail and fix them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.