Giter VIP home page Giter VIP logo

Comments (4)

chr1st1ank avatar chr1st1ank commented on May 14, 2024 1

Talking about modularization I have a question there:

It seems that the Check class (and possible subclasses) is hard to unit test because the __call__ and _vectorized_check method take parent_schema and check_index as mandatory input. Is there another reason for this else than only error formatting?
Because if there isn't I would suggest to detangle the classes from each other.

That's the _vectorized_check function for example:

def _vectorized_check(self, parent_schema, check_index, check_obj):
    """Perform a vectorized check on a series.

    :param parent_schema: The schema object that is being checked and that
        was inherited from the parent class.
    :param check_index: The validator to check the series for
    :param dict check_obj: a dictionary of pd.Series to be used by
        `_check_fn` and `_vectorized_series_check`

    """
    val_result = self.fn(check_obj)
    if isinstance(val_result, pd.Series):
        if not val_result.dtype == PandasDtype.Bool.value:
            raise TypeError(
                "validator %d: %s must return bool or Series of type "
                "bool, found %s" %
                (check_index, self.fn.__name__, val_result.dtype))
        if val_result.all():
            return True
        elif isinstance(check_obj, dict) or \
                check_obj.shape[0] != val_result.shape[0] or \
                (check_obj.index != val_result.index).all():
            raise SchemaError(
                self.generic_error_message(parent_schema, check_index))
        else:
            raise SchemaError(self.vectorized_error_message(
                parent_schema, check_index, check_obj[~val_result]))
    else:
        if val_result:
            return True
        raise SchemaError(
            self.generic_error_message(parent_schema, check_index))

It seems that it essentially returns either True or raises an error indicating where something went wrong. Wouldn't it be more straight-forward to return either True or the pandas index values for the failures? Then the failure handling could be done in the calling schema objects.

I see another benefit then: the schema objects could be configured to either throw errors as is or instead return the erroneous rows very easily.

Just an idea. I don't want to pull the library apart. It is just that these are internal methods anyway and maybe this could help detangle the classes, allow for separated testing and improve clarity a bit.

from pandera.

cosmicBboy avatar cosmicBboy commented on May 14, 2024

yes, I've been meaning to clean up this part of the internal API for some time now.

As you suggest, I think we can separate the error formatting and reporting functionality currently in the Check class into a utility module, and which would allow Check.__call__ to not rely on the parent_schema or check_index argument, but I'd consider this another ticket. You're welcome to create an issue (you can just reference your comment above) and we can tackle that as a separate problem.

The scope of this issue is really to get out of single-file-hell :)

from pandera.

chr1st1ank avatar chr1st1ank commented on May 14, 2024

I created such an issue, not expecting that we will work on it very soon.

Are you actively working on the modularization right now? I am a bit afraid that branches are moving too far appart if we don't take care. I am currently working on some built-in checks and on the yaml loading topic. I will try to keep it in separate files for now to avoid conflicts.

from pandera.

cosmicBboy avatar cosmicBboy commented on May 14, 2024

Are you actively working on the modularization right now?

Should be pushing something up by the end of this weekend

I am currently working on some built-in checks

Let's move this discussion over to here #74, as there are a couple of decision we should discuss re: API design that I'd like your and @mastersplinter feedback on.

from pandera.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.