Giter VIP home page Giter VIP logo

Comments (4)

domanchi avatar domanchi commented on May 31, 2024 2

I spoke with @KevinHock today about this, and decided to record conversation down, for posterity.

Historical Context

Initially, the ini parser was written in order to try and catch secrets that did not need quote marks around them -- namely, config files.

$ cat config.ini
[private]
key=secret

The issue is that there's no easy way to identify whether a file is a config file. File extensions don't work, because config files don't have a typical set of extensions that they correspond to. And, there's no special header file that identifies that a file is a config file. e.g. It's not like you could do:

$ file config.ini

Therefore, the only way to really identify whether a file is a config file is to try and parse it, and handle errors appropriately.

Issue

It seems that this approach runs into two performance hits:

  1. Needing to parse the entire file, with configparser, before having usable results.
  2. Error traceback construction for large files takes a long time (as @killuazhu pointed out)

Possible Solutions

1. Use the first N lines to try and determine whether a file is actually a config file

Credit to @KevinHock for this idea. Essentially, if the following conditions hold true, we may be able to identify whether a file is a config file by reading the first few lines.

a. The first N lines are a representative sample for the entire file, and
b. The first N lines are independently parseable as a config file by themselves.

If we're able to do this, then we would be able to optimize on both issues listed above, since you don't need to parse the entire file to determine whether a given file is suitable for ini file parsing.

Our issue is that we don't have a large enough sample set of config files to test out this method.

2. Try to use a different library for config file parsing

If we use a different library, we may be able to avoid that error traceback construction, and speed things along. Or similarly, we might be able to perform a special sub-classed invocation of configparser to avoid ParsingError recording every line of output.

3. Rethink how we approach config files

file_type_analyzers = (
(self._analyze_ini_file(), configparser.Error,),
(self._analyze_yaml_file, yaml.YAMLError,),
(super(HighEntropyStringsPlugin, self).analyze, Exception,),
(self._analyze_ini_file(add_header=True), configparser.Error,),
)

Maybe, there's a better way to do this, than trying to scan the ini file twice?

from detect-secrets.

KevinHock avatar KevinHock commented on May 31, 2024

I ran into this today as well, with a file that was ~250k lines.

from detect-secrets.

KevinHock avatar KevinHock commented on May 31, 2024

We did a short-term solution, number 2 from @domanchi's comment, in the above referenced PRs. They are live in version 0.12.2.

Thanks again for making this issue, I'm gonna keep it open until we improve on it more completely.

from detect-secrets.

domanchi avatar domanchi commented on May 31, 2024

Closing this issue, seeing that #187 has factual evidence that the changes made have been effective for long files.

We can separately track performance for files with long lines.

from detect-secrets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.