Giter VIP home page Giter VIP logo

Comments (16)

dmaluka avatar dmaluka commented on August 16, 2024 2

So I think we should restore the header functionality

Although perhaps we should make header have lower priority than filename, like we now already do with signature (and unlike in the old implementation, where it was non-deterministic whether filename match or header match would take precedence). Perhaps then we won't even need to fix json?

from micro.

taconi avatar taconi commented on August 16, 2024 1

But besides ^{$ is very unspecific so far and doesn't really help the initial problem still persists.

But the problem still persists but it will only happen if the file name matches more than one file type, the problem would be the signature patterns, they must be fixed, improved and added so that false positives can be corrected. The signature should be used in case the file type is unknown too and that false positives are reported by whoever finds it and that the signatures that would avoid these errors are implemented.

from micro.

taconi avatar taconi commented on August 16, 2024 1

Placing an option that activates or deactivates the use of signature would be trying to sweep the problem of detecting the file type from the file content down the table, in the same way as placing these bugs only in files that match with more than one filename.

Maybe warn that signature may present errors and ask the user to report bugs found with this option activated in the option documentation, something like:

* `detectsignature`:  If it is `true`, the `signature` field of the syntax files
   will be used to try to find the `filetype` when it is `unknown` or when
   more than one `filetype` is found (this function may end up generating false positives, if this happens please report the bug so we can improve its functionality).

	default value: `false`

But I still believe that signature should be used to try to find the type of files of type unknown by default, so signature will have a greater frequency of use than it is today, like #3183 (comment) that shows an example of use of the signature of the json files and in which scenarios this could happen unexpectedly.

from micro.

JoeKar avatar JoeKar commented on August 16, 2024 1

If I put the signature with "%YAML", any file that has this content of this type?

No, as described in #3183 the signature check is currently used as last resort to sort out multiple detected filenames (more or less as Vim does)

What is this field for?

See above.

How is it used by the micro to carry out detection?

It will iterate over the amount of detectlimit lines to search for e.g. keywords and or patterns with which the file type can be recognized.

In which scenarios would I have which behaviors?

As written, it's currently used to define the chosen one of multiple filename matches, depending on the given/defined patterns.

In what cases could there be an error?

In case the pattern is insufficient or filename will be changed without considering signature once again.
That's definitely something to keep in mind when we decide to add more filename patterns into already existing definitions and don't create new dedicated definitions.

How to make proper use of it?

As of now, I agree that it most probably was a fault to keep them available (with the same pattern as before with header) for everything else other than the C* headers. Because this caused unintended confusion.
But as already written, there is still some room to use it as a absolute last fallback.

How do I find this information without having to review PR's and issues?

Well, what shall I say/write?
We figured out by discussion, since confusion came up due to the lack of proper documentation.

As already often said:
This project is already larger than 2-3 maintainers can handle (which was the reason why I vote for you too). 😃
At least labeling should be possible for volunteers, but unfortunately it isn't.

Help is appreciated with support to prepare the proper doc update. 😉

from micro.

dmaluka avatar dmaluka commented on August 16, 2024 1

I've read the discussion and I come to a conclusion that we should admit that replacing "headers" with "signatures" was a hasty decision. It is not just a matter of a proper doc update, it is a matter of breakage of useful functionality. header and signature are different things, with quite different purposes, both are useful, we should have added the latter without obstructing the former.

If we have a file named foo with the following content:

#!/bin/sh

echo "hello world"

with v2.0.13 it is recognized and highlighted as a shell script. With the newest master it is not.

To me, this is a regression, and a rather serious one.

The majority of micro's syntax files have no header/signature patterns. In those syntax files that do, those patterns are mainly shebangs (#!/bin/sh, #!/usr/bin/awk etc) and things like <?xml ... ?> or <!DOCTYPE html5>, i.e. things that: 1. can be quite reliably used for detecting the filetype independently of filename, not just in addition to it. 2. should only occur in the first line of the file, not in the first 100 lines or so. In other words, the old header semantics was exactly what was needed for those filetypes, while the new signature semantics makes little sense for them.

So I think we should restore the header functionality, restore its usage instead of signature for most filetypes, and keep using signature probably just for distinguishing C/C++/Objective-C headers.

When it comes to false positives caused by header in the past, were #2894 and #3054 the only known cases of such? If so, in all cases json was guilty, so I think we should fix that simply by removing this "^{$" pattern from json.yaml, since this pattern is indeed too general and probably was causing more harm than good.

from micro.

JoeKar avatar JoeKar commented on August 16, 2024 1

To me, this is a regression, and a rather serious one.

Ok, then we keep it short and header will return. signature is and can be used to solve multiple hits with the same filename and header will step up in the moment no filename hit at all.

When it comes to false positives caused by header in the past, were #2894 and #3054 the only known cases of such?

Yep, grep tells that the json definition is right now the only one with a very generic pattern.

Last but not least the doc of signature can then be extended/optimized too.

from micro.

JoeKar avatar JoeKar commented on August 16, 2024

With #2819 it was changed in that way that it considers the signature-check only in case more than one file type check hit, to concretize the result (C* file types). Previously it has led to to false positive results that the header-check hit (#2894), before the possible later file type check could hit.

It can be discussed to try this guess as well in the moment no file type check hit at all, but this again could lead to false positives in the moment the signature-check identifies some lines by the "wrong" signature of a different syntax definition than expected due to an much to generalized wording within the signature definition. For example it detects a buzz word within a comment of your unspecified/unknown file.

So from that point there is no guarantee to hit exactly, it would be a guess and it could be wrong (with a chance for new issues).
The correct way would be to add an concrete file type in a new or existing syntax definition.

from micro.

taconi avatar taconi commented on August 16, 2024

In my opinion, the fields within the detect key should be used to detect the file type, as well as the header in 2.0.13.

But in view of the behavior of the signature field, I think it would be worth removing the signatures that don't make sense, for example, in what scenario would the signature of json files be used? I can't think of any filenames for this functionality to be used.
The change was made as if the behavior of the header and signature were the same, but some signatures will never be used and could be removed.

from micro.

JoeKar avatar JoeKar commented on August 16, 2024

...for example, in what scenario would the signature of json files be used?

But wasn't the same question legit for the old json header too?

I can't think of any filenames for this functionality to be used.

Currently me too, but this doesn't mean that they don't exist. I "broke" some expected default handling in the past.
But besides ^{$ is very unspecific so far and doesn't really help the initial problem still persists.

The signature check can help as fallback and heavily relies on sufficient rules.

the fields within the detect key should be used to detect the file type

So you'd recommend to consider to provide this as a fallback method and remove insufficient signatures, right?

from micro.

taconi avatar taconi commented on August 16, 2024

...for example, in what scenario would the signature of json files be used?

But wasn't the same question legit for the old json header too?

I don't think so, all files that start with { in the first line will be considered json files, but this is very vague and can generate a lot of false positives.

The point is that the header has been changed as if it were for an equivalent, which is not true.
The json signature will only be used if another syntax file matches the name of the file, currently json files are those that end with .json and no other syntax file matches a similar pattern, so the way it is in file types like json, this field is unusable.

I think the right thing to do in this case would be to remove it from the file, but that's not my decision.

the fields within the detect key should be used to detect the file type

So you'd recommend to consider to provide this as a fallback method and remove insufficient signatures, right?

No, actually the signature is used to define the type of the file in cases where more than one detect.filename match (as far as I understand, correct me if I'm wrong) and the header that was used to define the type of the file by the content of the first line of the file has been removed, this was an approved decision and merged.

I was just thinking that the signature was the same as the header, I hadn't correctly understood the function of signature, I'll close the subject.

from micro.

dustdfg avatar dustdfg commented on August 16, 2024

Maybe user can be asked about filetype if it was deduced from patterns or can be added one more option to control if patterns are used at all

from micro.

JoeKar avatar JoeKar commented on August 16, 2024

But I still believe that signature should be used to try to find the type of files of type unknown by default [...]

I'm open for that. We just need to keep in mind, that this will have downsides.

Just one example:
Any (new) language which has types (structs, classes, etc.) and allows to "open" them at the next line with {, but no file extension is given yet or the language is yet unknown, because it's new.
Do we want to have it highlighted as json? 😉

So proper highlighting should be produced with explicit file extensions or whole file names in the syntax definition. Everything else is some (better) guessing, especially in cases with more or less "wildcard" signatures.

from micro.

dustdfg avatar dustdfg commented on August 16, 2024

Ok. What vim does? You said in one of the PR's vim uses similar approach so do they have the issue?

from micro.

JoeKar avatar JoeKar commented on August 16, 2024

Vim detects it as json, but it does this by this simple fact:
https://github.com/vim/vim/blob/master/runtime/filetype.vim#L1627

So the file name "Pipfile.lock" is explicit tracked to be of a json file type. 😉

For *.h files for example Vim does that dynamic lookup...
https://github.com/vim/vim/blob/master/runtime/filetype.vim#L364 -> https://github.com/vim/vim/blob/master/runtime/autoload/dist/ft.vim#L183
...from which micros signature check was inspired. It doesn't do this by default.

from micro.

taconi avatar taconi commented on August 16, 2024

From what I understand, the FTheader function is only used for *.h files, just as the FTbas function is only used for *.sys\c files. Here signature can (?) run for all files definitions.

I believe that the solution for the Pipfile.lock file is to update the filename regex of json to match this file.

The question is, what does the signature field look like?

You may also provide an optional signature regex that will check a certain
amount of lines of a file to find specific marks. For example:

detect:
    filename: "\\.ya?ml$"
    signature: "%YAML"

If I put the signature with "%YAML", any file that has this content of this type?
We found out not with Pipfile.lock and signature from json, but that's not what the documentation says.

What is this field for?
How is it used by the micro to carry out detection?
In which scenarios would I have which behaviors?
In what cases could there be an error?
How to make proper use of it?

How do I find this information without having to review PR's and issues?

from micro.

dustdfg avatar dustdfg commented on August 16, 2024

So proper highlighting should be produced with explicit file extensions or whole file names in the syntax definition. Everything else is some (better) guessing, especially in cases with more or less "wildcard" signatures.

Off topic sorry... It is a good place for a tuned and lightweight AI similar to googles magika which weights 1MB and it would be really perfect for such a problem.

It is even open source

from micro.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.