Giter VIP home page Giter VIP logo

surfraw-tools's People

Contributors

dependabot[bot] avatar hoboneer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

surfraw-tools's Issues

Support non-POST `<Param>`/`<param:Parameter>` element usage

I found some sites that fail because of this... this shouldn't be too hard to implement though.

Precedence (for GET):

  1. <Param>/<param:Parameter> elements
  2. Inline parameters in the query string

Now lets hope (use assert!) that websites don't use both at the same time...

Add `--list-collapse=` option to `mkelvis`

Right now, --collapse= has special-cased implementations for collapsing list variables and non-list ("scalar" hereafter) variables. This would allow list-like scalar variables to have its values be collapsed individually. Additionally, there would be symmetry with the other --{list-,}{inline,map}= options.

However, this would be a backwards-incompatible change. It's not too bad since I'm the only user of this program (as far as I know) and I only have ~30 elvi to change. This is easily done with sed.

A better question, though: is this even a desirable feature?
I'll have to think about it.

Validate elvis names

Only a subset of valid shell identifiers should be allowed. It should preserve the consistent naming scheme of the variables, so no underscores allowed.

Add script to generate elvi from an OpenSearch URL

Using OpenSearch would automate many of the uses of mkelvis (quick one-offs). It should generate the options that the site supports "for free"--stuff like the count and language OpenSearch parameters.

I guess I need to read the spec. Version 1.1 has been out for a looooong time. I should only support that--at least initially.

Regarding the language OpenSearch parameter, how should it be reconciled with the one in mkelvis? I'd like to maximise code reuse.

It should take a variable number of arguments ("+") with the metavar file_or_url. It should be able to take:

  1. A local file argument of the OpenSearch description file.

2. A URL, from some domain, to the OpenSearch description file. This is unnecessary. I suspect most uses would be (3) and (1), in order of popularity.

  1. Any URL from some domain. The OpenSearch description file will be found automatically (with opensearch-discover or an internal finder). Since websites may have multiple description files, there needs to be some way to configure it (like opensearch-discover does).

For simplicity, it could be implemented in that order (1-3) (1, 3).

Surfraw's opensearch-discover needs to handle HTTPS URLs though. I need to write a fix for that.

See Hoboneer/surfraw-elvis#4.

Improve documentation

If anything, this would help me design the program and remember how it works.

  • Document common conventions for elvi (with example elvi that do it)
  • Host on Read the Docs?
    1. General usage docs first: README, etc.
    2. Library code
  • [ ] State the reason behind the three names (repo, package, program)
  • List features under a "Features" heading in the README
  • Clarify what each option type means, including their typenames.
  • Clarify how each option type interacts with others.
  • Separate mkelvis options into groups in --help.
  • Installation docs (simple in README.md; more detail in INSTALL.md, see #14 (comment)).

Set up tests

The most testing I've done is ad-hoc regression testing by remaking all my elvi and checking the diffs between the new elvi and the elvi I've already installed locally.

I don't know what to test. mypy gets most of the bugs I introduce (hopefully) but the dynamic parts are a concern.

What to test:

  • Known usage errors in the CLI
  • Whether the constraints that the Elvis class expects of the relationships between SurfrawOption objects are followed

Support POST method searches

This might make generating OpenSearch elvi easier.

As for implementation, the elvis.in template could be modified to create a temporary file with a filled-in HTML form which auto-submits on page-load (using JavaScript). It would then output the name of this file, perhaps with a file:// scheme.

Allow disable completion code generation

Something like a --disable-completions or --no-completions option would be good. Don't know why a user would do that though. Perhaps it would make sense to have such an option before my merge request (to Surfraw) is completed and accepted.

Change exit codes

Currently, when the program fails, the exit codes correspond to those in /usr/include/sysexits.h (via Python's os module).

From some cursory research, it appears that this isn't actually standard. It should use common exit codes.

Improve `mkelvis` startup time

On my machine, running PYTHONPROFILEIMPORTTIME=1 mkelvis --help >/dev/null using the mkelvis from commit 38fb4e05a3174145c4682c784ad4b4521911b651 installed using pipx, importing surfraw_tools.mkelvis takes ~54,000 us +- 4000 us. Of that 54,000 us, jinja2 takes 24,000 us +- 1000 us.

This is 67.5% of the total execution time (~0.08 s) to generate the github elvis from commit cebb1380b4fbc9022d9f8a8e55eb51bca60924e6 of my elvi repo. The command used to time it was ./elvi/github.sh-in | grep -vE '^\s*#' | xargs time mkelvis github.

I've looked at the jinja2 source code and there's a lot of stuff that happens during import. The dependency graph between modules is rather complex too (See pydeps output on src/jinja2/). It's hard to know what causes it all. Take note of jinja2._identifier for the one-liner (logical line) with the most impact on import time--its a pretty big regex.

import time: self [us] | cumulative | imported package
import time:       153 |        153 | zipimport
import time:       923 |        923 | _frozen_importlib_external
import time:        76 |         76 |     _codecs
import time:       501 |        577 |   codecs
import time:       394 |        394 |   encodings.aliases
import time:       813 |       1783 | encodings
import time:       213 |        213 | encodings.utf_8
import time:       129 |        129 | _signal
import time:       301 |        301 | encodings.latin_1
import time:        51 |         51 |     _abc
import time:       261 |        312 |   abc
import time:       297 |        609 | io
import time:        89 |         89 |   _locale
import time:       136 |        224 | _bootlocale
import time:        75 |         75 |       _stat
import time:       246 |        320 |     stat
import time:       159 |        159 |       genericpath
import time:       302 |        461 |     posixpath
import time:      1089 |       1089 |     _collections_abc
import time:       591 |       2459 |   os
import time:       198 |        198 |   _sitebuiltins
import time:       190 |        190 |     apport_python_hook
import time:       135 |        324 |   sitecustomize
import time:       631 |       3611 | site
import time:       311 |        311 |     types
import time:        93 |         93 |     _collections
import time:       837 |       1240 |   enum
import time:       137 |        137 |     _sre
import time:       308 |        308 |       sre_constants
import time:       388 |        695 |     sre_parse
import time:       312 |       1143 |   sre_compile
import time:        82 |         82 |     _functools
import time:        98 |         98 |         _operator
import time:       318 |        415 |       operator
import time:       156 |        156 |       keyword
import time:        53 |         53 |         _heapq
import time:       181 |        233 |       heapq
import time:       142 |        142 |       itertools
import time:       206 |        206 |       reprlib
import time:       928 |       2078 |     collections
import time:      1063 |       3222 |   functools
import time:       177 |        177 |   copyreg
import time:       695 |       6475 | re
import time:       167 |        167 |     surfraw_tools._package
import time:       192 |        359 |   surfraw_tools
import time:       168 |        168 |   __future__
import time:       955 |        955 |       locale
import time:      1089 |       2043 |     gettext
import time:       994 |       3037 |   argparse
import time:       357 |        357 |     warnings
import time:       162 |        162 |       fnmatch
import time:        97 |         97 |       errno
import time:       106 |        106 |       zlib
import time:       240 |        240 |         _compression
import time:       119 |        119 |           time
import time:       186 |        186 |                 token
import time:      1004 |       1190 |               tokenize
import time:       168 |       1357 |             linecache
import time:       402 |       1758 |           traceback
import time:       240 |        240 |           _weakrefset
import time:       802 |       2917 |         threading
import time:       296 |        296 |         _bz2
import time:       297 |       3748 |       bz2
import time:       307 |        307 |         _lzma
import time:       271 |        578 |       lzma
import time:        56 |         56 |       pwd
import time:        46 |         46 |       grp
import time:       592 |       5381 |     shutil
import time:        67 |         67 |       math
import time:      2446 |       2446 |         _hashlib
import time:        73 |         73 |         _blake2
import time:        95 |         95 |         _sha3
import time:       336 |       2948 |       hashlib
import time:        44 |         44 |         _bisect
import time:       204 |        247 |       bisect
import time:        43 |         43 |       _random
import time:       555 |       3858 |     random
import time:       506 |        506 |     weakref
import time:       453 |      10553 |   tempfile
import time:       186 |        186 |     collections.abc
import time:       669 |        669 |     contextlib
import time:      1708 |       2562 |   typing
import time:       449 |        449 |       surfraw_tools.validation
import time:       478 |        926 |     surfraw_tools.options
import time:      2289 |       3215 |   surfraw_tools.cliopts
import time:       211 |        211 |     importlib
import time:       144 |        144 |         importlib.machinery
import time:       525 |        669 |       importlib.abc
import time:       115 |        115 |           nt
import time:        85 |         85 |           nt
import time:        82 |         82 |           nt
import time:       259 |        540 |         ntpath
import time:       132 |        132 |           urllib
import time:      1124 |       1256 |         urllib.parse
import time:       873 |       2667 |       pathlib
import time:       463 |       3798 |     importlib.resources
import time:        56 |         56 |           _string
import time:       897 |        953 |         string
import time:       198 |        198 |         markupsafe._compat
import time:       173 |        173 |         markupsafe._speedups
import time:       682 |       2004 |       markupsafe
import time:        80 |         80 |               _struct
import time:       155 |        235 |             struct
import time:       268 |        268 |             _compat_pickle
import time:        84 |         84 |                 org
import time:        26 |        110 |               org.python
import time:        20 |        130 |             org.python.core
import time:       127 |        127 |             _pickle
import time:       973 |       1730 |           pickle
import time:       193 |       1923 |         jinja2._compat
import time:       199 |        199 |                 _json
import time:       536 |        734 |               json.scanner
import time:       509 |       1242 |             json.decoder
import time:       535 |        535 |             json.encoder
import time:       228 |       2004 |           json
import time:      1162 |       3166 |         jinja2.utils
import time:       384 |       5471 |       jinja2.bccache
import time:      1550 |       1550 |         jinja2.nodes
import time:       380 |        380 |           jinja2.exceptions
import time:       171 |        171 |             jinja2.visitor
import time:       271 |        442 |           jinja2.idtracking
import time:       160 |        160 |           jinja2.optimizer
import time:       877 |       1858 |         jinja2.compiler
import time:       593 |        593 |             jinja2.runtime
import time:       740 |       1332 |           jinja2.filters
import time:       987 |        987 |                 numbers
import time:       972 |       1959 |               _decimal
import time:       199 |       2157 |             decimal
import time:       407 |       2564 |           jinja2.tests
import time:       203 |       4098 |         jinja2.defaults
import time:        86 |         86 |             _ast
import time:       282 |        367 |           ast
import time:        56 |         56 |           unicodedata
import time:      1648 |       1648 |           jinja2._identifier
import time:      2023 |       4093 |         jinja2.lexer
import time:       353 |        353 |         jinja2.parser
import time:       818 |      12768 |       jinja2.environment
import time:       352 |        352 |       jinja2.loaders
import time:       358 |      20951 |     jinja2
import time:      1327 |      26285 |   surfraw_tools.common
import time:       416 |      46591 | surfraw_tools.mkelvis
import time:      1351 |       1351 | textwrap

Resolve flags in `add_flag` method

The current flag resolution code is messy.

E.g., flag_target.add_flag(new_flag) should raise some exception if the flag is invalid for some reason. Should it do a sanity check to see if the flag actually targets the option it's being added to?

The resolve_flags method should be removed since it's useless at that point.

Use internal OpenSearch description finder

This would remove any dependence on Surfraw for creating elvi--especially since the fix to support HTTPS in opensearch-discover hasn't been released yet.

Here's to making opensearch.py longer...

Add arguments for `--use-{language,results}-option`

So, the --use-{language,results}-option options would take two arguments: VALUES and DEFAULT, in that order. This is reversed (compared to the usual) because the default for DEFAULT is whatever is in the corresponding Surfraw global config variable (SURFRAW_lang and SURFRAW_results respectively).

This would avoid having to define very similar options using --enum=, with all of the boilerplate involved (metavar, description, etc.), just to have a value-restricted version of that option.

Reduce logic in templates

The import time for jinja annoys me. Something like Mustache-compliant template engines (chevron?) would work, but they don't have any logic.

I'll have to do a lot of pre-processing in that case.

Polish library code

I'd like to make the underlying elvis and option objects nicer to use--as a library. I think this would make #18 easier to do. It's a bit of a pain right now.

Maybe there could be an Elvis class to hold it all together?

Now that these are done, it would be good to make it usable outside of this project. The main motivation, however, is to have a cleaner, shorter codebase.

Use logging

It would be nice to have repeatable --verbose and --quiet options. The print(..., file=sys.stderr) calls are getting unwieldy...

Remove dependency on Jinja2

On commit 9bac109b2f8ba6b286071db0e085b4d77f7301fa of my elvi repo, and commit 9f265df3397e9ae00a7a170130b9501e1c976a47 of this repo, the serial compile time of all the elvi differed:

On Jinja2 v2.11.3: ~4.100s
On Jinja2 v3.0.2: ~7.900s

The import time definitely increased.

Some parts of the elvis.in template are already very complex, so moving it to code might make it more manageable too.

Also see #42

opensearch2elvis: Support `<Param />` elements

These should be children of <Url> elements.

From what I've seen in the wild, this element is used without the appropriate namespace--since it's an extension to OpenSearch (the parameter one). This was part of the main namespace before draft 3, but it has since moved (https://github.com/dewitt/opensearch/blob/master/mediawiki/Specifications/OpenSearch/Changelog.wiki)

Examples:

https://duckduckgo.com/opensearch_lite_v2.xml:

<?xml version="1.0" encoding="utf-8"?>
<OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/"> 
  <ShortName>DuckDuckGo Lite</ShortName> 
  <Description>Search DuckDuckGo (Lite)</Description> 
  <InputEncoding>UTF-8</InputEncoding> 
  <LongName>DuckDuckGo Search (Lite, non-JS)</LongName> 
  <Image height="16" width="16"></Image>
  <Url type="text/html" method="post" template="https://lite.duckduckgo.com/lite/">
    <Param name="q" value="{searchTerms}"/>  
  </Url>
  <Url type="application/x-suggestions+json" template="https://duckduckgo.com/ac/?q={searchTerms}&amp;type=list"/>
</OpenSearchDescription> 

https://duckduckgo.com/opensearch_html_v2.xml:

<?xml version="1.0" encoding="utf-8"?>
<OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/"> 
  <ShortName>DuckDuckGo HTML</ShortName> 
  <Description>Search DuckDuckGo (HTML)</Description> 
  <InputEncoding>UTF-8</InputEncoding> 
  <LongName>DuckDuckGo Search (HTML, non-JS)</LongName> 
  <Image height="16" width="16"></Image>
  <Url type="text/html" method="post" template="https://html.duckduckgo.com/html/">
    <Param name="q" value="{searchTerms}"/>  
  </Url>
  <Url type="application/x-suggestions+json" template="https://duckduckgo.com/ac/?q={searchTerms}&amp;type=list"/>
</OpenSearchDescription> 

Refactor surfraw option parsing code

Each SurfrawOption subclass should specify how an instance is created from:

  1. the CLI;
  2. the input file (see #49); and
  3. future ways of specifying elvi.

As for (1), each class could specify a from_cli(str) -> <Current Class> class method. However, how would the flag and alias options be done? They need to be resolved, so maybe their parsing could be deferred until all the non-resolving options ("Auxiliary Options" in the reference) have been parsed, meaning that upon encountering each auxiliary option for the first, the raw string should just be stored for later (grouped into flags and aliases, of course). A similar approach should be taken for (2) and (3).

The supporting code in Option that facilitates parsing from the CLI could then become either utility functions called by each SurfrawOption.from_cli method or decorators to be applied to each such method. Alternatively, SurfrawOption could take on that role.

Allow ':' in option arguments

This could either be done by:

  1. Allow special characters (e.g., :) to mkelvis option arguments to be escaped; or
  2. Allow changing the argument delimiter with an option like --delimiter= (short: -d).

(1) would result in a more complex Option.parse_args. (2) would probably need a global variable to keep track of what delimiter is active.

Perhaps there could be a different command-line program that sidesteps this issue? Something like a file2elvis that takes a file describing the elvis in some format (e.g., json). Option.parse_args would have to accept any iterable as an argument, replacing raw_arg, and a new method, say from_args which accepts a list of arguments rather than a str (cf. from_arg). Is that name too similar? Option.from_arg could simply be a wrapper for Option.from_args which splits arg and returns Option.from_args like so: return cls.from_args(arg.split(":")).

Support `bash` and other shells

Since my /bin/sh is dash, I haven't run into any issues with the $_ variable. But it would be good to support this since lots of my elvi use it, and end up having an empty query string and variables (if used in --collapse=) when run in bash.

It would be nice if other shells could be tested too.

Either document this (and tell users to use a shell that doesn't do it) or:

  1. find out how to turn off the special behaviour of the $_ variable;
  2. allow the user to set what the "it variable" is named; or
  3. change the name of this variable--to what? $it?

I'm leaning towards (2). I'd like to remain backwards-compatible... I don't want to change all my elvi.

Running this in the root of my elvi repo (including my elvi using the new --no-append-mappings option) gives the number of elvi using the variable:

$ grep '$_' -- src/*.in src/*.sh-in src/*.mediawiki-in | cut -f1 -d: | sort | uniq | wc -l
11

Not bad. That makes (3) more palatable to me.

Use logging everywhere

I think it'd be best to just use logging.getLogger(PROGRAM_NAME) everywhere. I don't want to have to pass the logger wherever it's needed.

However, how would the library code access the program name?

Determine minimum lxml version

Document this too. Make sure to mention that these are the versions that were (roughly) tested--it might work with earlier versions.

Allow "collapse" of search query

Something like --collapse= specialised for the search query would be nice. It would also generalise the --{list-,}inline= options too.

I'm thinking it would be --query-collapse=. This would not have the first argument specify the variable it targets (cf. --collapse=).

Update licence headers

Remember to add the year of release to the header only if that file has changed since then.

Update documentation for new features

I don't want to make my README too long. Maybe it could be split into more files? So, three files as of now: the main README, and advanced docs for mkelvis and opensearch2elvis respectively.

Or not. It might be simpler this way.

Stuff to document:

  • opensearch2elvis:
    • only OpenSearch parameters supported
    • no distinction between optional and required parameters (excl. searchTerms--it must be required)
    • URL scheme must be HTTP because of limitations in opensearch-discover (pending merge and release in Surfraw) (This has now been implemented internally).
    • OpenSearch parameters extension not minimally supported, but any custom parameters are --anything options
    • only version 1.1 (draft 6) of OpenSearch supported, with support for <Param> elements to maximise compatibility with existing websites
  • The fact that the --verbose and --quiet options exist now (maybe only in the CHANGELOG [which is still TODO...])
  • Plain URLs are allowed for mkelvis (e.g,. https://example.com instead of having to do strip the scheme).
  • The tab-completions code is no longer WIP. It was merged!
  • Update jinja version in "Installation" page

Bug for search_url variable

$ mkelvis askubuntu 'www.askubuntu.com' 'https://askubuntu.com/search?q='

$ sr -p askubuntu firefox
https://https://askubuntu.com/search?q=firefox

seems like it adds the extra https:// to the prefix of the search_url variable when it generates the script

Add better debugging options for `opensearch2elvis`

A way to dump the output of intermediate steps (the request response, the downloaded HTML page, the downloaded OpenSearch description) would be good.

Exposing the internal intermediate results would also allow my surfraw-elvis repo to not depend on opensearch-discover for the OpenSearch elvi.

Allow `$mappings` to not be appended to search URL

This allows full control of where the mappings are; which, when used with --no-append-args (in mkelvis), allows full control over the search URL without having to manually map things.

I guess I'll have to finally document the available shell variables at each step. Also document that if a query parameter is defined, then it should be the last one in $mappings, and still opened.

For my own elvi, this would allow me to implement uoalib and the -lucky= option for ddg. They're in the query-collapse-elvi branch of my elvi repo. I'd like to get this done before release.

Reconsider use of `tempfile` module

In the interest of improving the startup time of mkelvis (#2), is it even worth using tempfile? The data that mkelvis handles isn't particularly sensitive, so it could probably just construct a tempfile name from the elvis name--something like .${elvisname}.tmp or .${elvisname}.${progname}.tmp. According to that issue, the tempfile import takes ~10,000 us, which is 12.5% of the total runtime (80,000 us).

A potential problem of this approach is in the case where multiple mkelvis processes run at the same time, with common elvis names. I don't suspect that this will be common (or useful!).

Additionally, the program already makes the assumption that it is not being run from an archive (since decompressing on every invocation is too slow), evident in the use of jinja2.ModuleLoader with concrete filesystem paths and jinja2.FileSystemLoader. A more lightweight solution would be to use __file__, which would be guaranteed to exist in our case. Using importlib.resources results in tempfile also being imported.

Support XHTML pages

I'm not sure how many sites actually have XHTML pages, but opensearch2elvis doesn't handle it.

I'll fix this if actually needed.

Add completion support for query arguments

Options like --suggestions=TYPE:QUERY_URL would be good. This would need embedded shell snippets to have the correct output--like https://example.com/suggestions?q=$it&n=$SURFRAW_example_results. Maybe there could be mappings (MappingOption) for these URLs?

It's a bit annoying to do, but I intend to use this feature for my OpenSearch elvis generator.

This needs changes in my completions code for Surfraw.

Add more option types

As the need arises, it would be good to have other option types.

Some ideas:

  • Natural numbers (nat): Maybe numeric range expressions would be nice here too.
  • ...

In general, I think supporting the search syntax of most websites would be good. There's a lot in the github and stackexchange search engines (from my elvi repo) that are done as --anything= options. It could be better.

As a side note, how would these new option types' arguments be tab-completed?

Validate elvis names

As part of the tool's goal to make elvis creation more standard, ensure elvis names are at least valid shell variable names, and maybe conform to a certain syntax. I suggest just lowercase alnum chars. Result: more uniform variable names in the form SURFRAW_${ELVIS}_${VAR}. The user can just do -o NAME to give it special characters, anyway.

Possible regex: ^[a-z0-9]+$.

Not many people use the tool, so doing this isn't too breaking.

Allow multiple uses of a single parameter in one URL

I'd also like to remove the span attribute from OpenSearchParameter. It's not even used anywhere or in OpenSearchURL.get_surfraw_template(...)--its main purpose (IIRC). The element order in OpenSearchURL.params should be good enough.

I'd also have to check if a param has already been seen when building Surfraw* objects.

Fix linter warnings

In particular, flake8 is complaining due to a lack of docstrings for public module/class/method/function.

Fix these before release.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.