leaverou / bafr Goto Github PK

Write batch find & replace scripts that transform files with a simple human-readable syntax

JavaScript 91.49% HTML 5.37% CSS 3.14%

bafr's Introduction

brep (Batch REPlace)

For versions < 0.0.8 see bafr

Ever written some complex find & replace operations in a text editor, sed or whatever, and wished you could save them somewhere and re-run them in the future? This is exactly what brep (Batch Replace) does!

You write a brep script (see syntax below), and then you apply it from the command-line like:

brep myscript.brep.toml src/**/*.html

This will apply the script myscript.brep.toml to all HTML files in the src folder and its subfolders. You don’t need to specify the file paths multiple times if they don’t change, you can include them in your script as defaults (and still override them if needed).

Installation
Syntax
Syntax reference
CLI
JS API
Future plans

Installation

You will need to have Node.js installed. Then, to install brep, run:

npm install -g brep

Syntax

There are three main syntaxes, each more appropriate for different use cases:

TOML when your strings are multiline or have weird characters and you want their boundaries to be very explicit
YAML when you want a more concise syntax for simple replacements
JSON is also supported. It’s not recommended for writing by hand but can be convenient as the output from other tools.

Of all three, YAML is the most concise and human-readable, but can behave unpredictably or be confusing in edge cases (special symbols, multiline strings). TOML supports a very precise syntax, and multiline strings, but can look rather awkward. Lastly, JSON is very fragile and verbose, but has the best compatibility with other tools.

	TOML	YAML	JSON
Readability	★★★☆☆	★★★★☆	★★☆☆☆
Conciseness	★★☆☆☆	★★★★★	★★☆☆☆
Robustness¹	★★★☆☆	★★☆☆☆	★☆☆☆☆
Compatibility	★☆☆☆☆	★★☆☆☆	★★★★★
Supports comments?	✅	✅	🚫
Multiline strings?	✅	✅ but very complex	🚫²

The docs below will show TOML and YAML, and it’s up to you what you prefer.

Replacing text with different text

The most basic brep script is a single replacement consisting of a single static from declaration and a single to replacement.

As an example, here is how you can replace all instances of <br> with a line break character:

from = "<br>"
to = "\n"

from: <br>
to: "\n"

{
	"from": "<br>",
	"to": "\n"
}

Note that the YAML syntax allows you to not quote strings in many cases, which can be quite convenient but can also create errors if you’re not careful.

Multiline strings

This also works, and shows how you can do multiline strings:

from = "<br>"
to = """
"""

from: <br>
to: >+

I do not recommend using YAML for multiline strings but you can read more about the many different ways to do it here.

Regular expressions

Replacing fixed strings with other fixed strings is useful, but not very powerful. The real power of brep comes from its ability to use regular expressions. For example, here is how you’d strip all <blink> tags:

regexp = true
from = "<blink>([\S\s]+?)</blink>"
to = "<span class=blink>$1</span>" # $1 will match the content of the tag

regexp: true
from: <blink>([\S\s]+?)</blink>
to: <span class=blink>$1</span> # $1 will match the content of the tag

{
	"regexp": true,
	"from": "<blink>([\\S\\s]+?)</blink>",
	"to": "<span class=blink>$1</span>"
}

brep uses the JS dialect for regular expressions (cheatsheet) with the following flags:

g (global): Replace all occurrences, not just the first one
m (multiline): ^ and $ match the start and end of lines, not of the whole file.
s (dotAll): . matches any character, including newlines. Use [^\r\n] to match any character except newlines.
v (unicodeSets): More reasonable Unicode handling, and named Unicode classes as \p{…} (e.g. \p{Letter}).
The i flag (case-insensitive) is not on by default, but can be enabled with the ignore_case option.

Multiple find & replace operations

So far our script has only been specifying a single find & replace operation. That’s not very powerful. The real power of Brep is that a single script can specify multiple find & replace operations, executed in order, with each operating on the result of the previous one. We will refer to each of these as a replacement in the rest of the docs.

Multiple replacements in TOML

To specify multiple find & replace operations, you simply add [[ replace ]] sections:

[[ replace ]]
from = "<blink>"
to = "<span class=blink>"

[[ replace ]]
from = "</blink>"
to = "</span>"

You can also do it like this:

replace = [
	{ from = "<blink>", to = "<span class=blink>" },
	{ from = "</blink>", to = "</span>" },
]

Multiple replacements in YAML

To specify multiple declarations, you need to enclose them in { }:

replace:
- { from: <blink>, to: '<span class="blink">' }
- { from: </blink>, to: "</span>" }

Multiple replacements in JSON

"replace": [
	{"from": "<blink>", "to": "<span class=blink>"},
	{"from": "</blink>", "to": "</span>"}
]

Nested replacements

In some cases it’s more convenient to match a larger part of the text and then do more specific replacements inside just those matches. In a way, that is similar to a text editor’s "find in selection" feature, except on steroids.

# Match sequences of single-line JS comments
regexp: true
from: "(^//[^\n\r]*$)+"
to: "/* $& */" # Convert to block comments
replace:
# Strip comment character
- { regexp: true, from: "^//", to: "" }
- { regexp: true, from: "^/* //", to: "/*" }

If you specify a to, it will be applied before the child replacements.

Refer to the matched string

You can use $& to refer to the matched string (even when not in regexp mode). For example, to wrap every instance of "brep" with an <abbr> tag you can do:

from = "brep"
to = '<abbr title="Batch REPlace">$&</abbr>'

from: brep
to: '<abbr title="Batch REPlace">$&</abbr>'

Beyond $& there is a bunch of other special replacements, all starting with a dollar sign ($). To disable these special replacements, use literal = true / literal: true.

Append/prepend

While $& can be convenient, it’s also a little cryptic. To make it easier to append or prepend matches with a string, brep also supports before, after, and insert properties.

For example, this will insert "Bar" before every instance of "Foo":

before = "Foo"
insert = "Bar"

before: Foo
insert: Bar

after is also supported and works as you might expect.

Note

insert is literally just a an alias of to, it just reads nicer in these cases.

You can also combine these with from to add additional criteria. For example this script:

from = "brep"
after = "using"
to = "awesome brep"

Will convert "I am using brep" to "I am using awesome brep".

[from, to] shortcut syntax for many simple replacements

There are many cases where you want to make many replacements, all with the same settings (specified on their parent) and just different from/to values. Brep supports a shortcut for this.

Instead of declarations, you can specify from/to pairs directly by enclosing them in brackets, separated by a comma. This can be combined with regular replacements, though far more easily in YAML:

replace:
- [foo, bar]
- [baz, quux]
- {from: yolo, to: hello} # regular replacement

In TOML, it cannot be combined with regular [[ replace ]] blocks, so all replacements need to be specified in a different way:

replace = [
	["foo", "bar"],
	["baz", "quux"],
	# cannot be combined with [[ replace ]] blocks
	{ from = "yolo", to = "hello", ignore_case = true },
]

Syntax reference

Replacements

Key	Type	Default	Description
`from`	String or array of strings	(Mandatory)	The string to search for. If an array of strings, they are considered alternatives.
`to`	String	(matched string)	The string to replace the `from` string with.
`before`	String or array of strings	-	Match only strings before this one. Will be interpreted as a regular expression in regexp mode.
`after`	String or array of strings	-	Match only strings after this one. Will be interpreted as a regular expression in regexp mode.
`regexp`	Boolean	`false`	Whether the `from` field should be treated as a regular expression.
`ignore_case`	Boolean	`false`	Set to `true` to make the search should case-insensitive.
`whole_word`	Boolean	`false`	Match only matches either beginning/ending in non-word characters or preceded/followed by non-word characters. Unicode aware.
`recursive`	Boolean	`false`	Whether the replacement should be run recursively on its own output until it stops changing the output.
`files`	String or array of strings	-	Partial paths to filter against. This is an additional filter over the files being processed, to apply specific replacements only to some of the files.

Global settings

Key	Type	Default	Description
`files`	String or array of strings	-	A glob pattern to match files to process.
`suffix`	String	`""`	Instead of overwriting the original file, append this suffix to its filename
`extension`	String	-	Instead of overwriting the original file, change its extension to this value. Can start with a `.` but doesn’t need to.
`path`	String	-	Allows the new file to be in a different directory. Both absolute and relative paths are supported. If relative, it's resolved based on the original file's location. For example, `..` will write a file one directory level up.

CLI

To apply a brep script to the files specified in the script, simply run:

brep script.brep.toml

Where script.brep.toml is your brep script (and could be a .yaml or .json file).

To override the files specified in the script, specify them after the script file name, like so:

brep script.brep.toml src/*.md

The syntax (TOML, YAML, JSON) is inferred from the file extension. To override that (or to use an arbitrary file extension) you can use --format:

brep script.brep --format=toml

Note

You can name your script however you want, however ending in .brep.ext can more clearly communicate that this is a brep script.

Supported flags

--verbose: Print out additional information
--dry-run: Just print out the output and don’t write anything
--version: Just print out the version and don’t do anything

Any root-level setting can also be specified as a flag, e.g. --suffix=-edited or --extension=txt.

JS API

You can access all of brep’s functionality via JS, and some of it even works client-side, in the browser!

Accessing the CLI via JS (Node.js only)

There is also the JS version of the CLI you can access as:

import brep from "brep/cli";
await brep("script.yaml");
// Do stuff after script runs

`Replacer` class (Browser-compatible)

This is the core of brep and takes care of applying the replacements on strings of text.

import { Replacer } from "brep/replacer";

or, in the browser:

import { Replacer } from "node_modules/brep/src/replacer.js";

Instance methods:

new Replacer(script, parent): Create a new instance of the replacer. script is the script object, parent is the parent replacer (if any).
replacer.transform(content): Process a string and return the result.

`Brep` class (Node.js-only)

This takes care of reading script files, parsing them, creating Replacer instances, and applying the brep script to files.

import Brep from "brep";

Instance methods:

brep.text(content): Process a string (internally calls replacer.transform()).
brep.file(path [, outputPath]): Process a file and write the results back (async).
brep.files(paths): Process multiple files and write the results back
brep.glob(pattern): Process multiple files and write the results back

Future plans

I/O

A way to intersect globs, e.g. the script specifies **/*.html then the script user specifies folder/** and all HTML files in folder are processed.

CLI

Interactive mode
--help flag

Opposite of error-proneness. How hard is it to make mistakes? This includes both syntax errors or writing syntax that behaves differently than what you expect. ↩
This refers to support for strings that can spread across multiple lines in your brep script. You can always include line breaks by using \n to represent them. ↩

bafr's People

Contributors

Stargazers

Watchers

bafr's Issues

Support globs in replacement `files` property

Leftover from #4. Currently a child files is tested as a substring. It would be nice if it could be a glob. However, it seems none of the glob packages support testing of a path against a glob, and it seems overly intensive to apply the glob, get paths, then intersect these paths with the root paths.

Support changing extension in output path

Currently we can only add a suffix. If we could also change the extension, bafr can be used as a build tool for many types of processing.

Swap?

Currently, swapping a string for another means you need to convert it to a temp first. E.g. replacing a with b and b with a would look like this:

[[ replace ]]
from = "a"
to = "c"

[[ replace ]]
from = "b"
to = "a"

[[ replace ]]
from = "c"
to = "b"

This is pretty awkward, especially for nonprogrammers, and error-prone (what if the string you choose already exists in the doc?).

I propose we have a mode that does this automatically. Something like:

swap = true
from = "a"
to = "b"

Since this only works with exact matches, perhaps the syntax should not allow combining it with regex by switching them both to a mode (or strategy?) setting:

mode = "swap"
from = "a"
to = "b"

(and regexp = true would become mode = regexp)

Add `regex` as a flavor option for improved DX

Love the project. 😊

A significant downside of this project being written in JavaScript is that JS regexes, with any significant level of complexity, are awful with readability and maintainability compared to other major regex flavors (due to the lack of options for free spacing, comments, and any features for composing/reusing subpatterns). They're also especially vulnerable to catastrophic backtracking due to the lack of backtracking control syntax.

As a result, I think my recently released regex package could be a great fit to include with brep, either as an option or even as the default regex handler. It uses a strict superset of native JS regex syntax, so, apart from the differences that result from its implicit flags x and n, 100% of JS regexes (with flag v) work the same. If it wasn't used as the default, perhaps there could be a new option like regexp_flavor with values 'RegExp' (native) and 'regex'.

From the regex package's readme:

regex is a template tag that extends JavaScript regular expressions with features that make them more powerful and dramatically more readable. It returns native RegExp instances that equal or exceed native performance. It's also lightweight, supports all ES2024+ regex features, and can be used as a Babel plugin to avoid any runtime dependencies or added runtime cost.

Highlights include support for free spacing and comments, atomic groups via (?>…) that can help you avoid ReDoS, subroutines via \g<name> and definition groups via (?(DEFINE)…) that enable powerful composition, and context-aware interpolation of regexes, escaped strings, and partial patterns.

With the regex package, JavaScript steps up as one of the best regex flavors alongside PCRE and Perl, and maybe surpassing C++, Java, .NET, and Python.

What do you think?

Note: Since brep would need to call regex with dynamic input rather than using it with backticks as a template tag, that would work like this: regex({raw: [<pattern-string>]}) or regex(<flags-str>)({raw: [<pattern-string>]}).

A way to extract matches only and write them to another file

Instead of replacing matches and keeping whatever is in between, I’ve often wanted to extract matches and write them to a file. I wonder if that could be in scope as well.

Shorthand syntax?

For a given bafr script (replacing bibliography ids) we have a long series of:

[[ replace ]]
from = "@Bakke:2011:SUI:1978942.1979313"
to = "@Bakke2011AData"

[[ replace ]]
from = "@chang2016using"
to = "@Chang2016UsingSpreadsheets"

# ...

What if we supported a shorthand syntax for these kinds of replacements which don’t need any config and just have a from and a to? E.g. something like:

[[ replacements ]]

"@Bakke:2011:SUI:1978942.1979313" = "@Bakke2011AData"
"@chang2016using" = "@Chang2016UsingSpreadsheets"
# ...

The problem is that with that, these cannot take part in the flow of other replacements due to how TOML is converted to JSON (they’d be a separate array).

Alternatively it could be a replacement flag:

shorthand = true
"@Bakke:2011:SUI:1978942.1979313" = "@Bakke2011AData"
"@chang2016using" = "@Chang2016UsingSpreadsheets"
# ...

When that flag is on, any unknown key would be considered a replacement.

Or an array:

pairs = [
	"@Bakke:2011:SUI:1978942.1979313", "@Bakke2011AData",
	"@chang2016using", "@Chang2016UsingSpreadsheets",
]

Worth it or not?

Custom syntax?

Currently bafr scripts are based on TOML.
If we were willing to parse our own syntax, we could have nice things like:

No more awkward [[ replace ]] blocks, just a separator, or maybe even nothing (just make sure to start every replacement with the from key — so every time we encounter a new from declaration, we start a new replacement)
No need to quote keys containing special characters (see #1)
We could use the more natural : instead of =
We could use to instead of = in #1
We could have a less awkward multiline string syntax. Perhaps just a regular quote character (why terminate quotes on line endings anyway?).
We could support unquoted strings for simple things. Basically only quote your string if you don't want whitespace trimmed or it’s multiline (we don't need to distinguish numbers or other values from strings anyway)

The obvious downsides are:

WAY more effort
Reduced compatibility. No syntax highlight anywhere unless this becomes quite popular.

Alternatively, we could explore using YAML. I went with TOML to avoid all the weird whitespace sensitivity of YAML, but I just read it can also use {} for scope, so it might be worth another look: https://medium.com/@kasunbg/write-yaml-without-indentation-via-curly-braces-3c05ae8700ce

`whole_word` flag

To truly emulate text editors' find & replace, we also need a whole_word flag. However, this is not as trivial as wrapping the regex with \b(?: ... )\b. Word boundaries (\b) detect transitions from \w to \W (and vice versa). However when the match already starts or ends with a non-word character, it’s already a "whole word" match.

We probably need some lookarounds instead:

(?<=^|\W) = preceded by beginning of line/file OR non-word character
(?=$|\W) = followed by end of line/file OR non-word character

Furthermore, \W is not unicode aware and treats any non latin letter as a non-word character. For a unicode aware version, I think we need [^_\p{L}\p{N}]

The default `to` should be "no change"

Defaulting to the empty string may save chars for some use cases, but can also be very surprising and destructive. Especially with subreplacements, a to that defaults to leaving the string unchanged can be useful, and if anything less surprising.

`from`/`before`/`after` should support arrays for alternatives

That’s a nice declarative non-regex way to do alternatives, and could be convenient even for regex cases.

Support `files` per replacement, to narrow down specific replacements

A way to output result files in another folder

Experimenting with https://github.com/LeaVerou/phd/issues/6, I thought how nice it would be if Bafr supported this feature. In that case, we could keep temporary .tex files in one place and remove them in one shot when we no longer need them.

It would be nice if we could specify such a folder either as a CLI parameter or a global setting (or both).
I'm unsure about the parameter's name. At first, I thought of something like folder, but with the files setting, it might be seen as a folder where the source files live. Another option might be output.

Suppose we go with output, then we can do the following:

bafr script.bafr.yaml --output=processed

And

files: content/*.md
output: processed
replace:
- from: <br>

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.