juliadocs / markdownast.jl Goto Github PK

View Code? Open in Web Editor NEW

13.0 13.0 3.0 408 KB

Abstract syntax tree representation of Markdown documents in Julia

Home Page: https://juliadocs.github.io/MarkdownAST.jl

License: Other

Julia 100.00%

julia markdown

markdownast.jl's People

Contributors

Stargazers

Watchers

Forkers

ericphanson goerz-forks

markdownast.jl's Issues

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Pretty terminal printing of Markdown documents

I.e. what CommonMark does and can largely be lifted from there. However, we should probably drop the dependency on Crayons for it (x-ref: MichaelHatherly/CommonMark.jl#41 (comment)).

Implementation of the table element

The precise implementation of the Table element in CommonMark is currently unclear to me (not sure how the different sub-elements should be organized).

The Table extension needs to be properly hashed out and documented.
The conversion from standard library tables also needs to be fixed.
Remove the TableBody and TableHeader elements altogether, and just interpret the first row as the header.
Make TableCell a singleton, removing its fields. They could also be turned into some sort of dynamic properties that are determined by traversing the tree.
Make TableComponent subtype AbstractElement directly and don't use it for Table (to ensure that the internal nodes of a table are not allowed to exists in other contexts).
Add helper methods like rows, nrows, ncols etc. to dynamically determine the number of columns and rows of a table.

Backslash / soft break / line break elements

CommonMark has implemented dedicated inlines for these characters. But do we actually need them here?

MarkdownAST.jl/src/markdown.jl

Lines 467 to 469 in 72c5f6c

 # struct Backslash <: AbstractInline end 

 # struct SoftBreak <: AbstractInline end 

 # struct LineBreak <: AbstractInline end

Implement LineBreak (b45a1e5, 2f77d57)
Implement or discard SoftBreak
Implement or discard Backslash

Document should not be a subtype of AbstractBlock

The Document element is not really a block, since it should not be contained in other elements (e.g. an Admonition probably should not contain a Document node as a child). Instead, it could subtype AbstractElement directly.

Julia expression interpolation element

This is currently not implemented here, but likely something we need to support converting from standard library Markdown objects containing interpolations.

MarkdownAST.jl/src/markdown.jl

Line 466 in 72c5f6c

# src/extensions/interpolation.jl:struct JuliaValue <: AbstractInline

MarkdownAST.jl/src/stdlib.jl

Lines 119 to 127 in 72c5f6c

 # TODO: Fallback methods. These should maybe use the interpolation extension? 

 # function _convert_inline(x) 

 # @debug "Strange inline Markdown node (typeof(x) = $(typeof(x))), falling back to repr()" x 

 # Text(repr(x)) 

 # end 

 # function _convert_block(x) 

 # @debug "Strange inline Markdown node (typeof(x) = $(typeof(x))), falling back to repr()" x 

 # Paragraph([Text(repr(x))]) 

 # end

Whole tree iteration

CommonMark iterates over the whole tree if you do for node in tree, which is currently not implemented here (we only have children).

I would argue, however, that it is not intuitively clear which type of iteration (over direct children? over whole tree? over direct and indirect children, but not parents?) should be the default. Hence I would advocate that for each iterator there should be a function (like children) that returns the iterator. For the whole-tree iteration it could be called tree(node).

Convert back to markdown string

Is there an easy way to convert a MarkdownAST tree back to the markdown string it represents? I managed to do it via conversion back to the stdlib-Markdown:

julia> using MarkdownAST: @ast, Document, Heading, Paragraph
julia> using Markdown: MD

julia> md = @ast Document() do
           Heading(1) do
               "Top-level heading"
           end
           Paragraph() do
               "Some paragraph text"
           end
       end

julia> print(string(convert(MD, md)))
# Top-level heading

Some paragraph text

It would nice if string(md) or something like that could work directly.

Missing closing back-ticks on jldoctest causes corrupt documentation but does not fail checks

I forgot some closing backticks in a jldoctest block (fixed here LilithHafner/AliasTables.jl#29) which should, in order of preference

Fail the documentation build
Render okay
Break that docstring's rendering
Break that docstring's rendering and some subsequent content
Break the whole documentation page but still show up as a pass

Right now we're at a 3 or 4 here: https://aliastables.lilithhafner.com/v1.0.0/#AliasTables.probabilities

Moved from LuxDL/DocumenterVitepress.jl#116

Rename Node{T} -> GenericNode{T}

.. and define const Node = GenericNode{Nothing}. This way MarkdownAST.Node would always refer to a concrete type, and would also make it more clear in other packages when they define their own Node with their own T.

This is inspired by how IOBuffer is really an instance of GenericIOBuffer{T}.

Rename Code -> CodeInline

For consistency with other elements (AbstractBlock / AbstractInline, HTMLBlock/HTMLInline, DisplayMath/InlineMath).

Enforce iscontainer and can_contain in mutating methods

The various tree mutation methods (push!, insert_after! etc.) do not enforce the requirements on elements that are described by the iscontainer and can_contain methods.

Remove/rename internal properties for `Node` types

We have these internal fields like .nxt and .first_child, and we do some unnecessary get/setproperty stuff:

MarkdownAST.jl/src/node.jl

Lines 103 to 137 in 99e0f82

 function Base.getproperty(node::Node{T}, name::Symbol) where T 

 if name === :element 

 getfield(node, :t) 

 elseif name === :children 

 NodeChildren(node) 

 elseif name === :next 

 getfield(node, :nxt) 

 elseif name === :previous 

 getfield(node, :prv) 

 elseif name === :parent 

 getfield(node, :parent) 

 elseif name === :meta 

 getfield(node, :meta) 

 else 

 # TODO: error("type Node does not have property $(name)") 

 @debug "Accessing private field $(name) of Node" stacktrace() 

 getfield(node, name) 

 end 

 end 

 function Base.setproperty!(node::Node, name::Symbol, x) 

 if name === :element 

 setfield!(node, :t, x) 

 elseif name === :meta 

 setfield!(node, :meta, x) 

 elseif name in propertynames(node) 

 # TODO: error("Unable to set property $(name) for Node") 

 @debug "Setting private field :$(name) of Node" stacktrace() 

 setfield!(node, name, x) 

 else 

 # TODO: error("type Node does not have property $(name)") 

 @debug "Accessing private field :$(name) of Node" stacktrace() 

 setfield!(node, name, x) 

 end 

 end

We should clean that up and make sure that you can only access documented fields. Internally, we can use getfield and setfield! where necessary. But let's do that in 0.2.0.

X-ref: #19.

Implement `Base.replace` and `Base.replace!`

Continuing from the discussion on Discourse and in the context of implementing JuliaDocs/DocumenterCitations.jl#6, it would be extremely useful to implement the Base functions replace and replace! on AST trees.

I would propose the following implementation:

using Pkg
Pkg.activate(temp=true)
Pkg.add("MarkdownAST")

import MarkdownAST


"""
    replace(f::Function, root::Node)

Creates a copy of the tree where all child nodes of `root` are recursively
replaced by the result of `f(child)`.

The function `f(child::Node)` must return either a new `Node` to replace
`child` or a Vector of nodes that will be inserted as siblings, replacing
`child`.

Note that `replace` does not allow the construction of invalid trees, and
element replacements that require invalid parent-child relationships (e.g., a
block element as a child to an element expecting inlines) will throw an error.

# Example

The following snippet removes links from the given AST. That is, it replaces
`Link` nodes with their link text (which may contain nested inline markdown
elements):

```julia
new_mdast = replace(mdast) do node
    if node.element isa MarkdownAST.Link
        return [MarkdownAST.copy_tree(child) for child in node.children]
    else
        return node
    end
end
```
"""
function Base.replace(f::Function, root::MarkdownAST.Node{M}) where M
    new_root = MarkdownAST.Node{M}(root.element, deepcopy(root.meta))
    for child in root.children
        replaced_child = replace(f, child)
        transformed = f(replaced_child)
        if transformed isa MarkdownAST.Node
            push!(new_root.children, transformed)
        elseif transformed isa Vector
            append!(new_root.children, transformed)
        else
            error("Function `f` in `replace(f, root::MarkdownAST.Node)` must return either a Node or a Vector of nodes, not $(repr(typeof(transformed)))")
        end
    end
    return new_root
end


"""
    replace!(f::Function, root::Node)

Acts like `replace(f, root)`, but modifies `root` in-place.
"""
function Base.replace!(f::Function, root::MarkdownAST.Node{M}) where M
    new_root = replace(f, root)
    while !isempty(root.children)
        # `Base.empty!(root.children)` would be nice!
        MarkdownAST.unlink!(first(root.children))
    end
    append!(root.children, new_root.children)
    return root
end

It might be nice to also implement Base.empty(::MarkdownAST.NodeChildren) (see comment): is there a better way to do that than the loop that I implemented?

To test the behavior in the context of my original intent with DocumenterCitations:

## TEST  ######################################################################
#
# As a test, we're resolving simple citation links in a format similar to
# https://juliadocs.org/DocumenterCitations.jl/stable/gallery/#Custom-style:-Citation-key-labels
#
# That test replaces a single Link node with a list of new inline nodes that
# mix text and links to a `references.md` page.
#
# Also, to test the simpler transformation of a node with a single new node, we
# replace Strong (bold) nodes with Emph (italic) nodes – This could also be
# donw with MarkdownAST.copy_tree directly, but it's just a test.

Pkg.add(url="https://github.com/JuliaDocs/Documenter.jl", rev="master")
import Markdown
import Documenter

MD = raw"""
# Quantum Control

**[Quantum optimal control](https://qutip.org/docs/latest/guide/guide-control.html)**
[BrumerShapiro2003;BrifNJP2010;KochJPCM2016;SolaAAMOP2018;MorzhinRMS2019;
Wilhelm2003.10132;KochEPJQT2022](@cite) attempts to steer a quantum system in
some desired way.

## Methods used

We use the following methods:

* *[Krotov's method](https://github.com/JuliaQuantumControl/Krotov.jl)*
  [Krotov1996](@cite), and
* [**GRAPE** (*Gradient Ascent Pulse Engineering*)](https://github.com/JuliaQuantumControl/GRAPE.jl)
  [KhanejaJMR2005;FouquieresJMR2011](@cite).

This concludes the document.
"""

function parse_md_string(mdsrc)
    mdpage = Markdown.parse(mdsrc)
    return convert(MarkdownAST.Node, mdpage)
end

mdast = parse_md_string(MD)
println("====== IN =======")
println("AS AST:")
@show mdast
println("AS TEXT:")
print(string(convert(Markdown.MD, mdast)))
println("=== TRANSFORM ===")
replace!(mdast) do node
    if node.element == MarkdownAST.Link("@cite", "")
        text = first(node.children).element.text  # assume no nested markdown
        keys = [strip(key) for key in split(text, ";")]
        n = length(keys)
        if n == 1
            k = keys[1]
            new_md = "[[$k]](references.md#$k)"
        else
            k1 = keys[1]
            k2 = keys[end]
            if n > 2
                new_md = "[[$k1](references.md#$k1)-[$k2](references.md#$k2)]"
            else
                new_md = "[[$k1](references.md#$k1), [$k2](references.md#$k2)]"
            end
        end
        return Documenter.mdparse(new_md; mode=:span)
        # We probably wouldn't want to use `Documenter`, but it shouldn't be
        # hard to copy in a stripped-down version of `mdparse` here.
    elseif node.element == MarkdownAST.Strong()
        # Not sure if `copy_tree(f, node)` is really the most elegant way to do
        # this, but I wanted to try out how `copy_tree` can modify a node's
        # `element`.
        return MarkdownAST.copy_tree(node) do node, element
            element == MarkdownAST.Strong() ? MarkdownAST.Emph() : element
        end
    else
        return node
    end
end
println("====== OUT =======")
println("AS AST:")
@show mdast
println("AS TEXT:")
print(string(convert(Markdown.MD, mdast)))
println("====== END =======")

Second, to test the simple example from the docstring:

# TEST 2: delete links (example from the docstring)  ##########################
println("\n\n=====================================")
println("TEST2: ORIGINAL MD WITH LINKS REMOVED")
mdast = parse_md_string(MD)
replace!(mdast) do node
    if node.element isa MarkdownAST.Link
        return [MarkdownAST.copy_tree(child) for child in node.children]
    else
        return node
    end
end
print(string(convert(Markdown.MD, mdast)))
println("====== END =======")

@mortenpi Would you like me to start working a PR for this with proper testing and documentation?

Any comments on the prototype?

Package exports

Currently, the package does not export anything, so everything has to be explicitly included. We probably want to export some (or all) of the following things:

@ast macro
Node type
Functions for querying and mutating trees (next, insert_after! etc).
Abstract element types (AbstractElement, AbstractInline etc).
Concrete element types (note: Text is ambiguous with a Base export; but the @ast macro does not actually need the Text() method, so it could remain unexported).

show methods for trees / Node

There are different representations of Node object that are useful in different cases:

Something short that just says that this object is a Node with some element (current behavior).
Full AST printout (the current showast function, replicating the input of the @ast macro). This is useful when working with the tree manually, but is technical and can get pretty long.

This can maybe be combined with (1), in that for large printouts we just put an ellipsis like we do for large arrays.
Pretty-printed document (the behaviour of CommonMark). This is useful for users who do not want to be concerned with the technical details of the representation, and also relevant for e.g. rendering docstrings.

We need to decide on which one should be the default output if a Node is returned in the REPL, and how to access the other option.

Node accessor API

From MichaelHatherly/CommonMark.jl#41 (comment):

node[] for the AbstractContainer instance

Perhaps container() rather than overloading getindex, which adds inconsistencies in how you access particular parts of the nodes.

e.g. next()

Probably too generic, unless we're expecting to not export?

Yep, having the element access consistent with the rest of the API is probably a good idea. I would advocate calling it element though (~AbstractElement).
next and previous are indeed quite generic.

Maybe, to avoid the whole issue of exporting generic functions (e.g. parent, children, container/element are also kind of generic), we stick to having them be clearly documented fields/properties, e.g. .element, .next, .previous, .parent.

I would argue that setproperty! for many of them should be disallowed, so that it wouldn't be possible to construct nonsensical trees. You can always still call setfield() if you really need low level access to the underlying fields (e.g. in basic functions such as insert_after!).

Another bikeshedding question is whether to have them be called nxt/prv or next/prev or next/previous. While slightly more verbose, I would advocate for the latter option for clarity.

Furthermore, we could also overload the iterator over children, such that you could add children with push!(node.children, child) and pushfirst!(node.children, child). Semantically, node.children feels array/list-like, and so overloading push!/pushfirst! seems appropriate.

Handling mutating trees with iterators

Currently when e.g. iterating over children(node) you can mutate the tree while the iteration is happening. This will likely lead to unexpected behavior (note: changing or updating the AbstractElement is fine).

We should minimally document that you should not do that. However, I wonder if there is something else we could do to make sure that you don't get bad behavior. A few options:

Collect Nodes into an array when iterator is constructed and then naively iterate over that array instead. If some of them get unlinked etc., then that won't affect the iteration per se. However, this will mean allocating a potentially big array (especially in the whole-tree case).. we could have an keyword argument for iterator functions to allow for unsafe, but efficient iteration (e.g. children(node, unsafe=true)?
Make the tree immutable during iterators. This would mean attaching some global metadata to each node (e.g. something as simple as a Ref{Bool}).

Methods for adding children

From MichaelHatherly/CommonMark.jl#41 (comment):

Instead of append_child and prepend_child, I overload push! and pushfirst! for this. I felt that "append"/"prepend" could be confusing, as in the standard library append! and prepend! concatenate two collections, rather than adding an element. However, at the same time, I am not really sure it makes sense to think of a node as a "collection of its children", which this choice implies.

Those were intentionally not added to the push! and pushfirst! methods since I didn't feel they could really reasonably be classed as "array-like" enough for it not to be punning.

We should change away from push!(::Node, ...) and pushfirst!(::Node, ...) for adding children, as it's not really semantically appropriate. But I don't really have a good idea for an alternative name, and still not a huge fan of "append" and "prepend".

A different option, together with #10, would be push!(node.children, child) and pushfirst!(node.children, child).

MarkdownAST attempted to iterate a `Markdown.Paragraph`

Hi, in trying to upgrade to Documenter 1.0 I've hit this issue (with DataToolkitBase).

ERROR: LoadError: MethodError: no method matching iterate(::Markdown.Paragraph)

Closest candidates are:
  iterate(::RegexMatch, Any...)
   @ Base regex.jl:284
  iterate(::ExponentialBackOff)
   @ Base error.jl:260
  iterate(::ExponentialBackOff, ::Any)
   @ Base error.jl:260
  ...

Stacktrace:
  [1] _convert(nodefn::MarkdownAST.NodeFn{Nothing}, c::MarkdownAST.Item, child_convert_fn::typeof(MarkdownAST._convert_block), md_children::Markdown.Paragraph)
    @ MarkdownAST ~/.julia/packages/MarkdownAST/CZtZT/src/stdlib/fromstdlib.jl:33
  [2] _convert_block(nodefn::MarkdownAST.NodeFn{Nothing}, b::Markdown.List)
    @ MarkdownAST ~/.julia/packages/MarkdownAST/CZtZT/src/stdlib/fromstdlib.jl:65
  [3] _convert(nodefn::MarkdownAST.NodeFn{Nothing}, c::MarkdownAST.Item, child_convert_fn::typeof(MarkdownAST._convert_block), md_children::Vector{Any})
    @ MarkdownAST ~/.julia/packages/MarkdownAST/CZtZT/src/stdlib/fromstdlib.jl:34
  [4] _convert_block(nodefn::MarkdownAST.NodeFn{Nothing}, b::Markdown.List)
    @ MarkdownAST ~/.julia/packages/MarkdownAST/CZtZT/src/stdlib/fromstdlib.jl:65
  [5] _convert(nodefn::MarkdownAST.NodeFn{Nothing}, c::MarkdownAST.Document, child_convert_fn::typeof(MarkdownAST._convert_block), md_children::Vector{Any})
    @ MarkdownAST ~/.julia/packages/MarkdownAST/CZtZT/src/stdlib/fromstdlib.jl:34
  [6] convert (repeats 2 times)
    @ Documenter ~/.julia/packages/MarkdownAST/CZtZT/src/stdlib/fromstdlib.jl:23 [inlined]
  [7] convert
    @ Documenter ~/.julia/packages/MarkdownAST/CZtZT/src/stdlib/fromstdlib.jl:21 [inlined]
  [8] (::Documenter.var"#49#50"{MarkdownAST.Node{Nothing}, Documenter.Page, Documenter.Document, LineNumberNode, Module, MarkdownAST.CodeBlock})()
    @ Documenter ~/.julia/packages/Documenter/Meee1/src/expander_pipeline.jl:630
  [9] cd(f::Documenter.var"#49#50"{MarkdownAST.Node{Nothing}, Documenter.Page, Documenter.Document, LineNumberNode, Module, MarkdownAST.CodeBlock}, dir::String)
    @ Base.Filesystem ./file.jl:112
 [10] runner(::Type{Documenter.Expanders.EvalBlocks}, node::MarkdownAST.Node{Nothing}, page::Documenter.Page, doc::Documenter.Document)
    @ Documenter ~/.julia/packages/Documenter/Meee1/src/expander_pipeline.jl:610
 [...]
 [20] top-level scope
    @ ~/.julia/dev/DataToolkitBase/docs/make.jl:19
in expression starting at /home/tec/.julia/dev/DataToolkitBase/docs/make.jl:19

	# struct Backslash <: AbstractInline end
	# struct SoftBreak <: AbstractInline end
	# struct LineBreak <: AbstractInline end

	# TODO: Fallback methods. These should maybe use the interpolation extension?
	# function _convert_inline(x)
	# @debug "Strange inline Markdown node (typeof(x) = $(typeof(x))), falling back to repr()" x
	# Text(repr(x))
	# end
	# function _convert_block(x)
	# @debug "Strange inline Markdown node (typeof(x) = $(typeof(x))), falling back to repr()" x
	# Paragraph([Text(repr(x))])
	# end

	function Base.getproperty(node::Node{T}, name::Symbol) where T
	if name === :element
	getfield(node, :t)
	elseif name === :children
	NodeChildren(node)
	elseif name === :next
	getfield(node, :nxt)
	elseif name === :previous
	getfield(node, :prv)
	elseif name === :parent
	getfield(node, :parent)
	elseif name === :meta
	getfield(node, :meta)
	else
	# TODO: error("type Node does not have property $(name)")
	@debug "Accessing private field $(name) of Node" stacktrace()
	getfield(node, name)
	end
	end

	function Base.setproperty!(node::Node, name::Symbol, x)
	if name === :element
	setfield!(node, :t, x)
	elseif name === :meta
	setfield!(node, :meta, x)
	elseif name in propertynames(node)
	# TODO: error("Unable to set property $(name) for Node")
	@debug "Setting private field :$(name) of Node" stacktrace()
	setfield!(node, name, x)
	else
	# TODO: error("type Node does not have property $(name)")
	@debug "Accessing private field :$(name) of Node" stacktrace()
	setfield!(node, name, x)
	end
	end