ariabuckles / simple-markdown Goto Github PK

JavaScript markdown parsing, made simple

License: MIT License

Makefile 0.60% JavaScript 99.39% Procfile 0.01%

simple-markdown's Introduction

🚚 As of April 2022 this repo is no longer the home of simple-markdown. The contents and development activity have moved into the Perseus repo here.

simple-markdown

simple-markdown is a markdown-like parser designed for simplicity and extensibility.

Change log

Philosophy

Most markdown-like parsers aim for speed or edge case handling. simple-markdown aims for extensibility and simplicity.

What does this mean? Many websites using markdown-like languages have custom extensions, such as @mentions or issue number linking. Unfortunately, most markdown-like parsers don't allow extension without forking, and can be difficult to modify even when forked. simple-markdown is designed to allow simple addition of custom extensions without needing to be forked.

At Khan Academy, we use simple-markdown to format over half of our math exercises, because we need markdown extensions for math text and interactive widgets.

simple-markdown is MIT licensed.

Getting started

First, let's parse and output some generic markdown using simple-markdown.

If you want to run these examples in node, you should run npm install in the simple-markdown folder or npm install simple-markdown in your project's folder. Then you can acquire the SimpleMarkdown variable with:

var SimpleMarkdown = require("simple-markdown");

Then let's get a basic markdown parser and outputter. SimpleMarkdown provides default parsers/outputters for generic markdown:

var mdParse = SimpleMarkdown.defaultBlockParse;
var mdOutput = SimpleMarkdown.defaultOutput;

mdParse can give us a syntax tree:

var syntaxTree = mdParse("Here is a paragraph and an *em tag*.");

Let's inspect our syntax tree:

    // pretty-print this with 4-space indentation:
    console.log(JSON.stringify(syntaxTree, null, 4));
    => [
        {
            "content": [
                {
                    "content": "Here is a paragraph and an ",
                    "type": "text"
                },
                {
                    "content": [
                        {
                            "content": "em tag",
                            "type": "text"
                        }
                    ],
                    "type": "em"
                },
                {
                    "content": ".",
                    "type": "text"
                }
            ],
            "type": "paragraph"
        }
    ]

Then to turn that into an array of React elements, we can call mdOutput:

    mdOutput(syntaxTree)
    => [ { type: 'div',
        key: null,
        ref: null,
        _owner: null,
        _context: {},
        _store: { validated: false, props: [Object] } } ]

Adding a simple extension

Let's add an underline extension! To do this, we'll need to create a new rule and then make a new parser/outputter. The next section will explain how all of these steps work in greater detail. (To follow along with these examples, you'll also need underscore.)

First, we create a new rule. We'll look for double underscores surrounding text.

We'll put underlines right before ems, so that __ will be parsed before _ for emphasis/italics.

A regex to capture this would look something like /^__([\s\S]+?)__(?!_)/. This matches __, followed by any content until it finds another __ not followed by a third _.

var underlineRule = {
  // Specify the order in which this rule is to be run
  order: SimpleMarkdown.defaultRules.em.order - 0.5,

  // First we check whether a string matches
  match: function (source) {
    return /^__([\s\S]+?)__(?!_)/.exec(source);
  },

  // Then parse this string into a syntax node
  parse: function (capture, parse, state) {
    return {
      content: parse(capture[1], state),
    };
  },

  // Finally transform this syntax node into a
  // React element
  react: function (node, output) {
    return React.DOM.u(null, output(node.content));
  },

  // Or an html element:
  // (Note: you may only need to make one of `react:` or
  // `html:`, as long as you never ask for an outputter
  // for the other type.)
  html: function (node, output) {
    return "<u>" + output(node.content) + "</u>";
  },
};

Then, we need to add this rule to the other rules:

var rules = _.extend({}, SimpleMarkdown.defaultRules, {
  underline: underlineRule,
});

Finally, we need to build our parser and outputters:

var rawBuiltParser = SimpleMarkdown.parserFor(rules);
var parse = function (source) {
  var blockSource = source + "\n\n";
  return rawBuiltParser(blockSource, { inline: false });
};
// You probably only need one of these: choose depending on
// whether you want react nodes or an html string:
var reactOutput = SimpleMarkdown.outputFor(rules, "react");
var htmlOutput = SimpleMarkdown.outputFor(rules, "html");

Now we can use our custom parse and output functions to parse markdown with underlines!

    var syntaxTree = parse("__hello underlines__");
    console.log(JSON.stringify(syntaxTree, null, 4));
    => [
        {
            "content": [
                {
                    "content": [
                        {
                            "content": "hello underlines",
                            "type": "text"
                        }
                    ],
                    "type": "underline"
                }
            ],
            "type": "paragraph"
        }
    ]

    reactOutput(syntaxTree)
    => [ { type: 'div',
        key: null,
        ref: null,
        _owner: null,
        _context: {},
        _store: { validated: false, props: [Object] } } ]

    htmlOutput(syntaxTree)

    => '<div class="paragraph"><u>hello underlines</u></div>'

Basic parsing/output API

`SimpleMarkdown.defaultBlockParse(source)`

Returns a syntax tree of the result of parsing source with the default markdown rules. Assumes a block scope.

`SimpleMarkdown.defaultInlineParse(source)`

Returns a syntax tree of the result of parsing source with the default markdown rules, where source is assumed to be inline text. Does not emit  elements. Useful for allowing inline markdown formatting in one-line fields where paragraphs, lists, etc. are disallowed.

`SimpleMarkdown.defaultImplicitParse(source)`

Parses source as block if it ends with \n\n, or inline if not.

`SimpleMarkdown.defaultOutput(syntaxTree)`

Returns React-renderable output for syntaxTree.

Note: raw html output will be coming soon

Extension Overview

Elements in simple-markdown are generally created from rules. For parsing, rules must specify match and parse methods. For output, rules must specify a react or html method (or both), depending on which outputter you create afterwards.

Here is an example rule, a slightly modified version of what simple-markdown uses for parsing strong (bold) text:

    strong: {
        match: function(source, state, lookbehind) {
            return /^\*\*([\s\S]+?)\*\*(?!\*)/.exec(source);
        },
        parse: function(capture, recurseParse, state) {
            return {
                content: recurseParse(capture[1], state)
            };
        },
        react: function(node, recurseOutput) {
            return React.DOM.strong(null, recurseOutput(node.content));
        },
        html: function(node, recurseOutput) {
            return '<strong>' + recurseOutput(node.content) + '</strong>';
        },
    },

Let's look at those three methods in more detail.

`match(source, state, lookbehind)`

simple-markdown calls your match function to determine whether the upcoming markdown source matches this rule or not.

source is the upcoming source, beginning at the current position of parsing (source[0] is the next character).

state is a mutable state object to allow for more complicated matching and parsing. The most common field on state is inline, which all of the default rules set to true when we are in an inline scope, and false or undefined when we are in a block scope.

DEPRECATED - use state.prevCapture instead. lookbehind is the string previously captured at this parsing level, to allow for lookbehind. For example, lists check that lookbehind ends with /^$|\n *$/ to ensure that lists only match at the beginning of a new line.

If this rule matches, match should return an object, array, or array-like object, which we'll call capture, where capture[0] is the full matched source, and any other fields can be used in the rule's parse function. The return value from Regexp.prototype.exec fits this requirement, and the common use case is to return the result of someRegex.exec(source).

If this rule does not match, match should return null.

NOTE: If you are using regexes in your match function, your regex should always begin with ^. Regexes without leading ^s can cause unexpected output or infinite loops.

`parse(capture, recurseParse, state)`

parse takes the output of match and transforms it into a syntax tree node object, which we'll call node here.

capture is the non-null result returned from match.

recurseParse is a function that can be called on sub-content and state to recursively parse the sub-content. This returns an array.

state is the mutable state threading object, which can be examined or modified, and should be passed as the third argument to any recurseParse calls.

For example, to parse inline sub-content, you can add inline: true to state, or inline: false to force block parsing (to leave the parsing scope alone, you can just pass state with no modifications). For example:

var innerText = capture[1];
recurseParse(
  innerText,
  _.defaults(
    {
      inline: true,
    },
    state
  )
);

parse should return a node object, which can have custom fields that will be passed to output, below. The one reserved field is type, which designates the type of the node, which will be used for output. If no type is specified, simple-markdown will use the current rule's type (the common case). If you have multiple ways to parse a single element, it can be useful to have multiple rules that all return nodes of the same type.

`react(node, recurseOutput, state)`

react takes a syntax tree node and transforms it into React-renderable output.

node is the return value from parse, which has a type field of the same type as the current rule, as well as any custom fields created by parse.

recurseOutput is a function to recursively output sub-tree nodes created by using recurseParse in parse.

state is the mutable state threading object, which can be examined or modified, and should be passed as the second argument to any recurseOutput calls.

The simple-markdown API contains several helper methods for creating rules, as well as methods for creating parsers and outputters from rules.

Extension API

simple-markdown includes access to the default list of rules, as well as several functions to allow you to create parsers and outputters from modifications of those default rules, or even from a totally custom rule list.

These functions are separated so that you can customize intermediate steps in the parsing/output process, if necessary.

`SimpleMarkdown.defaultRules`

The default rules, specified as an object, where the keys are the rule types, and the values are objects containing order, match, parse, react, and html fields (these rules can be used for both parsing and outputting).

`SimpleMarkdown.parserFor(rules)`

Takes a rules object and returns a parser for the rule types in the rules object, in order of increasing order fields, which must be present and a finite number for each rule. In the case of order field ties, rules are ordered lexicographically by rule name. Each of the rules in the rules object must contain a match and a parse function.

`SimpleMarkdown.outputFor(rules, key)`

Takes a rules object and a key that indicates which key in the rules object is mapped to the function that generates the output type you want. This will be 'react' or 'html' unless you are defining a custom output type.

It returns a function that outputs a single syntax tree node of any type that is in the rules object, given a node and a recursive output function.

Putting it all together

Given a set of rules, one can create a single function that takes an input content string and outputs a React-renderable as follows. Note that since many rules expect blocks to end in "\n\n", we append that to source input manually, in addition to specifying inline: false (inline: false is technically optional for all of the default rules, which assume inline is false if it is undefined).

var rules = {
    ...SimpleMarkdown.defaultRules,
    paragraph: {
        ...SimpleMarkdown.defaultRules.paragraph,
        react: (node, output, state) => {
            return <p key={state.key}>{output(node.content, state)}</p>;
        }
    }
};

var parser = SimpleMarkdown.parserFor(rules);
var reactOutput = SimpleMarkdown.outputFor(rules, 'react'));
var htmlOutput = SimpleMarkdown.outputFor(rules, 'html'));

var blockParseAndOutput = function(source) {
    // Many rules require content to end in \n\n to be interpreted
    // as a block.
    var blockSource = source + "\n\n";
    var parseTree = parser(blockSource, {inline: false});
    var outputResult = htmlOutput(parseTree);
    // Or for react output, use:
    // var outputResult = reactOutput(parseTree);
    return outputResult;
};

Extension rules helper functions

Coming soon

LICENSE

MIT. See the LICENSE file for text.

simple-markdown's People

Contributors

Stargazers

Watchers

Forkers

bbondy danactive lwansbrough utterstep radivarig bodyk airtoxin uilianries discord thesisb charlesmangwa buggyj randrej jhgg andangrd rutgerfarry whjvenyl bsnelder caretaker-labs mgreer sophiebits kant devsnek lkatartn dannycochran owenlittlewhite sapegin diegolmello jjayyyyyyy fullstackenviormentss xen entria bclonan bluenex formme bodybuildingcom alejandrorangel attilam k4771kim fanidamj-zz titaninvest dgca coreyjv eugeneplotnikov jedixak alula cnxtech alcaro forklifters jbiele-verys agusmade fozzle doylemark claranceliberi smirea orionrush suryatmodulus michaeldibernardo get-wrecked nowandme imandyie petzku minthantsin riadhossain43 dsp-testing joostlubach solymosiz alexanderlohmeyer novacrazy grahampcharles karkutsi gr-qft figtreesoftware gkartalis fengmiaosen metkm jo3w4rd yankeeinlondon levibuzolic danielgavrilov mayankkamboj47 shangtee mathiasreker darkyeg kevin-nathaniel sergeimeza sebastiansandqvist scantist-ossops-m2

simple-markdown's Issues

Hash tags for headers don't work if the header is followed by regular text on a new line

if I have the following:

#test
test

The parser fails to turn the header text into a title. The following doesn't work either:

#test#
test

test
#test

It only works with the following:

#test

#test

test

Syntax Highlighting

Can you please add support for syntax highlighting using highlight.js in code blocks?

Unit testing custom rules

I wonder if there is a recommended route for unit testing custom rules?

In my scenario:

I'm modifying some of the default rule match, parse and react functions
I'm adding custom rules
I want to test all of these on strings

A simplified unit test might look something like

  it('Should be able to parse a body with markdown that is simple', function() {
   // custom parser that accepts options and uses simple-markdown at its core
    const actual = parseBody('hello *there* mate', { option: 'foo' })
    const expected = [ 'hello ', md('strong', 'there'), ' mate' ]
    expect(actual).to.deep.equal(expected)
  })

The md function in this test is what I'm having trouble defining.

I tried using React.createElement('strong') because I thought that would be what I'm checking against, but there were some differences between actual and expected (_store, originalProps, etc.).
I tried copying and pasting some of the not exported code from simple-markdown.js into my unit tests to try and recreate how the AST is created in the library (not ideal!). I stopped short of the implementing recursion as that put me on the verge of re-implementing the library in my test suite.
I thought maybe all I really care about in this test is whether it's valid simple-markdown AST or perhaps even just valid React AST, but how best to check this?

Do you have any recommendations please? Thank you.

Minimum Required Default Rules (could not find rule to match content)

Thought I'd document this in the issues so that others might find it.

Basically, there's a set of default rules that must be included in your parser and those are:

const bareMinimumDefaultRules = [
  'newline',
  'paragraph',
  'text'
]
const initialDefaults = _pick(SimpleMarkdown.defaultRules, bareMinimumDefaultRules)

Now for context, my original scenario:

// A subset of rules our custom parser wants to extract from the SimpleMarkdown.defaultRules
// Note the absence of `newline, paragraph, text`
const subset = [
  'blockQuote',
  'em',
  'strong',
  'url',
  'link',
  'text'
] 
const initialDefaults = _pick(SimpleMarkdown.defaultRules, subset)

// extendDefaults is custom function to extend the defaults with custom rules
// not included here for brevity
const extendedDefaults = extendDefaults(initialDefaults)

// sometime later in another file
const parser = SimpleMarkdown.parserFor(extendedDefaults)
const reactOutput = SimpleMarkdown.reactFor(SimpleMarkdown.ruleOutput(extendedDefaults, 'react'))

// Make it blockmode as from the repo examples, (I also got by for a while with {inline: true} instead)
const blockSource = source + "\n\n";
return reactOutput(parser(blockSource))

However, when matching with this subset you will run into errors where nothing gets matched by the rules in that subset.

So, as a result there is a 'bare minimum subset' of defaultRules that are required.

This feels fragile though, I feel a safer option would be to have a 'default' case where if no rules are matched, it can use text for example?

I'm going to mark issue as closed, but I'm open to discussion or for someone to tell me I'm doing this wayyy wrong and there's a sensible way of doing what I'm trying to do.

Parser aggressively breaks paragraphs up by punctuation.

If I parse a sentence like this:

md.defaultBlockParse('Lorem Ipsum Dolor. Sit, Amet')

the output I would expect is:

[
    {
        "content": [
            {
                "content": "Lorem Ipsum Dolor. Sit, Amet",
                "type": "text"
            },

        ],
        "type": "paragraph"
    }
]

Instead what I get is:

[
    {
        "content": [
            {
                "content": "Lorem Ipsum Dolor",
                "type": "text"
            },
            {
                "content": ". Sit",
                "type": "text"
            },
            {
                "content": ", Amet",
                "type": "text"
            }
        ],
        "type": "paragraph"
    }
]

For longer documents this creates quite a lot of redundant nodes which leads to bloated output and potential performance issues. I think it'd be great if either the library didn't break text up like this in the first place or had a streamlining step at the end to merge them into a single node.

Escaped pipe chars in table creates new cells

Hi 👋 Thanks for a great library 🎉 Loving it ❤️

I've been running into an issue within storybook, that uses markdown-to-jsx, which again is using simple-markdown to parse markdown.

The issue I am experiencing can be found described in detail here.

I also made a crude test repo to check that the same base output also is found in the simple-markdown syntaxTree, which turned out to be the case: https://github.com/robaxelsen/markdown-to-jsx-table-pipe-test

Any chance someone could help fix this, or help illuminate the logic behind ignoring the escaped pipe characters, please? Any help greatly appreciated 🙏

Simplify regular expressions

Hello. I'd like to simplify codebase a bit. I really like to do that.

What about simplifying regular expression using literals and getting strings from source prop? It helps by greatly decreasing the number of backslashes and increasing readability.

Example:

/\r\n?/g.source === '\\r\\n?' // true

Note that I'm not 100% sure about browser compatibility yet, there are many question marks on mdn compatibility table. But I used that on production websites and there was no issue in browsers I needed to support. If idea sounds good I will test on many different browsers. BTW is there list of supported browsers somewhere?

Case where italics do not parse

One of our users reported a case where italics were not parsing. Anytime you have an italic phrase ending with a single letter, it does not parse.

Example:

> JSON.stringify(SimpleMarkdown.defaultBlockParse('*test i*'))
'[{"content":[{"content":"*test i","type":"text"},{"content":"*","type":"text"}],"type":"paragraph"}]'

Each child in an array should have a unique "key" prop

How to deal with this warning? How can I assign keys to rendered output?

Unable to escape ordered list

I'm trying to get the following string to render as a paragraph:
"1. This is some text"

This gets converted into an ordered list, adding indentation. Is there any way to prevent ordered list rule from picking this up?

Not parsing more than one rule

I added custom rules and it only parses the first match. It seems to be a problem with my rules because it works with the default rules.

The code is here and if anyone can help find the problem, that'd be appreciated.

Handle nested bold and italics correctly

See #19 for a potential solution to this.

Error on custom extension

I've set the code for extension with underlineRule from README.md. At first everything seem to work, but I figured that it is so because the rule is already in the default ruleset. So, when I change regex to anything else and I parse the string that will match, I get the following error:

 Uncaught TypeError: rules[ast.type][property] is not a function  //simple-markdown.js: 1293

Another parser issue, this time with links

Discord recently added Webhooks, which allow services to post custom links into chat (we don't allow this otherwise to protect users from bad links). I noticed during testing that links containing escaped markdown, like \) do not properly escape.

For example, [test link](https://test.link/$test$) becomes test link) with simple-markdown. GitHub renders this properly: test link

Here's a console output too:

> JSON.stringify(SimpleMarkdown.defaultBlockParse('[test link](https://test.link/\(test\))'))
'[{"content":[{"content":[{"content":"test link","type":"text"}],"target":"https://test.link/(test","type":"link"},{"content":")","type":"text"}],"type":"paragraph"}]'

could not find rule to match content:

In my case the error could not find rule to match content: was raised because the rules do not match a block source without newlines \n\n, and the default option inline: false is making the rule to expect a block.
First match was a block but its nested parse substring was not, so the text rule was skipping it.

I solved it by setting stateNested.inline = true in return of the parse function.

parse: function(capture, parse, state) { 
  var stateNested = Object.assign({}, state, {inline: true})  //important to clone!
  return { content: parse(capture[1], stateNested)}
}

Might be useful for documentation

Trailing commas

Getting a whole bunch of errors when building warning about IE8. I'm removing the trailing commas manually for now but just thought I'd report it

scripts/simple-markdown.js:590: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                type: 'hr',
                ^

scripts/simple-markdown.js:777: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                type: ListWrapper,
                ^

scripts/simple-markdown.js:783: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                            type: 'li',
                            ^

scripts/simple-markdown.js:845: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                target: target,
                ^

scripts/simple-markdown.js:852: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                def: def,
                ^

scripts/simple-markdown.js:885: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                    type: 'tr',
                    ^

scripts/simple-markdown.js:890: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                                type: 'td',
                                ^

scripts/simple-markdown.js:907: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                type: 'table',
                ^

scripts/simple-markdown.js:911: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                        type: 'thead',
                        ^

scripts/simple-markdown.js:915: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                                type: 'tr',
                                ^

scripts/simple-markdown.js:926: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                        type: 'tbody',
                        ^

scripts/simple-markdown.js:976: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                type: 'div',
                ^

scripts/simple-markdown.js:1068: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                type: 'a',
                ^

scripts/simple-markdown.js:1102: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                type: 'img',
                ^

scripts/simple-markdown.js:1156: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                type: 'strong',
                ^

scripts/simple-markdown.js:1174: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                type: 'u',
                ^

scripts/simple-markdown.js:1217: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                type: 'em',
                ^

scripts/simple-markdown.js:1235: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                type: 'del',
                ^

scripts/simple-markdown.js:1257: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                type: 'code',
                ^

scripts/simple-markdown.js:1275: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
                type: 'br',
                ^

scripts/simple-markdown.js:1349: ERROR - Parse error. IE8 (and below) will parse trailing commas in array and object literals incorrectly. If you are targeting newer versions of JS, set the appropriate language_in option.
    defaultRules: defaultRules,
    ^

Document that users should sanitize HTML

I noticed that HTML is just passed through. This can have unwanted side effects if this plugin is used on a platform where users can provide the content, which is then displayed to other users.

Here is a demonstration of the problem: https://bl.ocks.org/domoritz/raw/05254ae24b0a69b6e5dbe8a5718ab506/cdf4e1ccf44f193de06f98d78bfa486ee1b53742/.

While it is not necessarily the responsibility of this module to sanitize HTML, I'd say there are two options. First, document in the readme that this module does not sanitize HTML. Second, sanitize all HTML. This would unfortunately mean that one cannot write HTML in markdown anymore but maybe that is not the end of the world.

ReDoS on link regex

The regex for the link rule is subject to DoS when there is a malformed link.

Working repro:

SimpleMarkdown.defaultBlockParse('[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n[+test)[]\n');

Handle inline code inside bold/italics/links

Currently, _italic text and code with_ pointers* doesn't match correctly. [Note: apparently it doesn't match correctly in gfm, either, so that's some validation :/.]

Handling this correctly (especially with simple-markdown's regex-based design) is tricky, but important.

We might have to write a custom matcher function for the rules where this is most important, instead of using regex matchers. For an example, see the math matching in Perseus.

Typescript definition files?

Where can I find Typescript definition files for this package?

Nested unordered lists parsing/indentation issue

Running into an issue where nested bullets are improperly indented.

Some background, I use react-rte (https://github.com/sstur/react-rte) to convert a user's rich text input to markdown which I pass to a Markdown component:

  0 import React from 'react';
  1 import SimpleMarkdown from 'simple-markdown';
  2
  3 import Content from 'components/Content';
  4
  5 const Markdown = ({ string, className }) => {
  6   if (!string) {
  7     return null;
  8   }
  9
 10   const mdParser = SimpleMarkdown.defaultBlockParse;
 11   const mdOutput = SimpleMarkdown.defaultOutput;
 12
 13   const syntaxTree = mdParser(string);
 14
 15   const content = mdOutput(syntaxTree);
 16
 17   return (
 18     <Content className={className}>
 19       {content}
 20     </Content>
 21   );
 22 };
 23
 24 export default Markdown;

An example of the issue -- when I pass the following string (generated by the react-rte lib) to the Markdown component above:

- test one
    - thing one
- test two
    - thing two
    - one more thing two
- test three
    - thing three
- test four

I end up with the following output, you'll notice the one more thing two item is treated as a new ul when it should only be considered another list item of the current ul:

Does not happen for ordered lists.

Here is a link to the syntaxTree output generated by the mdParser function.
Let me know if there is any additional info you need to help figure out what's going on. Thanks.

Typescript, dependency issues and missing definitions

Using the version 7.1, type definitions for markdown.htmlFor & markdown.ruleOutput are missing, and I'm running into an issue with the dependencies labeled in the package:

"@types/node": ">=10.0.0",
"@types/react": ">=16.0.0"

I'm using @types/node@^12.0.0 witch is causing a ton of compile errors, the @types packages should be labeled peerDependencies.

Line break in Simple Breakdown

I am getting standard markdown from my server as a string. I am trying to figure out if simple-markdown supports the standard line break and paragraph rules, i.e. "\n " (newline followed by 2 spaces) is a line break vs. "\n\n" is a paragraph break.

What I am observing is that "\n\n" indeed works as a paragraph break, however "\n " gives a line break followed by 2 spaces. Is this intentional or a bug. Can you please clarify.

modernize code

it would be cool to use es6/7 in the code to make it easier to maintain

Documentation: docs for the parsed syntax tree types

Could you provide documentation for the parsed syntax tree documentation types, supported mark down tags -> types, etc?

Would be really helpful in creating a custom "renderer".

em/strong should handle \-escapes correctly

Currently, they end on * or ** without regard to whether it is escaped or not.

Allow one level of balanced parens in link urls, per CommonMark

See the section on this in the CommonMark spec

simple-markdown does not try to conform to CommonMark, but does try to be compatible when possible and choose sensible defaults, which CommonMark also tries to do.

Why the extra \n?

Stumbled upon an issue with headings here Benjamin-Dobell/react-native-markdown-view#2 that is based on simple-markdown.

Have only looked briefly into the code, but what's the reason for the last \n here?
https://github.com/Khan/simple-markdown/blob/master/simple-markdown.js#L558

Do you parse the whole text at once, can this be another solution?
http://regexr.com/3gc2f or simply /^(#{1,6})([^\n]*)/gm?

So that you don't need to have an extra line break after a #. Can't find any specification for markdown that says you need to specify this extra line break? Please enlighten us @ariabuckles :)

using linkify-it

I'm trying to use linkify-it with simple-markdown because the links don't fully work the way I want it to.
I tried doing this:

//import linkify-it
import linkify from 'linkify-it'
const linkifyInstance = linkify();

//... later in the code
match: function(source) {
   return linkifyInstance.match(source)
},

but i get this error:
Error in render: "Error: `match` must return a capture starting at index 0 (the current parse index). Did you forget a ^ at the start of the RegExp?"

This is what linkifyInstance.match(source) contains:

flow issue

gonna try a fix on this

Text is cut on Android after upgrade to Expo SDK 33 and react 16.8.3

We noticed that after upgrade to Expo SDK 33 and react 16.8.3 under some conditions the text is cut at the end of paragraphs. We work with Hebrew (RTL).
It happens only on Android with larger screens. It looks like wrong calculation of internal size of a view based on the text. Playing with padding or margins in styles will not fix it. The simple-markdown used from react-native-markdown-view, we tried to change to another markdown package but it has the same issue.
What can be the reason for that and how it can be fixed?

strong rule parse issue

When i use simple-markdown to parse the string **con**nect, it will render Conconnect. The rule which I use is defaultRules.strong; but if i rewrite the react function of the rule as follows:

        react: function(node, output, state) {
            return <strong key={state.key}>{output(node.content, state)}</strong>;
        }

the error has gone. Which may cause this problem?

Prettier

Continuing part of discussion from #55

I understand that prettier more likely will break flow annotations in comments, so it may require using syntax without comments and https://github.com/flowtype/flow-remove-types or something like that.

On the plus side it simplifies contributing (I really don't like to format my code manually anymore) and also tooling support is better, e.g. taking into account just vscode: flow/flow-for-vscode#128, flow/flow-for-vscode#96, flow/flow-for-vscode#46

So I think it's worth it. What do you think?

Cells with periods and commas seem to parse incorrectly

If there is a simple table with decimals and commas, it seems to create new lines for them...

This simple table...

Markdown	Test
Periods	Commas
25.7	1,000

Produces...

Flow error: Cannot assign function to match because:

Full error:


Error ┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈ node_modules/simple-markdown/simple-markdown.js:465:39

Cannot assign function to match because:
 • property 0 is missing in object type [1] but exists in Capture [2] in the return value.
 • ... 1 more error.

     node_modules/simple-markdown/simple-markdown.js
 [2]  78│ ) => ?Capture;
        :
     462│
     463│ // Creates a match function for an inline scoped element from a regex
     464│ var inlineRegex = function(regex /* : RegExp */) {
     465│     var match /* : MatchFunction */ = function(source, state) {
     466│         if (state.inline) {
     467│             return regex.exec(source);
     468│         } else {
     469│             return null;
     470│         }
     471│     };
     472│     match.regex = regex;
     473│     return match;
     474│ };

     /private/tmp/flow/flowlib_102fce2d/core.js
 [1] 288│ type RegExp$matchResult = Array<string> & {index: number, input: string, groups: ?{[name: string]: string}};

I am using flow version 0.86

Edit:
there are a few more errors flow 0.86 complains about. If there is interest, I can post their error messages as well.

Replacing new lines with breaks

Hi, I can't seem to figure out how to get a rule working for new lines (that I will eventually use to replace new lines with  ). What I'm trying is:

const SimpleMarkdown = require('simple-markdown')

var underlineRule = {
  order: SimpleMarkdown.defaultRules.em.order - 0.5,
  match: function(source) {
      return /^__([\s\S]+?)__(?!_)/.exec(source);
  },
  parse: function(capture, parse, state) {
      return {
          content: parse(capture[1], state)
      };
  }
};

const newlineRule = {
  order: SimpleMarkdown.defaultRules.newline.order - 0.5,
  match: source => {
    return /^(\r\n|\r|\n)/.exec(source)
  },
  parse: (capture, recurseParse, state) => {
    return {
      content: recurseParse(capture[1], state)
    }
  }
}

var rules = {
  ...SimpleMarkdown.defaultRules, 
  myUnderline: underlineRule,
  myNewline: newlineRule
}

var rawBuiltParser = SimpleMarkdown.parserFor(rules);
var parse = function(source) {
    var blockSource = source + "\n\n";
    return rawBuiltParser(blockSource, {inline: false});
};


console.log(JSON.stringify(parse('__some__ foo\n\na new\nparagraph'), null, 2))

The output is:

[
  {
    "content": [
      {
        "content": [
          {
            "content": "some",
            "type": "text"
          }
        ],
        "type": "myUnderline"
      },
      {
        "content": " foo",
        "type": "text"
      }
    ],
    "type": "paragraph"
  },
  {
    "content": [
      {
        "content": "a new\nparagraph",
        "type": "text"
      }
    ],
    "type": "paragraph"
  }
]

The first item is correct - my underline shows up within the paragraph, but I can't get my my new line rule working

ReDoS attack with inline code blocks

Our users found another ReDoS, this time with inline code blocks.

Here's a repro:

const d = '`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ';
const start = Date.now();
SimpleMarkdown.defaultInlineParse(d);
console.log(`${Date.now() - start}ms`);
> 3936ms

With nested parsing intermixed with other markdown syntax this can trigger very long parses (upwards of 30seconds for us) since this regex gets triggered again and again.

how to block Source end with one \n

when one \n output paragraph

Commas in output

I'm seeing commas in the parser output (see here: lwansbrough/react-native-markdown#4) -- perhaps having a look at my implementation code might also help debug this: https://github.com/lwansbrough/react-native-markdown/blob/master/Markdown.js#L40

Here's what it looks like: https://i.imgur.com/zIzROOt.png

Custom rules

Is there anyway I could make my own rule such as spoilers, or something custom id like to do?

Move defaultRules to a separate module?

Hi!

I'm facing a task of implementing a subset of markdown for our webapp, that basically includes making links clickable, link images becoming images, having bold and italic texts. That's about as much as we need. I tried to find an existing library that does that, and your repo is the closest thing to it. However, since I will only need a subset of defaultRules, it looks like an overhead to load all of them on the browser. Do you think it would be possible to extract defaultRules to a separate repo to make it more modular/customizable?

Enable the rules I want

Is there any way to only enable the rules I want?
For example, I only want my custom rules + bold, link and nothing else.

Why paragraphs are rendered with divs?

It seems very unusual in comparison to other libraries. Would be nice to have at least a comment:

https://github.com/Khan/simple-markdown/blob/7627d864445616b6c749399bac2f64e3185a5434/simple-markdown.js#L1049-L1061

Silent break on Firefox when order is NaN

From #8

order: SimpleMarkdown.defaultRules.link -0.5

is actually NaN as it should be

order: SimpleMarkdown.defaultRules.link.order -0.5  // .order

Interestingly, it does work on Chrome and breaks silently on Firefox! :)

Unable to install '../simple-markdown: Appears to be a git repo or submodule.'

npm ERR! path /Users/bar/commons/node_modules/simple-markdown
npm ERR! code EISGIT
npm ERR! git /Users/foo/bar/node_modules/simple-markdown: Appears to be a git repo or submodule.
npm ERR! git     /Users/foo/bar/node_modules/simple-markdown
npm ERR! git Refusing to remove it. Update manually,
npm ERR! git or move it out of the way first.

Rogue return character in the reactFor array?

I have no idea why this custom rule is showing a return character in the reactOutput array (see below for the output)

I'm trying to make a rule that matches >!spoilers!<, which in this case should take the text spoilers and render that text in react.

The Markdown component:

const Markdown: React.FC<MarkdownProps> = ({ theme, content }) => {
  const rules = generateRules(theme);
  const rawBuiltParser = SimpleMarkdown.parserFor(rules);
  const parse: Parser = (source) => {
    const blockSource = source + '\n\n';
    return rawBuiltParser(blockSource, { inline: false });
  };
  const reactOutput = SimpleMarkdown.reactFor(SimpleMarkdown.ruleOutput(rules, 'react'));
  console.log(reactOutput(parse(content)));
  return <View>
    {reactOutput(parse(content))}
  </View>;

The spoiler rule, it gets merged into the default rules with lodash _.merge()

    spoilers: {
      order: SimpleMarkdown.defaultRules.blockQuote.order - 0.5,
      match: source => /^>!([\s\S]+?)!</.exec(source),
      react: (node, output, state) => createElement(Text, {
        style: { textDecorationLine: 'line-through' },
        key: state.key
      }, output(node.content, state)),
      parse: (capture, parse, state) => ({
        content: parse(capture[1], state)
      })
    },

Reverse/recursive regex matching

Hi, my extension matches [string1]{string2} pattern. I'd need to match the same pattern or similar (that includes characters that break other patters) nested like this:

Nested with [some [text]{inside}<-M1 ]{and [a nested]{text inside}<-M2 }<-M3 pattern.

in the inside out order, so that it first evaluates match1 and match2 then match3 with replaced values of match1 and match2.

Nested with [ some M1]{and M2} pattern.

and finally

Nested with M3 pattern.

I was hoping that it can be done with the order of the extensions?

Another idea was to match the outermost match M3 and then call the matching on its content, convert to a string and filter out to get the innerText, but it seems like not the right approach.

Here are my extensions:

{
      rule1: {
        order: SimpleMarkdown.defaultRules.em.order + 0.5 // +
      , match: function(source) {
          return /^\[([^\[\}]*\]\{[^\[\}]*)\}/.exec(source)
        }
      , parse: function(capture, parse, state) {
          return {
            content: parse(capture[1], state) // group drops first '[' and last '}'
          }
        }
      , react: function(node, output) {
          return <u children={output(node.content)} />
        }
      }
    , rule2: {
        order: SimpleMarkdown.defaultRules.em.order - 0.5 // -
      , match: function(source) {
          return /^(\[[^\[\}]*\]\{[^\[\}]*aaa\})/.exec(source)
        }
      , parse: function(capture, parse, state) {
          return {
            content: parse('matched_rule2', state) // directly replacing
          }
        }
      , react: function(node, output) {
          return <u children={output(node.content)} />
        }
      }

    }

the following will replace with rule2:
[as []{aaa} fd]{sdf} => [as matched_rule2 fd]{sdf}
and this with rule1:
[as fd]{sdf} => as fd]{sdf

what it'd need to do is first replace with rule2, so that it remove the brackets that break rule1 pattern, and then replace with rule1.

so expected would be:
[as []{aaa} fd]{sdf} => as matched_rule2 fd]{sdf

Thanks in advance

npm out of sync for 0.3.1?

Hi, first of all thanks for all the work on this project!

I'm wondering if the code on npm is out of sync with this repo and needs updating? It shows 0.3.1 in the releases here, but there is code that is not present on the 0.3.1 on npm.

Specifically the reactElement function

npm 0.3.1

    // This should just override an already present element._store, which
    // exists so that the class of this object doesn't change in V8
    element._store = {
        validated: true,
        originalProps: element.props
    };
    return element;
};

github 0.3.1

    return {
        $$typeof: TYPE_SYMBOL,
        type: type,
        key: key,
        ref: null,
        props: props,
        _owner: null
    };

The reason I noticed is because I'm trying to implement unit testing for my own rules, which I will create a separate ticket for.

Thanks again.

way to access context on react rendering

I want to know if there is a way for me to access the context object when rendering the markdown to react elements

New rule weighting logic sometimes does not choose a rule

Getting an occassional simple-markdown.js?eb4d:145 Uncaught TypeError: currRule.match is not a functionon the 0.3.0. This does not happen with 0.2.2.

ariabuckles / simple-markdown Goto Github PK

simple-markdown's Introduction

simple-markdown

Philosophy

Getting started

Adding a simple extension

Basic parsing/output API

SimpleMarkdown.defaultBlockParse(source)

SimpleMarkdown.defaultInlineParse(source)

SimpleMarkdown.defaultImplicitParse(source)

SimpleMarkdown.defaultOutput(syntaxTree)