Giter VIP home page Giter VIP logo

free-programming-books-parser's Issues

Parser doesn't take into account bold format in notes

Having current code:

} else {
// for now we assume that all previous ifs are mutually exclusive with this, may polish later
if (i.type === "emphasis") {
// this is the emphasis, add it in boldface and move on
s += "*" + i.children[0].value + "*";
} else if (i.type === "link") {
// something has gone terribly wrong. this book must be viewed and edited manually.
entry.manualReviewRequired = true;
break;
} else {
// hopefully this is the end of the note
let rightParen = i.value.indexOf(")");
if (rightParen === -1) {
// we have to go AGAIN
s += i.value;
} else {
// finally, we have reached the end of the note
entry.notes.push(stripParens(s + i.value.slice(0, rightParen + 1)));
s = "";
// this is a copypaste of another block of code. probably not a good thing tbh.
leftParen = i.value.indexOf("(");
while (leftParen != -1) {
rightParen = i.value.indexOf(")", leftParen);
if (rightParen === -1) {
// there must be some *emphasis* found
s += i.value.slice(leftParen);
break;
}
entry.notes.push(i.value.slice(leftParen + 1, rightParen));
leftParen = i.value.indexOf("(", rightParen);
}
}
}

image

If a bold format is found, the i.value is undefined and then the program crash. It should check if i.type == "strong" or have i.children.

Resources affected:

In general, we should extends the fix in depth to other inline formats like emphasis (already exists), bold, code, image...

Improve file media type extraction from directory name

It should that function in charge of extract the file type doesn't work well.

/**
* Retrieves the folder name from a string representing a directory and file
* @param {String} dir A string representing a directory in the format "./directory/file"
* @returns {String} The extracted directory name
*/
function getMediaFromDirectory(dir) {
const slash = dir.lastIndexOf("/");
let mediaType = dir.slice(2, slash);
return mediaType;
}

Always returns "fpb" instead of "books", "courses"....

See https://raw.githubusercontent.com/EbookFoundation/free-programming-books-search/main/fpb.json

image

Even worst if not sanatized path is provided or the parser is executed with customized inputs.

image

Tasks

  • Sanatize input to be independent of OS.
  • Extract right slug for both cases: if input parameter is file or is directory.

Improve extraction of section texts from Markdown headings

According to current code...

if (item.type == "heading") {
if (item.depth == 3) {
// Heading is an h3
currentDepth = 3;
let newSection = {
section: item.children[0].value, // Get the name of the section
entries: [],
subsections: [],
};
sections.push(newSection); // Push the section to the output array
} else if (item.depth == 4) {
// Heading is an h4
currentDepth = 4;
let newSubsection = {
section: item.children[0].value, // Get the name of the subsection
entries: [],
};
sections[sections.length - 1].subsections.push(newSubsection); // Add to subsection array of most recent h3
}
} else if (item.type == "list") {
item.children.forEach((listItem) => {

it seems that the parser not supports HTML anchor aliases neither Markdown syntax. It takes for granted that childrens[0] will be plain text.

image
image

It's necesary make a ES6 Array.reduce of the heading item.children taking into account all cases in order to rebuild the desired text. Cases type:

  • html, link: ignore record. maybe an anchor alias or a back to top/upper section link
  • text: append value
  • emphasis: append children values wrapping between _ (italic Markdown tokens)
  • strong: append children values wrapping between ** (bold Markdown tokens)
  • ...
    image

Improve Index section detection

Resolve this TODOS across localized files

// find where Index ends
// probably could be done better, review later
let i = 0,
count = 0;
for (i; i < tree.length; i++) {
if (tree[i].type == "heading" && tree[i].depth == "3") count++;
if (count == 2) break;
}
tree.slice(i).forEach((item) => {
// Start iterating after Index
try {
if (item.type == "heading" && item.children[0].value == "Index") return;
if (item.type == "heading") {

part of EbookFoundation/free-programming-books#6988 (comment)

Index word is not translated according to file locale. E.g.:

Parser don't take into account resources organized in sublists (fascicles/parts)

Improve title text extraction

According to current code

const [link, ...otherStuff] = listItem; // head of listItem = url, the rest is "other stuff"
entry.url = link.url;
entry.title = link.children[0].value;
// remember to get OTHER STUFF!! remember there may be multiple links!

first node children[0] is used as resource titles without check if there are more meaningfull tokens. So the rest is stripped making sometimes difficult to do a search by title of resources.

image

Therefore a escape in resources title links part is needed when submitting and make a rebuild Markdown here is mandatory

Context

See EbookFoundation/free-programming-books#7086
Related with #2 (same workarround)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.