Giter VIP home page Giter VIP logo

Comments (5)

dennyabrain avatar dennyabrain commented on May 28, 2024

ogbv-inline-icons

I was successfully able to inject buttons at the tweet level.
Right now the ability to screenshot an individual tweet is in place.

from uli.

dennyabrain avatar dennyabrain commented on May 28, 2024

My first attempt at parsing/finding the html element with the tweet text in it was as follows :

const TWEET_TEXT =
  "div > div > div > div > div > article > div > div > div:nth-child(1) >  div:nth-child(2) >  div:nth-child(2) > div:nth-child(2) > div:first-child > div:first-child > span";
const text = tweetDom.querySelector(TWEET_TEXT).innerText;

I discovered some edge cases where this does not successfully work. For instance :
replying-to-tweets
Tweets that are a reply to someone else, have this "Replying to" text in them that is also within the same div as the tweet text. So basing your parsing logic on the child's index is prone to breaking.

I guess I could write some code with conditions on the index of the div within its parent but that sounds like code that would be hard to understand for someone else and maintain. Since this code is susceptible to breaking and changing if twitter changes its DOM structure, I wanted to think of something that was easier to read and grok even at the expense of speed (to an acceptable degree)

Alternate Approach : Depth First Traversal of the DOM tree

This approach relies on 2 assumption :

  1. Tweet text is always a leaf node in the DOM tree - its rendered as a span element
  2. Its very easy for a user to inspect the dom element in a browser and and find the location of the tweet div. This can be as easy as saying, the tweet is in a span thats inside a div that is inside a div that is inside a div and so on. something that can be easily represented as a string like this - "DIV:DIV:DIV:DIV:ARTICLE:DIV:DIV:DIV:DIV:DIV:DIV:DIV:DIV:SPAN" If the DOM structure changes, we just need to figure out what the new nesting order is and replace this in the code.

I wrote a depth first traversal code for the DOM element that contains the tweet : It finds the leaf nodes in the tree and checks that the path it took to reach the leaf matches the path specified by the user :

const TWEET_SPAN_PATH =
  "DIV:DIV:DIV:DIV:ARTICLE:DIV:DIV:DIV:DIV:DIV:DIV:DIV:DIV:SPAN";

function DFT(node, currentPath, shouldContinue) {
  //   console.log(currentPath);

  if (!shouldContinue) {
    return;
  }
  let elementTag = node.tagName;
  if (elementTag == undefined) {
    elementTag = "UND";
  }
  const newCurrentPath = currentPath + ":" + elementTag;
  node.childNodes.forEach((a) => {
    if (TWEET_SPAN_PATH.startsWith(newCurrentPath)) {
      shouldContinue = true;
    } else {
      shouldContinue = false;
    }
    DFT(a, newCurrentPath, shouldContinue);
  });
  if (node.childNodes.length === 0) {
     if (currentPath === TWEET_SPAN_PATH) {
      console.log("Reached TWEET at ", currentPath);
      console.log(node);
    }
  }
}

This is working well as of now. It also helped me discover another edge case. Not all tweet text is in a span. Hashtags are rendered as <a>. Which is obvious in hindsight since they are clickable.

What I like about this approach is also that this allows discovering edge cases like these. With some slight changes this algo can be used to find and print the leaf nodes. Which can then be inspected to check if they match the expected nodes. Anytime you discover an anomaly you can add code to take care of that case.

from uli.

dennyabrain avatar dennyabrain commented on May 28, 2024

In case you want to test the code on your machine without a DOM, here's some test code.

var input = {
  tagName: "DIV",
  childNodes: [
    {
      tagName: "DIV",
      childNodes: [
        {
          tagName: "DIV",
          childNodes: [
            {
              tagName: "IMAGE",
              childNodes: [],
            },
          ],
        },
      ],
    },
    {
      tagName: "DIV",
      childNodes: [
        {
          tagName: "DIV",
          childNodes: [
            {
              tagName: "SPAN",
              childNodes: [],
            },
          ],
        },
      ],
    },
    {
      tagName: "DIV",
      childNodes: [
        {
          tagName: "IMAGE",
          childNodes: [],
        },
      ],
    },
  ],
};

TWEET_PATH = "";
TWEET_TIME =
  "DIV:DIV:DIV:DIV:ARTICLE:DIV:DIV:DIV:DIV:DIV:DIV:DIV:DIV:DIV:A:TIME";

const TWEET_SPAN_PATH =
  "DIV:DIV:DIV:DIV:ARTICLE:DIV:DIV:DIV:DIV:DIV:DIV:DIV:DIV:SPAN";

function DFT(node, currentPath, shouldContinue) {
  //   console.log(currentPath);

  if (!shouldContinue) {
    return;
  }
  let elementTag = node.tagName;
  if (elementTag == undefined) {
    elementTag = "UND";
  }
  const newCurrentPath = currentPath + ":" + elementTag;
  node.childNodes.forEach((a) => {
    if (TWEET_SPAN_PATH.startsWith(newCurrentPath)) {
      shouldContinue = true;
    } else {
      shouldContinue = false;
    }
    DFT(a, newCurrentPath, shouldContinue);
  });
  if (node.childNodes.length === 0) {
     if (currentPath === TWEET_SPAN_PATH) {
      console.log("Reached TWEET at ", currentPath);
      console.log(node);
    }
  }
}

DFT(input, "DIV", true);

The input variable is a mock object of a DOM element. The DFT function only care for the tagName and childNodes element of a real DOM element.

from uli.

dennyabrain avatar dennyabrain commented on May 28, 2024

Added support for matching against index number within a div. It seems that twitter's layout changed last night and I was able to fix the parser with this new approach fairly quickly.

/**
 * Parses the DOM and extracts structured data from it.
 * @param {DOMElement obtained from document.getElementById query} tweetDom
 * @return {Object} Tweet - stuctured tweet with values extracted from the dom
 */
function parseAndMakeTweet(tweetDom) {
  const TWEET_PATH_GENERAL = new RegExp(
    "DIV\\(0\\):DIV\\(0\\):DIV\\(0\\):DIV\\(0\\):ARTICLE\\(0\\):DIV\\(0\\):DIV\\(1\\):DIV\\(1\\):DIV\\(1\\):DIV\\([0-9]+\\):DIV\\(0\\):DIV\\([0-9]+\\):DIV\\([0-9]+\\):DIV\\(0\\):SPAN"
  );
  let leaves = {};

  function getId() {
    const id = `ogbv_tweet_${Math.floor(Math.random() * 999999)}`;
    return id;
  }

  function DFT(node, currentPath) {
    let elementTag = node.tagName;
    if (elementTag == undefined) {
      elementTag = "UND";
    }
    node.childNodes.forEach((a, ix) => {
      const newCurrentPath = currentPath + `(${ix})` + ":" + elementTag;
      DFT(a, newCurrentPath);
    });
    if (
      node.childNodes.length === 0 &&
      (elementTag === "SPAN" || elementTag === "A" || elementTag === "UND")
    ) {
      console.log({ currentPath, node });
      if (TWEET_PATH_GENERAL.test(currentPath)) {
        const id = getId();
        const parentElement = node.parentElement;
        parentElement.setAttribute("id", id);
        leaves[id] = parentElement;
      }
    }
  }

  DFT(tweetDom, "DIV");

  return leaves;
}

from uli.

dennyabrain avatar dennyabrain commented on May 28, 2024

Just archiving some paths for other elements. Might come in handy later if their path hasn't changed

TWEET_PATH = "";
TWEET_TIME =
  "DIV:DIV:DIV:DIV:ARTICLE:DIV:DIV:DIV:DIV:DIV:DIV:DIV:DIV:DIV:A:TIME";

const TWEET_SPAN_PATH =
  "DIV:DIV:DIV:DIV:ARTICLE:DIV:DIV:DIV:DIV:DIV:DIV:DIV:DIV:SPAN";
const AUTHOR_PATH =
  "DIV(1):DIV(0):DIV(0):DIV(0):ARTICLE(0):DIV(0):DIV(1):DIV(1):DIV(0):DIV(0):DIV(0):DIV(0):DIV(0):DIV(0):DIV(0):DIV(0):DIV(0):A(0):DIV(0):DIV(0):SPAN(0):SPAN";
const AUTHOR_HANDLE_PATH =
  "DIV(1):DIV(0):DIV(0):DIV(0):ARTICLE(0):DIV(0):DIV(1):DIV(1):DIV(0):DIV(0):DIV(0):DIV(0):DIV(0):DIV(0):DIV(1):DIV(0):DIV(0):DIV(0):A(0):DIV(0):DIV(0):SPAN";

const TIMESTAMP_PATH =
  "DIV(1):DIV(0):DIV(0):DIV(0):ARTICLE(0):DIV(0):DIV(1):DIV(1):DIV(0):DIV(0):DIV(0):DIV(0):DIV(2):DIV(0):A(0):TIME";
const TWEET_PATH =
  "DIV(1):DIV(0):DIV(0):DIV(0):ARTICLE(0):DIV(0):DIV(1):DIV(1):DIV(1):DIV(0):DIV(0):DIV(0):DIV(0):SPAN";

const TWEET_HYPERLINK =
  "DIV(1):DIV(0):DIV(0):DIV(0):ARTICLE(0):DIV(0):DIV(1):DIV(1):DIV(1):DIV(0):DIV(0):DIV(1):DIV(1):A";

from uli.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.