Comments (5)
I was successfully able to inject buttons at the tweet level.
Right now the ability to screenshot an individual tweet is in place.
from uli.
My first attempt at parsing/finding the html element with the tweet text in it was as follows :
const TWEET_TEXT =
"div > div > div > div > div > article > div > div > div:nth-child(1) > div:nth-child(2) > div:nth-child(2) > div:nth-child(2) > div:first-child > div:first-child > span";
const text = tweetDom.querySelector(TWEET_TEXT).innerText;
I discovered some edge cases where this does not successfully work. For instance :
Tweets that are a reply to someone else, have this "Replying to" text in them that is also within the same div as the tweet text. So basing your parsing logic on the child's index is prone to breaking.
I guess I could write some code with conditions on the index of the div within its parent but that sounds like code that would be hard to understand for someone else and maintain. Since this code is susceptible to breaking and changing if twitter changes its DOM structure, I wanted to think of something that was easier to read and grok even at the expense of speed (to an acceptable degree)
Alternate Approach : Depth First Traversal of the DOM tree
This approach relies on 2 assumption :
- Tweet text is always a leaf node in the DOM tree - its rendered as a span element
- Its very easy for a user to inspect the dom element in a browser and and find the location of the tweet div. This can be as easy as saying, the tweet is in a span thats inside a div that is inside a div that is inside a div and so on. something that can be easily represented as a string like this - "DIV:DIV:DIV:DIV:ARTICLE:DIV:DIV:DIV:DIV:DIV:DIV:DIV:DIV:SPAN" If the DOM structure changes, we just need to figure out what the new nesting order is and replace this in the code.
I wrote a depth first traversal code for the DOM element that contains the tweet : It finds the leaf nodes in the tree and checks that the path it took to reach the leaf matches the path specified by the user :
const TWEET_SPAN_PATH =
"DIV:DIV:DIV:DIV:ARTICLE:DIV:DIV:DIV:DIV:DIV:DIV:DIV:DIV:SPAN";
function DFT(node, currentPath, shouldContinue) {
// console.log(currentPath);
if (!shouldContinue) {
return;
}
let elementTag = node.tagName;
if (elementTag == undefined) {
elementTag = "UND";
}
const newCurrentPath = currentPath + ":" + elementTag;
node.childNodes.forEach((a) => {
if (TWEET_SPAN_PATH.startsWith(newCurrentPath)) {
shouldContinue = true;
} else {
shouldContinue = false;
}
DFT(a, newCurrentPath, shouldContinue);
});
if (node.childNodes.length === 0) {
if (currentPath === TWEET_SPAN_PATH) {
console.log("Reached TWEET at ", currentPath);
console.log(node);
}
}
}
This is working well as of now. It also helped me discover another edge case. Not all tweet text is in a span. Hashtags are rendered as <a>
. Which is obvious in hindsight since they are clickable.
What I like about this approach is also that this allows discovering edge cases like these. With some slight changes this algo can be used to find and print the leaf nodes. Which can then be inspected to check if they match the expected nodes. Anytime you discover an anomaly you can add code to take care of that case.
from uli.
In case you want to test the code on your machine without a DOM, here's some test code.
var input = {
tagName: "DIV",
childNodes: [
{
tagName: "DIV",
childNodes: [
{
tagName: "DIV",
childNodes: [
{
tagName: "IMAGE",
childNodes: [],
},
],
},
],
},
{
tagName: "DIV",
childNodes: [
{
tagName: "DIV",
childNodes: [
{
tagName: "SPAN",
childNodes: [],
},
],
},
],
},
{
tagName: "DIV",
childNodes: [
{
tagName: "IMAGE",
childNodes: [],
},
],
},
],
};
TWEET_PATH = "";
TWEET_TIME =
"DIV:DIV:DIV:DIV:ARTICLE:DIV:DIV:DIV:DIV:DIV:DIV:DIV:DIV:DIV:A:TIME";
const TWEET_SPAN_PATH =
"DIV:DIV:DIV:DIV:ARTICLE:DIV:DIV:DIV:DIV:DIV:DIV:DIV:DIV:SPAN";
function DFT(node, currentPath, shouldContinue) {
// console.log(currentPath);
if (!shouldContinue) {
return;
}
let elementTag = node.tagName;
if (elementTag == undefined) {
elementTag = "UND";
}
const newCurrentPath = currentPath + ":" + elementTag;
node.childNodes.forEach((a) => {
if (TWEET_SPAN_PATH.startsWith(newCurrentPath)) {
shouldContinue = true;
} else {
shouldContinue = false;
}
DFT(a, newCurrentPath, shouldContinue);
});
if (node.childNodes.length === 0) {
if (currentPath === TWEET_SPAN_PATH) {
console.log("Reached TWEET at ", currentPath);
console.log(node);
}
}
}
DFT(input, "DIV", true);
The input
variable is a mock object of a DOM element. The DFT
function only care for the tagName
and childNodes
element of a real DOM element.
from uli.
Added support for matching against index number within a div. It seems that twitter's layout changed last night and I was able to fix the parser with this new approach fairly quickly.
/**
* Parses the DOM and extracts structured data from it.
* @param {DOMElement obtained from document.getElementById query} tweetDom
* @return {Object} Tweet - stuctured tweet with values extracted from the dom
*/
function parseAndMakeTweet(tweetDom) {
const TWEET_PATH_GENERAL = new RegExp(
"DIV\\(0\\):DIV\\(0\\):DIV\\(0\\):DIV\\(0\\):ARTICLE\\(0\\):DIV\\(0\\):DIV\\(1\\):DIV\\(1\\):DIV\\(1\\):DIV\\([0-9]+\\):DIV\\(0\\):DIV\\([0-9]+\\):DIV\\([0-9]+\\):DIV\\(0\\):SPAN"
);
let leaves = {};
function getId() {
const id = `ogbv_tweet_${Math.floor(Math.random() * 999999)}`;
return id;
}
function DFT(node, currentPath) {
let elementTag = node.tagName;
if (elementTag == undefined) {
elementTag = "UND";
}
node.childNodes.forEach((a, ix) => {
const newCurrentPath = currentPath + `(${ix})` + ":" + elementTag;
DFT(a, newCurrentPath);
});
if (
node.childNodes.length === 0 &&
(elementTag === "SPAN" || elementTag === "A" || elementTag === "UND")
) {
console.log({ currentPath, node });
if (TWEET_PATH_GENERAL.test(currentPath)) {
const id = getId();
const parentElement = node.parentElement;
parentElement.setAttribute("id", id);
leaves[id] = parentElement;
}
}
}
DFT(tweetDom, "DIV");
return leaves;
}
from uli.
Just archiving some paths for other elements. Might come in handy later if their path hasn't changed
TWEET_PATH = "";
TWEET_TIME =
"DIV:DIV:DIV:DIV:ARTICLE:DIV:DIV:DIV:DIV:DIV:DIV:DIV:DIV:DIV:A:TIME";
const TWEET_SPAN_PATH =
"DIV:DIV:DIV:DIV:ARTICLE:DIV:DIV:DIV:DIV:DIV:DIV:DIV:DIV:SPAN";
const AUTHOR_PATH =
"DIV(1):DIV(0):DIV(0):DIV(0):ARTICLE(0):DIV(0):DIV(1):DIV(1):DIV(0):DIV(0):DIV(0):DIV(0):DIV(0):DIV(0):DIV(0):DIV(0):DIV(0):A(0):DIV(0):DIV(0):SPAN(0):SPAN";
const AUTHOR_HANDLE_PATH =
"DIV(1):DIV(0):DIV(0):DIV(0):ARTICLE(0):DIV(0):DIV(1):DIV(1):DIV(0):DIV(0):DIV(0):DIV(0):DIV(0):DIV(0):DIV(1):DIV(0):DIV(0):DIV(0):A(0):DIV(0):DIV(0):SPAN";
const TIMESTAMP_PATH =
"DIV(1):DIV(0):DIV(0):DIV(0):ARTICLE(0):DIV(0):DIV(1):DIV(1):DIV(0):DIV(0):DIV(0):DIV(0):DIV(2):DIV(0):A(0):TIME";
const TWEET_PATH =
"DIV(1):DIV(0):DIV(0):DIV(0):ARTICLE(0):DIV(0):DIV(1):DIV(1):DIV(1):DIV(0):DIV(0):DIV(0):DIV(0):SPAN";
const TWEET_HYPERLINK =
"DIV(1):DIV(0):DIV(0):DIV(0):ARTICLE(0):DIV(0):DIV(1):DIV(1):DIV(1):DIV(0):DIV(0):DIV(1):DIV(1):A";
from uli.
Related Issues (20)
- Prepare a DPGCM Proposal HOT 1
- Target Audience for the program HOT 2
- Aspects of community management to be covered HOT 1
- Schedule Sessions for Slur List Contributions with Partner Orgs HOT 4
- WOAH paper submission HOT 1
- Write Part 2 of the Cross Platform Abuse Blog HOT 2
- Uli participation in DMP 2024
- [April 14 - April 27] Engage with contributors HOT 3
- Grow the Uli Slur List HOT 1
- Conduct crowdsourcing sessions with partner organizations HOT 2
- [DMP 2024]: Show Slur Metadata in the webpage HOT 28
- WOAH Camera Ready Submission
- [Apr 28 - May 11] Engage with contributors
- Feature request : Add Hover Effects to Edit and Delete Buttons on Slur List Page
- Feature request : Ensure Consistent Textbox Design Across Preferences, Edit Slur, and Add Slur Sections
- [uli-website]: order blogs in descending order of date HOT 3
- [uli-website]: fix code blocks styling in mdx files HOT 3
- Refactor API endpoints into individual files HOT 5
- [May 12 - May 25] Review DMP proposals for Uli
- Create Release 0.1.16
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from uli.