connorjoleary / deepcite Goto Github PK

Traversing links to find the deep source of information

License: GNU General Public License v3.0

Python 21.10% Shell 2.28% HTML 35.38% JavaScript 22.01% CSS 5.43% PLpgSQL 0.60% Jupyter Notebook 12.49% Dockerfile 0.71%

javascript machine-learning python

deepcite's People

Contributors

Stargazers

Watchers

Forkers

deadmau6 noah122 sethivanshika mayurstwt emersoncpp visbene jimba86

deepcite's Issues

Don't grab DB Pass from Env

This is a security concern because anyone can see this info, shh. We should grab the value from google secrets instead.

When requesting a website, should check for status code

Is your feature request related to a problem? Please describe.
When you parse the results of nested websites the code currently doesn't take into account if the response was something like 404, which means that we may see an error page, but treat is as if the website was returned as expected.

Describe the solution you'd like
When requesting a website, look for a status 200, and if it is something else, don't parse that page.

Describe alternatives you've considered
Give special results for specific codes (404, 403, ...).

Additional context
https://www.reddit.com/r/todayilearned/comments/nz6hl7/til_the_banana_plant_is_a_herb_distantly_related/

Node_modules in extension

Hi, I don't know if you are looking for any help with this project because it appears to be for school, I'd love to help maybe even after the course is done.

Anyways I thought you might wanna know that the extension\node_modules\ exists but the extension\.gitignore excludes this folder. If you want to you can have a global .gitignore and if you want to remove all node_modules you can do node_modules/. Or if you want to remove just a specific directory then you could do test-server/node_modules.gitignore_pattern_format

Media (video or photos) showing example usage of deepcite

Figure out if spacy or genism with google vecs is better

' breaks db commit

Describe the bug
' mess up the ability to submit claim to postgres

To Reproduce
https://www.reddit.com/r/todayilearned/comments/otef3u/til_leonardo_da_vinci_wrote_all_the_branches_of_a/

Additional context
Traceback (most recent call last): File "/layers/google.python.pip/pip/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context self.dialect.do_execute( File "/layers/google.python.pip/pip/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute cursor.execute(statement, parameters) File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pg8000/core.py", line 350, in execute self._c.execute_unnamed( File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pg8000/core.py", line 1296, in execute_unnamed self.handle_messages(cursor) File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pg8000/core.py", line 1463, in handle_messages raise self.error pg8000.exceptions.ProgrammingError: {'S': 'ERROR', 'V': 'ERROR', 'C': '42601', 'M': 'syntax error at or near "s"', 'P': '603', 'F': 'scan.l', 'L': '1149', 'R': 'scanner_yyerror'}

Edge cases in tokenizer.py predict

If the sentence is very small then it may not add legit paragraphs to the predict object. Also, if there are multiple sentences, the one with a link associated would be a better thing to return.

Automatically run and popup info when right click cite

When you right click and cite some selection it only populates the popup. Instead it should pop up the popup automatically, input the text, and hit submit.

It does not look like you can automatically trigger the popup though from the background script, so instead it may be better to convert the popup to an iframe similar to how google keep operates

Source: https://stackoverflow.com/questions/17851700/how-to-open-the-default-popup-from-context-menu-in-a-chrome-extension

When clicking the link, jump straight to the text on the page

https://chromestatus.com/feature/4733392803332096

Improve error handling

right now this response is not send to the user, nor is any other error
except Exception as e: if check_instance(e): full_pre_json['error'] = str(e) else: link = html_link('https://github.com/connorjoleary/DeepCite/issues') full_pre_json['error'] = str('Error 500: Internal Server Error ' + str(e) + "." + \ new_indention("Please add your error to " + link + " with the corresponding claim and link."))

if we do want to show users errors we need to improve what they say.

Create contributing guide

Add configured timeout to requests

Larger tree page

Is your feature request related to a problem? Please describe.
When a website finds a lot of possible sources. The tree it returns get squished at the bottom.

Describe the solution you'd like
A dynamic page size which allows you to scroll left and right

Add option to display tree

We should already have an endpoint setup as well as code to display a tree. They just need to be matched up and they need a button in the UI to show tree.

Corner cases in web scrapper

dynamic JS
404 error
multiple citations in sentence in wikipedia
claim unable to be found in webpage of link.
scraper goes to wrong link within a sentence.

Reddit TIL Source Not Returned

Describe the bug
It didn't even pick up the Wiki page, why?

To Reproduce
https://www.reddit.com/r/todayilearned/comments/nolm6e/til_the_words_female_and_male_are_etymologically/

Add ability to right click and cite reddit posts

Most useful on today I learned

Possibly just look at current website

When clicking new citation, the popup goes away

Dockerize

Should probably break this up into sub-tasks

Update the README

When installing and running I noticed that some of the readme instructions are a little lacking. For example, the install for word2vec is a little unclear because I noticed that the backend only requires the pretrained model and not the complete code but we can easily expand those instructions with a few extra bullet points on how to download the pretrained model. Also I would like to add a short instruction on how to install on firefox. Could I take this issue?

Configure from environment

There are quite a few global variables that can be established as an environment variable. There should be a single module for the backend that will validate and source global variables for the backend to use.

List index out of range

Describe the bug
The reddit post https://www.reddit.com/r/todayilearned/comments/np87ef/til_of_the_golden_spruce_a_gorgeous_one_of_a_kind/
gives this error

Traceback (most recent call last): File "/app/main.py", line 28, in deep_cite tree = Tree(link, claim) File "/app/tree.py", line 24, in __init__ self.tree_root = Claim(url, claim) File "/app/claim.py", line 57, in __init__ self.parse_child() File "/app/claim.py", line 266, in parse_child self.create_children(ref2text, scores) File "/app/claim.py", line 146, in create_children self.child.append(Claim(ref2text[words], words, scores[i], (self.height +1), self)) # does ref2text allow for multiple links File "/app/claim.py", line 57, in __init__ self.parse_child() File "/app/claim.py", line 266, in parse_child self.create_children(ref2text, scores) File "/app/claim.py", line 165, in create_children self.child.append(Claim(ref2text[ref_key], words, scores[i], (self.height +1), self)) File "/app/claim.py", line 57, in __init__ self.parse_child() File "/app/claim.py", line 187, in parse_child citation = wiki(self.href, self.parent.href) File "/app/wiki_scraper.py", line 26, in wiki target_link = links[linkdict[link] + 1] IndexError: list index out of range

There should be an explanation for no sources returned

Either print the error or print that there were no sources found

Add extension description

When looking at the extensions page chrome://extensions/ there is nothing for deepcite

The ability to opt out of recording usage

Something like a check box on the extension that passes a flag to the lambda

Style and Linting

For javascript I'm thinking:

prettier for style
eslint for linting
And for Python:
autopep8 for style
and Idk for linting?

Check DB for duplicate run

Is your feature request related to a problem? Please describe.
If a user runs the same text and link as one which is stored in the db, it should return the already returned result

Describe the solution you'd like
Check against version number and text for a match. The text match should be done after trimming and can be a percent string match.

Popup Error Messages are HTML Instead of Text

When we receive an error message from the server to display in the popup extension, we should be receiving strictly text. Instead, we are receiving formatted HTML. This forces us to use .innerHTML instead of .value or .innerText, opening the gates for potential cross-site scripting attacks. We should only receive text to display to the user so we don't have to use this.

Steps to reproduce:

type invalid claim and link
hit the 'Cite' button

Expected results:

The error modal displays, and the text is unformatted.

Actual results:

The error modal displays, along with specific formatting inside of the error message field textbox.

Allow lambda to be run locally

In order for people to test their code in its entirety, the lambda must be allowed to be run locally and interact with the model and extension which can already be run locally.

Save responses and scores to RDS

If one website fails then we shouldn't return only an error for everything

Short Circuit model by checking if there is an exact match in linked website

This can occur in the lambda or the ecr instance (probably cheaper in the lambda though). But cutting out compute time should lower costs.

Relocate Bugs in README to issues

<p> is associated with a link that is not correct.

Backend error, soon to be fixed

Add Support for Pipenv

Hi, I was just wondering if we can put in a small feature request for those of us that use pipenv. pipenv just generates some extra files like a Pipfile and Pipfile.lock. So I think adding support for pipenv would be as simple as adding these files to the .gitignore so that way pipenv users don't have to worry about accidentally pushing these files. Also can I take this issue?

Text fragment

Describe the bug
The text fragment link is not selecting the text

To Reproduce
Steps to reproduce the behavior:

claim - Haagen Dazs doesn't even mean anything - it's a non-sensical phrase.
link - https://www.reddit.com/r/todayilearned/comments/n8kzs4/til_that_haagen_dazs_au_bon_pain_pret_a_manger/
this linked to: https://www.atlasobscura.com/articles/haagen-dazs-fake-foreign-branding#:~:text=if%20we%20assume%20that%20the%20dots%20in%20h%E4agen-dazs%20is%20a%20diaeresis%2C%20it%20should%20be%20pronounced%20%u201Chah-ah-gen%20dazs.%u201D%20which%20it%20is%20not.

Expected behavior
https://www.atlasobscura.com/articles/haagen-dazs-fake-foreign-branding#:~:text=if%20we%20assume%20that%20the%20dots%20in%20haagen-dazs%20is%20a%20diaeresis%2C%20it%20should%20be%20pronounced%20%E2%80%9Chah-ah-gen%20dazs.%E2%80%9D%20which%20it%20is%20not.

Server cannot handle two requests at the same time

Issue description:

Making multiple requests at once causes an error.

Steps to reproduce:

Not sure if this is still a relevant issue now that we use gunicorn.

Setup tests to be run in CI

Create Ability to see Model Changes

In order to reduce prices drastically, we need to use a smaller model. We should set up (in a jupyter notebook probably) the ability to compare how the new model would perform vs the old one based on the runs already submitted by users.

Add user feedback to approve or disapprove the result

Testing with Travis CI

Add tests and run them with travis ci.

Setup Versioning and Changelog

For versioning I think we should follow semver standards.

Fix error and note error in RDS

https://www.reddit.com/r/todayilearned/comments/gzwlp6/til_6_years_after_resigning_nixon_testified_on returns
"error":"Unable` to obtain infomation from the website.
but this error is not put into the status code and the original website and claim are not saved in RDS

claim: 6 years after resigning, Nixon testified on behalf of former FBI assistant director Mark Felt at Felt's own trial, and gave money to Felt's defense fund. In 2005 Felt revealed he had been "Deep Throat", Bob Woodward's source while breaking the Watergate scandal that led to Nixon's resignation

Source not returned

Describe the bug
true source not given as an option

To Reproduce
Steps to reproduce the behavior:
https://www.reddit.com/r/todayilearned/comments/n9evzh/til_theres_roughly_100_firefighter_arsonists/
full quote with that website as link

Expected behavior
There is literally a quote in the source, how did deepcite miss this?

Fix lambda timeout

The api gateway on aws has a timeout of 30 sec, but our calls may be longer than that

Don't give repeated url for all repeated urls

Describe the bug
The repeated url is being shown even when it shouldn't be

To Reproduce
https://www.reddit.com/r/todayilearned/comments/otqoog/til_of_research_during_1950s_allmale_combat/

Expected behavior
The url should not be repeated if it appears on another branch (only if it is in the direct path to root) as the claim of the nodes above should influence the next nodes it finds.

Convert to serverless

Instead of calling a flask endpoint, the lambda should call another lambda with EFS, call a sagemaker endpoint, or call another api that finds the similarity of two sentences (word2vec)

Add PDF Support

Is your feature request related to a problem? Please describe.
This program should be able to parse PDF's

Describe the solution you'd like
Should read the text of the PDF and find links in it like beautiful soup does

Additional context
Ex
https://www.reddit.com/r/todayilearned/comments/n8gf4n/til_that_in_1759_arthur_guinness_signed_a_9000/

Use user id instead of Ip address

Is your feature request related to a problem? Please describe.
Ip address is too personal of something to track for what we need it for

Describe the solution you'd like
Store an it and sync it between chrome accounts
chrome.sync.set("uersId") or something like that

Describe alternatives you've considered
None

Additional context
None

Setup CI

Should probably break this up into sub-tasks

connorjoleary / deepcite Goto Github PK

deepcite's People

Contributors

Stargazers

Watchers

Forkers

deepcite's Issues

Issue description:

Steps to reproduce:

Recommend Projects

Recommend Topics

Recommend Org