karpathy / arxiv-sanity-preserver Goto Github PK

View Code? Open in Web Editor NEW

5.0K 5.0K 1.3K 984 KB

Web interface for browsing, search and filtering recent arxiv submissions

Home Page: http://www.arxiv-sanity.com/

License: MIT License

Python 54.39% HTML 28.17% CSS 7.62% JavaScript 9.81%

arxiv-sanity-preserver's Introduction

I like deep neural nets.

arxiv-sanity-preserver's People

Contributors

Stargazers

Watchers

Forkers

sisirkoppaka paengs kylemcdonald bikmaeff devnambi wangg12 andland johnny5550822 yoavg andreas-bulling rlwjr kgourgou zbessinger naturegirl jasonge27 sepehr125 markvanheeswijk mhlr cwgreene phunterlau ywteh aiyear85 wgapl imclab apsaltis aruneral01 rygbee rtvt123 ilyaraz edoput rsarxiv hans mdzahidh huamichaelchen jaym sunsocool cc13ny seanbe deeplearnphy coreysharris plus13 codeaudit moredread tuahzh veeresht rafaelcosman shatu tonydeep programmer-util paulhendricks skalskis yangjunpro yanweifu liyumeng bskaggs mpett mave5 madebyollin tianzq zhiyue-archive ecprice cequencer spillai ebagnaschi mmottahedi jinlmsft wanjinchang qgzang maxhodak noodlefrenzy piandpower benjamesbabala jgoldfar silky arrmac zachlungu saikswaroop harjatinsingh korry8911 rahimnathwani techstone corlobin srkm009 a455bcd9 prafiles mohammed90 nh007cs asiagood michaelmior neutralino isummer luismilanooliveira b-area pranavashok nunofernandes-plight kingtaurus outcastofmusic yangxs marlithjdm eglassman

arxiv-sanity-preserver's Issues

cs.IR

Those paper don't appear in arxiv-sanity:

I guess it's because they are listed under cs.IR, which isn't indexed by arxiv-sanity.

This is a bit strange as those papers could have been published under stat.ML or cs.CL.

Do you think cs.IR could be added to arxiv-sanity?

This issue is similar to #39 which seems to be fixed.

Feature Request: Allow searching by arXiv ID e.g. 1412.7210

Rather surprisingly, putting an ID e.g. 1412.7210 directly into the search field fails to procure the corresponding paper.

Top Hype (Last Year and All Time)

Let's add Top Hype papers for Last Year and All Time

It would be interesting to know how the social media was reacting to papers written last year

Any plans for adding feature of showing number of citations?

Feature request: add/replace abstract with extracted conclusion

Conclusion section is often more compact and to the point.

It would be great to have 2 summary tabs - "abstract" and "conclusion" (where available) for each paper in the list.

Also default view in general preference with an option to view as default "abstract" or "conclusion" as a summary would be nice.

cs.AI

This recent deepmind paper doesn't appear in arxiv-sanity. I believe it's because it's listed under cs.AI, which isn't indexed by arxiv-sanity.

This is a bit strange as it seems relevant to the other categories included in arxiv-sanity. This particular paper could have just as easily been posted in stat.ML or cs.LG.

Issue installing on my local machine

I would like to add an RSS feed to the most recent papers tab. I was trying to setup on my local machine based on the instruction in the README. It failed when I ran analyze.py

C:\Users\<user>\Desktop\arxiv-sanity-preserver>python analyze.py
Traceback (most recent call last):
  File "analyze.py", line 29, in <module>
    txt = f.read()
  File "C:\Users\<user>\AppData\Local\Continuum\Anaconda3\lib\encodings\cp1252
.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2705: cha
racter maps to <undefined>

Any idea how to fix it?

Ravi

Feature request: mobile view

currently its very difficult to view the website on mobile. it would be nice to have mobile view.

great project btw.

I'm trying to do a study of texts of ML papers, and am using these scripts to acquire paper texts. After running download_pdfs.py (with about 20000 candidate papers acquired using fetch_papers.py) I was seemingly blocked by arxiv after downloading 1201 papers.
Has anyone experienced this sort of rate-limiting, not during fetch_papers but during download_pdfs? I can't access arxiv at all (including for my regular research), and am wondering whether this sort of blocking goes away after a little while, or if I need to start worrying.

Thanks!

System for sharing libraries

This is a great tool, and I was thinking about extending it with paper archives beyond arXiv. But if everyone set up their own version of arxiv-sanity, the benefit of having access to other users' libraries for the recommendation system disappears.

Would you consider a system for exporting the library database? I guess something as simple as providing a regular static dump of DOI sets corresponding to anonymized libraries would be a good start. Then every admin of an arxiv-sanity setup (or other tools for that matter) can benefit from the user base of others.

ImageMagick as dependency is not listed in README

Hi,

ImageMagick is needed for the thumbnail creation, but is not listed in the readme.

Feature Request: Expanding to all of arXiv

How feasible would it be to expand to all categories in arXiv?

Per #33, you mention that it's important to keep communities small so that "top papers" are still relevant. Couldn't this still be maintained by having a user specify as part of their account which subcategories they work in? And then top papers for a user would do some sort of cross-category normalization to account for multiple communities of different sizes. Maybe we could also crowdsource clustering of categories into different research areas and have those preset (like it has been done for ML currently).

Would love to see this platform become widely adopted!

Feature Request: Date filter

It would be really nice to a rudimentary date filter.

Legal?

Here: https://arxiv.org/help/robots
is the "Robots Beware: Indiscriminate automated downloads from this site are not permitted."
This makes Your code doing what is explicitly forbidden by arxiv.

Feature Requests/Suggestions

Some feature requests/suggestions:

The ability to add/rename/delete folders in my library.
Add personal notes to papers.
- Similarly, allow user discussions on specific papers (or links to e.g. the reddit discussions)
Add links to paper code/GitHub (similar to GitXiv) and other related links (e.g. YouTube demo)

P.S. this is awesome...Many thanks!

Feature Request: Alert for Customized Search Query

Like Google Alert for search result or citation notification in Google Scholar, but basically, user will be able to set alert for their search query, and upon any new submission that matches that search query, it'll shoot a notification/email.

Any plans for adding feature of commenting on papers?

Would be great, e.g. for research groups to discuss papers/add comments/thoughts/etc

mention poppler dependency in readme

poppler is required for the pdftotext dependency and should IMHO be mentioned in the readme

Convert does not work on Mac OS

Amazing work here!

Unfortunately I can't generate the images:

convert: unable to open image `pdf/********.pdf[0-7]`

while [0-7] should not be part of the file name, instead, the index of pages I want. Any hint?

Top recent 3 days sometimes empty

http://www.arxiv-sanity.com/top?timefilter=3days&vfilter=all

This view is sometimes empty, just showing the 'load more' button. Might be related to #57.

Hosting "fork" for physics categories

I'm started indexing some of the Physics categories. My plan is to cover all of them, but I've started with physics.* and astro-ph.* for now.

The site is currently hosted at http://physics.arxiv-sanity.nolife.de/

I'd be wiling to host it long term, if you want to focus on the already covered categories. Alternatively I could forward the PDFs, thumbnails and extracted texts to you, if you want to incorporate them in your site. What is your plan at the moment?

How do you want to handle domain names for forks? As a sub domain, or should I register a different one?

Feature Request: Newsletter

Can a newsletter feature be added for suggested papers?

Newest papers?

Right now the most recent paper from arxiv-sanity is from 11th of april while on arxiv there are several new paper since then.
Is there a problem with the refreshment?

this is very cool.

I'm glad I found what I was looking for.

Password reset

Hi Andrej,

Thanks a lot, for your wonderful works and especially your attempt to further democratizing AI.

Quick question: is there any way to reset the password? I looked at the codes and http://www.arxiv-sanity.com/ didn't find any code for that.

Thanks,
Rasool

Use file list instead of database in `analyze.py`

Would you consider a PR to use the list of txt files in analyze.py instead of querying the database? This would make it easier to make use of this script in other contexts. In addition, the script already skips over files when the text doesn't exist anyway. The only way the behaviour should be different is if there are some txt files that were manually placed in the folder for some reason.

UI bug with overlapping "Fork me on github" banner

At certain screen widths the "Fork me on github" banner overlays the paper's PDF button.

Normal case:

Problematic case:

I'd be happy to help solving if there's interest and no ongoing UI rework already.

maybe you can use elasticsearch to index papers

elasticsearch is easy to use, to index your data, and it provides a RESTful API.

how to sign up?

sorry for this silly question but I couldn't find a sign up option.. how can I create an account and log in?

thanks

Would it make sense to add LSA/LDA vec to TF IDF representation?

I was wondering if using both a topic vector (LSA/LDA based, or even paragraph2vec...) plus tf idf would improve results.
Topic vector based score would be added to tf idf based score with a low weight so common words (with high tfidf weight) are very important, but topic would be taken into account to probably affect document order.

What do you think?

Integrate with Short Science / GitXiv

arxiv paper ios app

I'm writing an ios app named RSarXiv, it aims to recommend arxiv papers based on user's behavior.
Maybe u can try it.
It's my great honor if you can give me some advices.
You can search rsarxiv to get the app in app store.
thanks a lot

Feature request: alert when paper in my library is replaced with new revision

It would be nice to receive an alert of some form when the authors upload a new revision of the paper i have previously saved in my library.

library not showing my saved papers.

I have saved 74 papers. Yes that is a bit much, but not that much. I thought it would help the recommendation algorithm, and also store papers that looked interesting that I might want to read in the future.

Now axiv-sanity refused to show all of my papers. The papers I have saved most recently do not appear on the list at all. Whereas others do appear, but are near the bottom of the list. And scrolling down hits "You hit the limit of number of papers to show is one result. [sic]"

I would like to at least like to be able to see the papers I saved most recently on the top. Being ordered by time saved, makes it much easier to find stuff I saved. It looks like they are instead being ordered by the date of the paper itself.
Be able to see all of my saved papers so that I can use arxiv-sanity as a store for papers I might want to read. I understand limits might need to exist to prevent abuse, but just storing the ids of less than 100 papers shouldn't be that problematic.

I'm wondering if I now have to go through and unsave every paper and go back to using bookmarks or something. But I really like the convenience of arxi-sanity, and the ability to take advantage of it's recommendation algorithm.

Feature suggestion: search by full text similarity

First of all: I love the web app! I had actually built something similar (PubVis), when a reviewer made me aware of the arxiv sanity preserver. One feature that I had implemented and that I think from your setup you could probably easily add as well is a search using full text similarity. The idea here is that when you start drafting a paper, you want to make absolutely sure you didn't miss any essential references. Instead of conducting multiple keyword searches, with the full text search you can just paste your existing abstract (+ other text) and it is transformed into a tf-idf vector and then used to find related papers by computing the cosine similarity to the existing papers.

Can not start webserver. No such table in database.

Hi, sorry for my poor bug report. I'm new with github und such.
I'm trying to use your program with two topics in the astrophysics domain.
Everything processed fine until the webserver-like thing tries to read some tables.

~/arxiv-sanity-preserver ❯❯❯ ./venv/bin/python serve.py --prod
/$HOME/arxiv-sanity-preserver/venv/lib/python2.7/site-packages/flask_limiter/extension.py:124: UserWarning: Use of the default get_ipaddr function is discouraged. Please refer to https://flask-limiter.readthedocs.org/#rate-limit-domain for the recommended configuration
UserWarning
Namespace(num_results=200, port=5000, prod=True)
loading db.p...
loading tfidf_meta.p...
loading sim_dict.p...
loading user_sim.p...
precomputing papers date sorted...
computing top papers...
Traceback (most recent call last):
File "serve.py", line 415, in
top_counts = get_popular()
File "serve.py", line 409, in get_popular
libs = sqldb.execute('''select * from library''').fetchall()
sqlite3.OperationalError: no such table: library

Analyzing uses too much memory

Hi,

I'm not sure if this is normal, but analyzing a corpus of 800MB (ca. 16000 articles) runs out of memory on my machine with 8GB of RAM + 2GB of swap. Can someone with a background in data analysis judge if this is expected?

This might be the main issue for me to scale the database for the physics section of arXiv, as I only have run the analysis on a small portion of it (less than a year for most section, and not all categories that are relevant).

I'll try to profile the memory usage, but I hope the attempt isn't futile. :p

Recommended papers is empty

I've registered a week or so ago, added some papers to library, but the recommended papers tab is empty

Any plan for including other fields? (and some suggestion about social features)

I'm from the quantum information field and would be interested to use a similar service.

By the way, we have a crowd-rating website for arXiv papers https://scirate3.herokuapp.com/
It might be interesting to combine these two features.

Add a LICENSE file, please.

Is arxiv-sanity-preserver under an open source license? Can we make modifications and contribute back?

Add SSL/TLS certificate and use secure cookies

Thanks for making this 💪

I think the site would benefit from having security improved. Unfortunately, people have a tendency to re-use passwords, and as of now, the password and the session cookie can be intercepted on the same network and in man-in-the-middle attacks.

Perhaps you can use Certbot (Let's Encrypt) for this?

Add Mathjax/KaTeX to display math content

As the text already contains $ and LaTeX code everywhere, it should be quite simple to display it properly using mathjax, or the very cool KaTeX.

Mathjax instructions: https://docs.mathjax.org/en/v2.6-latest/start.html
KaTeX instructions: https://github.com/Khan/KaTeX/blob/master/contrib/auto-render/README.md

See here for a version of the sanity-preserver that uses Mathjax: https://arxiv.babushk.in/.

Would it be useful to have an extension to this project where you can see the ancestors and predecessors of a research paper?

For instance, I'm reading this paper and I see it referred to ideas from previously published papers. I want to put this paper as a child of those research papers and maintain a tree so that I can keep track of the ideas from the paper in a systematic manner to aid my research. In other words, I want to visualize the path of knowledge that flows from one research paper to another.

More than 20 papers in "Most recent"

Papers often arrive in batches, or sometimes I can't check them for a few days. It would be nice to be able to see a chronological list than spans maybe a week.

PDF link goes to article page, not the PDF

The PDF link on articles goes to the article page, not the PDF. Is this per design?

Switch to Twitter Stream

Hi @karpathy,

Few weeks ago, I forked your code to add twitter trends, I end up with a different architecture (wanted something more robust), anyway, I use postgres and sqlalchemy to record twitter stream.

I just open-sourced it so you can use if you want to use it! It's pretty straightforward if you have a postgres db.
=> https://github.com/BenderV/twitter_stream/tree/arxiv

I also have the same thing (sqlalchemy) set-up for arxiv (authors/papers/tags) if you are interested (I just need to do small work before open-sourcing it).

add support for http://www.jmlr.org/

Can the support for http://www.jmlr.org/ papers be added?

Stemming, ELK

I notice you started accepting pull requests. I'm adding stemming now. Prepare for you to merge when it's done?

Also I put the data into dockerized ElasticSearch/Kibana (similar to @rsarxiv suggestion). Just several lines of code and you've got a nice Kibana GUI for exploration. Found some interesting insights there. Interested in this as well? But it likes good disks, preferably SSD, for indexing.

Any plans for adding feature of voting (e.g. like/dislike) for papers?

Again, would be nice to be able to rank/sort papers in a distributed fashion, e.g. among members of a research group. The more likes a paper collects the more likely it is to be discussed in the next reading group or the like.

Add ontology / connections graph visualization

Hi!
I was recently thinking about similar service!
What do you think about ontology/connections graph visualization option?