oduwsdl / hypercane Goto Github PK

View Code? Open in Web Editor NEW

5.0 3.0 3.0 8.48 MB

A toolkit for developing algorithms that sample mementos from a web archive collection.

Home Page: https://oduwsdl.github.io/hypercane

License: MIT License

Python 75.03% Shell 11.32% HTML 13.15% Makefile 0.50%

summarization webarchives storytelling memento sampling clustering filtering

hypercane's People

Contributors

Stargazers

Watchers

Forkers

zenmancn himarshaj ato

hypercane's Issues

Allow users to specify their own boilerplate removal method

Hypercane uses ArticleExtractor from boilerpipe because that is what works best with sumgrams. There are some scenarios where this boilerplate removal method produces nothing. Allowing the user to specify their own method, and perhaps an ordered list of preferred methods would go a long way to addressing this issue.

Fix typo in the DSA1 implementation

After reworking Hypercane to use '.halg' formatted files as part of the IIPC 2021 Grant work, the DSA1 algorithm implementation is now wrong. We execute the time slice twice instead of the DBSCAN step:

hypercane/hypercane/packaged_algorithms/dsa1.halg

Lines 79 to 93 in b965662

 # prevent extra work if we already have it from previous runs 

 if [ ! -e ${TIME_SLICE_FILE} ]; then 

 echo "clustering mementos from remainder by time" 

 hc cluster time-slice -i mementos -a ${ONLY_ENGLISH_FILE} -o ${TIME_SLICE_FILE} -l ${TIME_SLICE_LOG} 

 fi 

 # apply DBSCAN to cluster by Simhash distance 

 DBSCAN_FILE=${WORKING_DIRECTORY}/dsa1-dbscan.tsv 

 DBSCAN_LOG=${WORKING_DIRECTORY}/dsa1-cluster-dbscan.log 

 # prevent extra work if we already have it from previous runs 

 if [ ! -e ${DBSCAN_FILE} ]; then 

 echo "clustering mementos from remainder by Simhash" 

 hc cluster time-slice -i mementos -a ${TIME_SLICE_FILE} -o ${DBSCAN_FILE} -l ${DBSCAN_LOG} 

 fi

It needs to follow AlNoamany's Algorithm again, like it did while working on my dissertation work.

Finish Hypercane GUI script for score action

The existing CLI application must be reworked. This work was started already and needs to be tested.

Once that work is done, we can add the corresponding GUI script for the Wooey interface.

Move Hypercane from MongoDB to PostgreSQL for storage and caching

Hypercane uses MongoDB for caching memento content, headers, and derived data. It also uses PostgreSQL as part of its Web User Interface (WUI). Rather than having to install/maintain multiple databases for different purposes, we want to move Hypercane to PostgreSQL for the following reasons.

MongoDB does not install "easily" for some users. I installed it on macOS with homebrew but reinstalled it after issues. On Ubuntu/RHEL, the admin needs to add a third-party yum/apt repository install it. Almost every distro includes PostgreSQL.
I've had issues saving MongoDB data and restoring its data across versions and systems. Sometimes the BSON is corrupted. I'm sure I was supposed to do something on one end or another, but it seems like SQL databases have an easier time with this.
As Hypercane has matured, I've saved more derived data in the database. The ability to query this with SQL is becoming more and more attractive over time. Such standardization may provide third-party tools with another interface for easy analysis.
A point in favor of MongoDB is that I can shove any data we want into a record and not worry about creating standardized fields. We could achieve something similar with planned foreign keys and relations in SQL at the expense of planning time and schema changes. The truth is that function calls in the code have to correspond to database actions; hence, we will write some queries either way. Moving to PostgreSQL will require that we change the schema for each derived value that we want to store.
For space reasons, a user may want to clear out the memento content and keep the derived data. With MongoDB, we have to save the parts we like, get rid of the whole record, and create a new record. With SQL and a decently designed schema, we can delete the records from the table storing the content.
Another point in favor of MongoDB is its ability to expire records, which we do not currently use, but should. PostgreSQL does not natively support this as far as I can tell, but I can achieve something similar with triggers.
MongoDB has a BSON size limit of 16MB unless I switch to GridFS. PostgreSQL has a maximum field size of 1GB. Currently, Hypercane discards anything over 16MB, which means that some images and other binary files are skipped rather than processed.
Some claim that MongoDB is faster than PostgreSQL, but some studies show that PostgreSQL has caught up. (We need to add these links.) Performance depends on indexing, table structure, and the queries used in the test. We can likely get comparable performance with good database creation scripts.
Some web archiving folks had suggested instead storing the data as WARC/WAT/etc. and maintaining a CDX. This was a good suggestion when we were only caching content, but it does not work as well for querying the derived data. If we store derived data in the CDX, it becomes a table.🙂
The choice of MongoDB came from needing to handle concurrent writes. Writing a single WARC for each memento creates many files and addresses concurrency. Creating a CDX afterward must be timed well. Alternatives, like SQLite, don't handle concurrent writes well either. Database engines, like PostgreSQL or MongoDB, manage this with their own caching, checkpointing, and optimization.
Thanks to the pilot, we have a better idea of the type of data we should store in the database, meaning that we have a better data model moving forward.

With all of this in mind, I will be using this issue to document ER diagrams and other insights as I experiment with this change.

Replace Hypercane GUI's Download button

Wooey has a confusing setup with respect to Downloads. We should replace the Download button with something that allows the user to download the file generated by the given action.

We can force a download rather than a browser render by adding the download attribute to an a tag. I'm not sure how to do this to a button, but I think the buttons in Wooey are largely decorative rather than true button tags.

Write Documentation for Hypercane WUI

The Hypercane WUI installer is almost finished. We need to document the process for users.

Create a Linux install for Hypercane

This script should take into account the lessons learned from oduwsdl/raintale#19, oduwsdl/raintale#20, oduwsdl/raintale#21.

Ideally, we would create an RPM install for RedHat-based systems and a DEB install for Debian-based systems, but that may be a bit too much to test for the duration of the IIPC Grant. A tarball containing the necessary files and an installation script is likely enough. To make it administrator-friendly, we could apply Makeself as well. This way the user can download a single file, execute it, and it will extract our content, execute our script, and start up the Hypercane GUI.

Remove Wooey's Re-run and Resubmit buttons from the Hypercane GUI

They are confusing to us and will likely be so for users. Until we can articulate how to use them, we should remove them.

We just need someone to remove the lines from hypercane-gui/templates/jobs/job_view.html:

hypercane/hypercane-gui/templates/jobs/job_view.html

Lines 114 to 119 in 037031e

 <button class="btn btn-primary btn-warning status-completed-toggle status-revoked-toggle status-failure-toggle" name="celery-command" value="rerun" type="submit"> 

 <span class="glyphicon glyphicon-repeat" aria-hidden="true"></span> {% trans "Re-run" %} 

 </button> 

 <button class="btn btn-warning" name="celery-command" value="resubmit" type="submit"> 

 <span class="glyphicon glyphicon-repeat" aria-hidden="true"></span> {% trans "Resubmit" %} 

 </button>

Create a Hypercane GUI convenience script that runs sample, report, and synthesize commands for a Raintale story

The user should be able to start a run a single script and produce a complex Raintale Story JSON file just like we do with the SHARI process.

In this case, the GUI should allow the user to select the appropriate sampling algorithm, and then the following happens:

sample from the collection with the algorithm selected by the user
run an entity, sumgram, and image report on the sample
synthesize a Raintale story JSON file from the sample file and reports

Uninstalling hypercane wouldn't remove the hc script from .local/bin/

When uninstalling hypercane,

$ pip3 uninstall hypercane
/usr/lib/python3/dist-packages/secretstorage/dhcrypto.py:15: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
  from cryptography.utils import int_from_bytes
/usr/lib/python3/dist-packages/secretstorage/util.py:19: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
  from cryptography.utils import int_from_bytes
Found existing installation: hypercane 0.2021.3.10.202429
Uninstalling hypercane-0.2021.3.10.202429:
  Would remove:
    /home/marsh/.local/lib/python3.8/site-packages/hypercane
    /home/marsh/.local/lib/python3.8/site-packages/hypercane-0.2021.3.10.202429.egg-info
Proceed (y/n)? y
  Successfully uninstalled hypercane-0.2021.3.10.202429

With this, hypercane will be uninstalled but the hc script from .local/bin/ will not get automatically removed.

$ which hc
/home/marsh/.local/bin/hc

$/.local/lib/python3.8/site-packages

$ hc --help
hc (Hypercane) is a framework for building algorithms for sampling mementos from a web archive collection.
It is a complex toolchain requiring a supported action and additional arguments.

For example:
    hc sample dsa1 -i archiveit -a 8778 -o story-mementos.txt

This is the list of supported actions:

    * sample
    * report
    * synthesize
    * identify
    * filter
    * cluster
    * score
    * order

For each of these actions, you can view additional help by typing --help after the action name, for example:
    hc sample --help

$ hc sample --help
Traceback (most recent call last):
  File "/home/marsh/.local/bin/hc", line 54, in <module>
    actionmodule = importlib.import_module(supported_actions[action])
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'hypercane'

Fix poor responsiveness for script placement in GUI

For certain browser sizes, the "grid" of Hypercane actions do not line up properly. This can be fixed with CSS, but we need to figure out how to do it while still maintaining the responsiveness.

Hypercane WUI suspends rendering in Firefox

Due to issues downloading //fonts.googleapis.com/css?family=Pacifico, Firefox does not render Hypercane's WUI for quite a long time.

After removing line 14 from hypercane-gui/templates/base.html the page loads fine.

Finish Hypercane GUI script for report action

The existing CLI application must be reworked.

Once that work is done, we can add the corresponding GUI script to the Wooey interface.

Test that Hypercane works effectively with the proxy servers specified in HTTP*_PROXY environment variables

Most *nix commands honor the HTTP_PROXY and HTTPS_PROXY environment variables. Hypercane processes these variables and applies them in hypercane/utils.py as part of get_web_session. We need to test this with Squid or Varnish to ensure the system will actually use a proxy server as a datastore.

Finish Hypercane GUI script for cluster action

The existing CLI application must be reworked. This work was started already and needs to be tested.

Once that work is done, we can add the corresponding GUI script for the Wooey interface.

Add provenance fields to the WARCs created by synthesize

The function generate_warc_record_for_urim should implement the fields WARC-Source-URI and WARC-Creation-Date fields proposed by @ikreymer as noted in the Twitter thread from https://twitter.com/IlyaKreymer/status/1487111893567246336.

Make domains configurable for hc score dsa1-score

Scoring mementos by Padia's work depends upon a list of domains for the categories of news, image sharing sites, video sharing sites, blog sites, and social media sites. These lists will likely change over time and should be configurable by the end user.

The scoring is handled on lines 431 - 451 of hypercane/score/dsa1_ranking. We will have to make it generic.

hypercane/hypercane/score/dsa1_ranking.py

Lines 431 to 451 in 2e071d7

 if domain in blog_sources: 

 return 0.4 

 elif domain in wikipedia_imagesharing_sources: 

 return 0.6 

 elif domain.upper() in pew_news_sources: 

 return 0.7 

 elif domain in w3newspapers_sources: 

 return 0.7 

 elif 'news' in domain: 

 return 0.7 

 elif domain in wikipedia_video_sources: 

 return 0.7 

 elif domain in adobe_socialmedia_sources: 

 return 0.5

The lists themselves are on lines 15 - 386 of that same file.

hypercane/hypercane/score/dsa1_ranking.py

Lines 15 to 386 in 2e071d7

 blog_sources = [ 

 'blogger.com', 

 'blogspot.com', 

 'wordpress.com', 

 'typepad.com' 

 ] 

 # image sharing websites as per https://en.wikipedia.org/wiki/List_of_image-sharing_websites 

 wikipedia_imagesharing_sources = [ 

 '500px.com', 

 'album2.com', 

 'bilddagboken.se', 

 'myphotodiary.com', 

 'kuvapaivakirja.fi', 

 'bildedagboka.no', 

 'billeddagbog.dk', 

 'deviantart.com', 

 'dronestagr.am', 

 'flickr.com', 

 'fotki.com', 

 'fotolog.com', 

 'fotolog.net', 

 'geograph.org.uk', 

 'photos.google.com', 

 'instagram.com', 

 'imgur.com', 

 'ipernity.com', 

 'jalbum.net', 

 'photobucket.com', 

 'pinterest.com', 

 'pixabay.com', 

 'securetribeapp.com', 

 'shutterflyinc.com', 

 'smugmug.com', 

 'snapfish.com', 

 'unsplash.com' 

 ] 

 # video domains as per https://en.wikipedia.org/wiki/List_of_video_hosting_services#Specifically_dedicated_video_hosting_websites 

 wikipedia_video_sources = [ 

 'acfun.cn', 

 'afreecatv.com', 

 'aparat.com', 

 'bigo.tv', 

 'bilibili.com', 

 'bitchute.com', 

 'dailymotion.com', 

 'godtube.com', 

 'iqiyi.com', 

 'liveleak.com', 

 'metacafe.com', 

 'mixer.com', 

 'nicovideo.jp', 

 'periscope.tv', 

 'rutube.ru', 

 'schooltube.com', 

 'smashcast.tv', 

 'trilulilu.ro', 

 'tudou.com', 

 'tune.pk', 

 'twitch.tv', 

 'vbox7.com', 

 'veoh.com', 

 'vimeo.com', 

 'youku.com', 

 'younow.com', 

 'youtube.com' 

 ] 

 # social media domains as per https://helpx.adobe.com/analytics/kb/list-social-networks.html 

 adobe_socialmedia_sources = [ 

 '12seconds.tv', 

 '4travel.jp', 

 'advogato.org', 

 'ameba.jp', 

 'anobii.com', 

 'answers.yahoo.com', 

 'asmallworld.net', 

 'avforums.com', 

 'backtype.com', 

 'badoo.com', 

 'bebo.com', 

 'bigadda.com', 

 'bigtent.com', 

 'biip.no', 

 'blackplanet.com', 

 'blog.seesaa.jp', 

 'blogspot.com', 

 'blogster.com', 

 'blomotion.jp', 

 'bolt.com', 

 'brightkite.com', 

 'buzznet.com', 

 'cafemom.com', 

 'care2.com', 

 'classmates.com', 

 'cloob.com', 

 'collegeblender.com', 

 'cyworld.co.kr', 

 'cyworld.com.cn', 

 'dailymotion.com', 

 'delicious.com', 

 'deviantart.com', 

 'digg.com', 

 'diigo.com', 

 'disqus.com', 

 'draugiem.lv', 

 'facebook.com', 

 'faceparty.com', 

 'fc2.com', 

 'flickr.com', 

 'flixster.com', 

 'fotolog.com', 

 'foursquare.com', 

 'friendfeed.com', 

 'friendsreunited.co.uk', 

 'friendsreunited.com', 

 'friendster.com', 

 'fubar.com', 

 'gaiaonline.com', 

 'geni.com', 

 'goodreads.com', 

 'grono.net', 

 'habbo.com', 

 'hatena.ne.jp', 

 'hi5.com', 

 'hotnews.infoseek.co.jp', 

 'hyves.nl', 

 'ibibo.com', 

 'identi.ca', 

 'imeem.com', 

 'instagram.com', 

 'intensedebate.com', 

 'irc-galleria.net', 

 'iwiw.hu', 

 'jaiku.com', 

 'jp.myspace.com', 

 'kaixin001.com', 

 'kaixin002.com', 

 'kakaku.com', 

 'kanshin.com', 

 'kozocom.com', 

 'last.fm', 

 'linkedin.com', 

 'livejournal.com', 

 'lnkd.in', 

 'matome.naver.jp', 

 'me2day.net', 

 'meetup.com', 

 'mister-wong.com', 

 'mixi.jp', 

 'mixx.com', 

 'mouthshut.com', 

 'mp.weixin.qq.com', 

 'multiply.com', 

 'mumsnet.com', 

 'myheritage.com', 

 'mylife.com', 

 'myspace.com', 

 'myyearbook.com', 

 'nasza-klasa.pl', 

 'netlog.com', 

 'nettby.no', 

 'netvibes.com', 

 'nextdoor.com', 

 'nicovideo.jp', 

 'ning.com', 

 'odnoklassniki.ru', 

 'ok.ru', 

 'orkut.com', 

 'pakila.jp', 

 'photobucket.com', 

 'pinterest.at', 

 'pinterest.be', 

 'pinterest.ca', 

 'pinterest.ch', 

 'pinterest.cl', 

 'pinterest.co', 

 'pinterest.co.kr', 

 'pinterest.co.uk', 

 'pinterest.com', 

 'pinterest.de', 

 'pinterest.dk', 

 'pinterest.es', 

 'pinterest.fr', 

 'pinterest.hu', 

 'pinterest.ie', 

 'pinterest.in', 

 'pinterest.jp', 

 'pinterest.nz', 

 'pinterest.ph', 

 'pinterest.pt', 

 'pinterest.se', 

 'plaxo.com', 

 'plurk.com', 

 'plus.google.com', 

 'plus.url.google.com', 

 'po.st', 

 'reddit.com', 

 'renren.com', 

 'skyrock.com', 

 'slideshare.net', 

 'smcb.jp', 

 'smugmug.com', 

 'sonico.com', 

 'studivz.net', 

 'stumbleupon.com', 

 't.163.com', 

 't.co', 

 't.hexun.com', 

 't.ifeng.com', 

 't.people.com.cn', 

 't.qq.com', 

 't.sina.com.cn', 

 't.sohu.com', 

 'tabelog.com', 

 'tagged.com', 

 'taringa.net', 

 'thefancy.com', 

 'toutiao.com', 

 'tripit.com', 

 'trombi.com', 

 'trytrend.jp', 

 'tuenti.com', 

 'tumblr.com', 

 'twine.com', 

 'twitter.com', 

 'uhuru.jp', 

 'viadeo.com', 

 'vimeo.com', 

 'vk.com', 

 'wayn.com', 

 'weibo.com', 

 'weourfamily.com', 

 'wer-kennt-wen.de', 

 'wordpress.com', 

 'xanga.com', 

 'xing.com', 

 'yammer.com', 

 'yaplog.jp', 

 'yelp.co.uk', 

 'yelp.com', 

 'youku.com', 

 'youtube.com', 

 'yozm.daum.net', 

 'yuku.com', 

 'zhihu.com', 

 'zooomr.com' 

 ] 

 # news domains from https://pewresearch-org-preprod.go-vip.co/journalism/2019/07/23/state-of-the-news-media-methodology/#digital-native-news-outlet-audit 

 pew_news_sources = [ 

 '12UP.COM', 

 '247SPORTS.COM', 

 '90MIN.COM', 

 'APLUS.COM', 

 'BGR.COM', 

 'BLEACHERREPORT.COM', 

 'BREITBART.COM', 

 'BUSINESSINSIDER.COM', 

 'BUSTLE.COM', 

 'BUZZFEED.COM', 

 'BUZZFEEDNEWS.COM', 

 'CHEATSHEET.COM', 

 'CINEMABLEND.COM', 

 'CNET.COM', 

 'COMICBOOK.COM', 

 'DAILYDOT.COM', 

 'DEADSPIN.COM', 

 'DIGITALTRENDS.COM', 

 'EATER.COM', 

 'ELITEDAILY.COM', 

 'ENGADGET.COM', 

 'FIVETHIRTYEIGHT.COM', 

 'GAMESPOT.COM', 

 'GIZMODO.COM', 

 'HELLOGIGGLES.COM', 

 'HOLLYWOODLIFE.COM', 

 'HUFFINGTONPOST.COM', 

 'IBTIMES.COM', 

 'IFLSCIENCE.COM', 

 'IGN.COM', 

 'IJR.COM', 

 'IJREVIEW.COM', 

 'INVESTOPEDIA.COM', 

 'JEZEBEL.COM', 

 'MARKETWATCH.COM', 

 'MASHABLE.COM', 

 'MAXPREPS.COM', 

 'MIC.COM', 

 'OPPOSINGVIEWS.COM', 

 'POLITICO.COM', 

 'POLYGON.COM', 

 'QZ.COM', 

 'RARE.US', 

 'RAWSTORY.COM', 

 'REFINERY29.COM', 

 'SALON.COM', 

 'SBNATION.COM', 

 'SLATE.COM', 

 'TECHRADAR.COM', 

 'THEBLAZE.COM', 

 'THEDAILYBEAST.COM', 

 'THEROOT.COM', 

 'THEVERGE.COM', 

 'THISISINSIDER.COM', 

 'THRILLIST.COM', 

 'TMZ.COM', 

 'TOPIX.COM', 

 'TOPIX.NET', 

 'UPROXX.COM', 

 'UPWORTHY.COM', 

 'VOX.COM' 

 ] 

 # sources from https://www.w3newspapers.com/newssites/ 

 w3newspapers_sources = [ 

 'aljazeera.com', 

 'nytimes.com', 

 'wsj.com', 

 'huffpost.com', 

 'washingtonpost.com', 

 'latimes.com', 

 'reuters.com', 

 'abcnews.go.com', 

 'usatoday.com', 

 'bloomberg.com', 

 'nbcnews.com', 

 'dailymail.co.uk', 

 'theguardian.com', 

 'thesun.co.uk', 

 'mirror.co.uk', 

 'telegraph.co.uk', 

 'bbc.com', 

 'thestar.com', 

 'theglobeandmail.com', 

 'news.com.au', 

 'forbes.com', 

 'cnbc.com', 

 'chinadaily.com.cn', 

 'chron.com', 

 'nypost.com', 

 'usnews.com', 

 'dw.com', 

 'indiatimes.com', 

 'thehindu.com', 

 'indianexpress.com', 

 'hindustantimes.com', 

 'cbsnews.com', 

 'time.com', 

 'sfgate.com', 

 'thehill.com', 

 'thedailybeast.com', 

 'newsweek.com', 

 'theatlantic.com', 

 'nzherald.co.nz', 

 'herald.co.zw', 

 'vanguardngr.com', 

 'dailysun.co.za', 

 'thejakartapost.com', 

 'thestar.com.my', 

 'straitstimes.com', 

 'bangkokpost.com', 

 'japantimes.co.jp', 

 'thedailystar.net', 

 'dawn.com', 

 'alarabiya.net', 

 'hollywoodreporter.com', 

 'scmp.com', 

 'aljazeera.com', 

 'voanews.com' 

 ]

Update raiseversion.sh to also update the version and date-released on CITATION.cff

The date and version in CITATION.cff must match the date and version used elsewhere in the software.

Finish Hypercane GUI script for synthesize action

The existing CLI application must be reworked. This work was started already and needs to be tested.

Once that work is done, we can add the corresponding GUI script for the Wooey interface.

Create Hypercane GUI script for identifying Memento objects based on collection IDs

The existing identify script only handles Memento objects. The command line version of Hypercane can support files containing URIs (file handles) or collection identifiers (strings). Wooey doesn't support both of these at the same time, so we need to create a separate script that allows the user to execute an identify action and convert a collection identifier to the desired file listing Memento objects.

Add score as a filter

Add a filter that allows the user to specify the scoring range to include in the output. In case multiple score fields exist in the input, provide the user an argument with which to specify a given field. Perhaps options for upper and lower bound should be available as well.

Make stopword list for hc report terms configurable

The stopwords for hc report terms are currently hardcoded. Even worse, the are hard coded only in the sumgram code and not the general n-gram code.

hypercane/hypercane/report/sumgrams.py

Lines 53 to 162 in 2e071d7

 # TODO: load these from a file 

 added_stopwords = [ 

 "associated press", 

 "com", 

 "donald trump", 

 "fox news", 

 "abc news", 

 "getty images", 

 "last month", 

 "last week", 

 "last year", 

 "pic", 

 "pinterest reddit", 

 "pm et", 

 "president donald", 

 "president donald trump", 

 "president trump", 

 "president trump's", 

 "print mail", 

 "reddit print", 

 "said statement", 

 "send whatsapp", 

 "sign up", 

 "trump administration", 

 "trump said", 

 "twitter", 

 "united states", 

 "washington post", 

 "white house", 

 "whatsapp pinterest", 

 "subscribe whatsapp", 

 "york times", 

 "privacy policy", 

 "terms use" 

 ] 

 added_stopwords.append( "{} read".format(last_year) ) 

 added_stopwords.append( "{} read".format(current_year) ) 

 stopmonths = [ 

 "january", 

 "february", 

 "march", 

 "april", 

 "may", 

 "june", 

 "july", 

 "august", 

 "september", 

 "october", 

 "november", 

 "december" 

 ] 

 # add just the month to the stop words 

 added_stopwords.extend(stopmonths) 

 stopmonths_short = [ 

 "jan", 

 "feb", 

 "mar", 

 "apr", 

 "may", 

 "jun", 

 "jul", 

 "aug", 

 "sep", 

 "oct", 

 "nov", 

 "dec" 

 ] 

 added_stopwords.extend(stopmonths_short) 

 # add the day of the week, too 

 added_stopwords.extend([ 

 "monday", 

 "tuesday", 

 "wednesday", 

 "thursday", 

 "friday", 

 "saturday", 

 "sunday" 

 ]) 

 added_stopwords.extend([ 

 "mon", 

 "tue", 

 "wed", 

 "thu", 

 "fri", 

 "sat", 

 "sun" 

 ]) 

 # for i in range(1, 13): 

 # added_stopwords.append( 

 # datetime(current_year, i, current_date).strftime('%b %Y') 

 # ) 

 # added_stopwords.append( 

 # datetime(last_year, i, current_date).strftime('%b %Y') 

 # ) 

 # for i in range(1, 13): 

 # added_stopwords.append( 

 # datetime(current_year, i, current_date).strftime('%B %Y') 

 # ) 

 # added_stopwords.append( 

 # datetime(last_year, i, current_date).strftime('%B %Y') 

 # )

The generic terms report will need to accept the same stopword list at get_document_tokens:

hypercane/hypercane/report/terms.py

Lines 6 to 28 in 2e071d7

 def get_document_tokens(urim, cache_storage, ngram_length): 

 from hypercane.utils import get_boilerplate_free_content 

 from nltk.corpus import stopwords 

 from nltk import word_tokenize, ngrams 

 import string 

 # TODO: stoplist based on language of the document 

 stoplist = list(set(stopwords.words('english'))) 

 punctuation = [ i for i in string.punctuation ] 

 additional_stopchars = [ '’', '‘', '“', '”', '•', '·', '—', '–', '›', '»'] 

 stop_numbers = [ str(i) for i in range(0, 11) ] 

 allstop = stoplist + punctuation + additional_stopchars + stop_numbers 

 content = get_boilerplate_free_content(urim, cache_storage=cache_storage) 

 doc_tokens = word_tokenize(content.decode('utf8').lower()) 

 doc_tokens = [ token for token in doc_tokens if token not in allstop ] 

 table = str.maketrans('', '', string.punctuation) 

 doc_tokens = [ w.translate(table) for w in doc_tokens ] 

 doc_tokens = [ w for w in doc_tokens if len(w) > 0 ] 

 doc_ngrams = ngrams(doc_tokens, ngram_length) 

 return list(doc_ngrams)

Add the ability to only use the cache

Some users may have built a cache from prior runs and not want to issue new HTTP requests to add to it. We may not be able to force non-network access with caches supplied via environment variables like HTTPS_PROXY, but the faster MongoDB cache used by requests-cache can be overridden to not issue a network connection for cache misses.

Create a new object named OnlyCachedSession that is a child of CachedSession. This object will skip the network connections provided by requests altogether.

Some code below that has worked in testing:

from requests.hooks import dispatch_hook
from requests_cache import CachedSession

class FailedCacheResponse(Exception):
    pass

class OnlyCachedSession(CachedSession):

    def send(self, request, **kwargs):

        cache_key = self.cache.create_key(request)

        def send_request_and_cache_response():
            response = super(CachedSession, self).send(request, **kwargs)
            if response.status_code in self._cache_allowable_codes:
                self.cache.save_response(cache_key, response)
            response.from_cache = False
            return response

        try:
            response, timestamp = self.cache.get_response_and_time(cache_key)
        except (ImportError, TypeError):
            raise FailedCacheResponse(
                "Import/Type Errors : could not get response and time : item {} is not in the cache".format(cache_key)
            )

        if response is None:
            raise FailedCacheResponse(
                "response is None : could not get response and time : item {} is not in the cache".format(cache_key)
            )

        # dispatch hook here, because we've removed it before pickling
        response.from_cache = True
        response = dispatch_hook('response', request.hooks, response, **kwargs)
        return response

Update Hypercane to accept NLA collection identifiers as input for discovering mementos

Hypercane currently accepts an input type of archiveit and a number as an input argument that identifies the Archive-It collection. We want to do the same, so someone can type Hypercane commands like:

# hc identify -i nla -a 13000 ...

Once this AIU issue is complete Hypercane will be able to acquire metadata and a list of URI-Ms for each NLA collection. We just want to connect all of this together into a new input type.

We will likely need to add functions for NLA that function like generate_archiveit_urits.

We will also need to update discover_mementos_by_input_type, discover_timemaps_by_input_type, and discover_original_resources_by_input_type to support a new input type of nla.

I think these are the only changes needed, but we will need to test to make sure.

Add Carbon Date to order action

Allow a user to order a set of mementos by the carbon date of their URI-R.

Update Image Report to Score Images From Metadata Higher

The current scores produced by hc report image-data are not as effective as they could be. Humans may have already supplied their desired striking images in the metadata of the web pages making up the collection.

Hypercane's existing image scoring function in hypercane/report/imagedata.py:rank_images currently adds image properties to a list on lines 143 - 152

hypercane/hypercane/report/imagedata.py

Lines 143 to 152 in 44491c3

 imageranking.append( 

 ( 

 score, 

 pixelsize, 

 colorcount, 

 1 / ratio, 

 noverN, 

 image_urim 

 ) 

 )

Add another column to the left containing values of 1 or 0. If Hypercane discovers the image in the metadata, set this column to 1 otherwise 0. This way, when the sorting occurs on line 154, all images discovered in the metadata will exist at the highest ranks in the list and then will be sorted by their MementoEmbed score.

Provide a command line option to hc that allows the user to specify their own User-Agent string

I've been reluctant to provide this functionality in case it might be misused, but I needed it today.

Create a Linux installer for Hypercane

Based on what we learn from @ato when addressing oduwsdl/raintale#30 we need to do the same for Hypercane.

Finish Hypercane GUI script for order action

The existing CLI application must be reworked. This work was started already and needs to be tested.

Once that work is done, we can add the corresponding GUI script for the Wooey interface.

Improve the HALG file format

In v0.5, we introduced the HALG file format for executing Hypercane recipes.

To start, Hypercane needs two bash functions to make HALG more compact: cache_hc and move_output.

Consider the following script, simplified to illustrate a point:

#!/bin/bash

input_type=$1
input_argument=$2
working_directory=$3
output_file=$4

cd ${working_directory}

if [ ! -e identified-mementos.tsv ]; then
  hc identify mementos -i $input_type -a $input_argument -o identified-mementos.tsv
fi

if [ ! -e sample-mementos.tsv ]; then
  hc sample true-random -k 2000 -i mementos -a identified-mementos.tsv -o sample-mementos.tsv
fi

if [ ! -e image-report.json ]; then
  hc report imagedata -i mementos -a identified-mementos.tsv -o image-report.json
fi

if [ ! -e terms.tsv ]; then
  hc report terms -i mementos -a identified-mementos.tsv -o terms.tsv
fi

if [ ! -e story.json ]; then
  hc synthesize raintale-story -i mementos -a identified-mementos --imagedata image-report.json --term-report terms.tsv -o story.json
fi

cp ${working_directory}/story.json ${output_file}

which could be simplified to something like this:

#!/bin/bash

input_type=$1
input_argument=$2
working_directory=$3
output_file=$4

function cache_hc() { ... }

function move_output() { ... }

cache_hc "identify mementos" "${input_type}=${input_argument}" "694-mementos.tsv"
cache_hc "sample true-random -k 2000" "694-mementos.tsv" "sample-mementos.tsv"
cache_hc "report imagedata" "sample-mementos.tsv" "image-report.json"
cache_hc "report terms --use-sumgrams" "sample-mementos.tsv" "terms.tsv"
cache_hc "synthesize raintale-story --imagedata image-report.json --term-report terms.tsv" "sample-mementos.tsv" "story.json"
move_output "story.json" "${output_file}

and we can even make the cache_hc and move_output functions available as part of the Hypercane installation itself.

This issue is the start of a conversation/documentation of thinking about this idea with the goal of making HALG more applicable as v0.6 development unfolds.

We also need to document HALG. So far, it differs from a regular shell script by offering comments in the following format:

#!/bin/bash
# algorithm name: DSA1
# algorithm description: An implementation of the algorithm from AlNoamany's dissertation.

These comments are used by Hypercane when displaying the possible algorithms. Getting HALG straight is an important step toward the Recipe Builder.

Finish Hypercane GUI scripts for filter include-only and exclude actions

The existing CLI application must be reworked.

Once that work is done, we can add the corresponding GUI scripts to the Wooey interface.

Finish Hypercane GUI script for sample action

The existing CLI application must be reworked. This work was started already and needs to be tested.

Once that work is done, we can add the corresponding GUI script for the Wooey interface.

Duplicate error/help message in CLI

When running hc command with an unsupported action, error/help is printed twice:

$ hc foo
ERROR: unsupported action foo

hc (Hypercane) is a complex toolchain requiring a supported action and additional arguments

For example:
    hc sample dsa1 -i archiveit -a 8778 -o story-mementos.txt

    Supported actions:

    * sample
    * report
    * synthesize
    * identify
    * filter
    * cluster
    * score
    * order

    For each of these actions, you can view additional help by typing --help after the action name, for example:
    hc sample --help

ERROR: unsupported action foo

hc (Hypercane) is a complex toolchain requiring a supported action and additional arguments

For example:
    hc sample dsa1 -i archiveit -a 8778 -o story-mementos.txt

    Supported actions:

    * sample
    * report
    * synthesize
    * identify
    * filter
    * cluster
    * score
    * order

    For each of these actions, you can view additional help by typing --help after the action name, for example:
    hc sample --help

Synthesize warc using regular vs raw stream

The synthesize warc command will unintentionally switch back to the original stream instead of the raw stream. The bug seems to be resolved by making deep copies of all variables from the original stream.

Affected lines in hypercane/hypercane/synthesize/warcs.py:
76 - headers_list = copy.deepcopy(resp.raw.headers.items())
81 - warc_target_uri = str(resp.links[link]['url'])
88 - mdt = str(resp.headers['memento-datetime'])

Add a command for managing the cache

A command that allows the user to manage the cache would be very helpful after we have implemented #65.

I'm envisioning something like the following:

This command would list all URIs in the cache:

# hc-cache list-uris -o all-uris.txt

This command would purge all cache tables:

# hc-cache purge-all

This command would only purge the memento URI-Ms in the list:

# hc-cache purge -i memento-urims.txt

This command would only purge the cached content of memento URI-Ms, but leave the derived data:

# hc-cache purge -i memento-urims.txt --only-content

This command would preload the cache with a list of URIs:

# hc-cache preload -i uris.txt

This would export the cache into some (to be determined) file format:

# hc-cache export -o exported-cache-data.dat

Likewise, we can load the cache using some (to be determined) file format:

# hc-cache import -i some-elses-cache-data.dat

As time goes on, I'm sure I can think of other things.

Make domain category lists for hc score dsa1-score configurable.

As part of the DSA1 scoring equation, the original resource domain of the memento is given a different score based on its category according to Padia's 2012 work. Padia outlined the following categories:

news source
video sharing sites
social media domains
video sharing sites

Right now these domain lists are hard-coded and likely to change over time. Create a parameter that allows the user to supply them.

Synthesize Action: TypeError On Docker

When using the synthesize action on Docker a type error occurs ("TypeError: 'Namespace' object is not iterable").

The command I used is listed below:
hc synthesize warcs -i archiveit -a 7760 -o South_Louisiana_Flood

Finish Hypercane GUI script for identify action

The existing CLI application must be reworked. This work was started already and needs to be tested.

Once that work is done, we can add the corresponding GUI script for the Wooey interface.

Add functionality to synthesize warcs from archive.today

Feeding the mementos from a timemap generated by memgator to the "synthesize warcs" action in hypercane results in exceptions for mementos from archive.today. There appears to be a captcha.

Update Hypercane with MementoEmbed code so it can handle NLA mementos properly

Thanks to @himarshaj, MementoEmbed now handles NLA mementos properly. Hypercane needs this functionality so it can process them as well. Hypercane should use MementoEmbed's MementoResource class to extract the correct raw mementos, original resource domains, etc. so that we do not need to update code in two places.

Utilize add_subparsers to unify CLI

It looks like argparse is being used in individual actions which is then called from bin/hc where a top-level command (i.e., hc) implements arg parsing manually. We could perhaps use add_subparsers in the entrypoint script to leverage built-in capabilities on the standard argument parser package. We have used this technique in some of the other WSDL projects.

Hypercane does not tell the user when the HC_STORAGE_CACHE variable has not been set

This is why #41 and other items failed during review on Monday.

The Hypercane GUI should refuse to start if the HC_STORAGE_CACHE environment variable is not set and it should notify the user.

	# prevent extra work if we already have it from previous runs
	if [ ! -e ${TIME_SLICE_FILE} ]; then
	echo "clustering mementos from remainder by time"
	hc cluster time-slice -i mementos -a ${ONLY_ENGLISH_FILE} -o ${TIME_SLICE_FILE} -l ${TIME_SLICE_LOG}
	fi

	# apply DBSCAN to cluster by Simhash distance
	DBSCAN_FILE=${WORKING_DIRECTORY}/dsa1-dbscan.tsv
	DBSCAN_LOG=${WORKING_DIRECTORY}/dsa1-cluster-dbscan.log

	# prevent extra work if we already have it from previous runs
	if [ ! -e ${DBSCAN_FILE} ]; then
	echo "clustering mementos from remainder by Simhash"
	hc cluster time-slice -i mementos -a ${TIME_SLICE_FILE} -o ${DBSCAN_FILE} -l ${DBSCAN_LOG}
	fi

	<button class="btn btn-primary btn-warning status-completed-toggle status-revoked-toggle status-failure-toggle" name="celery-command" value="rerun" type="submit">
	<span class="glyphicon glyphicon-repeat" aria-hidden="true"></span> {% trans "Re-run" %}
	</button>
	<button class="btn btn-warning" name="celery-command" value="resubmit" type="submit">
	<span class="glyphicon glyphicon-repeat" aria-hidden="true"></span> {% trans "Resubmit" %}
	</button>

	if domain in blog_sources:
	return 0.4

	elif domain in wikipedia_imagesharing_sources:
	return 0.6

	elif domain.upper() in pew_news_sources:
	return 0.7

	elif domain in w3newspapers_sources:
	return 0.7

	elif 'news' in domain:
	return 0.7

	elif domain in wikipedia_video_sources:
	return 0.7

	elif domain in adobe_socialmedia_sources:
	return 0.5

	blog_sources = [
	'blogger.com',
	'blogspot.com',
	'wordpress.com',
	'typepad.com'
	]

	# image sharing websites as per https://en.wikipedia.org/wiki/List_of_image-sharing_websites
	wikipedia_imagesharing_sources = [
	'500px.com',
	'album2.com',
	'bilddagboken.se',
	'myphotodiary.com',
	'kuvapaivakirja.fi',
	'bildedagboka.no',
	'billeddagbog.dk',
	'deviantart.com',
	'dronestagr.am',
	'flickr.com',
	'fotki.com',
	'fotolog.com',
	'fotolog.net',
	'geograph.org.uk',
	'photos.google.com',
	'instagram.com',
	'imgur.com',
	'ipernity.com',
	'jalbum.net',
	'photobucket.com',
	'pinterest.com',
	'pixabay.com',
	'securetribeapp.com',
	'shutterflyinc.com',
	'smugmug.com',
	'snapfish.com',
	'unsplash.com'
	]

	# video domains as per https://en.wikipedia.org/wiki/List_of_video_hosting_services#Specifically_dedicated_video_hosting_websites
	wikipedia_video_sources = [
	'acfun.cn',
	'afreecatv.com',
	'aparat.com',
	'bigo.tv',
	'bilibili.com',
	'bitchute.com',
	'dailymotion.com',
	'godtube.com',
	'iqiyi.com',
	'liveleak.com',
	'metacafe.com',
	'mixer.com',
	'nicovideo.jp',
	'periscope.tv',
	'rutube.ru',
	'schooltube.com',
	'smashcast.tv',
	'trilulilu.ro',
	'tudou.com',
	'tune.pk',
	'twitch.tv',
	'vbox7.com',
	'veoh.com',
	'vimeo.com',
	'youku.com',
	'younow.com',
	'youtube.com'
	]

	# social media domains as per https://helpx.adobe.com/analytics/kb/list-social-networks.html
	adobe_socialmedia_sources = [
	'12seconds.tv',
	'4travel.jp',
	'advogato.org',
	'ameba.jp',
	'anobii.com',
	'answers.yahoo.com',
	'asmallworld.net',
	'avforums.com',
	'backtype.com',
	'badoo.com',
	'bebo.com',
	'bigadda.com',
	'bigtent.com',
	'biip.no',
	'blackplanet.com',
	'blog.seesaa.jp',
	'blogspot.com',
	'blogster.com',
	'blomotion.jp',
	'bolt.com',
	'brightkite.com',
	'buzznet.com',
	'cafemom.com',
	'care2.com',
	'classmates.com',
	'cloob.com',
	'collegeblender.com',
	'cyworld.co.kr',
	'cyworld.com.cn',
	'dailymotion.com',
	'delicious.com',
	'deviantart.com',
	'digg.com',
	'diigo.com',
	'disqus.com',
	'draugiem.lv',
	'facebook.com',
	'faceparty.com',
	'fc2.com',
	'flickr.com',
	'flixster.com',
	'fotolog.com',
	'foursquare.com',
	'friendfeed.com',
	'friendsreunited.co.uk',
	'friendsreunited.com',
	'friendster.com',
	'fubar.com',
	'gaiaonline.com',
	'geni.com',
	'goodreads.com',
	'grono.net',
	'habbo.com',
	'hatena.ne.jp',
	'hi5.com',
	'hotnews.infoseek.co.jp',
	'hyves.nl',
	'ibibo.com',
	'identi.ca',
	'imeem.com',
	'instagram.com',
	'intensedebate.com',
	'irc-galleria.net',
	'iwiw.hu',
	'jaiku.com',
	'jp.myspace.com',
	'kaixin001.com',
	'kaixin002.com',
	'kakaku.com',
	'kanshin.com',
	'kozocom.com',
	'last.fm',
	'linkedin.com',
	'livejournal.com',
	'lnkd.in',
	'matome.naver.jp',
	'me2day.net',
	'meetup.com',
	'mister-wong.com',
	'mixi.jp',
	'mixx.com',
	'mouthshut.com',
	'mp.weixin.qq.com',
	'multiply.com',
	'mumsnet.com',
	'myheritage.com',
	'mylife.com',
	'myspace.com',
	'myyearbook.com',
	'nasza-klasa.pl',
	'netlog.com',
	'nettby.no',
	'netvibes.com',
	'nextdoor.com',
	'nicovideo.jp',
	'ning.com',
	'odnoklassniki.ru',
	'ok.ru',
	'orkut.com',
	'pakila.jp',
	'photobucket.com',
	'pinterest.at',
	'pinterest.be',
	'pinterest.ca',
	'pinterest.ch',
	'pinterest.cl',
	'pinterest.co',
	'pinterest.co.kr',
	'pinterest.co.uk',
	'pinterest.com',
	'pinterest.de',
	'pinterest.dk',
	'pinterest.es',
	'pinterest.fr',
	'pinterest.hu',
	'pinterest.ie',
	'pinterest.in',
	'pinterest.jp',
	'pinterest.nz',
	'pinterest.ph',
	'pinterest.pt',
	'pinterest.se',
	'plaxo.com',
	'plurk.com',
	'plus.google.com',
	'plus.url.google.com',
	'po.st',
	'reddit.com',
	'renren.com',
	'skyrock.com',
	'slideshare.net',
	'smcb.jp',
	'smugmug.com',
	'sonico.com',
	'studivz.net',
	'stumbleupon.com',
	't.163.com',
	't.co',
	't.hexun.com',
	't.ifeng.com',
	't.people.com.cn',
	't.qq.com',
	't.sina.com.cn',
	't.sohu.com',
	'tabelog.com',
	'tagged.com',
	'taringa.net',
	'thefancy.com',
	'toutiao.com',
	'tripit.com',
	'trombi.com',
	'trytrend.jp',
	'tuenti.com',
	'tumblr.com',
	'twine.com',
	'twitter.com',
	'uhuru.jp',
	'viadeo.com',
	'vimeo.com',
	'vk.com',
	'wayn.com',
	'weibo.com',
	'weourfamily.com',
	'wer-kennt-wen.de',
	'wordpress.com',
	'xanga.com',
	'xing.com',
	'yammer.com',
	'yaplog.jp',
	'yelp.co.uk',
	'yelp.com',
	'youku.com',
	'youtube.com',
	'yozm.daum.net',
	'yuku.com',
	'zhihu.com',
	'zooomr.com'
	]

	# news domains from https://pewresearch-org-preprod.go-vip.co/journalism/2019/07/23/state-of-the-news-media-methodology/#digital-native-news-outlet-audit
	pew_news_sources = [
	'12UP.COM',
	'247SPORTS.COM',
	'90MIN.COM',
	'APLUS.COM',
	'BGR.COM',
	'BLEACHERREPORT.COM',
	'BREITBART.COM',
	'BUSINESSINSIDER.COM',
	'BUSTLE.COM',
	'BUZZFEED.COM',
	'BUZZFEEDNEWS.COM',
	'CHEATSHEET.COM',
	'CINEMABLEND.COM',
	'CNET.COM',
	'COMICBOOK.COM',
	'DAILYDOT.COM',
	'DEADSPIN.COM',
	'DIGITALTRENDS.COM',
	'EATER.COM',
	'ELITEDAILY.COM',
	'ENGADGET.COM',
	'FIVETHIRTYEIGHT.COM',
	'GAMESPOT.COM',
	'GIZMODO.COM',
	'HELLOGIGGLES.COM',
	'HOLLYWOODLIFE.COM',
	'HUFFINGTONPOST.COM',
	'IBTIMES.COM',
	'IFLSCIENCE.COM',
	'IGN.COM',
	'IJR.COM',
	'IJREVIEW.COM',
	'INVESTOPEDIA.COM',
	'JEZEBEL.COM',
	'MARKETWATCH.COM',
	'MASHABLE.COM',
	'MAXPREPS.COM',
	'MIC.COM',
	'OPPOSINGVIEWS.COM',
	'POLITICO.COM',
	'POLYGON.COM',
	'QZ.COM',
	'RARE.US',
	'RAWSTORY.COM',
	'REFINERY29.COM',
	'SALON.COM',
	'SBNATION.COM',
	'SLATE.COM',
	'TECHRADAR.COM',
	'THEBLAZE.COM',
	'THEDAILYBEAST.COM',
	'THEROOT.COM',
	'THEVERGE.COM',
	'THISISINSIDER.COM',
	'THRILLIST.COM',
	'TMZ.COM',
	'TOPIX.COM',
	'TOPIX.NET',
	'UPROXX.COM',
	'UPWORTHY.COM',
	'VOX.COM'
	]

	# sources from https://www.w3newspapers.com/newssites/
	w3newspapers_sources = [
	'aljazeera.com',
	'nytimes.com',
	'wsj.com',
	'huffpost.com',
	'washingtonpost.com',
	'latimes.com',
	'reuters.com',
	'abcnews.go.com',
	'usatoday.com',
	'bloomberg.com',
	'nbcnews.com',
	'dailymail.co.uk',
	'theguardian.com',
	'thesun.co.uk',
	'mirror.co.uk',
	'telegraph.co.uk',
	'bbc.com',
	'thestar.com',
	'theglobeandmail.com',
	'news.com.au',
	'forbes.com',
	'cnbc.com',
	'chinadaily.com.cn',
	'chron.com',
	'nypost.com',
	'usnews.com',
	'dw.com',
	'indiatimes.com',
	'thehindu.com',
	'indianexpress.com',
	'hindustantimes.com',
	'cbsnews.com',
	'time.com',
	'sfgate.com',
	'thehill.com',
	'thedailybeast.com',
	'newsweek.com',
	'theatlantic.com',
	'nzherald.co.nz',
	'herald.co.zw',
	'vanguardngr.com',
	'dailysun.co.za',
	'thejakartapost.com',
	'thestar.com.my',
	'straitstimes.com',
	'bangkokpost.com',
	'japantimes.co.jp',
	'thedailystar.net',
	'dawn.com',
	'alarabiya.net',
	'hollywoodreporter.com',
	'scmp.com',
	'aljazeera.com',
	'voanews.com'
	]

	# TODO: load these from a file
	added_stopwords = [
	"associated press",
	"com",
	"donald trump",
	"fox news",
	"abc news",
	"getty images",
	"last month",
	"last week",
	"last year",
	"pic",
	"pinterest reddit",
	"pm et",
	"president donald",
	"president donald trump",
	"president trump",
	"president trump's",
	"print mail",
	"reddit print",
	"said statement",
	"send whatsapp",
	"sign up",
	"trump administration",
	"trump said",
	"twitter",
	"united states",
	"washington post",
	"white house",
	"whatsapp pinterest",
	"subscribe whatsapp",
	"york times",
	"privacy policy",
	"terms use"
	]

	added_stopwords.append( "{} read".format(last_year) )
	added_stopwords.append( "{} read".format(current_year) )

	stopmonths = [
	"january",
	"february",
	"march",
	"april",
	"may",
	"june",
	"july",
	"august",
	"september",
	"october",
	"november",
	"december"
	]

	# add just the month to the stop words
	added_stopwords.extend(stopmonths)

	stopmonths_short = [
	"jan",
	"feb",
	"mar",
	"apr",
	"may",
	"jun",
	"jul",
	"aug",
	"sep",
	"oct",
	"nov",
	"dec"
	]

	added_stopwords.extend(stopmonths_short)

	# add the day of the week, too
	added_stopwords.extend([
	"monday",
	"tuesday",
	"wednesday",
	"thursday",
	"friday",
	"saturday",
	"sunday"
	])

	added_stopwords.extend([
	"mon",
	"tue",
	"wed",
	"thu",
	"fri",
	"sat",
	"sun"
	])

	# for i in range(1, 13):
	# added_stopwords.append(
	# datetime(current_year, i, current_date).strftime('%b %Y')
	# )
	# added_stopwords.append(
	# datetime(last_year, i, current_date).strftime('%b %Y')
	# )

	# for i in range(1, 13):
	# added_stopwords.append(
	# datetime(current_year, i, current_date).strftime('%B %Y')
	# )
	# added_stopwords.append(
	# datetime(last_year, i, current_date).strftime('%B %Y')
	# )

	def get_document_tokens(urim, cache_storage, ngram_length):

	from hypercane.utils import get_boilerplate_free_content
	from nltk.corpus import stopwords
	from nltk import word_tokenize, ngrams
	import string

	# TODO: stoplist based on language of the document
	stoplist = list(set(stopwords.words('english')))
	punctuation = [ i for i in string.punctuation ]
	additional_stopchars = [ '’', '‘', '“', '”', '•', '·', '—', '–', '›', '»']
	stop_numbers = [ str(i) for i in range(0, 11) ]
	allstop = stoplist + punctuation + additional_stopchars + stop_numbers

	content = get_boilerplate_free_content(urim, cache_storage=cache_storage)
	doc_tokens = word_tokenize(content.decode('utf8').lower())
	doc_tokens = [ token for token in doc_tokens if token not in allstop ]
	table = str.maketrans('', '', string.punctuation)
	doc_tokens = [ w.translate(table) for w in doc_tokens ]
	doc_tokens = [ w for w in doc_tokens if len(w) > 0 ]
	doc_ngrams = ngrams(doc_tokens, ngram_length)

	return list(doc_ngrams)

	imageranking.append(
	(
	score,
	pixelsize,
	colorcount,
	1 / ratio,
	noverN,
	image_urim
	)
	)

oduwsdl / hypercane Goto Github PK

hypercane's People

Contributors

Stargazers

Watchers

Forkers

hypercane's Issues

Recommend Projects

Recommend Topics

Recommend Org