Giter VIP home page Giter VIP logo

hypercane's People

Contributors

ato avatar dependabot[bot] avatar ibnesayeed avatar machawk1 avatar shawnmjones avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

hypercane's Issues

Allow users to specify their own boilerplate removal method

Hypercane uses ArticleExtractor from boilerpipe because that is what works best with sumgrams. There are some scenarios where this boilerplate removal method produces nothing. Allowing the user to specify their own method, and perhaps an ordered list of preferred methods would go a long way to addressing this issue.

Fix typo in the DSA1 implementation

After reworking Hypercane to use '.halg' formatted files as part of the IIPC 2021 Grant work, the DSA1 algorithm implementation is now wrong. We execute the time slice twice instead of the DBSCAN step:

# prevent extra work if we already have it from previous runs
if [ ! -e ${TIME_SLICE_FILE} ]; then
echo "clustering mementos from remainder by time"
hc cluster time-slice -i mementos -a ${ONLY_ENGLISH_FILE} -o ${TIME_SLICE_FILE} -l ${TIME_SLICE_LOG}
fi
# apply DBSCAN to cluster by Simhash distance
DBSCAN_FILE=${WORKING_DIRECTORY}/dsa1-dbscan.tsv
DBSCAN_LOG=${WORKING_DIRECTORY}/dsa1-cluster-dbscan.log
# prevent extra work if we already have it from previous runs
if [ ! -e ${DBSCAN_FILE} ]; then
echo "clustering mementos from remainder by Simhash"
hc cluster time-slice -i mementos -a ${TIME_SLICE_FILE} -o ${DBSCAN_FILE} -l ${DBSCAN_LOG}
fi

It needs to follow AlNoamany's Algorithm again, like it did while working on my dissertation work.

Move Hypercane from MongoDB to PostgreSQL for storage and caching

Hypercane uses MongoDB for caching memento content, headers, and derived data. It also uses PostgreSQL as part of its Web User Interface (WUI). Rather than having to install/maintain multiple databases for different purposes, we want to move Hypercane to PostgreSQL for the following reasons.

  1. MongoDB does not install "easily" for some users. I installed it on macOS with homebrew but reinstalled it after issues. On Ubuntu/RHEL, the admin needs to add a third-party yum/apt repository install it. Almost every distro includes PostgreSQL.
  2. I've had issues saving MongoDB data and restoring its data across versions and systems. Sometimes the BSON is corrupted. I'm sure I was supposed to do something on one end or another, but it seems like SQL databases have an easier time with this.
  3. As Hypercane has matured, I've saved more derived data in the database. The ability to query this with SQL is becoming more and more attractive over time. Such standardization may provide third-party tools with another interface for easy analysis.
  4. A point in favor of MongoDB is that I can shove any data we want into a record and not worry about creating standardized fields. We could achieve something similar with planned foreign keys and relations in SQL at the expense of planning time and schema changes. The truth is that function calls in the code have to correspond to database actions; hence, we will write some queries either way. Moving to PostgreSQL will require that we change the schema for each derived value that we want to store.
  5. For space reasons, a user may want to clear out the memento content and keep the derived data. With MongoDB, we have to save the parts we like, get rid of the whole record, and create a new record. With SQL and a decently designed schema, we can delete the records from the table storing the content.
  6. Another point in favor of MongoDB is its ability to expire records, which we do not currently use, but should. PostgreSQL does not natively support this as far as I can tell, but I can achieve something similar with triggers.
  7. MongoDB has a BSON size limit of 16MB unless I switch to GridFS. PostgreSQL has a maximum field size of 1GB. Currently, Hypercane discards anything over 16MB, which means that some images and other binary files are skipped rather than processed.
  8. Some claim that MongoDB is faster than PostgreSQL, but some studies show that PostgreSQL has caught up. (We need to add these links.) Performance depends on indexing, table structure, and the queries used in the test. We can likely get comparable performance with good database creation scripts.
  9. Some web archiving folks had suggested instead storing the data as WARC/WAT/etc. and maintaining a CDX. This was a good suggestion when we were only caching content, but it does not work as well for querying the derived data. If we store derived data in the CDX, it becomes a table.๐Ÿ™‚
  10. The choice of MongoDB came from needing to handle concurrent writes. Writing a single WARC for each memento creates many files and addresses concurrency. Creating a CDX afterward must be timed well. Alternatives, like SQLite, don't handle concurrent writes well either. Database engines, like PostgreSQL or MongoDB, manage this with their own caching, checkpointing, and optimization.
  11. Thanks to the pilot, we have a better idea of the type of data we should store in the database, meaning that we have a better data model moving forward.

With all of this in mind, I will be using this issue to document ER diagrams and other insights as I experiment with this change.

Replace Hypercane GUI's Download button

Wooey has a confusing setup with respect to Downloads. We should replace the Download button with something that allows the user to download the file generated by the given action.

We can force a download rather than a browser render by adding the download attribute to an a tag. I'm not sure how to do this to a button, but I think the buttons in Wooey are largely decorative rather than true button tags.

Create a Linux install for Hypercane

This script should take into account the lessons learned from oduwsdl/raintale#19, oduwsdl/raintale#20, oduwsdl/raintale#21.

Ideally, we would create an RPM install for RedHat-based systems and a DEB install for Debian-based systems, but that may be a bit too much to test for the duration of the IIPC Grant. A tarball containing the necessary files and an installation script is likely enough. To make it administrator-friendly, we could apply Makeself as well. This way the user can download a single file, execute it, and it will extract our content, execute our script, and start up the Hypercane GUI.

Remove Wooey's Re-run and Resubmit buttons from the Hypercane GUI

They are confusing to us and will likely be so for users. Until we can articulate how to use them, we should remove them.

We just need someone to remove the lines from hypercane-gui/templates/jobs/job_view.html:

<button class="btn btn-primary btn-warning status-completed-toggle status-revoked-toggle status-failure-toggle" name="celery-command" value="rerun" type="submit">
<span class="glyphicon glyphicon-repeat" aria-hidden="true"></span> {% trans "Re-run" %}
</button>
<button class="btn btn-warning" name="celery-command" value="resubmit" type="submit">
<span class="glyphicon glyphicon-repeat" aria-hidden="true"></span> {% trans "Resubmit" %}
</button>

Create a Hypercane GUI convenience script that runs sample, report, and synthesize commands for a Raintale story

The user should be able to start a run a single script and produce a complex Raintale Story JSON file just like we do with the SHARI process.

In this case, the GUI should allow the user to select the appropriate sampling algorithm, and then the following happens:

  1. sample from the collection with the algorithm selected by the user
  2. run an entity, sumgram, and image report on the sample
  3. synthesize a Raintale story JSON file from the sample file and reports

Uninstalling hypercane wouldn't remove the hc script from .local/bin/

  • When uninstalling hypercane,
$ pip3 uninstall hypercane
/usr/lib/python3/dist-packages/secretstorage/dhcrypto.py:15: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
  from cryptography.utils import int_from_bytes
/usr/lib/python3/dist-packages/secretstorage/util.py:19: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
  from cryptography.utils import int_from_bytes
Found existing installation: hypercane 0.2021.3.10.202429
Uninstalling hypercane-0.2021.3.10.202429:
  Would remove:
    /home/marsh/.local/lib/python3.8/site-packages/hypercane
    /home/marsh/.local/lib/python3.8/site-packages/hypercane-0.2021.3.10.202429.egg-info
Proceed (y/n)? y
  Successfully uninstalled hypercane-0.2021.3.10.202429
  • With this, hypercane will be uninstalled but the hc script from .local/bin/ will not get automatically removed.
$ which hc
/home/marsh/.local/bin/hc

$/.local/lib/python3.8/site-packages

$ hc --help
hc (Hypercane) is a framework for building algorithms for sampling mementos from a web archive collection.
It is a complex toolchain requiring a supported action and additional arguments.

For example:
    hc sample dsa1 -i archiveit -a 8778 -o story-mementos.txt

This is the list of supported actions:

    * sample
    * report
    * synthesize
    * identify
    * filter
    * cluster
    * score
    * order

For each of these actions, you can view additional help by typing --help after the action name, for example:
    hc sample --help

$ hc sample --help
Traceback (most recent call last):
  File "/home/marsh/.local/bin/hc", line 54, in <module>
    actionmodule = importlib.import_module(supported_actions[action])
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'hypercane'

Hypercane WUI suspends rendering in Firefox

Due to issues downloading //fonts.googleapis.com/css?family=Pacifico, Firefox does not render Hypercane's WUI for quite a long time.

After removing line 14 from hypercane-gui/templates/base.html the page loads fine.

Make domains configurable for hc score dsa1-score

Scoring mementos by Padia's work depends upon a list of domains for the categories of news, image sharing sites, video sharing sites, blog sites, and social media sites. These lists will likely change over time and should be configurable by the end user.

The scoring is handled on lines 431 - 451 of hypercane/score/dsa1_ranking. We will have to make it generic.

if domain in blog_sources:
return 0.4
elif domain in wikipedia_imagesharing_sources:
return 0.6
elif domain.upper() in pew_news_sources:
return 0.7
elif domain in w3newspapers_sources:
return 0.7
elif 'news' in domain:
return 0.7
elif domain in wikipedia_video_sources:
return 0.7
elif domain in adobe_socialmedia_sources:
return 0.5

The lists themselves are on lines 15 - 386 of that same file.

blog_sources = [
'blogger.com',
'blogspot.com',
'wordpress.com',
'typepad.com'
]
# image sharing websites as per https://en.wikipedia.org/wiki/List_of_image-sharing_websites
wikipedia_imagesharing_sources = [
'500px.com',
'album2.com',
'bilddagboken.se',
'myphotodiary.com',
'kuvapaivakirja.fi',
'bildedagboka.no',
'billeddagbog.dk',
'deviantart.com',
'dronestagr.am',
'flickr.com',
'fotki.com',
'fotolog.com',
'fotolog.net',
'geograph.org.uk',
'photos.google.com',
'instagram.com',
'imgur.com',
'ipernity.com',
'jalbum.net',
'photobucket.com',
'pinterest.com',
'pixabay.com',
'securetribeapp.com',
'shutterflyinc.com',
'smugmug.com',
'snapfish.com',
'unsplash.com'
]
# video domains as per https://en.wikipedia.org/wiki/List_of_video_hosting_services#Specifically_dedicated_video_hosting_websites
wikipedia_video_sources = [
'acfun.cn',
'afreecatv.com',
'aparat.com',
'bigo.tv',
'bilibili.com',
'bitchute.com',
'dailymotion.com',
'godtube.com',
'iqiyi.com',
'liveleak.com',
'metacafe.com',
'mixer.com',
'nicovideo.jp',
'periscope.tv',
'rutube.ru',
'schooltube.com',
'smashcast.tv',
'trilulilu.ro',
'tudou.com',
'tune.pk',
'twitch.tv',
'vbox7.com',
'veoh.com',
'vimeo.com',
'youku.com',
'younow.com',
'youtube.com'
]
# social media domains as per https://helpx.adobe.com/analytics/kb/list-social-networks.html
adobe_socialmedia_sources = [
'12seconds.tv',
'4travel.jp',
'advogato.org',
'ameba.jp',
'anobii.com',
'answers.yahoo.com',
'asmallworld.net',
'avforums.com',
'backtype.com',
'badoo.com',
'bebo.com',
'bigadda.com',
'bigtent.com',
'biip.no',
'blackplanet.com',
'blog.seesaa.jp',
'blogspot.com',
'blogster.com',
'blomotion.jp',
'bolt.com',
'brightkite.com',
'buzznet.com',
'cafemom.com',
'care2.com',
'classmates.com',
'cloob.com',
'collegeblender.com',
'cyworld.co.kr',
'cyworld.com.cn',
'dailymotion.com',
'delicious.com',
'deviantart.com',
'digg.com',
'diigo.com',
'disqus.com',
'draugiem.lv',
'facebook.com',
'faceparty.com',
'fc2.com',
'flickr.com',
'flixster.com',
'fotolog.com',
'foursquare.com',
'friendfeed.com',
'friendsreunited.co.uk',
'friendsreunited.com',
'friendster.com',
'fubar.com',
'gaiaonline.com',
'geni.com',
'goodreads.com',
'grono.net',
'habbo.com',
'hatena.ne.jp',
'hi5.com',
'hotnews.infoseek.co.jp',
'hyves.nl',
'ibibo.com',
'identi.ca',
'imeem.com',
'instagram.com',
'intensedebate.com',
'irc-galleria.net',
'iwiw.hu',
'jaiku.com',
'jp.myspace.com',
'kaixin001.com',
'kaixin002.com',
'kakaku.com',
'kanshin.com',
'kozocom.com',
'last.fm',
'linkedin.com',
'livejournal.com',
'lnkd.in',
'matome.naver.jp',
'me2day.net',
'meetup.com',
'mister-wong.com',
'mixi.jp',
'mixx.com',
'mouthshut.com',
'mp.weixin.qq.com',
'multiply.com',
'mumsnet.com',
'myheritage.com',
'mylife.com',
'myspace.com',
'myyearbook.com',
'nasza-klasa.pl',
'netlog.com',
'nettby.no',
'netvibes.com',
'nextdoor.com',
'nicovideo.jp',
'ning.com',
'odnoklassniki.ru',
'ok.ru',
'orkut.com',
'pakila.jp',
'photobucket.com',
'pinterest.at',
'pinterest.be',
'pinterest.ca',
'pinterest.ch',
'pinterest.cl',
'pinterest.co',
'pinterest.co.kr',
'pinterest.co.uk',
'pinterest.com',
'pinterest.de',
'pinterest.dk',
'pinterest.es',
'pinterest.fr',
'pinterest.hu',
'pinterest.ie',
'pinterest.in',
'pinterest.jp',
'pinterest.nz',
'pinterest.ph',
'pinterest.pt',
'pinterest.se',
'plaxo.com',
'plurk.com',
'plus.google.com',
'plus.url.google.com',
'po.st',
'reddit.com',
'renren.com',
'skyrock.com',
'slideshare.net',
'smcb.jp',
'smugmug.com',
'sonico.com',
'studivz.net',
'stumbleupon.com',
't.163.com',
't.co',
't.hexun.com',
't.ifeng.com',
't.people.com.cn',
't.qq.com',
't.sina.com.cn',
't.sohu.com',
'tabelog.com',
'tagged.com',
'taringa.net',
'thefancy.com',
'toutiao.com',
'tripit.com',
'trombi.com',
'trytrend.jp',
'tuenti.com',
'tumblr.com',
'twine.com',
'twitter.com',
'uhuru.jp',
'viadeo.com',
'vimeo.com',
'vk.com',
'wayn.com',
'weibo.com',
'weourfamily.com',
'wer-kennt-wen.de',
'wordpress.com',
'xanga.com',
'xing.com',
'yammer.com',
'yaplog.jp',
'yelp.co.uk',
'yelp.com',
'youku.com',
'youtube.com',
'yozm.daum.net',
'yuku.com',
'zhihu.com',
'zooomr.com'
]
# news domains from https://pewresearch-org-preprod.go-vip.co/journalism/2019/07/23/state-of-the-news-media-methodology/#digital-native-news-outlet-audit
pew_news_sources = [
'12UP.COM',
'247SPORTS.COM',
'90MIN.COM',
'APLUS.COM',
'BGR.COM',
'BLEACHERREPORT.COM',
'BREITBART.COM',
'BUSINESSINSIDER.COM',
'BUSTLE.COM',
'BUZZFEED.COM',
'BUZZFEEDNEWS.COM',
'CHEATSHEET.COM',
'CINEMABLEND.COM',
'CNET.COM',
'COMICBOOK.COM',
'DAILYDOT.COM',
'DEADSPIN.COM',
'DIGITALTRENDS.COM',
'EATER.COM',
'ELITEDAILY.COM',
'ENGADGET.COM',
'FIVETHIRTYEIGHT.COM',
'GAMESPOT.COM',
'GIZMODO.COM',
'HELLOGIGGLES.COM',
'HOLLYWOODLIFE.COM',
'HUFFINGTONPOST.COM',
'IBTIMES.COM',
'IFLSCIENCE.COM',
'IGN.COM',
'IJR.COM',
'IJREVIEW.COM',
'INVESTOPEDIA.COM',
'JEZEBEL.COM',
'MARKETWATCH.COM',
'MASHABLE.COM',
'MAXPREPS.COM',
'MIC.COM',
'OPPOSINGVIEWS.COM',
'POLITICO.COM',
'POLYGON.COM',
'QZ.COM',
'RARE.US',
'RAWSTORY.COM',
'REFINERY29.COM',
'SALON.COM',
'SBNATION.COM',
'SLATE.COM',
'TECHRADAR.COM',
'THEBLAZE.COM',
'THEDAILYBEAST.COM',
'THEROOT.COM',
'THEVERGE.COM',
'THISISINSIDER.COM',
'THRILLIST.COM',
'TMZ.COM',
'TOPIX.COM',
'TOPIX.NET',
'UPROXX.COM',
'UPWORTHY.COM',
'VOX.COM'
]
# sources from https://www.w3newspapers.com/newssites/
w3newspapers_sources = [
'aljazeera.com',
'nytimes.com',
'wsj.com',
'huffpost.com',
'washingtonpost.com',
'latimes.com',
'reuters.com',
'abcnews.go.com',
'usatoday.com',
'bloomberg.com',
'nbcnews.com',
'dailymail.co.uk',
'theguardian.com',
'thesun.co.uk',
'mirror.co.uk',
'telegraph.co.uk',
'bbc.com',
'thestar.com',
'theglobeandmail.com',
'news.com.au',
'forbes.com',
'cnbc.com',
'chinadaily.com.cn',
'chron.com',
'nypost.com',
'usnews.com',
'dw.com',
'indiatimes.com',
'thehindu.com',
'indianexpress.com',
'hindustantimes.com',
'cbsnews.com',
'time.com',
'sfgate.com',
'thehill.com',
'thedailybeast.com',
'newsweek.com',
'theatlantic.com',
'nzherald.co.nz',
'herald.co.zw',
'vanguardngr.com',
'dailysun.co.za',
'thejakartapost.com',
'thestar.com.my',
'straitstimes.com',
'bangkokpost.com',
'japantimes.co.jp',
'thedailystar.net',
'dawn.com',
'alarabiya.net',
'hollywoodreporter.com',
'scmp.com',
'aljazeera.com',
'voanews.com'
]

Create Hypercane GUI script for identifying Memento objects based on collection IDs

The existing identify script only handles Memento objects. The command line version of Hypercane can support files containing URIs (file handles) or collection identifiers (strings). Wooey doesn't support both of these at the same time, so we need to create a separate script that allows the user to execute an identify action and convert a collection identifier to the desired file listing Memento objects.

Add score as a filter

Add a filter that allows the user to specify the scoring range to include in the output. In case multiple score fields exist in the input, provide the user an argument with which to specify a given field. Perhaps options for upper and lower bound should be available as well.

Make stopword list for hc report terms configurable

The stopwords for hc report terms are currently hardcoded. Even worse, the are hard coded only in the sumgram code and not the general n-gram code.

# TODO: load these from a file
added_stopwords = [
"associated press",
"com",
"donald trump",
"fox news",
"abc news",
"getty images",
"last month",
"last week",
"last year",
"pic",
"pinterest reddit",
"pm et",
"president donald",
"president donald trump",
"president trump",
"president trump's",
"print mail",
"reddit print",
"said statement",
"send whatsapp",
"sign up",
"trump administration",
"trump said",
"twitter",
"united states",
"washington post",
"white house",
"whatsapp pinterest",
"subscribe whatsapp",
"york times",
"privacy policy",
"terms use"
]
added_stopwords.append( "{} read".format(last_year) )
added_stopwords.append( "{} read".format(current_year) )
stopmonths = [
"january",
"february",
"march",
"april",
"may",
"june",
"july",
"august",
"september",
"october",
"november",
"december"
]
# add just the month to the stop words
added_stopwords.extend(stopmonths)
stopmonths_short = [
"jan",
"feb",
"mar",
"apr",
"may",
"jun",
"jul",
"aug",
"sep",
"oct",
"nov",
"dec"
]
added_stopwords.extend(stopmonths_short)
# add the day of the week, too
added_stopwords.extend([
"monday",
"tuesday",
"wednesday",
"thursday",
"friday",
"saturday",
"sunday"
])
added_stopwords.extend([
"mon",
"tue",
"wed",
"thu",
"fri",
"sat",
"sun"
])
# for i in range(1, 13):
# added_stopwords.append(
# datetime(current_year, i, current_date).strftime('%b %Y')
# )
# added_stopwords.append(
# datetime(last_year, i, current_date).strftime('%b %Y')
# )
# for i in range(1, 13):
# added_stopwords.append(
# datetime(current_year, i, current_date).strftime('%B %Y')
# )
# added_stopwords.append(
# datetime(last_year, i, current_date).strftime('%B %Y')
# )

The generic terms report will need to accept the same stopword list at get_document_tokens:

def get_document_tokens(urim, cache_storage, ngram_length):
from hypercane.utils import get_boilerplate_free_content
from nltk.corpus import stopwords
from nltk import word_tokenize, ngrams
import string
# TODO: stoplist based on language of the document
stoplist = list(set(stopwords.words('english')))
punctuation = [ i for i in string.punctuation ]
additional_stopchars = [ 'โ€™', 'โ€˜', 'โ€œ', 'โ€', 'โ€ข', 'ยท', 'โ€”', 'โ€“', 'โ€บ', 'ยป']
stop_numbers = [ str(i) for i in range(0, 11) ]
allstop = stoplist + punctuation + additional_stopchars + stop_numbers
content = get_boilerplate_free_content(urim, cache_storage=cache_storage)
doc_tokens = word_tokenize(content.decode('utf8').lower())
doc_tokens = [ token for token in doc_tokens if token not in allstop ]
table = str.maketrans('', '', string.punctuation)
doc_tokens = [ w.translate(table) for w in doc_tokens ]
doc_tokens = [ w for w in doc_tokens if len(w) > 0 ]
doc_ngrams = ngrams(doc_tokens, ngram_length)
return list(doc_ngrams)

Add the ability to only use the cache

Some users may have built a cache from prior runs and not want to issue new HTTP requests to add to it. We may not be able to force non-network access with caches supplied via environment variables like HTTPS_PROXY, but the faster MongoDB cache used by requests-cache can be overridden to not issue a network connection for cache misses.

Create a new object named OnlyCachedSession that is a child of CachedSession. This object will skip the network connections provided by requests altogether.

Some code below that has worked in testing:

from requests.hooks import dispatch_hook
from requests_cache import CachedSession

class FailedCacheResponse(Exception):
    pass

class OnlyCachedSession(CachedSession):

    def send(self, request, **kwargs):

        cache_key = self.cache.create_key(request)

        def send_request_and_cache_response():
            response = super(CachedSession, self).send(request, **kwargs)
            if response.status_code in self._cache_allowable_codes:
                self.cache.save_response(cache_key, response)
            response.from_cache = False
            return response

        try:
            response, timestamp = self.cache.get_response_and_time(cache_key)
        except (ImportError, TypeError):
            raise FailedCacheResponse(
                "Import/Type Errors : could not get response and time : item {} is not in the cache".format(cache_key)
            )

        if response is None:
            raise FailedCacheResponse(
                "response is None : could not get response and time : item {} is not in the cache".format(cache_key)
            )

        # dispatch hook here, because we've removed it before pickling
        response.from_cache = True
        response = dispatch_hook('response', request.hooks, response, **kwargs)
        return response

Update Hypercane to accept NLA collection identifiers as input for discovering mementos

Hypercane currently accepts an input type of archiveit and a number as an input argument that identifies the Archive-It collection. We want to do the same, so someone can type Hypercane commands like:

# hc identify -i nla -a 13000 ...

Once this AIU issue is complete Hypercane will be able to acquire metadata and a list of URI-Ms for each NLA collection. We just want to connect all of this together into a new input type.

We will likely need to add functions for NLA that function like generate_archiveit_urits.

We will also need to update discover_mementos_by_input_type, discover_timemaps_by_input_type, and discover_original_resources_by_input_type to support a new input type of nla.

I think these are the only changes needed, but we will need to test to make sure.

Update Image Report to Score Images From Metadata Higher

The current scores produced by hc report image-data are not as effective as they could be. Humans may have already supplied their desired striking images in the metadata of the web pages making up the collection.

Hypercane's existing image scoring function in hypercane/report/imagedata.py:rank_images currently adds image properties to a list on lines 143 - 152

imageranking.append(
(
score,
pixelsize,
colorcount,
1 / ratio,
noverN,
image_urim
)
)

Add another column to the left containing values of 1 or 0. If Hypercane discovers the image in the metadata, set this column to 1 otherwise 0. This way, when the sorting occurs on line 154, all images discovered in the metadata will exist at the highest ranks in the list and then will be sorted by their MementoEmbed score.

Improve the HALG file format

In v0.5, we introduced the HALG file format for executing Hypercane recipes.

To start, Hypercane needs two bash functions to make HALG more compact: cache_hc and move_output.

Consider the following script, simplified to illustrate a point:

#!/bin/bash

input_type=$1
input_argument=$2
working_directory=$3
output_file=$4

cd ${working_directory}

if [ ! -e identified-mementos.tsv ]; then
  hc identify mementos -i $input_type -a $input_argument -o identified-mementos.tsv
fi

if [ ! -e sample-mementos.tsv ]; then
  hc sample true-random -k 2000 -i mementos -a identified-mementos.tsv -o sample-mementos.tsv
fi

if [ ! -e image-report.json ]; then
  hc report imagedata -i mementos -a identified-mementos.tsv -o image-report.json
fi

if [ ! -e terms.tsv ]; then
  hc report terms -i mementos -a identified-mementos.tsv -o terms.tsv
fi

if [ ! -e story.json ]; then
  hc synthesize raintale-story -i mementos -a identified-mementos --imagedata image-report.json --term-report terms.tsv -o story.json
fi

cp ${working_directory}/story.json ${output_file}

which could be simplified to something like this:

#!/bin/bash

input_type=$1
input_argument=$2
working_directory=$3
output_file=$4

function cache_hc() { ... }

function move_output() { ... }

cache_hc "identify mementos" "${input_type}=${input_argument}" "694-mementos.tsv"
cache_hc "sample true-random -k 2000" "694-mementos.tsv" "sample-mementos.tsv"
cache_hc "report imagedata" "sample-mementos.tsv" "image-report.json"
cache_hc "report terms --use-sumgrams" "sample-mementos.tsv" "terms.tsv"
cache_hc "synthesize raintale-story --imagedata image-report.json --term-report terms.tsv" "sample-mementos.tsv" "story.json"
move_output "story.json" "${output_file}

and we can even make the cache_hc and move_output functions available as part of the Hypercane installation itself.

This issue is the start of a conversation/documentation of thinking about this idea with the goal of making HALG more applicable as v0.6 development unfolds.

We also need to document HALG. So far, it differs from a regular shell script by offering comments in the following format:

#!/bin/bash
# algorithm name: DSA1
# algorithm description: An implementation of the algorithm from AlNoamany's dissertation.

These comments are used by Hypercane when displaying the possible algorithms. Getting HALG straight is an important step toward the Recipe Builder.

Duplicate error/help message in CLI

When running hc command with an unsupported action, error/help is printed twice:

$ hc foo
ERROR: unsupported action foo

hc (Hypercane) is a complex toolchain requiring a supported action and additional arguments

For example:
    hc sample dsa1 -i archiveit -a 8778 -o story-mementos.txt

    Supported actions:

    * sample
    * report
    * synthesize
    * identify
    * filter
    * cluster
    * score
    * order

    For each of these actions, you can view additional help by typing --help after the action name, for example:
    hc sample --help

ERROR: unsupported action foo

hc (Hypercane) is a complex toolchain requiring a supported action and additional arguments

For example:
    hc sample dsa1 -i archiveit -a 8778 -o story-mementos.txt

    Supported actions:

    * sample
    * report
    * synthesize
    * identify
    * filter
    * cluster
    * score
    * order

    For each of these actions, you can view additional help by typing --help after the action name, for example:
    hc sample --help

Synthesize warc using regular vs raw stream

The synthesize warc command will unintentionally switch back to the original stream instead of the raw stream. The bug seems to be resolved by making deep copies of all variables from the original stream.

Affected lines in hypercane/hypercane/synthesize/warcs.py:
76 - headers_list = copy.deepcopy(resp.raw.headers.items())
81 - warc_target_uri = str(resp.links[link]['url'])
88 - mdt = str(resp.headers['memento-datetime'])

Add a command for managing the cache

A command that allows the user to manage the cache would be very helpful after we have implemented #65.

I'm envisioning something like the following:

This command would list all URIs in the cache:

# hc-cache list-uris -o all-uris.txt

This command would purge all cache tables:

# hc-cache purge-all

This command would only purge the memento URI-Ms in the list:

# hc-cache purge -i memento-urims.txt

This command would only purge the cached content of memento URI-Ms, but leave the derived data:

# hc-cache purge -i memento-urims.txt --only-content

This command would preload the cache with a list of URIs:

# hc-cache preload -i uris.txt

This would export the cache into some (to be determined) file format:

# hc-cache export -o exported-cache-data.dat

Likewise, we can load the cache using some (to be determined) file format:

# hc-cache import -i some-elses-cache-data.dat

As time goes on, I'm sure I can think of other things.

Make domain category lists for hc score dsa1-score configurable.

As part of the DSA1 scoring equation, the original resource domain of the memento is given a different score based on its category according to Padia's 2012 work. Padia outlined the following categories:

  • news source
  • video sharing sites
  • social media domains
  • video sharing sites

Right now these domain lists are hard-coded and likely to change over time. Create a parameter that allows the user to supply them.

Synthesize Action: TypeError On Docker

When using the synthesize action on Docker a type error occurs ("TypeError: 'Namespace' object is not iterable").

The command I used is listed below:
hc synthesize warcs -i archiveit -a 7760 -o South_Louisiana_Flood

sythesize_action_error_docker_windows

Utilize add_subparsers to unify CLI

It looks like argparse is being used in individual actions which is then called from bin/hc where a top-level command (i.e., hc) implements arg parsing manually. We could perhaps use add_subparsers in the entrypoint script to leverage built-in capabilities on the standard argument parser package. We have used this technique in some of the other WSDL projects.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.