oduwsdl / hypercane Goto Github PK
View Code? Open in Web Editor NEWA toolkit for developing algorithms that sample mementos from a web archive collection.
Home Page: https://oduwsdl.github.io/hypercane
License: MIT License
A toolkit for developing algorithms that sample mementos from a web archive collection.
Home Page: https://oduwsdl.github.io/hypercane
License: MIT License
Hypercane uses ArticleExtractor
from boilerpipe because that is what works best with sumgrams. There are some scenarios where this boilerplate removal method produces nothing. Allowing the user to specify their own method, and perhaps an ordered list of preferred methods would go a long way to addressing this issue.
After reworking Hypercane to use '.halg' formatted files as part of the IIPC 2021 Grant work, the DSA1 algorithm implementation is now wrong. We execute the time slice twice instead of the DBSCAN step:
hypercane/hypercane/packaged_algorithms/dsa1.halg
Lines 79 to 93 in b965662
It needs to follow AlNoamany's Algorithm again, like it did while working on my dissertation work.
The existing CLI application must be reworked. This work was started already and needs to be tested.
Once that work is done, we can add the corresponding GUI script for the Wooey interface.
Hypercane uses MongoDB for caching memento content, headers, and derived data. It also uses PostgreSQL as part of its Web User Interface (WUI). Rather than having to install/maintain multiple databases for different purposes, we want to move Hypercane to PostgreSQL for the following reasons.
With all of this in mind, I will be using this issue to document ER diagrams and other insights as I experiment with this change.
Wooey has a confusing setup with respect to Downloads. We should replace the Download button with something that allows the user to download the file generated by the given action.
We can force a download rather than a browser render by adding the download
attribute to an a
tag. I'm not sure how to do this to a button, but I think the buttons in Wooey are largely decorative rather than true button
tags.
The Hypercane WUI installer is almost finished. We need to document the process for users.
This script should take into account the lessons learned from oduwsdl/raintale#19, oduwsdl/raintale#20, oduwsdl/raintale#21.
Ideally, we would create an RPM install for RedHat-based systems and a DEB install for Debian-based systems, but that may be a bit too much to test for the duration of the IIPC Grant. A tarball containing the necessary files and an installation script is likely enough. To make it administrator-friendly, we could apply Makeself as well. This way the user can download a single file, execute it, and it will extract our content, execute our script, and start up the Hypercane GUI.
They are confusing to us and will likely be so for users. Until we can articulate how to use them, we should remove them.
We just need someone to remove the lines from hypercane-gui/templates/jobs/job_view.html
:
hypercane/hypercane-gui/templates/jobs/job_view.html
Lines 114 to 119 in 037031e
The user should be able to start a run a single script and produce a complex Raintale Story JSON file just like we do with the SHARI process.
In this case, the GUI should allow the user to select the appropriate sampling algorithm, and then the following happens:
$ pip3 uninstall hypercane
/usr/lib/python3/dist-packages/secretstorage/dhcrypto.py:15: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
from cryptography.utils import int_from_bytes
/usr/lib/python3/dist-packages/secretstorage/util.py:19: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead
from cryptography.utils import int_from_bytes
Found existing installation: hypercane 0.2021.3.10.202429
Uninstalling hypercane-0.2021.3.10.202429:
Would remove:
/home/marsh/.local/lib/python3.8/site-packages/hypercane
/home/marsh/.local/lib/python3.8/site-packages/hypercane-0.2021.3.10.202429.egg-info
Proceed (y/n)? y
Successfully uninstalled hypercane-0.2021.3.10.202429
$ which hc
/home/marsh/.local/bin/hc
$/.local/lib/python3.8/site-packages
$ hc --help
hc (Hypercane) is a framework for building algorithms for sampling mementos from a web archive collection.
It is a complex toolchain requiring a supported action and additional arguments.
For example:
hc sample dsa1 -i archiveit -a 8778 -o story-mementos.txt
This is the list of supported actions:
* sample
* report
* synthesize
* identify
* filter
* cluster
* score
* order
For each of these actions, you can view additional help by typing --help after the action name, for example:
hc sample --help
$ hc sample --help
Traceback (most recent call last):
File "/home/marsh/.local/bin/hc", line 54, in <module>
actionmodule = importlib.import_module(supported_actions[action])
File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'hypercane'
Due to issues downloading //fonts.googleapis.com/css?family=Pacifico
, Firefox does not render Hypercane's WUI for quite a long time.
After removing line 14 from hypercane-gui/templates/base.html
the page loads fine.
The existing CLI application must be reworked.
Once that work is done, we can add the corresponding GUI script to the Wooey interface.
Most *nix commands honor the HTTP_PROXY
and HTTPS_PROXY
environment variables. Hypercane processes these variables and applies them in hypercane/utils.py
as part of get_web_session
. We need to test this with Squid or Varnish to ensure the system will actually use a proxy server as a datastore.
The existing CLI application must be reworked. This work was started already and needs to be tested.
Once that work is done, we can add the corresponding GUI script for the Wooey interface.
The function generate_warc_record_for_urim
should implement the fields WARC-Source-URI
and WARC-Creation-Date
fields proposed by @ikreymer as noted in the Twitter thread from https://twitter.com/IlyaKreymer/status/1487111893567246336.
Scoring mementos by Padia's work depends upon a list of domains for the categories of news, image sharing sites, video sharing sites, blog sites, and social media sites. These lists will likely change over time and should be configurable by the end user.
The scoring is handled on lines 431 - 451 of hypercane/score/dsa1_ranking
. We will have to make it generic.
hypercane/hypercane/score/dsa1_ranking.py
Lines 431 to 451 in 2e071d7
The lists themselves are on lines 15 - 386 of that same file.
hypercane/hypercane/score/dsa1_ranking.py
Lines 15 to 386 in 2e071d7
The date and version in CITATION.cff must match the date and version used elsewhere in the software.
The existing CLI application must be reworked. This work was started already and needs to be tested.
Once that work is done, we can add the corresponding GUI script for the Wooey interface.
The existing identify script only handles Memento objects. The command line version of Hypercane can support files containing URIs (file handles) or collection identifiers (strings). Wooey doesn't support both of these at the same time, so we need to create a separate script that allows the user to execute an identify action and convert a collection identifier to the desired file listing Memento objects.
Add a filter that allows the user to specify the scoring range to include in the output. In case multiple score fields exist in the input, provide the user an argument with which to specify a given field. Perhaps options for upper and lower bound should be available as well.
The stopwords for hc report terms
are currently hardcoded. Even worse, the are hard coded only in the sumgram code and not the general n-gram code.
hypercane/hypercane/report/sumgrams.py
Lines 53 to 162 in 2e071d7
The generic terms report will need to accept the same stopword list at get_document_tokens
:
hypercane/hypercane/report/terms.py
Lines 6 to 28 in 2e071d7
Some users may have built a cache from prior runs and not want to issue new HTTP requests to add to it. We may not be able to force non-network access with caches supplied via environment variables like HTTPS_PROXY
, but the faster MongoDB cache used by requests-cache
can be overridden to not issue a network connection for cache misses.
Create a new object named OnlyCachedSession
that is a child of CachedSession
. This object will skip the network connections provided by requests
altogether.
Some code below that has worked in testing:
from requests.hooks import dispatch_hook
from requests_cache import CachedSession
class FailedCacheResponse(Exception):
pass
class OnlyCachedSession(CachedSession):
def send(self, request, **kwargs):
cache_key = self.cache.create_key(request)
def send_request_and_cache_response():
response = super(CachedSession, self).send(request, **kwargs)
if response.status_code in self._cache_allowable_codes:
self.cache.save_response(cache_key, response)
response.from_cache = False
return response
try:
response, timestamp = self.cache.get_response_and_time(cache_key)
except (ImportError, TypeError):
raise FailedCacheResponse(
"Import/Type Errors : could not get response and time : item {} is not in the cache".format(cache_key)
)
if response is None:
raise FailedCacheResponse(
"response is None : could not get response and time : item {} is not in the cache".format(cache_key)
)
# dispatch hook here, because we've removed it before pickling
response.from_cache = True
response = dispatch_hook('response', request.hooks, response, **kwargs)
return response
Hypercane currently accepts an input type of archiveit
and a number as an input argument that identifies the Archive-It collection. We want to do the same, so someone can type Hypercane commands like:
# hc identify -i nla -a 13000 ...
Once this AIU issue is complete Hypercane will be able to acquire metadata and a list of URI-Ms for each NLA collection. We just want to connect all of this together into a new input type.
We will likely need to add functions for NLA that function like generate_archiveit_urits.
We will also need to update discover_mementos_by_input_type, discover_timemaps_by_input_type, and discover_original_resources_by_input_type to support a new input type of nla
.
I think these are the only changes needed, but we will need to test to make sure.
Allow a user to order a set of mementos by the carbon date of their URI-R.
The current scores produced by hc report image-data
are not as effective as they could be. Humans may have already supplied their desired striking images in the metadata of the web pages making up the collection.
Hypercane's existing image scoring function in hypercane/report/imagedata.py:rank_images
currently adds image properties to a list on lines 143 - 152
hypercane/hypercane/report/imagedata.py
Lines 143 to 152 in 44491c3
Add another column to the left containing values of 1 or 0. If Hypercane discovers the image in the metadata, set this column to 1 otherwise 0. This way, when the sorting occurs on line 154, all images discovered in the metadata will exist at the highest ranks in the list and then will be sorted by their MementoEmbed score.
I've been reluctant to provide this functionality in case it might be misused, but I needed it today.
Based on what we learn from @ato when addressing oduwsdl/raintale#30 we need to do the same for Hypercane.
The existing CLI application must be reworked. This work was started already and needs to be tested.
Once that work is done, we can add the corresponding GUI script for the Wooey interface.
In v0.5, we introduced the HALG file format for executing Hypercane recipes.
To start, Hypercane needs two bash functions to make HALG more compact: cache_hc
and move_output
.
Consider the following script, simplified to illustrate a point:
#!/bin/bash
input_type=$1
input_argument=$2
working_directory=$3
output_file=$4
cd ${working_directory}
if [ ! -e identified-mementos.tsv ]; then
hc identify mementos -i $input_type -a $input_argument -o identified-mementos.tsv
fi
if [ ! -e sample-mementos.tsv ]; then
hc sample true-random -k 2000 -i mementos -a identified-mementos.tsv -o sample-mementos.tsv
fi
if [ ! -e image-report.json ]; then
hc report imagedata -i mementos -a identified-mementos.tsv -o image-report.json
fi
if [ ! -e terms.tsv ]; then
hc report terms -i mementos -a identified-mementos.tsv -o terms.tsv
fi
if [ ! -e story.json ]; then
hc synthesize raintale-story -i mementos -a identified-mementos --imagedata image-report.json --term-report terms.tsv -o story.json
fi
cp ${working_directory}/story.json ${output_file}
which could be simplified to something like this:
#!/bin/bash
input_type=$1
input_argument=$2
working_directory=$3
output_file=$4
function cache_hc() { ... }
function move_output() { ... }
cache_hc "identify mementos" "${input_type}=${input_argument}" "694-mementos.tsv"
cache_hc "sample true-random -k 2000" "694-mementos.tsv" "sample-mementos.tsv"
cache_hc "report imagedata" "sample-mementos.tsv" "image-report.json"
cache_hc "report terms --use-sumgrams" "sample-mementos.tsv" "terms.tsv"
cache_hc "synthesize raintale-story --imagedata image-report.json --term-report terms.tsv" "sample-mementos.tsv" "story.json"
move_output "story.json" "${output_file}
and we can even make the cache_hc
and move_output
functions available as part of the Hypercane installation itself.
This issue is the start of a conversation/documentation of thinking about this idea with the goal of making HALG more applicable as v0.6 development unfolds.
We also need to document HALG. So far, it differs from a regular shell script by offering comments in the following format:
#!/bin/bash
# algorithm name: DSA1
# algorithm description: An implementation of the algorithm from AlNoamany's dissertation.
These comments are used by Hypercane when displaying the possible algorithms. Getting HALG straight is an important step toward the Recipe Builder.
The existing CLI application must be reworked.
Once that work is done, we can add the corresponding GUI scripts to the Wooey interface.
The existing CLI application must be reworked. This work was started already and needs to be tested.
Once that work is done, we can add the corresponding GUI script for the Wooey interface.
When running hc
command with an unsupported action, error/help is printed twice:
$ hc foo
ERROR: unsupported action foo
hc (Hypercane) is a complex toolchain requiring a supported action and additional arguments
For example:
hc sample dsa1 -i archiveit -a 8778 -o story-mementos.txt
Supported actions:
* sample
* report
* synthesize
* identify
* filter
* cluster
* score
* order
For each of these actions, you can view additional help by typing --help after the action name, for example:
hc sample --help
ERROR: unsupported action foo
hc (Hypercane) is a complex toolchain requiring a supported action and additional arguments
For example:
hc sample dsa1 -i archiveit -a 8778 -o story-mementos.txt
Supported actions:
* sample
* report
* synthesize
* identify
* filter
* cluster
* score
* order
For each of these actions, you can view additional help by typing --help after the action name, for example:
hc sample --help
The synthesize warc command will unintentionally switch back to the original stream instead of the raw stream. The bug seems to be resolved by making deep copies of all variables from the original stream.
Affected lines in hypercane/hypercane/synthesize/warcs.py:
76 - headers_list = copy.deepcopy(resp.raw.headers.items())
81 - warc_target_uri = str(resp.links[link]['url'])
88 - mdt = str(resp.headers['memento-datetime'])
A command that allows the user to manage the cache would be very helpful after we have implemented #65.
I'm envisioning something like the following:
This command would list all URIs in the cache:
# hc-cache list-uris -o all-uris.txt
This command would purge all cache tables:
# hc-cache purge-all
This command would only purge the memento URI-Ms in the list:
# hc-cache purge -i memento-urims.txt
This command would only purge the cached content of memento URI-Ms, but leave the derived data:
# hc-cache purge -i memento-urims.txt --only-content
This command would preload the cache with a list of URIs:
# hc-cache preload -i uris.txt
This would export the cache into some (to be determined) file format:
# hc-cache export -o exported-cache-data.dat
Likewise, we can load the cache using some (to be determined) file format:
# hc-cache import -i some-elses-cache-data.dat
As time goes on, I'm sure I can think of other things.
As part of the DSA1 scoring equation, the original resource domain of the memento is given a different score based on its category according to Padia's 2012 work. Padia outlined the following categories:
Right now these domain lists are hard-coded and likely to change over time. Create a parameter that allows the user to supply them.
The existing CLI application must be reworked. This work was started already and needs to be tested.
Once that work is done, we can add the corresponding GUI script for the Wooey interface.
Feeding the mementos from a timemap generated by memgator to the "synthesize warcs" action in hypercane results in exceptions for mementos from archive.today. There appears to be a captcha.
Thanks to @himarshaj, MementoEmbed now handles NLA mementos properly. Hypercane needs this functionality so it can process them as well. Hypercane should use MementoEmbed's MementoResource
class to extract the correct raw mementos, original resource domains, etc. so that we do not need to update code in two places.
It looks like argparse
is being used in individual actions which is then called from bin/hc
where a top-level command (i.e., hc
) implements arg parsing manually. We could perhaps use add_subparsers
in the entrypoint script to leverage built-in capabilities on the standard argument parser package. We have used this technique in some of the other WSDL projects.
This is why #41 and other items failed during review on Monday.
The Hypercane GUI should refuse to start if the HC_STORAGE_CACHE environment variable is not set and it should notify the user.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.