andrewtavis / wikirec Goto Github PK

Recommendation engine framework based on Wikipedia data

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

wikipedia recommender-system recommendation-engine python books lda bert bert-embeddings unsupervised-learning multilingual

wikirec's Introduction

Organizations initiated

activist is an open-source, non-profit political action network. The current goal is the creation of activist.org, a platform to find and discover political events and organizations.
- activist (web) • activist-Android • activist-iOS
Scribe creates keyboard apps for language learners that include translation, verb conjugation and word annotation for confident communication without leaving the keyboard.
- Scribe-iOS • Scribe-Data • Scribe-Android • Scribe-Desktop

wikirec's People

Contributors

Stargazers

Watchers

Forkers

wikirec's Issues

Add WikilinkNN Unwanted Links

The WikilinkNN currently best supports book recommendations in wikirec as there are preset links that are removed via the following in wikirec.model._wikilink_nn:

to_remove = [
    "hardcover",
    "paperback",
    "hardback",
    "e-book",
    "wikipedia:wikiproject books",
    "wikipedia:wikiproject novels",
]
wikilinks = [item for item in wikilinks if item not in to_remove]

It would be best if this could be adapted for other kinds of recommendation inputs. The style of input could potentially be passed to wikirec.model.gen_embeddings, but a discussion could also be had about other ways to derive which links should be removed.

Add ability to change model results based on topic models

A potential addition to wikirec would be allowing a user to change the recommendations based on the topics. As of now this is only a sketch, but the general idea would be that topic coherences could be returned to the user with the words that define a topic, and then the user could say that they want results that are more in line with a topic by passing percentages a word or n-gram along with a general score. 0.5 could be that topics that include the passed word would not be weighted, with numbers below or above implying that topic importances should be shifted based on the words importance in them.

This would allow a user to express interest in genres, or simply say that results should be more similar to those that are focussed on a similar topic keyword. kwx could be looked to for topic keyword derivation in this case.

Update gensim LDA to 4.x

This issue is for discussing and eventually implementing an update for gensim implementations of LDA in wikirec. The package was originally written with 3.x versions of gensim, and 4.x versions apparently have some dramatic improvements as far as modeling options/efficency and n-gram creation (for wikirec.data_utils.clean). Changes would need to be made in wikirec.data_utils and wikirec.model.

Documenting what would need to happen for the switch and then work towards implementing it would be very much appreciated :)

Thanks for your interest in contributing!

Devising ways to best combine recommendations

This issue is to discuss ways to best combine vector embeddings so that a wikirec user can optimally pass more than one argument to wikirec.model.recommend.

The current way of combining recommendations for more than one input is to simply take the arithmetic means of the similarity matrix rows for each passed title, which is depicted in the following snippet from wikirec.model.recommend:

for i, t in enumerate(titles):
    if t == inpt:
        if first_input:
            sims = sim_matrix[i]

            first_input = False

        else:
            sims = [np.mean([s, sim_matrix[i][j]]) for j, s in enumerate(sims)]

A discussion of whether this is the best way to do this would be much appreciated! Furthermore, how could the above be changed to allow a user to express disinterest (as discussed in #33).

Allowing users to express disinterest in model.recommend

This issue is for discussing and potentially implementing a way for users to express disinterest in a title when calling wikirec.model.recommend. The general idea now would be to allow users to pass a title with a negation indicator of some kind (ex "!title"), in which case the selections given the similarity matrix for the given item would be reversed.

It would be great to know if the above would be intuitive UI, and an implementation would be welcome!

Add Wikidata metadata

One way to provide more data for wikirec would be to add metadata for the given article via its Wikidata:Main_Page page. This would change the manner in which the data is extracted, but article texts could be derived as well via the Wikipedia pages that are linked to the Wikidata entity. Whether or not the project should be shifted to focus on Wikidata as a main data source could also be discussed, with tools like WikidataIntegrator being used to derive article categories and query the needed information.

Create concise requirement and env files

This issue is for creating concise versions of requirements.txt and environment.yml for wikirec. It would be great if these files were created by hand with specific version numbers or generated in a way so that sub-dependencies don't always need to be updated.

As of now both files are being created with the following commands in the package's conda virtual environment:

pip list --format=freeze > requirements.txt  
conda env export --no-builds | grep -v "^prefix: " > environment.yml

wikirec, en-core-web-sm (spacy package that breaks tests), and other obviously unneeded packages are then removed from these files before being uploaded.

Any insights or help would be much appreciated!

Add t-SNE to wikirec

It would be helpful to be able to visualize the embeddings created by wikirec models, and one such way to achieve this is t-SNE. This would allow the results models to be visually compared to see how relationships are being derived.

The Python package kwx has an implementation of t-SNE that could be adopted for this package, with another reference being the blogpost that this package was originally based on, which is found here. Ideally this would be put into a visuals.py module, which further would be added to the documentation and tested using pytest's monkeypatch feature (see the tests for kwx for an example). Partial implementations are more than welcome though!

Please first indicate your interest in working on this, as it is a feature implementation :)

Thanks for your interest in contributing!

Add neural network model

This issue is for adding an embeddings neural network implementation to wikirec. This package was originally based on the linked blog post, but the original model implementation to now has not been included. That original work and the provided codes could serve as the basis to adding such a model to wikirec, which ideally would also be included in the documentation and tested. That model was based on analyzing the links between pages, which could serve as a basis for the wikirec version with modifications to wikirec.data_utils, or the model could focus on the article texts. Partial implementations are more than welcome though :)

Please first indicate your interest in working on this, as it is a feature implementation :)

Thanks for your interest in contributing!

Implementing simple parsing arguments

This issue is to discuss and implement keys for wikirec.data_utils.input_conversion_dict to make it easier for people to find valid arguments to parse Wikipedia articles using wikirec.data_utils.parse_to_ndjson. Rather than needing to search for the given Infobox topic, a user could instead simply query the keys of input_conversion_dict for the desired language and see what would be valid values to pass to the topics argument. Suggestions and pull requests are welcome for any language :)

Thanks for your interest in contributing!

New recommendation models

Please use this issue to make suggestions for new models that could be added to wikirec. Suggestions would ideally include some of the following:

A blogpost or other source where the method is applied for related NLP tasks
A research paper that details the method and its potential applications
Source code for the method in Python or another language

Estimates of the model's efficacy would also be appreciated so that a new good first issue can be made and prioritized.

Thanks for any suggestions!

andrewtavis / wikirec Goto Github PK

wikirec's Introduction

Organizations initiated

wikirec's People

Contributors

Stargazers

Watchers

Forkers

wikirec's Issues

Recommend Projects

Recommend Topics

Recommend Org