Giter VIP home page Giter VIP logo

wordlist_filter's Introduction

What is this?

This is a tool that can be used to filter a large wordlist down to the "best" subset of words. Perfect for building 1d100 lists for oracles / inspirational prompts for writing / RPGs.

What is this not?

This tool cannot create wordlists from nothing, it can only take large wordlists and make them smaller. It will only output words that were in your input list.

How to use?

Python

  1. Download the model file, unzip it, put it inside the /model directory
  2. pip install -r requirements.txt
  3. python3 -c "import nltk; nltk.download('wordnet')"
  4. ./filter.py -f example.txt -n 10

Docker

  1. Download the model file, unzip it, put it inside the /model directory
  2. docker build . -t filter
  3. docker run -i --rm filter -f - -n 10 < example.txt

Examples

Get the 10 "best" verbs from the list:

./filter.py -f example.txt -n 10 -p VERB

Average
Betray
Calm
Control
Create
Depart
Empty
Fancy
Release
Travel

Give me the "best" 100 words from all my wordlists:

find /path/to/wordlists -type f -name '*.txt' -exec cat {} \; | ./filter.py -f - -n 100

Advance
Advantage
Affect
Animal
Armed
...

Why did you make this?

I wanted a way to take all the RPG Oracle wordlists I have, combine them together, and generate a single list that encompasses the best of all of them. More importantly, I wanted something that:

  • Was deterministic. The same inputs will yield the same outputs.
  • Could optionally filter words by their part-of-speech (noun, verb, adjective)
  • Took word similarity and diversity into account
  • Allowed me to specify priorities of words

How do I prioritize words?

Two ways:

  • Words that appear more than once in the list have higher priority
  • Words that appear towards the top of the list have higher priority

This means you can concatenate a bunch of wordlists together in a preferred order, then run it through this filter to spit out the best words from those lists.

How does it work?

Ah, thats the complicated part.

First, it cleans the input words, removing lines that start with comments, lines that contain multiple words, lines that contain symbols, etc.

Then it reorders the words by frequency, while maintaining the input order for words with the same counts.

Next, it optionally filters on the specified part of speech, leaving only nouns/verbs/adjectives in the list.

Then, it uses the hdbscan algorithm to cluster words together based on their vector representation.

Word2Vec is a popular algorithm for word embedding, which represents words as dense vectors in a high-dimensional space based on their contextual usage patterns. It captures semantic and syntactic relationships between words, enabling mathematical operations on the word vectors, such as word similarity and analogy.

(This code uses the FastText model for converting words to vectors, because it can handle words outside of the training set)

hdbscan will group similar words together, such as [dog, canine, puppy], into clusters.

The "best" word is chosen from each cluster, with the intent that the wordlist won't contain duplicate words in terms of meaning.

Then, from the remaining list, the n most diverse words are chosen. This is done by finding the words with the most dissimilar word vectors.

The final list is sorted alphabetically.

TLDR:

  • Prioritize the words
  • Cluster them by similarity
  • Pick one word from each cluster to make a list
  • Choose the N most dissimilar words from that list

What do you mean by "best" words?

"Best" is subjective, but here is what the algorithm does. Given a cluster, it goes down this list of criteria:

  • Take the cluster and all of its synonyms, calculate an average vector, and find the single word closest to that average vector. Skip this if the cluster is too small.
  • Choose the word in the cluster that is the most frequent (due to duplicates). Skip if there is a tie.
  • Choose the word that has the most synonyms and antonyms, indicating that it is a versatile word that could have multiple interpretations

Where did the example.txt come from?

It is built from multiple wordlists, the primary ones coming from:

What is up with this code?

I pieced it together with the help of ChatGPT. It may not be the most efficient, but it is pretty well commented and broken out into small functions, so hopefully its easy to read/understand.

wordlist_filter's People

Contributors

jrruethe avatar

Stargazers

 avatar Jeremy W. Sherman avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.