fakenews's Introduction

Fakenews

Generate fake news headlines with the power of non-stationary Markov chains.
In reality this generator is not limited to news-headlines but it is designed to generate single sentence sequences rather than arbitrary length blobs.

How-to

The program is run with python3.

$   python3 fakenews.py

This script with default configuration will generate a new model from available data unless there is already a cached model available. Output is then printed on stdout. The markovmodel.py module may can be used as a stand-alone generator without printed output. See the implementation of fakenews.py for more details.

Command-line options

-n, --samples       # specifies the number of headlines to generate (default 10)
-o, --order         # specifies the order of the Markov-chain (default 2)
--refresh-cache     # tells the program to ignore cached data and generate a new model (default false)

Example usage

$   python3 fakenews.py -n 100 --refresh-cache --order 3

This operation generates 100 fake headlines, ignoring existing caches, using a fresh 3rd order Markov chain.

Tips and tricks

When the order in increased, the generated content gets closer to the original data. This often means that things like sentence structure, word ordering and grammar persists, which gives a more credible and lifelike result. However, by increasing the order you also add constraints on the generator and the amount of required data also increases.

If, for example, the average length of the headlines in your dataset is 5 words and you generate headlines of order 2; this means that the generator will use all pairs of consecutive words from the data. I've made the generator non-stationary which basically means that the position of each pair is also taken into account. For an order 3 model, all triples of consecutive words are used, etc. Thus, when the order increases, the amount of combinations that can be made from these tuples decreases. So if the size of the data is too small and the order is represented by a too large number; the generated headlines will be identical to those of the dataset.

Caching

Caching is turned on by default. This feature is put in place to make subsequent usage faster but can be ignored at runtime. The chain used by the generator may take fairly long time to calculate, depending on the size of your data and your CPU/memory, so if you plan to run the script several times I recommend that you leave caching of the chain enabled.

To invalidate the cache and calculate a new chain, the --refresh-cache command-line option can be specified. This is useful when new data has been added and you want to expand the chain. The cache can also be manually invalidated by removing the .cache.pkl storage.

Source data

The generator draws all its source data from all the files stored in the data/ subdirectory so place your datasets there. Each line of those files are treated like a single coherent headline by the parser.

Right now there is no good character filtering policy, so most unicode characters are accepted. However, all double quotation characters are removed.

Currently I can't share the Swedish datasets I've been using during development. I'm working on a solution but in the meantime you'll have to use your own. Open APIs or non-agressive web-scraping should do the trick, although note that intensive web-scraping may be viewed as a denial of service attack.

fakenews's People

Contributors

Stargazers

Watchers

fakenews's Issues

Please implement a realnews generator

I want the fake news generator to generate real news.

Implement real news?
Also implement clickbait headlines?
Can this fakenews generator also be a CLI text editor?

feat: some solution to avoid generating headlines identical to the source data

When the source data is either too small or has too little variation and the order-value is too large; overfitting will occur and generated headlines will start looking very similar if not even identical to counterparts in the source data.

Some way or another, I'd like to at the very least make the user aware of the likeness. Some options I've though of so far:

Add option to limit generated data with a likeness-threshold value, for example: "only generate headlines that are at most 75% identical to one of the source headlines."
Add the calculated likeness-value to the output so that the user can decide their relevance for themselves.

The ability to censor the output should perhaps not be part of the fakenews.py module itself to enforce modularity and to avoid forcing costly operations for every user. Checking a headline's likeness to the source data will likely be a fairly complex problem and may thus increase overhead significantly.

feat: evaluating data's statistical variance

It seems like the data I have been using during development may contain an excessive amount of similar, albeit not identical, headlines. This loss of variance when large portions of the input headlines are in fact identical does not affect the generator directly, since all these occurances are correctly represented in the transition matrix, however, it also increases the amount of generated headlines identical or very similar to the source data.

This might be desirable in some scenarios, but not if you want to have a varied output. Thus, some tool for calculating thet statistical variance of the source data would be appreciated. 🔬📊

Recommend Projects

remnestal / fakenews Goto Github PK