Giter VIP home page Giter VIP logo

chat-miner's Introduction

chat-miner: turn your chats into artwork

chat-miner: turn your chats into artwork

PyPI Version License: MIT Downloads codecov Code style: black


chat-miner provides lean parsers for every major platform transforming chats into dataframes. Artistic visualizations allow you to explore your data and create artwork from your chats.

1. Installation

Latest release including dependencies can be installed via PyPI:

pip install chat-miner

If you're interested in contributing, running the latest source code, or just like to build everything yourself:

git clone https://github.com/joweich/chat-miner.git
cd chat-miner
pip install -r requirements.txt

2. Exporting chat logs

Have a look at the official tutorials for WhatsApp, Signal, Telegram, Facebook Messenger, or Instagram Chats to learn how to export chat logs for your platform.

3. Parsing

Following code showcases the WhatsAppParser module. The usage of SignalParser, TelegramJsonParser, FacebookMessengerParser, and InstagramJsonParser follows the same pattern.

from chatminer.chatparsers import WhatsAppParser

parser = WhatsAppParser(FILEPATH)
parser.parse_file()
df = parser.parsed_messages.get_df(as_pandas=True) # as_pandas=False returns polars dataframe

Note: Depending on your source system, Python requires to convert the filepath to a raw string.

import os
FILEPATH = r"C:\Users\Username\chat.txt" # Windows
FILEPATH = "/home/username/chat.txt" # Unix
assert os.path.isfile(FILEPATH)

4. Visualizing

import chatminer.visualizations as vis
import matplotlib.pyplot as plt

4.1 Heatmap: Message count per day

fig, ax = plt.subplots(2, 1, figsize=(9, 3))
ax[0] = vis.calendar_heatmap(df, year=2020, cmap='Oranges', ax=ax[0])
ax[1] = vis.calendar_heatmap(df, year=2021, linewidth=0, monthly_border=True, ax=ax[1])

4.2 Sunburst: Message count per daytime

fig, ax = plt.subplots(1, 2, figsize=(7, 3), subplot_kw={'projection': 'polar'})
ax[0] = vis.sunburst(df, highlight_max=True, isolines=[2500, 5000], isolines_relative=False, ax=ax[0])
ax[1] = vis.sunburst(df, highlight_max=False, isolines=[0.5, 1], color='C1', ax=ax[1])

4.3 Wordcloud: Word frequencies

fig, ax = plt.subplots(figsize=(8, 3))
stopwords = ['these', 'are', 'stopwords']
kwargs={"background_color": "white", "width": 800, "height": 300, "max_words": 500}
ax = vis.wordcloud(df, ax=ax, stopwords=stopwords, **kwargs)

4.4 Radarchart: Message count per weekday

if not vis.is_radar_registered():
	vis.radar_factory(7, frame="polygon")
fig, ax = plt.subplots(1, 2, figsize=(7, 3), subplot_kw={'projection': 'radar'})
ax[0] = vis.radar(df, ax=ax[0])
ax[1] = vis.radar(df, ax=ax[1], color='C1', alpha=0)

5. Natural Language Processing

5.1 Add Sentiment

from chatminer.nlp import add_sentiment

df_sentiment = add_sentiment(df)

5.2 Example Plot: Sentiment per Author in Groupchat

df_grouped = df_sentiment.groupby(['author', 'sentiment']).size().unstack(fill_value=0)
ax = df_grouped.plot(kind='bar', stacked=True, figsize=(8, 3))

6. Command Line Interface

The CLI supports parsing chat logs into csv files. As of now, you can't create visualizations from the CLI directly.

Example usage:

$ chatminer -p whatsapp -i exportfile.txt -o output.csv

Usage guide:

usage: chatminer [-h] [-p {whatsapp,instagram,facebook,signal,telegram}] [-i INPUT] [-o OUTPUT]

options:
  -h, --help 
                        Show this help message and exit
  -p {whatsapp,instagram,facebook,signal,telegram}, --parser {whatsapp,instagram,facebook,signal,telegram}
                        The platform from which the chats are imported
  -i INPUT, --input INPUT
                        Input file to be processed
  -o OUTPUT, --output OUTPUT
                        Output file for the results

chat-miner's People

Contributors

alfonso46674 avatar bdfsaraiva avatar dependabot[bot] avatar exterminator11 avatar gajrajgchouhan avatar galatolofederico avatar gutjuri avatar joaoaab avatar joweich avatar kiankhadempour avatar louispires avatar luc-girod avatar massimopavoni avatar smty2018 avatar unsurp4ssed avatar untriextv avatar victormihalache avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chat-miner's Issues

Add minimal documentation

Goal

Docs should be generated with Sphinx and hosted on readthedocs. This issue is meant to introduce a first minimal documentation that is to be extended in future PRs.

Jobs to be done

  • Add minimal sphinx configuration
  • Deploy documentation on readthedocs
  • [OPTIONAL] Add GitHub workflow for automatic deployment if PRs are merged on main

Unknown String format

While trying to parse a chat exported from a Phone in India,

raise ParserError("Unknown string format: %s", timestr)
dateutil.parser._parser.ParserError: Unknown string format: 7.9.2020

ValueError(f"Invalid date format: {line}")

"11.01.2023 17:54:46 INFO
Depending on the platform, the message format in chat logs might not be
standardized accross devices/versions/localization and might change over
time. Please report issues including your message format via GitHub."
hi, as said in the error I should do report.
"ValueError: Invalid date format: 20.15.21.15 h Stock Donnerstag 18.19 h Stock 19-20 Athletik" -> The programm reads it as a date, even if it is just a normal text (message).

Facebook Messenger KeyError 'type'

When running the FacebookMessengerParser i get the following error

14.12.2022 14:20:56 INFO
Depending on the platform, the message format in chat logs might not be
standardized accross devices/versions/localization and might change over
time. Please report issues including your message format via GitHub.

14.12.2022 14:20:56 INFO Initialized parser.
14.12.2022 14:20:56 INFO Starting reading raw messages into memory...
14.12.2022 14:20:56 INFO Finished reading 7567 raw messages into memory.
14.12.2022 14:20:56 INFO Starting parsing raw messages into dataframe...
0%| | 0/7567 [00:00<?, ?it/s]


KeyError Traceback (most recent call last)
Input In [10], in <cell line: 4>()
1 from chatminer.chatparsers import FacebookMessengerParser
3 parser = FacebookMessengerParser('message_3.json')
----> 4 parser.parse_file_into_df()

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\chatminer\chatparsers.py:45, in Parser.parse_file_into_df(self)
43 with logging_redirect_tqdm():
44 for mess in tqdm(self.messages):
---> 45 parsed_mess = self._parse_message(mess)
46 if parsed_mess:
47 parsed_messages.append(parsed_mess)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\chatminer\chatparsers.py:213, in FacebookMessengerParser._parse_message(self, mess)
212 def _parse_message(self, mess):
--> 213 if mess["type"] == "Share":
214 body = mess["share"]["link"]
215 elif "sticker" in mess:

KeyError: 'type'

CLI to parse into csv files

Goal

Add a command line interface that allows for parsing chat export files into csv files.
In its minimal version, the usage would like like this:

chatminer --parser whatsapp --input chatlog.txt --output export.csv

Jobs to be done:

  • Add new file cli.py
  • Add logic parsing and checking command line arguments
  • Trigger chat file parsing and csv export of resulting dataframe
  • [OPTIONAL] Add output for the --help flag explaining the usage of the CLI

Sentiment analysis for each message

Goal

Add a sentiment rating to each message in the parsed log. This should be done with libraries such as NLTK and VADER or Hugging Face's transformers.

Jobs to be done

  • Create a new file nlp.py in chatminer
  • Create a function def add_sentiment(df) to nlp.py, which adds the sentiment scores to the parser's output dataframe
  • Adjust README with tutorial and example

WhatsApp info messages break the code

Hei,

I haven't looked into fixing it yet, but I have noticed that many messages break the code.

For instance, if someone changes a group chat logo, or if a member is added, removed, or leaves, the line in the log spawns an error.

On the same vibe, if one of the chatter's name includes ", the code breaks.

Whish file types should be pointed by parser?

I am new for this, but I could not find which type of data files should be imported to use these parsers for each social media account? Json? HTML? Text?. Could you provide more information on readme file?

Contact with - in the name is not properly recognized

I have one contact in my whatsapp export that has a - in their name. This seems to not being recognized as author. Author is being set to System or a part of the message if it somehow catches something in there (more specificially https:// or a datetime)

TypeError: isinstance is missing 1 argument in the chatparser

Hi,

I just wanted to play around with the chat-miner, but I ran into the following error. I have the latest version of the module, and I tried to just copy the code in the README.md, but no luck. Could you please take a look at it?

File "...\chatminer\chatparsers.py", line 284, in <lambda>
    lambda m: m["text"] if isinstance(m) is dict else m,
                           ^^^^^^^^^^^^^
TypeError: isinstance expected 2 arguments, got 1

EDIT: I am using the TelegramJsonParser

ValueError: Unknown projection 'radar'

I get the following error when i use try plotting the wordcount by weekday radar plot using this code

fig, ax = plt.subplots(1, 2, figsize=(7, 3), subplot_kw={'projection': 'radar'})
ax[0] = vis.radar(parser.df, ax=ax[0])
ax[1] = vis.radar(parser.df, ax=ax[1], color='C1', alpha=0)

"ValueError: Unknown projection 'radar'"

any idea how to fix it?

Refactor telegram parser

I am currently adding typing to the files to make it easier to use chat-miner. While doing this I read (for the first time) the code for the Telegram parser. The _read_raw_messages_from_file method is really good (other than a potential spot for a tqdm) but the _parse_message method is littered with assert and isinstance. In addition, I don't really understand what the code is doing.

For reference, here is the current code:

def _parse_message(self, mess: dict[str, Any]):
        if "from" in mess and "text" in mess:
            if isinstance(mess["text"], str):
                body = mess["text"]
            elif isinstance(mess["text"], list):
                assert all(
                    [
                        (isinstance(m, dict) and "text" in m) or isinstance(m, str)
                        for m in mess["text"]
                    ]
                )
                body = " ".join(
                    map(
                        lambda m: m["text"] if isinstance(m, dict) else m,
                        mess["text"],
                    )
                )
            else:
                raise ValueError(f"Unable to parse type {type(mess['text'])} in {mess}")

            time = dt.datetime.fromtimestamp(int(mess["date_unixtime"]))
            author = mess["from"]
            return ParsedMessage(time, author, body)
        # NOTE: I changed the default return value to None instead of False to standardize the failure case
        return None

I propose that someone who knows how the Telegram JSON file is structured tries to rewrite this method (perhaps @galatolofederico) so that it is more readable, type-safe (not actually but it makes it harder to make mistakes), and pythonic.

Ease use by creating a sample code

I think it would be much easier for the user to build .py files with the readme code, so you only have to set the path and it would work. It would just be the readme but in an already built fashion.

Can't run the parser

Is there a specific Python version that should be used to run this parser? I can't run this, every time I solve an issue I have another one. I'm currently using Python 3.10.0.

ValueError: Invalid date format (WhatsApp)

My WhatsApp exported chat format is 8/21/21, 11:06 PM - NAME: MESSAGE

I first tested Chat Miner on a chat that has only a few texts. It worked fine. But when I applied the same procedure to a chat that has a lot of texts, it gave me the following error:

05.01.2023 21:24:13 INFO
Depending on the platform, the message format in chat logs might not be
standardized accross devices/versions/localization and might change over
time. Please report issues including your message format via GitHub.

05.01.2023 21:24:13 INFO Initialized parser.
05.01.2023 21:24:13 INFO Starting reading raw messages into memory...
05.01.2023 21:24:13 INFO Finished reading 40000 raw messages into memory.

ValueError Traceback (most recent call last)
Cell In [16], line 3
1 from chatminer.chatparsers import WhatsAppParser
----> 3 parser = WhatsAppParser("wa.txt")
4 parser.parse_file_into_df()

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\chatminer\chatparsers.py:118, in WhatsAppParser.init(self, filepath)
116 def init(self, filepath):
117 super().init(filepath)
--> 118 self._infer_datetime_format()

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\chatminer\chatparsers.py:162, in WhatsAppParser._infer_datetime_format(self)
160 max_second = max(max_second, day_and_month[1])
161 if (max_first > 12) and (max_second > 12):
--> 162 raise ValueError(f"Invalid date format: {line}")
164 if max_first > 12 and max_second <= 12:
165 self._logger.info("Inferred DMY format for date.")

ValueError: Invalid date format: 24.4.24

What could be the error? In my chat file "wa.txt", I tried to find a date "8/24/24" but there are no such date formats.

Radarchart - unknown projection radar

Hey,
I've been trying to use your library to create the graphs from my chat exports. Unfortunately, I have run into an issue with the radarchart.

The way the code executes creates a problem. First, it runs the matplotlib.subplots function with the subplot_kw argument. The radar projection is, however, defined in the vis.radar() function as a part of the radar_factory().

Therefore, I get ValueError: Unknown projection 'radar' because it is not yet defined, when the subplots function runs.

I am trying to build a GUI for this library and this is the only thing, which I can't get to work.

How did you get yours to work?

calender_heatmap() missing 1 required positional argument: 'year'

17.11.2022 11:15:10 INFO     Initialized parser.
17.11.2022 11:15:10 INFO     Starting reading raw messages into memory...
17.11.2022 11:15:10 INFO     Finished reading 38739 raw messages into memory.
17.11.2022 11:15:10 INFO     Inferred DMY format for date.
17.11.2022 11:15:10 INFO     Starting parsing raw messages into dataframe...
100%|███████████████████████████████████| 38739/38739 [00:04<00:00, 8693.78it/s]
17.11.2022 11:15:15 INFO     Finished parsing raw messages into dataframe.
               hour         words       letters
count  38739.000000  38739.000000  38739.000000
mean      14.367691      6.338367     35.067503
std        4.371213      6.666926     39.636194
min        0.000000      1.000000      1.000000
25%       11.000000      2.000000     11.000000
50%       14.000000      5.000000     26.000000
75%       18.000000      8.000000     46.000000
max       23.000000    383.000000   2181.000000
Traceback (most recent call last):
  File XXXX, line 8, in <module>
    vis.calendar_heatmap(parser.df)
TypeError: calendar_heatmap() missing 1 required positional argument: 'year'

used date format (WhatsApp, iOS):

[DD.MM.YY, hh:mm:ss] user: message

Author does not support having an emoji in it

If i have a message that looks like this:

DD-MM-YYY hh-mm - Alfonso 🤓: Lorem impsum

The function get_message_author cannot interpret the emoji since the regex patterns do not take emojis into account.
I've found an expression that matches any emoji (supposedly): https://www.regextester.com/106421
How I see this there are two options:

  1. Create more regex rules to incorporate the use of emojis in the author.
  2. Delete all emojis from the author except if it is made up only of emojis.

I think the better solution is to delete all emojis from the author, it is easier to do and yields better data since there is no need to take emojis into account.
What do you think @joweich?

Implement parser for Facebook Messenger chatfiles

Implement a new parser enabling the processing of Facebook Messenger export files.
The logic should be aligned with the existing parser classes.

Please find more information about exporting Facebook Messenger chatlogs here.

Wordcloud also displays non printable characters (Workaround)

I tried this on the chat with my girlfriend and as we are Slovak (Š,Č,Á,ô,...) and also use lot of emojis, it got into the "wordcloud" and caused lot of weird characters and lots of numbers.

I made small "dirty" fix with use of regex.

edit visualisations.py Line 108 to:

words = ["".join(re.findall(r'\b[a-zA-Z0-9]+\b', word)).lower() for sublist in messages for word in sublist]

This is really cheap workaround as I did not have lot of time to check inner working of the code, but it gets rid of all annoying characters other than A-Z, a-z, 0-9. If the word has emoji in between it removes it and joins the two halves together.
I am not sure if this would be code ready to be added to project, that's why I am writing this as issue.
EDIT: It may be possible that this is problem because I use linux and this behaviour is not present on the windows/mac as I had few problems with special characters before.

Unrecognized date format

Hi! I just did a whatsapp export and put it into chat-miner, but it seems my datetime format is not supported. I was able to get it working by modifying the code a bit, but not in a way that would be good for a PR, hence this issue instead.

My exported format is like this:
26-04-2022 12:11 - [name]: [message]

So it's day-month-year hour:minute.
Not only does chat-minder currently not expect a - as date separator, it also gets confused later on when it's splitting datetime and message content based on - (splitting on - with spaces around it works better).

Pie chart for most used emoji

Goal

Create a pie chart to visualize the most used emoji. Matplotlib has an inbuild pie chart, which we should leverage here. Eventually, it should look like the pie chart here.

Jobs to be done

  • Leverage matplotlib's pie chart in chat-miner's visualizations.py file as a new function def piechart_emoji(df, ...)
  • Introduce parameter to limit pie chart to the n most frequent emoji defaulting to 10
  • Introduce pandas logic to count emojis in parsed chat
  • Add example to README

Discord Support

Would be great for this to support discord chat. Perhaps this is something that can be looked into?
Several discord chat export tools exist for example: DiscorChatExporter

Index out of range while parsing

I was trying this app when I encountered this problem that while parsing the text file it's showing this error.

PS D:\Coding\Python> & C:/Users/qwerk/AppData/Local/Programs/Python/Python37/python.exe d:/Coding/Python/chat-miner/run.py
INFO:chatminer.chatparsers.WhatsAppParser.2067091229440:Finished reading 40001 messages.
INFO:chatminer.chatparsers.WhatsAppParser.2067091229440:Inferred month first format.
Traceback (most recent call last):
  File "d:/Coding/Python/chat-miner/run.py", line 5, in <module>
    parser.parse_file_into_df()
  File "d:\Coding\Python\chat-miner\chatminer\chatparsers.py", line 29, in parse_file_into_df
    self._parse_message(mess)
  File "d:\Coding\Python\chat-miner\chatminer\chatparsers.py", line 98, in _parse_message
    body = mess.split('-', 1)[1]
IndexError: list index out of range

Is it because of the long chats? (maybe?)

Matrix support

It'd be great to have support for the federated Matrix platform.

Most major Matrix clients don't seem to offer an 'export chats' feature, so the easiest way to get the data is probably to interact with the user's homeserver directly.

Radar chart for message frequency per weekday

Goal

Create a radar chart (also called spider or start chart) to visualize the message frequency per day of the week. Matplotlib has a nice tutorial that already holds everything we need.

Jobs to be done

  • Migrate matplotlib example to chat-miner's visualizations.py file as a new function def radar(df, ...)
  • Introduce pandas logic to enable breakdown on weekday basis, like df_radar = df.groupby(by="weekday")["message"].count()
  • Add example to README

Add unit tests for the parsers

Goal:

Test the single parsers using unit tests.

Jobs to be done for each parser:

  1. Create dummy chatlog with the same formatting as the respective native export files
  2. Define the target output of the parser for the dummy chatlog
  3. Test the target and the actual output for equality

Next steps

  • [WhatsApp] Create tests (done in #23)
  • [All parsers] Create GitHub workflow running the test (done in #23)
  • [Telegram] Create tests
  • [Facebook] Create tests
  • [Signal] Create tests
  • [WhatsApp] Extend the dummy chat logs with edge cases (e.g. logged system notifications in export files)

Contribution

The unit tests for the different parsers may be created separately, although they should be aligned with each other.

Add unit test for Command Line Interface

Goal

The CLI does not yet have test coverage. This should be fixed within this issue. This stackoverflow post might be a hint on how to test CLIs in python.

Jobs to be done

  • add unit test file test_cli.py to test dircetory

CLI for visualisation

Would be nice to have commands for some basic visialization. At least just the basic stuff form the examples.

I could look into that myself, when I have more free time :)

Implement parser for Telegram chatfiles in JSON format

Implement a new parser enabling the processing of Telegram export files. Telegram files may be exported in both JSON and HTML format.
This issue targets the parsing of JSON files. The corresponding issue for HTML files is #12.
The logic should be aligned with the existing parser classes.

Please find more information about exporting Telegram chatlogs here.

Refactor dataframe wrangling in calendar heat map

In calendar_heatmap (and to some extent in the other visualisations), we implicitly assume a certain dimensionality of the transformed dataframes (e.g. to have a scalar returned by df.min()). This should be made more explicit, e.g. by adding asserts or improve naming.
Originally raised in #98.

Parser for iMessage chatfiles

Goal

Implement a new parser enabling the processing of iMessage export files. The logic should be aligned with the existing parser classes.

Jobs to be done

  • Do research on how to export iMessage chats
  • Create a new class iMessageParser(Parser)
  • Create logic for _read_file_into_list() and _parse_message() like in the other parsers
  • Add export tutorial to README

`calendar_heatmap()` has a ValueError

I have exported my Signal chat according to the README and moved the index.md file generated by the export to the folder with the Python script.

This is the folder structure:

.
├── index.md
└── main.py

0 directories, 2 files

And this is the main.py

from chatminer.chatparsers import SignalParser

import chatminer.visualizations as vis
import matplotlib.pyplot as plt


parser = SignalParser("./index.md")
parser.parse_file_into_df()

fig, ax = plt.subplots(2, 1, figsize=(9, 3))
ax[0] = vis.calendar_heatmap(parser.df, year=2020, cmap='Oranges', ax=ax[0])
ax[1] = vis.calendar_heatmap(parser.df, year=2021, linewidth=0, monthly_border=True, ax=ax[1])

But I get this when running the script:

13.12.2022 10:10:18 INFO     
            Depending on the platform, the message format in chat logs might not be
            standardized accross devices/versions/localization and might change over
            time. Please report issues including your message format via GitHub.
            
13.12.2022 10:10:18 INFO     Initialized parser.
13.12.2022 10:10:18 INFO     Starting reading raw messages into memory...
13.12.2022 10:10:18 INFO     Finished reading 99018 raw messages into memory.
13.12.2022 10:10:18 INFO     Starting parsing raw messages into dataframe...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 99018/99018 [00:03<00:00, 31160.51it/s]
13.12.2022 10:10:22 INFO     Finished parsing raw messages into dataframe.
Traceback (most recent call last):
  File "/Users/victormihalache/Desktop/chatdata/main.py", line 11, in <module>
    ax[0] = vis.calendar_heatmap(parser.df, year=2020, cmap='Oranges', ax=ax[0])
  File "/opt/homebrew/lib/python3.10/site-packages/chatminer/visualizations.py", line 175, in calendar_heatmap
    pc, cax=cax, ticks=[min(vmin), int((min(vmin) + max(vmax)) / 2), max(vmax)]
ValueError: cannot convert float NaN to integer

Example of the index.md file:

[2022-12-12 10:48] Person2:
>
> The replied-to message
>
The message
[2022-12-12 10:49] Person2: Another message
[2022-12-12 10:49] Person2: And another one
[2022-12-12 10:49] Me: The answer
[2022-12-12 10:50] Person2: A response to the answer

(I have only changed the name of "Person2", not "Me", and the contents of the messages, but the format and dates is the same as the original)

I have no clue why it is not working.

EDIT:

I just noticed that sometimes we sent code snippets written in python, and VSCode interprets a python comment as a header, here is an example of me replying to a snippet (notice how the > is not present on every line)

[2022-02-17 10:59] Me:
>
> ```
# the comment
while True:
  print("Some python code")
\`\`\`
>
my eyes, my poor eyes

(the [\`\`\`] is just so formatting is not broken here, but is [```] in the original message)

EDIT 2:

However, doing print(parser.df) doesn't seem to show any anomaly

                 datetime author                                            message  weekday  hour  words  letters
0     2022-12-13 09:17:00    Pr2                                                msg  Tuesday     9      1        5
1     2022-12-13 09:17:00     Me  A seemingly long message so you see the dots h...  Tuesday     9      9      113
2     2022-12-13 09:17:00    Pr2                                            another  Tuesday     9      1        4
3     2022-12-13 09:16:00     Me                                         a response  Tuesday     9      4       25
4     2022-12-13 09:16:00    Pr2                              The words and letters  Tuesday     9      5       22
...                   ...    ...                                                ...      ...   ...    ...      ...
99013 2021-11-14 17:58:00    Pr2                                                ...   Sunday    17      1        3
99014 2021-11-14 17:58:00    Pr2            Are wrong because i changed the message   Sunday    17      1        4
99015 2021-11-14 17:58:00    Pr2                                                ...   Sunday    17      1        3
99016 2021-11-14 17:58:00     Me                                              hello   Sunday    17      1        6
99017 2021-11-14 13:05:00    Pr2                                                      Sunday    13      1        0

[99018 rows x 7 columns]

EDIT 3:

The sunburnst works fine:

from chatminer.chatparsers import SignalParser

import chatminer.visualizations as vis
import matplotlib.pyplot as plt


parser = SignalParser("./index.md")
parser.parse_file_into_df()

fig, ax = plt.subplots(1, 2, figsize=(
    7, 3), subplot_kw={'projection': 'polar'})
ax[0] = vis.sunburst(parser.df, highlight_max=True, isolines=[
                     2500, 5000], isolines_relative=False, ax=ax[0])
ax[1] = vis.sunburst(parser.df, highlight_max=False,
                     isolines=[0.5, 1], color='C1', ax=ax[1])

plt.show()

Screenshot 2022-12-13 at 10 36 54 AM

Implement parser for Telegram chatfiles in HTML format

Implement a new parser enabling the processing of Telegram export files. Telegram files may be exported in both JSON and HTML format.
This issue targets the parsing of HTML files. The corresponding issue for JSON files is #11.
The logic should be aligned with the existing parser classes.

Please find more information about exporting Telegram chatlogs here.

If you add custom stopwords all the stopwords get deleted (but still work, somehow)

I was trying to add default stopwords and I decided to print out the values of the variable before and after. I noticed that before updating it, it would be a completely fine set of strings, but after updating it would be None. Here is the testing code:

from wordcloud import STOPWORDS

print("\n", STOPWORDS, "\n")

stopwords = ["sent-share", "sent-photo", "sent-video", "sent-audio"]
stopwords = STOPWORDS.update(stopwords)

print(stopwords)

The result:

{'therefore', 'was', "where's", 'through', "he'll", 'up', 'him', "shan't", 'just', 'if', 'her', "let's", 'why', 'with', 'would', 'http', 'are', 'until', 'doing', 'about', "it's", 'most', 'own', 'under', "we'll", "hadn't", 'is', "isn't", 'r', 'however', 'before', "aren't", "i'm", 'more', "couldn't", 'on', 'it', 'too', 'you', "you'll", 'having', "who's", "she'd", "he'd", "shouldn't", 'a', 'com', "can't", 'yours', 'ours', 'to', "weren't", 'k', 'as', 'no', "they'll", 'same', "he's", 'such', 'after', 'themselves', "you'd", "you've", 'theirs', 'were', 'then', 'his', 'my', "you're", "wasn't", 'further', "she'll", 'yourselves', 'ourselves', 'she', 'am', 'by', 'for', 'how', 'during', 'like', 'there', "how's", 'myself', 'any', "won't", "she's", 'or', 'so', 'at', 'nor', 'have', 'than', 'herself', 'cannot', "haven't", 'that', "they're", 'all', 'also', 'between', "don't", "we're", 'being', 'get', 'while', "there's", 'should', 'because', 'else', 'of', 'has', 'below', 'not', 'been', 'do', 'what', "i'll", 'again', 'can', 'ever', 'other', 'them', 'does', 'could', 'each', "they've", 'which', 'off', 'both', 'but', "doesn't", "i've", "they'd", 'those', 'few', 'who', 'hers', 'did', "mustn't", 'this', 'www', 'from', 'be', 'ought', "we'd", 'here', 'he', 'itself', 'when', 'an', 'and', 'its', 'yourself', 'they', 'into', 'shall', 'out', 'hence', 'since', 'only', 'whom', "here's", "that's", "we've", "i'd", 'their', "what's", 'the', 'himself', "why's", 'very', "when's", 'down', "wouldn't", 'these', "didn't", 'above', 'me', 'otherwise', 'some', 'i', 'our', 'over', 'had', 'we', 'your', "hasn't", 'against', 'where', 'once', 'in'}

None

I think the problem is that stopwords is being assigned the return value of the .update method, which is None. Somehow, though, when I test the code everything works fine. The only problem is adding default stopwords such as in #65. Here is a suggested change:

From

if stopwords:
        stopwords = STOPWORDS.update(stopwords)

To

if stopwords:
        STOPWORDS.update(stopwords)

STOPWORDS.update(["default", "stopwords", "here"])
stopwords = STOPWORDS

I can add this change to #65 if you want.

Examples for Natural Language Processing

Goal

We added NLP analysis in #81. Yet, these features don't appear in the README. This issue is meant to add an usage guide and examples to the README.

Jobs to be done

  • Add simple usage guide to README
  • Add example results to README. Examples may include (non-exhaustive): sentiment score per chat, per time of day or per day of week.

The jobs may be solved in multiple PRs.

Introduce new visualization components

We currently support sunburst charts and word clouds to create visualizations for chat logs.
I'm sure that there are a lot of other ways to expressively visualize chat data that haven't come to my mind yet.
This issue is intentionally unspecific to encourage your creativity.
There are no limitations on libraries that may be used, although the result should be more than simply leveraging a library function.

Adding in ReadMe to use double backslashes or raw string for FILEPATH

While following the ReadMe instructions to use the package,
image

Directly put in unformatted file path("C:\Users\Sheetali\Downloads\text.txt"), which followed a syntax error due to escape characters.
image

I think this is a very common error users go through and can be avoided by adding a note in the ReadMe for the FILE_PATH variable that you can either use double backslashes or a raw string by prefixing the string with an 'r'.

ValueError: not enough values to unpack (expected 2, got 1)

Seeing the following error while using the WhatsApp parser:

22.04.2023 12:04:32 INFO     
            Depending on the platform, the message format in chat logs might not be
            standardized accross devices/versions/localization and might change over
            time. Please report issues including your message format via GitHub.
            
22.04.2023 12:04:32 INFO     Initialized parser.
22.04.2023 12:04:32 INFO     Starting reading raw messages...
22.04.2023 12:04:33 INFO     Inferred date format: month/day/year
22.04.2023 12:04:33 INFO     Finished reading 39999 raw messages.
22.04.2023 12:04:33 INFO     Starting parsing raw messages...
  0%|                                                                                        | 1/39999 [00:00<?, ?it/s]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 4
      1 from chatminer.chatparsers import WhatsAppParser
      3 parser = WhatsAppParser("WhatsApp-Chat-Sarmad.txt")
----> 4 parser.parse_file()
      5 df = parser.parsed_messages.get_df()

File E:\Projects\whatsapp-chat-miner\whatsapp-analysis\lib\site-packages\chatminer\chatparsers.py:74, in Parser.parse_file(self)
     71 self._logger.info("Finished reading %i raw messages.", len(self._raw_messages))
     73 self._logger.info("Starting parsing raw messages...")
---> 74 self._parse_raw_messages()
     75 self._logger.info("Finished parsing raw messages.")

File E:\Projects\whatsapp-chat-miner\whatsapp-analysis\lib\site-packages\chatminer\chatparsers.py:84, in Parser._parse_raw_messages(self)
     82 with logging_redirect_tqdm():
     83     for raw_mess in tqdm(self._raw_messages):
---> 84         parsed_mess = self._parse_message(raw_mess)
     85         if parsed_mess:
     86             self.parsed_messages.append(parsed_mess)

File E:\Projects\whatsapp-chat-miner\whatsapp-analysis\lib\site-packages\chatminer\chatparsers.py:168, in WhatsAppParser._parse_message(self, mess)
    163 time = datetimeparser.parse(
    164     datestr, dayfirst=self._datefmt.is_dayfirst, fuzzy=True
    165 )
    167 if ":" in author_and_body:
--> 168     author, body = [x.strip() for x in author_and_body.split(": ", 1)]
    169 else:
    170     author = "System"

ValueError: not enough values to unpack (expected 2, got 1)

Might be because of a format that is not being covered.

Graphs for each person in conversation

Feature request – graphs for each person in conversation

It would be great, if we could create graphs separate for each person in the conversation. This way, there would be even more data points, which we could use for “analysis” or simply our amusement.

Examples:

  • calendar heatmap for whole chat, and other figures under it featuring heatmap for different users
  • one radar chart figure containing two different graphs with the same scale – would allow comparison between two (maybe even more) people

The first example could be more useful for groups, and the latter one for 1on1 chats, in my opinion.

Which steps would have to be taken to achieve this?

I'd be interested in helping out, and later implementing it into my GUI.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.