Giter VIP home page Giter VIP logo

teledataparser's Introduction

Telegram Machine Learning Corpus Extraction

Python 3.*

Machine learning corpus extractor.

Parses Json data files exported by Telegram.

  • Extracting specific replies.
  • Extracts specific statements.
  • Filter support.
  • Extract all.
  • Support for word limit.
  • Custom field length calculation.

Extract a user's speech for AI learning, and save your loved one's chat history.

At the moment, because I'm too lazy, I've only done the part of extracting the plain text corpus.

Run

  • Installation

Run pip3 install -r requirements.txt in the project directory

  • Run

Configure config.ini to run python3 main.py to generate the data

Performance

Number of outputs

  • 100 w -> 28s
  • 49w -> 12s

Config

Constructing classes

from Core.Tool import TeleParser

Parser = TeleParser("JsonInput", "DataOutput", min_limit=5, max_limit=512)
dicts = Parser.get_all(lable="GIRL", showDate=False, ending="\n", uni_data=False)
print(dicts)
# See comments for yourself
# Returns: total number of writes, number of non-conforming skips, number of deleted, total number of signed messages

TeleParser Api

init(self, json_path: str, out_path: str, min_limit: int = 5, max_limit: int = 512, Counter: str = 'chinese', filter_mode: str = False, filter: str = 'Not_need.txt')

:param json_path:input_directory
:param out_path:output
:param min_limit:min_count
:param max_limit:max_count
:param Counter:counter
:param filter_mode:type, True to keep only sentences with keywords, False to keep only sentences without keywords
:param filter:path to filter phrase file
:return: dict

get_all(self, lable: str, showDate=False, ending='\n', uni_data=False, no_id: list = None) -> dict

:param lable: the label
:param no_id: who not to receive (e.g. messages from service bots)
:param uni_data: whether to de-duplicate
:param ending: the suffix
:param showDate: whether to show the date
:return: dict

get_all_reply(self, showDate=True, ending='\n', uni_data=False) -> dict

:param uni_data: whether to de-duplicate
:param ending: the suffix
:param showDate: whether to show the date
:return: dict

get_reply(self, lable, target_id, showDate=True, ending='\n', uni_data=False) -> dict

:param showDate: whether to show the date
:param ending: the suffix
:param uni_data: whether to de-duplicate
:param lable: the name tag
:param target_id: the target ID, the one with user
:return: dict

get_speech(self, lable, target_id, showDate=True, ending='\n', uni_data=False) -> dict

:param uni_data: whether to de-duplicate
:param ending: the suffix
:param showDate: whether to attach a date
:param lable: the name tag
:param target_id: the target ID, the one with user
:return: dict
  • hint method

write_out(self, speech: list, path: str, Wash: bool = False)

:param speech: list of phrases
:param path: the name of the output file
:param Wash: whether to de-duplicate
:return:

Length Gauge

class Tester(builtins.object) Static methods defined here:

chinese(ask)
default(ask)

Config File

; Sample configuration file
[user]
user = Someone
user_id = user114514


[path]
input = JsonInput
output = DataOutput

Sample reference format

{
  "name": "Unknown | Private",
  "type": "private_supergroup",
  "id": 11451418180,
  "messages": [
    {
      "id": 1,
      "type": "message",
      "date": "2022-01-28T01:35:46",
      "date_unixtime": "1643333746",
      "edited": "2022-05-15T14:16:08",
      "edited_unixtime": "1652624168",
      "from": "Someone",
      "from_id": "user2333",
      "reply_to_message_id": 271065,
      "text": "Hi,GOOD MORNING"
    }
  ]
}

counter

License

Use of this item for malicious purposes is not permitted.
This project is licensed under the Apache License

teledataparser's People

Contributors

sudoskys avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.