Giter VIP home page Giter VIP logo

tg-harvest-chats's Introduction

Telegram Chat Harvester

This is an early version of an utility application that allowes you to build a text dataset out of all your Telegram chats. To achieve it, the application exploits Telegram API.

The application is developed mainly to create text corpora. This is not a tool for exporting your Telegram information. See Why not just export?.

Data Format

This version is capable of collection messages into large text corpora. A single corpus contains all messages in the corresponding chat. The messages are written in the following format:

<|cs|>
<|m|>Hi, Mike! How are you?<|--m|>
<|cs|>
<|m|><|--me|>Hi!<|--m|>
<|m|><|--me|>Pretty cool :)<|--m|>
<|m|><|media|><|--m|>

Each message is wrapped in "message tokens": <|m|> denotes the beginning of a message and <|--m|> denotes the end of a message. A start token immediately followed by "author token": <|--me|> denotes a start of a message written by you. "Change sender" token <|cs|> denotes the end of a block of message written by one user, which is particulary used in groups. Finally, <|media|> token is used to denote a message that contains something different from text (sticker, image and etc.).

Why not just export?

Telegram Desktop can export all the data in machine-readable format (JSON). However, this application provides you with the following functionality that does not come with the desktop application:

  1. It can be run on systems with no GUI or capability to install the desktop application.
  2. It does the necessary preprocessing for producing text corpora.

But, the application cannot export media objects and other data. If you want to export your data, use Export Tool

Building

To build an application and run on your desktop machine you can either build application locally (using CMake or any other tool), or build a Docker image.

Local

To build locally, you have to link TDLib to the executable. Here is the example, how it can be done using CMake

git clone https://github.com/tdlib/td lib/td
mkdir build
cd build
cmake ..
make

It will take awhile to build the library together with the application. After it is done, you can run

./tx_harvestchats --help

Docker

To build an application with Docker run:

docker build -t tx-hc .
docker run -it -v /save/results/here:/txhc tx-hc bash -c "cd /txhc && /app/build/tx_harvestchats"

The results will be saved in /save/results/here folder. Feel free to substitute by your choice.

tg-harvest-chats's People

Contributors

teexone avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.