umputun / tg-spam Goto Github PK

View Code? Open in Web Editor NEW

107.0 5.0 16.0 42.15 MB

Anti-Spam bot for Telegram

Home Page: https://tg-spam.umputun.dev

License: MIT License

Go 93.80% Shell 0.54% Dockerfile 0.28% CSS 0.22% HTML 4.80% Makefile 0.35%

anit-spam self-hosted telegram-bot spam-detection spam-classification

tg-spam's Introduction

tg-spam

TG-Spam is an effective, self-hosted anti-spam bot specifically crafted for Telegram groups. Setting it up is straightforward as a Docker container, needing just a Telegram token and a group name or ID for the user to get started. Once activated, TG-Spam oversees messages, leveraging an advanced spam detection methods to pinpoint and eliminate spam content.

What is it and how it works?

TG-Spam keeps an eye on messages in Telegram groups, looking out for spam. It's quick to act, deleting spammy messages and banning the users who send them. The bot is also smart and gets smarter over time, learning from human guidance to catch new kinds of spam. It's a self-hosted tool that's pretty flexible in how you set it up, working great as a Docker container on anything from a small VPS to a Raspberry Pi. Plus, its Docker image supports various architectures like amd64, arm64, and armv7, and there are also binaries available for Linux, macOS, and Windows.

TG-Spam's spam detection algorithm is multifaceted, incorporating several criteria to ensure high accuracy and efficiency:

Message Analysis: It evaluates messages for similarities to known spam, flagging those that match typical spam characteristics.
Integration with Combot Anti-Spam System (CAS): It cross-references users with the Combot Anti-Spam System, a reputable external anti-spam database.
Spam Message Similarity Check: TG-Spam assesses the overall resemblance of each message to known spam patterns.
Stop Words Comparison: Messages are compared against a curated list of stop words commonly found in spam.
OpenAI Integration: TG-Spam may optionally use OpenAI's GPT models to analyze messages for spam patterns.
Emoji Count: Messages with an excessive number of emojis are scrutinized, as this is a common trait in spam messages.
Meta checks: TG-Spam can optionalsly check the message for the number of links and the presence of images. If the number of links is greater than the specified limit, or if the message contains images but no text, it will be marked as spam.
Automated Action: If a message is flagged as spam, TG-Spam takes immediate action by deleting the message and banning the responsible user.

TG-Spam can also run as a server, providing a simple HTTP API to check messages for spam. This is useful for integration with other tools, not related to Telegram. For more details see Running with webapi server section below. In addition, it provides WEB UI to perform some useful admin tasks. For more details see WEB UI section below. All the spam detection modules can be also used as a library. For more details see Using tg-spam as a library section below.

Installation

The primary method of installation is via Docker. TG-Spam is available as a Docker image, making it easy to deploy and run as a container. The image is available on Docker Hub at umputun/tg-spam as well as on GitHub Packages at ghcr.io/umputun/tg-spam.
Binary releases are also available on the releases page.
TG-Spam can be installed by cloning the repository and building the binary from source by running make build.
It can also be installed using brew tap umputun/apps && brew install umputun/apps/tg-spam on macOS.

Configuration

All the configuration is done via environment variables or command line arguments. Out of the box the bot has reasonable defaults, so user can run it without much hassle.

There are some mandatory parameters what has to be set:

--telegram.token=, [$TELEGRAM_TOKEN] - telegram bot token. See below how to get it.
--telegram.group=, [$TELEGRAM_GROUP] - group name/id. This can be a group name (for public groups it will lookg like mygroup) or group id (for private groups it will look like -123456789). To get the group id you can use this bot or others like it.

As long as theses two parameters are set, the bot will work. Don't forget to add the bot to the group as an admin, otherwise it will not be able to delete messages and ban users.

There are some important customizations available:

First of all - sample files, the bot is using some data files to detect spam. They are located in the /srv/data directory of the container and can be mounted from the host. The files are: spam-samples.txt, ham-samples.txt, exclude-tokens.txt and stop-words.txt.

User can specify custom location for them with --files.samples=, [$FILES_SAMPLES] parameters. This should be a directory, where all the files are located.

Second, are messages the bot is sending. There are three messages user may want to customize:

--message.startup=, [$MESSAGE_STARTUP] - message sent to the group when bot is started, can be empty
--message.spam=, [$MESSAGE_SPAM] - message sent to the group when spam detected
--message.dry=, [$MESSAGE_DRY] - message sent to the group when spam detected in dry mode

By default, the bot reports back to the group with the message this is spam and this is spam (dry mode) for dry mode. In non-dry mode, the bot will delete the spam message and ban the user permanently. It is possible to suppress those reports with --no-spam-reply, [$NO_SPAM_REPLY] parameter.

There are 4 files used by the bot to detect spam:

spam-samples.txt - list of spam samples. Each line in this file is a full text of spam message with removed EOL. I.e. the orginal message represented as a single line. EOLs can be replaced by spaces
ham-samples.txt - list of ham (non-spam) samples. Each line in this file is a full text of ham message with removed EOL
exclude-tokens.txt - list of tokens to exclude from spam detection, usually common words. Each line in this file is a single token (word), or a comma-separated list of words in dbl-quotes.
stop-words.txt - list of stop words to detect spam right away. Each line in this file is a single phrase (can be one or more words). The bot checks if any of those phrases are present in the message and if so, it marks the message as spam.

The bot dynamically reloads all 4 files, so user can change them on the fly without restarting the bot.

Another useful feature is the ability to keep the list of approved users persistently and keep other meta-information about detected spam and received messages. The bot will not ban approved users and won't check their messages for spam because they have already passed the initial check. All this info is stored in the internal storage under --files.dynamic =, [$FILES_DYNAMIC] directory. User should mount this directory from the host to keep the data persistent. All the files in this directory are handled by bot automatically.

Configuring spam detection modules and parameters

Message Analysis

This is the main spam detection module. It uses the list of spam and ham samples to detect spam by using Bayes classifier. The bot is enabled as long as --files.samples=, [$FILES_SAMPLES], point to existing directory with all the sample files (see above). There is also a parameter to set minimum spam probability percent to ban the user. If the probability of spam is less than --min-probability=, [$MIN_PROBABILITY] (default is 50), the message is not marked as spam.

Spam message similarity check

This check uses provides samples files and active by default. The bot compares the message with the samples and if the similarity is greater than --similarity-threshold=, [$SIMILARITY_THRESHOLD] (default is 0.5), the message is marked as spam. Setting the similarity threshold to 1 will effectively disable this check.

Stop Words Comparison

If stop words file is present, the bot will check the message for the presence of any of the phrases in the file. The bot is enabled as long as stop-words.txt file is present in samples directory and not empty.

Combot Anti-Spam System (CAS) integration

Nothing needed to enable CAS integration, it is enabled by default. To disable it, set --cas.api=, [$CAS_API] to empty string.

OpenAI integration

Setting --openai.token [$OPENAI_PROMPT] enables OpenAI integration. All other parameters for OpenAI integration are optional and have reasonable defaults, for more details see All Application Options section below.

To keep the number of calls low and the price manageable, the bot uses the following approach:

Only the first message(s) from a given user is checked for spam. If --paranoid mode is enabled, openai will not be used at all.
OpenAI check is the last in the chain of checks. Unless --openai.veto is not set, the bot will not even call OpenAI if any of the previous checks marked the message as spam. However, if --openai.veto is set, it will be called and the message will be marked as spam only if OpenAI thinks so.
By default, OpenAI integration is disabled.

Emoji Count

If the number of emojis in the message is greater than --max-emoji=, [$MAX_EMOJI] (default is 2), the message is marked as spam. Setting the max emoji count to -1 will effectively disable this check. Note: setting it to 0 will mark all the messages with any emoji as spam.

Minimum message length

This is not a separate check, but rather a parameter to control the minimum message length. If the message length is less than --min-msg-len=, [$MIN_MSG_LEN] (default is 50), the message won't be checked for spam. Setting the min message length to 0 will effectively disable this check. This check is needed to avoid false positives on short messages.

Maximum links in message

This option is disabled by default. If set to a positive number, the bot will check the message for the number of links. If the number of links is greater than --meta.links-limit=, [$META_LINKS_LIMIT] (default is -1), the message will be marked as spam. Setting the limit to -1 will effectively disable this check.

Links only check

This option is disabled by default. If set to true, the bot will check the message for the presence of any text. If the message contains links but no text, it will be marked as spam.

Image only check

This option is disabled by default. If set to true, the bot will check the message for the presence of any image. If the message contains images but no text, it will be marked as spam.

Admin chat/group

Optionally, user can specify the admin chat/group name/id. In this case, the bot will send a message to the admin chat as soon as a spammer is detected. Admin can see all the spam and all banned users and could also unban the user, confirm the ban or get results of spam checks by clicking a button directly on the message.

To allow such a feature, --admin.group=, [$ADMIN_GROUP] must be specified. This can be a group name (for public groups), but usually it is a group id (for private groups) or personal accounts.

Screenshots

admin commands

Admins can reply to the spam message with the text spam or /spam to mark it as spam. This is useful for training purposes as the bot will learn from the spam messages marked by the admin and will be able to detect similar spam in the future.
Replying to the message with the text ban or /ban will ban the user who sent the message. This is useful for post-moderation purposes. Essentially this is the same as sending /spam but without adding the message to the spam samples file.
Replying to the message with the text warn or /warn will remove the original message, and send a warning message to the user who sent the message. This is useful for post-moderation purposes. The warning message is defined by --message.warn=, [$MESSAGE_WARN] parameter.

Updating spam and ham samples dynamically

The bot can be configured to update spam samples dynamically. To enable this feature, reporting to the admin chat must be enabled (see --admin.group=, [$ADMIN_GROUP] above. If any of privileged users (--super=, [$SUPER_USER]) forwards a message to admin chat or reply to the message with /spam or spam text, the bot will add this message to the internal spam samples file (spam-dynamic.txt) and reload it. This allows the bot to learn new spam patterns on the fly. In addition, the bot will do the best to remove the original spam message from the group and ban the user who sent it. This is not always possible, as the forwarding strips the original user id. To address this limitation, tg-spam keeps the list of latest messages (in fact, it stores hashes) associated with the user id and the message id. This information is used to find the original message and ban the user. There are two parameters to control the lookup of the original message: --history-duration= (default: 1h) [$HISTORY_DURATION] and --history-min-size= (default: 1000) [$HISTORY_MIN_SIZE]. Both define how many messages to keep in the internal cache and for how long. In other words - if the message is older than --history-duration= and the total number of stored messages is greater than --history-min-size=, the bot will remove the message from the lookup table. The reason for this is to keep the lookup table small and fast. The default values are reasonable and should work for most cases.

Updating ham samples dynamically works differently. If any of privileged users unban a message in admin chat, the bot will add this message to the internal ham samples file (ham-dynamic.txt), reload it and unban the user. This allows the bot to learn new ham patterns on the fly.

Both dynamic spam and ham files are located in the directory set by --files.dynamic=, [$FILES_DYNAMIC] parameter. User should mount this directory from the host to keep the data persistent.

Logging

The default logging prints spam reports to the console (stdout). The bot can log all the spam messages to the file as well. To enable this feature, set --logger.enabled, [$LOGGER_ENABLED] to true. By default, the bot will log to the file tg-spam.log in the current directory. To change the location, set --logger.file, [$LOGGER_FILE] to the desired location. The bot will rotate the log file when it reaches the size specified in --logger.max-size, [$LOGGER_MAX_SIZE] (default is 100M). The bot will keep up to --logger.max-backups, [$LOGGER_MAX_BACKUPS] (default is 10) of the old, compressed log files.

Setting up the telegram bot

Getting the token

To get a token, talk to BotFather. All you need is to send /newbot command and choose the name for your bot (it must end in bot). That is it, and you got a token which you'll need to write down into remark42 configuration as TELEGRAM_TOKEN.

Example of such a "talk":

Umputun:
/newbot

BotFather:
Alright, a new bot. How are we going to call it? Please choose a name for your bot.

Umputun:
example_comments

BotFather:
Good. Now let's choose a username for your bot. It must end in `bot`. Like this, for example: TetrisBot or tetris_bot.

Umputun:
example_comments_bot

BotFather:
Done! Congratulations on your new bot. You will find it at t.me/example_comments_bot. You can now add a description, about section and profile picture for your bot, see /help for a list of commands. By the way, when you've finished creating your cool bot, ping our Bot Support if you want a better username for it. Just make sure the bot is fully operational before you do this.

Use this token to access the HTTP API:
12345678:xy778Iltzsdr45tg

Disabling privacy mode

In some cases, for example, for private groups, bot has to have privacy mode disabled. In order to do that user need to send BotFather the command /setprivacy and choose needed bot. Then choose Disable. Example of such conversation:

Umputun:
/setprivacy

BotFather:
Choose a bot to change group messages settings.

Umputun:
example_comments_bot

BotFather:
'Enable' - your bot will only receive messages that either start with the '/' symbol or mention the bot by username.
'Disable' - your bot will receive all messages that people send to groups.
Current status is: DISABLED

Umputun:
Disable

BotFather:
Success! The new status is: DISABLED. /help

Important: the privacy has to be disabled before bot is added to the group. If it was done after, user should remove bot from the group and add again.

All Application Options

      --admin.group=                admin group name, or channel id [$ADMIN_GROUP]
      --disable-admin-spam-forward  disable forwarding spam messages to admin group [$DISABLE_ADMIN_SPAM_FORWARD]      
      --testing-id=                 testing ids, allow bot to reply to them [$TESTING_ID]
      --history-duration=           history duration (default: 24h) [$HISTORY_DURATION]
      --history-min-size=           history minimal size to keep (default: 1000) [$HISTORY_MIN_SIZE]
      --super=                      super-users [$SUPER_USER]
      --no-spam-reply               do not reply to spam messages [$NO_SPAM_REPLY]
      --similarity-threshold=       spam threshold (default: 0.5) [$SIMILARITY_THRESHOLD]
      --min-msg-len=                min message length to check (default: 50) [$MIN_MSG_LEN]
      --max-emoji=                  max emoji count in message, -1 to disable check (default: 2) [$MAX_EMOJI]
      --min-probability=            min spam probability percent to ban (default: 50) [$MIN_PROBABILITY]
      --paranoid                    paranoid mode, check all messages [$PARANOID]
      --first-messages-count=       number of first messages to check (default: 1) [$FIRST_MESSAGES_COUNT]
      --training                    training mode, passive spam detection only [$TRAINING]
      --soft-ban                    soft ban mode, restrict user actions but not ban [$SOFT_BAN]
      
      --dry                         dry mode, no bans [$DRY]
      --dbg                         debug mode [$DEBUG]
      --tg-dbg                      telegram debug mode [$TG_DEBUG]

telegram:
      --telegram.token=             telegram bot token [$TELEGRAM_TOKEN]
      --telegram.group=             group name/id [$TELEGRAM_GROUP]
      --telegram.timeout=           http client timeout for telegram (default: 30s) [$TELEGRAM_TIMEOUT]
      --telegram.idle=              idle duration (default: 30s) [$TELEGRAM_IDLE]

logger:
      --logger.enabled              enable spam rotated logs [$LOGGER_ENABLED]
      --logger.file=                location of spam log (default: tg-spam.log) [$LOGGER_FILE]
      --logger.max-size=            maximum size before it gets rotated (default: 100M) [$LOGGER_MAX_SIZE]
      --logger.max-backups=         maximum number of old log files to retain (default: 10) [$LOGGER_MAX_BACKUPS]

cas:
      --cas.api=                    CAS API (default: https://api.cas.chat) [$CAS_API]
      --cas.timeout=                CAS timeout (default: 5s) [$CAS_TIMEOUT]

meta:
      --meta.links-limit=           max links in message, disabled by default (default: -1) [$META_LINKS_LIMIT]
      --meta.image-only             enable image only check [$META_IMAGE_ONLY]

openai:
      --openai.token=               openai token, disabled if not set [$OPENAI_TOKEN]
      --openai.veto                 veto mode, confirm detected spam [$OPENAI_VETO]
      --openai.prompt=              openai system prompt, if empty uses builtin default [$OPENAI_PROMPT]
      --openai.model=               openai model (default: gpt-4) [$OPENAI_MODEL]
      --openai.max-tokens-response= openai max tokens in response (default: 1024) [$OPENAI_MAX_TOKENS_RESPONSE]
      --openai.max-tokens-request=  openai max tokens in request (default: 2048) [$OPENAI_MAX_TOKENS_REQUEST]
      --openai.max-symbols-request= openai max symbols in request, failback if tokenizer failed (default: 16000) [$OPENAI_MAX_SYMBOLS_REQUEST]

files:
      --files.samples=              samples data path (default: data) [$FILES_SAMPLES]
      --files.dynamic=              dynamic data path (default: data) [$FILES_DYNAMIC]
      --files.watch-interval=       watch interval for dynamic files (default: 5s) [$FILES_WATCH_INTERVAL]

message:
      --message.startup=            startup message [$MESSAGE_STARTUP]
      --message.spam=               spam message (default: this is spam) [$MESSAGE_SPAM]
      --message.dry=                spam dry message (default: this is spam (dry mode)) [$MESSAGE_DRY]
      --message.warn=               warn message (default: You've violated our rules and this is your first and last warning. Further violations will lead to permanent access denial. Stay compliant or face the consequences!) [$MESSAGE_WARN]

server:
      --server.enabled              enable web server [$SERVER_ENABLED]
      --server.listen=              listen address (default: :8080) [$SERVER_LISTEN]
      --server.auth=                basic auth password for user 'tg-spam' (default: auto) [$SERVER_AUTH]

Help Options:
  -h, --help                        Show this help message

Application Options in details

super defines the list of privileged users, can be repeated multiple times or provide as a comma-separated list in the environment. Those users are immune to spam detection and can also unban other users. All the admins of the group are privileged by default.
no-spam-reply - if set to true, the bot will not reply to spam messages. By default, the bot will reply to spam messages with the text this is spam and this is spam (dry mode) for dry mode. In non-dry mode, the bot will delete the spam message and ban the user permanently with no reply to the group.
history-duration defines how long to keep the message in the internal cache. If the message is older than this value, it will be removed from the cache. The default value is 1 hour. The cache is used to match the original message with the forwarded one. See Updating spam and ham samples dynamically section for more details.
history-min-size defines the minimal number of messages to keep in the internal cache. If the number of messages is greater than this value, and the history-duration exceeded, the oldest messages will be removed from the cache.
--testing-id - this is needed to debug things if something unusual is going on. All it does is adding any chat ID to the list of chats bots will listen to. This is useful for debugging purposes only, but should not be used in production.
--paranoid - if set to true, the bot will check all the messages for spam, not just the first one. This is useful for testing and training purposes.
--first-messages-count - defines how many messages to check for spam. By default, the bot checks only the first message from a given user. However, in some cases, it is useful to check more than one message. For example, if the observed spam starts with a few non-spam messages, the bot will not be able to detect it. Setting this parameter to a higher value will allow the bot to detect such spam. Note: this parameter is ignored if --paranoid mode is enabled.
--training - if set, the bot will not ban users and delete messages but will learn from them. This is useful for training purposes.
--soft-ban - if set, the bot will restrict user actions but won't ban. This is useful for chats where the false-positive is hard or costly to recover from. With soft ban, the user won't be removed from the chat but will be restricted in actions. Practically, it means the user won't be able to send messages, but the recovery is easy - just unban the user, and they won't need to rejoin the chat.
--disable-admin-spam-forward - if set to true, the bot will not treat messages forwarded to the admin chat as spam.
--dry - if set to true, the bot will not ban users and delete messages. This is useful for testing purposes.
--dbg - if set to true, the bot will print debug information to the console.
--tg-dbg - if set to true, the bot will print debug information from the telegram library to the console.

Running the bot with an empty set of samples

The provided set of samples is just an example collected by the bot author. It is not enough to detect all the spam, in all groups and all languages. However, the bot is designed to learn on the fly, so it is possible to start with an empty set of samples and let the bot learn from the spam detected by humans.

To do so, several conditions must be met:

--files.samples [$FILES_SAMPLES] must be set to the new location (directory) without spam-samples.txt and ham-samples.txt files.
--files.dynamic [$FILES_DYNAMIC] must be set to the new location (directory) where the bot will keep all the dynamic data files. In the case of docker container, this directory must be mapped to the host volume.
admin chat should be enabled, see Admin chat/group section above.
admin name(s) should be set with --super [$SUPER_USER] parameter.

After that, the moment admin run into a spam message, he could forward it to the tg-spam bot. The bot will add this message to the spam samples file, ban user and delete the message. By doing so, the bot will learn new spam patterns on the fly and eventually will be able to detect spam without admin help. Note: the only thing admin should do is to forward the message to the bot, no need to add any text or comments, or remove/ban the original spammer. The bot will do all the work.

Training the bot on a live system safely

In case if such an active training on a live system is not possible, the bot can be trained without banning user and deleting messages automatically. Setting --training parameter will disable banning and deleting messages by bot right away, but the rest of the functionality will be the same. This is useful for testing and training purposes as bot can be trained on false-positive samples, by unbanning them in the admin chat as well as with false-negative samples by forwarding them to the bot. Alternatively, admin can reply to the spam message with the text spam or /spam to mark it as spam.

In this mode admin can ban users manually by clicking the "confirm ban" button on the message. This allows running the bot as a post-moderation tool and training it on the fly.

Pls note: Missed spam messages forwarded to the admin chat will be removed from the primary chat group and the user will be banned.

Running with webapi server

The bot can be run with a webapi server. This is useful for integration with other tools. The server is disabled by default, to enable it pass --server.enabled [$SERVER_ENABLED]. The server will listen on the port specified by --server.listen [$SERVER_LISTEN] parameter (default is :8080).

By default, the server is protected by basic auth with user tg-bot and randomly generated password. This password is printed to the console on startup. If user wants to set a custom auth password, it can be done with --server.auth [$SERVER_AUTH] parameter. Setting it to empty string will disable basic auth protection.

It is truly a bad idea to run the server without basic auth protection, as it allows adding/removing users and updating spam samples to anyone who knows the endpoint. The only reason to run it without protection is inside the trusted network or for testing purposes. Exposing the server directly to the internet is not recommended either, as basic auth is not secure enough if used without SSL. It is better to use a reverse proxy with TLS termination in front of the server.

endpoints:

GET /ping - returns pong if the server is running
POST /check - return spam check result for the message passed in the body. The body should be a json object with the following fields:
- msg - message text
- user_id - user id
- user_name - username
POST /update/spam - update spam samples with the message passed in the body. The body should be a json object with the following fields:
- msg - spam text
POST /update/ham - update ham samples with the message passed in the body. The body should be a json object with the following fields:
- msg - ham text
POST /delete/spam - delete spam samples with the message passed in the body. The body should be a json object with the following fields:
- msg - spam text
POST /delete/ham - delete ham samples with the message passed in the body. The body should be a json object with the following fields:
- msg - ham text
POST /users/add - add user to the list of approved users. The body should be a json object with the following fields:
- user_id - user id to add
- user_name - username, used for user_id lookup if user_id is not set
POST /users/delete - remove user from the list of approved users. The body should be a json object with the following fields:
- user_id - user id to add
- user_name - username, used for user_id lookup if user_id is not set
GET /users - get the list of approved users. The response is a json object with the following fields:
- user_ids - array of user ids
GET /samples - get the list of spam and ham samples. The response is a json object with the following fields:
- spam - array of spam samples
- ham - array of ham samples
PUT /samples - reload dynamic samples
GET /settings - return the current settings of the bot

for the real examples of http requests see webapp.rest file.

how it works

The server is using the same spam detection logic as the bot itself. It is using the same set of samples and the same set of parameters. The only difference is that the server is not banning users and deleting messages. It also doesn't assume any particular flow user should follow. For example, the /check api call doesn't update dynamic spam/ham samples automatically.

However, if users want to update spam/ham dynamic samples, they should call the corresponding endpoint /update/<spam|ham>. On the other hand, updating the approved users list is a part of the /check api call, so user doesn't need to call it separately. In case if the list of approved users should be managed by the client application, it is possible to call /users endpoints directly.

Generally, this is a very basic server, but should be sufficient for most use cases. If a user needs more functionality, it is possible to run the bot as a library and implement custom logic on top of it.

See also examples for small but complete applications using the bot as a library.

WEB UI

If webapi server enabled (see Running with webapi server section above), the bot will serve a simple web UI on the root path. It is a basic UI to check a message for spam, manage samples and handle approved users. It is protected by basic auth the same way as webapi server.

Screenshots

Example of docker-compose.yml

This is an example of a docker-compose.yml file to run the bot. It is using the latest stable version of the bot from docker hub and running as a non-root user with uid:gid 1000:1000 (matching host's uid:gid) to avoid permission issues with mounted volumes. The bot is using the host timezone and has a few super-users set. It is logging to the host directory ./log/tg-spam and keeps all the dynamic data files in ./var/tg-spam. The bot is using the admin chat and has a secret to protect generated links. It is also using the default set of samples and stop words.

services:
  
  tg-spam:
    image: umputun/tg-spam:latest
    hostname: tg-spam
    restart: always
    container_name: tg-spam
    user: "1000:1000" # set uid:gid to host user to avoid permission issues with mounted volumes
    logging: &default_logging
      driver: json-file
      options:
        max-size: "10m"
        max-file: "5"
    environment:
      - TZ=America/Chicago
      - TELEGRAM_TOKEN=ххххх
      - TELEGRAM_GROUP=example_chat # public group name to monitor and protect
      - ADMIN_GROUP=-403767890 # private group id for admin spam-reports
      - LOGGER_ENABLED=true
      - LOGGER_FILE=/srv/log/tg-spam.log
      - LOGGER_MAX_SIZE=5M
      - FILES_DYNAMIC=/srv/var
      - NO_SPAM_REPLY=true
      - DEBUG=true
    volumes:
      - ./log/tg-spam:/srv/log
      - ./var/tg-spam:/srv/var
    command: --super=name1 --super=name2 --super=name3

Getting spam samples from CAS

CAS provide an API to get spam samples, which can be used to creata a set of spam samples for the bot. Provided cas-export.sh script automate the process and result (messages.txt) can be used as a base for spam-samples.txt file. The script requires jq and curl to be installed and running it will take a long time.

curl -s https://raw.githubusercontent.com/umputun/tg-spam/master/cas-export.sh > cas-export.sh
chmod +x cas-export.sh
./cas-export.sh

Pls note: using results of this script directly as-is may not be such a good idea, because a particular chat group may have a different spam pattern. It is better to use it as a base by picking samples what seems appropriate for a given chat, and add more spam samples from the group itself.

Updating spam and ham samples from remote git repository

A small utility and docker container provided to update spam and ham samples from a remote git repository. The utility is designed to be run either as a docker container or as a standalone script or as a part of a cron job. For more details see updater/README.md.

It also has an example of docker-compose.yml to run it as a container side-by-side with the bot.

Running tgspam for multiple groups

It is not possible to run the bot for multiple groups, as the bot is designed to work with a single group only. However, it is possible to run multiple instances of the bot with different tokens and different groups. Note: it has to have a token per bot, because TG doesn't allow using the same token for multiple bots at the same time, and such a reuse attempt will prevent the bot from working properly.

At the same time, multiple instances of the bot can share the same set of samples and dynamic data files. To do so, user should mount the same directory with samples and dynamic data files to all the instances of the bot.

Using tg-spam as a library

The bot can be used as a library as well. To do so, import the github.com/umputun/tg-spam/lib package and create a new instance of the Detector struct. Then, call the Check method with the message and userID to check. The method will return true if the message is spam and false otherwise. In addition, the Check method will return the list of applied rules as well as the spam-related details.

For more details, see the docs on pkg.go.dev

Example:

package main

import (
  "fmt"
  "io"
  "net/http"
  "strings"

  "github.com/umputun/tg-spam/lib/spamcheck"
  "github.com/umputun/tg-spam/lib/tgspam"
)

func main() {
  // Initialize a new Detector with a Config
  detector := tgspam.NewDetector(tgspam.Config{
    MaxAllowedEmoji:  5,
    MinMsgLen:        10,
    FirstMessageOnly: true,
    CasAPI:           "https://cas.example.com",
    HTTPClient:       &http.Client{},
  })

  // Load stop words
  stopWords := strings.NewReader("\"word1\"\n\"word2\"\n\"hello world\"\n\"some phrase\", \"another phrase\"")
  res, err := detector.LoadStopWords(stopWords)
  if err != nil {
    fmt.Println("Error loading stop words:", err)
    return
  }
  fmt.Println("Loaded", res.StopWords, "stop words")

  // Load spam and ham samples
  spamSamples := strings.NewReader("spam sample 1\nspam sample 2\nspam sample 3")
  hamSamples := strings.NewReader("ham sample 1\nham sample 2\nham sample 3")
  excludedTokens := strings.NewReader("\"the\", \"a\", \"an\"")
  res, err = detector.LoadSamples(excludedTokens, []io.Reader{spamSamples}, []io.Reader{hamSamples})
  if err != nil {
    fmt.Println("Error loading samples:", err)
    return
  }
  fmt.Println("Loaded", res.SpamSamples, "spam samples and", res.HamSamples, "ham samples")

  // check a message for spam
  isSpam, info := detector.Check(spamcheck.Request{Msg: "This is a test message", UserID: "user1", UserName: "John Doe"})
  if isSpam {
    fmt.Println("The message is spam, info:", info)
  } else {
    fmt.Println("The message is not spam, info:", info)
  }

}

tg-spam's People

Contributors

Stargazers

Watchers

Forkers

themagic314 slawiko ninedraft countneuroman alehano amaranthlis nnick44 jshan94 glebsterx knn3477 hackerspace-team zorca nnovik 7h3v01c3 systemnick

tg-spam's Issues

Do not remove users after unban

According to docs

By default, this method guarantees that after the call the user is not a member of the chat, but will be able to join it. So if the user is a member of the chat they will also be removed from the chat

In my use case I would like to avoid it. In order to do it the only_if_banned parameter has to be passed. I would like to add it.

I will do it if you are ok with it. Do you think it should be run parameter, or it's fine to make it only possible behaviour of tg-spam?

Unban directly from admin group

If it's possible it would be great to be able to unban users directly from admin group where banned messages forwarded. Ideally to make a button under each message.

And if it's run in dry mode add a "BAN" button instead.

For example, now I run a bot in dry mode. And there are false positives messages. But I can’t tell the bot "it’s not a spam" to teach it. So I can’t switch off the dry mode cause the bot is not taught yet.

Add example webapp showing real-life usage of library/api

It would be nice to have some basic examples demonstrating the use of the library and/or API. Some basic htmx+templates allow users to enter a text message and show it on the common page, which will do the trick. This will be some one-topic forum or something like this. Maybe some fake auth to allow simulation of multiple users.

As soon as the message is submitted, it will be checked against tg-spam library/api and if spam is detected, it will add check results instead of the original message to this page. Adding a button "unban" for each banned message would also be nice, restoring the original text and sending the user ID to the list of approver users.

The goal here is not to build something useful but to make a minimal web app demonstrating as many aspects of integration with library/API as possible. Another goal is to get a feel for the usability of the library/api

This is also an excellent exercise for someone who wants to play with htmx go application and integration with third-party apis

Allow forwards without bans

We have an admin group for our main group, and sometimes we just forward messages from one to another to discuss something. I would really like users of the forwarded messages to not be banned, instead, I would prefer to mark messages as spam/ham manually.

automatic ban on multiple /warn

Currently, the /warn command doesn't keep track of which user received a warning and how many times. It would be beneficial to record this information and automatically ban users if a certain threshold is reached.

unban failure

Hi umputun,

I used docker-compose to deploy the application, and when I try to unban a user I get a warning

tg-spam | 2024/03/24 11:10:32.014 [WARN] failed to process callback: failed to unban user: failed to update ham for "Нужен человек для удаленного заработка. С вас, телефон и 2 часа свободного времени. Доход достойный. Для связи (+) в ЛС": can't update ham samples: can't update ham samples: failed to open /srv/data/ham-dynamic.txt: open /srv/data/ham-dynamic.txt: no such file or directory

tg-spam.log permission denied

Hi umputun,

I used docker-compose to deploy the application, and I see a warning

tg-spam    | 2024/03/24 11:09:18.928 [INFO]  user **** as spammer: {name: stopword, spam: false, details: not found}, {name: emoji, spam: false, details: 0/2}, {name: similarity, spam: true, details: 1.00/0.50}, {name: classifier, spam: true, details: probability of spam: 100.00%}, {name: cas, spam: false, details: record not found}, "Нужен человек для удаленного заработка. С вас, телефон и 2 часа свободного времени. Доход достойный. Для связи (+) в ЛС"
tg-spam    | 2024/03/24 11:09:18.928 [WARN]  can't write to log, can't open new logfile: open /srv/logs/tg-spam.log: permission denied

in container I see created folders from root user:

/srv $ ls -la
total 14356
drwxr-xr-x    1 root     root          4096 Mar 24 11:07 .
drwxr-xr-x    1 root     root          4096 Mar 24 11:07 ..
drwxrwxr-x    2 app      app           4096 Mar 24 11:09 data
drwxr-xr-x    2 root     root          4096 Mar 24 11:07 logs
-rwxr-xr-x    1 root     root      14684160 Mar 24 07:22 tg-spam
/srv $ ps -a
PID   USER     TIME  COMMAND
    1 app       0:04 /srv/tg-spam
   22 app       0:00 /bin/sh
   30 app       0:00 ps -a

on host mashine folder ./logs created from root:

user@ubuntu-vm:~/tg-spam$ ls -la
total 20
drwxrwxr-x 3 user   user   4096 Mar 24 08:07 .
drwxr-x--- 8 user   user   4096 Mar 23 19:06 ..
-rw-r--r-- 1 user   user   2132 Mar 24 08:07 docker-compose.yml
-rw-r--r-- 1 user   user    638 Mar 23 20:49 .env
drwxr-xr-x 2 root root 4096 Mar 24 08:07 logs

# example of a compose file with server enabled, logging turned on and samples on named volume
services:
  tg-spam:
    image: umputun/tg-spam:latest # use :master tag for latest (unstable) version
    hostname: tg-spam
    user: app
    restart: always
    container_name: tg-spam
    deploy:
      resources:
        limits:
          cpus: '0.25'
          memory: 50M
      restart_policy:
        condition: on-failure
        max_attempts: 3
    logging: &default_logging
      driver: json-file
      options:
        max-size: "5m"
        max-file: "2"
    env_file:
      - .env
    volumes:
      - tg-data:/srv/data       # mount volume with samples and dynamic files
      - ./logs:/srv/logs        # mount logs location to host's ./log directory
    ports:
      - 127.0.0.1:4080:8080
volumes:
  tg-data:

Add messages forwarded (by admins) to admin spam group as spam

Every message forwarded to the admin spam group should be added to the dynamic-spam file (see #2) and also should trigger spam removal and user ban right away

Controversial results on a spam message

It's one message, but not reported yet messages: one that it's spam, second that it's ok

Soft ban mode not working for my

I'm trying to use the Soft ban functionality, it is critical for me.
I write a phrase from the stop-words.txt dictionary in my test chat, but the bot does not restrict me, but bans me and removes me from the chat.

What's my mistake?

I'm using v1.12.0 here is the container configuration:

services:
  tg-spam:
    image: umputun/tg-spam:latest # use :master tag for latest (unstable) version
    hostname: tg-spam
    restart: always
    container_name: tg-spam
    logging: &default_logging
      driver: json-file
      options:
        max-size: "10m"
        max-file: "5"
    environment:
      - TELEGRAM_TOKEN=000000Q
      - TELEGRAM_GROUP=-00000
      - TZ=Europe/Sofia          
      - ADMIN_GROUP=-000000  
      - LOGGER_ENABLED=true      
      - LOGGER_FILE=/srv/logs/tg-spam.log
      - LOGGER_MAX_SIZE=5M      
      - FIRST_MESSAGES_COUNT=4   
      - META_IMAGE_ONLY=true     
      - META_LINKS_ONLY=true     
      - MAX_EMOJI=-1             
      - NO_SPAM_REPLY=false      
      - SERVER_ENABLED=false      
      - NO_SPAM_REPLY=true
      - SOFT_BAN=true
      - DEBUG=true 			 
      - DRY=false
      - PARANOID=true
      - TRAINING=false
    volumes:
      - /telegram/tg-spam/data/:/srv/data      
      - /telegram/tg-spam/logs/:/srv/logs

Potentially, the problem is that in main.go, the TelegramListener structure does not initialize the SoftBanMode parameter from the options strcut.

Update the list of approved users

It is not clear how to update the list of approved users.

It only loads list of users from file during the bot start: https://github.com/umputun/tg-spam/blob/master/app/main.go#L162 , but how (and when?) can I add user to that list.

My assumption is that list has to be updated after clicking unban admin?

Anyway, could you please explain how it should work from your perspective and I will do my best to fix that if needed.

Add some progress indication on potentially slow operations in web ui

With 1000+ records in the custom spam list clicking on the trash icon does something for a second or two and it will be nice to show some progress indicator. HTMX supports this kind of things, should be easy enough.

as a side note - need to check why it takes that long, maybe some index missing or smth like this.

CAPTCHA support

Hello! Thanks for your opensource bot - I am already using it in a fairly large group (several thousand users).

Is it appropriate to add captcha support for this bot? Most spam bots are not smart enough to pass even the simplest captcha checks, so this feature will help a lot.

I have opened this issue for discussion - if you agree, I would like to develop this feature myself.

issues with running latest v.1.3.1 bot

I'm receiving the following failure:

tg-spam v1.3.1-e9aa206-20231222T15:42:53
2023/12/23 19:10:18.341 [ERROR] can't make data db, unable to open database file: out of memory (14)
2023/12/23 19:10:18.341 [ERROR] can't make data db, unable to open database file: out of memory (14)
>>> stack trace:
main.main()
	/build/app/main.go:139 +0x465

for the following run command:

docker run ghcr.io/umputun/tg-spam:v1.3.1 --telegram.token=*** --telegram.group=*** --admin.group=*** --telegram.preserve-unbanned --no-spam-reply --max-emoji=-1

If I run from locally built go binary, everything is fine. Also, everything is fine if I run from locally built docker image.

What am I doing wrong?

Bot deletes messages of another bots even promoted to admin

The bot recognized as a spam a message from another bot — it was basically the message from the channel posted by Telegram itself to a group linked to the channel.
After unban and posting the same message again, the bot again detected a spam. I have added Telegram (ID: 777000) as a superuser to let the bot to ignore its messages.

Logs are below.

2023/12/14 12:54:48 stdout �[36m2023/12/14 10:54:48.530�[0m �[37m[DEBUG]�[0m �[34m{events/events.go:338 events.(*TelegramListener).sendBotResponse}�[0m �[37mbot response - permanently banned {777000  Telegram}\n⛔︎ unban if wrong ⛔︎\n\nВот вы всё "Apple", "Google", "монополия сторов" — мелко это всё. Вот поезда — это круто. В Польше одна региональная железная дорога решила сэкономить на обслуживании купленных пассажирских поездов Impuls от компании NEWAG и наняла независимую (от производителя) компанию для этого. После того, как сервисные работы были закончены, несколько поездов отказались заводиться. Ситуация была откровенно угрожающей (поездов не хватало для обеспечения перевозок), когда кто-то из персонала набрал в гугле "польские хакеры" и в итоге к проблеме подключились белые хакеры из группы Dragon Sector.   Им удалось разобраться с загадочными ошибками и обнаружить, что в поездах работает система распознавания "workshop-detection", которая начинает препятствовать работе, если обнаружено вмешательство неавторизованного механика. Короче говоря, производитель делает из поезда "кирпич", если его ремонтирует кто-то другой, показывая массу сообщений о нарушении копирайта и даже вроде имея возможность заблокировать поезд удаленно.  Хакерам удалось обойти это ограничение и запустить поезда. А производитель теперь отказывается от ответственности и заявляет, что ничего такого не делал и вообще грозит судиться с хакерами за клевету. Настаивая при этом, что ошибки были вызваны недостаточной квалификацией ремонтников, потому что обслуживать поезда должны только сотрудники производителя.   @blognot  https://www.404media.co/polish-hackers-repaired-trains-the-manufacturer-artificially-bricked-now-the-train-company-is-threatening-them/\n\n, reply-to:0�[0m

2023/12/14 12:54:48 stdout �[36m2023/12/14 10:54:48.530�[0m �[31m[WARN] �[0m �[34m{server/server.go:232 server.(*SpamWeb).UnbanURL}�[0m �[31mfailed to compress message "Вот вы всё "Apple", "Google", "монополия сторов" — мелко это всё. Вот поезда — это круто. В Польше одна региональная железная дорога решила сэкономить на обслуживании купленных пассажирских поездов Impuls от компании NEWAG и наняла независимую (от производителя) компанию для этого. После того, как сервисные работы были закончены, несколько поездов отказались заводиться. Ситуация была откровенно угрожающей (поездов не хватало для обеспечения перевозок), когда кто-то из персонала набрал в гугле "польские хакеры" и в итоге к проблеме подключились белые хакеры из группы Dragon Sector.   Им удалось разобраться с загадочными ошибками и обнаружить, что в поездах работает система распознавания "workshop-detection", которая начинает препятствовать работе, если обнаружено вмешательство неавторизованного механика. Короче говоря, производитель делает из поезда "кирпич", если его ремонтирует кто-то другой, показывая массу сообщений о нарушении копирайта и даже вроде имея возможность заблокировать поезд удаленно.  Хакерам удалось обойти это ограничение и запустить поезда. А производитель теперь отказывается от ответственности и заявляет, что ничего такого не делал и вообще грозит судиться с хакерами за клевету. Настаивая при этом, что ошибки были вызваны недостаточной квалификацией ремонтников, потому что обслуживать поезда должны только сотрудники производителя.   @blognot  https://www.404media.co/polish-hackers-repaired-trains-the-manufacturer-artificially-bricked-now-the-train-company-is-threatening-them/", encoded string is too long: 1622 characters�[0m

2023/12/14 12:54:48 stdout �[36m2023/12/14 10:54:48.529�[0m �[37m[DEBUG]�[0m �[34m{events/events.go:306 events.(*TelegramListener).reportToAdminChat}�[0m �[37mreport to admin chat, ban data for {777000  Telegram}, group: 120025072�[0m

2023/12/14 12:54:48 stdout �[36m2023/12/14 10:54:48.529�[0m �[33m[INFO] �[0m �[34m{events/events.go:206 events.(*TelegramListener).procEvents}�[0m �[33m{777000  Telegram} banned by bot for 9600h0m0s�[0m

2023/12/14 12:54:48 stdout �[36m2023/12/14 10:54:48.436�[0m �[37m[DEBUG]�[0m �[34m{app/main.go:260 main.execute.makeSpamLogger.func11}�[0m �[37mspam message: this is spam: "Telegram" (777000)�[0m

2023/12/14 12:54:48 stdout �[36m2023/12/14 10:54:48.436�[0m �[33m[INFO] �[0m �[34m{app/main.go:259 main.execute.makeSpamLogger.func11}�[0m �[33mspam detected from {777000  Telegram}, response: this is spam: "Telegram" (777000)�[0m

2023/12/14 12:54:48 stdout �[36m2023/12/14 10:54:48.436�[0m �[37m[DEBUG]�[0m �[34m{events/events.go:198 events.(*TelegramListener).procEvents}�[0m �[37mban initiated for {Text:this is spam: "Telegram" (777000) Send:true BanInterval:9600h0m0s User:{ID:777000 Username: DisplayName:Telegram} ChannelID:0 ReplyTo:42734 DeleteReplyTo:true}�[0m

2023/12/14 12:54:48 stdout �[36m2023/12/14 10:54:48.161�[0m �[37m[DEBUG]�[0m �[34m{events/events.go:338 events.(*TelegramListener).sendBotResponse}�[0m �[37mbot response - this is spam: "Telegram" (777000), reply-to:42734�[0m

2023/12/14 12:54:48 stdout �[36m2023/12/14 10:54:48.161�[0m �[33m[INFO] �[0m �[34m{bot/spam.go:84 bot.(*SpamFilter).OnMessage}�[0m �[33muser Telegram detected as spammer: {name: stopword, spam: false, details: not found}, {name: emoji, spam: false, details: 0/2}, {name: similarity, spam: false, details: 0.10/0.50}, {name: classifier, spam: true, details: probability: NaN%, certain: true}, {name: cas, spam: false, details: Record not found.}, "Вот вы всё "Apple", "Google", "монополия сторов" — мелко это всё. Вот поезда — это круто. В Польше одна региональная железная дорога решила сэкономить на обслуживании купленных пассажирских поездов Impuls от компании NEWAG и наняла независимую (от производителя) компанию для этого. После того, как сервисные работы были закончены, несколько поездов отказались заводиться. Ситуация была откровенно угрожающей (поездов не хватало для обеспечения перевозок), когда кто-то из персонала набрал в гугле "польские хакеры" и в итоге к проблеме подключились белые хакеры из группы Dragon Sector. \n\nИм удалось разобраться с загадочными ошибками и обнаружить, что в поездах работает система распознавания "workshop-detection", которая начинает препятствовать работе, если обнаружено вмешательство неавторизованного механика. Короче говоря, производитель делает из поезда "кирпич", если его ремонтирует кто-то другой, показывая массу сообщений о нарушении копирайта и даже вроде имея возможность заблокировать поезд удаленно.\n\nХакерам удалось обойти это ограничение и запустить поезда. А производитель теперь отказывается от ответственности и заявляет, что ничего такого не делал и вообще грозит судиться с хакерами за клевету. Настаивая при этом, что ошибки были вызваны недостаточной квалификацией ремонтников, потому что обслуживать поезда должны только сотрудники производителя. \n\n@blognot\n\nhttps://www.404media.co/polish-hackers-repaired-trains-the-manufacturer-artificially-bricked-now-the-train-company-is-threatening-them/"�[0m

2023/12/14 12:54:47 stdout �[36m2023/12/14 10:54:47.916�[0m �[37m[DEBUG]�[0m �[34m{events/events.go:184 events.(*TelegramListener).procEvents}�[0m �[37mincoming msg: Вот вы всё "Apple", "Google", "монополия сторов" — мелко это всё. Вот поезда — это круто. В Польше одна региональная железная дорога решила сэкономить на обслуживании купленных пассажирских поездов Impuls от компании NEWAG и наняла независимую (от производителя) компанию для этого. После того, как сервисные работы были закончены, несколько поездов отказались заводиться. Ситуация была откровенно угрожающей (поездов не хватало для обеспечения перевозок), когда кто-то из персонала набрал в гугле "польские хакеры" и в итоге к проблеме подключились белые хакеры из группы Dragon Sector.   Им удалось разобраться с загадочными ошибками и обнаружить, что в поездах работает система распознавания "workshop-detection", которая начинает препятствовать работе, если обнаружено вмешательство неавторизованного механика. Короче говоря, производитель делает из поезда "кирпич", если его ремонтирует кто-то другой, показывая массу сообщений о нарушении копирайта и даже вроде имея возможность заблокировать поезд удаленно.  Хакерам удалось обойти это ограничение и запустить поезда. А производитель теперь отказывается от ответственности и заявляет, что ничего такого не делал и вообще грозит судиться с хакерами за клевету. Настаивая при этом, что ошибки были вызваны недостаточной квалификацией ремонтников, потому что обслуживать поезда должны только сотрудники производителя.   @blognot  https://www.404media.co/polish-hackers-repaired-trains-the-manufacturer-artificially-bricked-now-the-train-company-is-threatening-them/�[0m

2023/12/14 12:54:47 stdout �[36m2023/12/14 10:54:47.916�[0m �[37m[DEBUG]�[0m �[34m{events/events.go:162 events.(*TelegramListener).procEvents}�[0m �[37m{"message_id":42734,"from":{"id":777000,"first_name":"Telegram"},"sender_chat":{"id":-1001065632275,"type":"channel","title":"БлоGнот","username":"blognot","photo":null,"location":null},"date":1702551287,"chat":{"id":-1001226560034,"type":"supergroup","title":"БлоGнот комментарии","username":"blognot_chat","photo":null,"location":null},"forward_from_chat":{"id":-1001065632275,"type":"channel","title":"БлоGнот","username":"blognot","photo":null,"location":null},"forward_from_message_id":4533,"forward_date":1702551284,"is_automatic_forward":true,"text":"Вот вы всё "Apple", "Google", "монополия сторов" — мелко это всё. Вот поезда — это круто. В Польше одна региональная железная дорога решила сэкономить на обслуживании купленных пассажирских поездов Impuls от компании NEWAG и наняла независимую (от производителя) компанию для этого. После того, как сервисные работы были закончены, несколько поездов отказались заводиться. Ситуация была откровенно угрожающей (поездов не хватало для обеспечения перевозок), когда кто-то из персонала набрал в гугле "польские хакеры" и в итоге к проблеме подключились белые хакеры из группы Dragon Sector. \n\nИм удалось разобраться с загадочными ошибками и обнаружить, что в поездах работает система распознавания "workshop-detection", которая начинает препятствовать работе, если обнаружено вмешательство неавторизованного механика. Короче говоря, производитель делает из поезда "кирпич", если его ремонтирует кто-то другой, показывая массу сообщений о нарушении копирайта и даже вроде имея возможность заблокировать поезд удаленно.\n\nХакерам удалось обойти это ограничение и запустить поезда. А производитель теперь отказывается от ответственности и заявляет, что ничего такого не делал и вообще грозит судиться с хакерами за клевету. Настаивая при этом, что ошибки были вызваны недостаточной квалификацией ремонтников, потому что обслуживать поезда должны только сотрудники производителя. \n\n@blognot\n\nhttps://www.404media.co/polish-hackers-repaired-trains-the-manufacturer-artificially-bricked-now-the-train-company-is-threatening-them/","entities":[{"type":"mention","offset":1374,"length":8},{"type":"url","offset":1384,"length":135}],"message_auto_delete_timer_changed":null,"proximity_alert_triggered":null,"voice_chat_scheduled":null,"voice_chat_started":null,"voice_chat_ended":null,"voice_chat_participants_invited":null}�[0m

2023/12/14 12:54:12 stdout �[36m2023/12/14 10:54:12.289�[0m �[33m[INFO] �[0m �[34m{server/server.go:179 server.(*SpamWeb).unbanHandler}�[0m �[33munban user 777000�[0m

2023/12/14 12:54:48	stdout	�[36m2023/12/14 10:54:48.530�[0m �[37m[DEBUG]�[0m �[34m{events/events.go:338 events.(TelegramListener).sendBotResponse}�[0m �[37mbot response - permanently banned {777000 Telegram}*\n⛔︎ unban if wrong ⛔︎\n\nВот вы всё "Apple", "Google", "монополия сторов" — мелко это всё. Вот поезда — это круто. В Польше одна региональная железная дорога решила сэкономить на обслуживании купленных пассажирских поездов Impuls от компании NEWAG и наняла независимую (от производителя) компанию для этого. После того, как сервисные работы были закончены, несколько поездов отказались заводиться. Ситуация была откровенно угрожающей (поездов не хватало для обеспечения перевозок), когда кто-то из персонала набрал в гугле "польские хакеры" и в итоге к проблеме подключились белые хакеры из группы Dragon Sector. Им удалось разобраться с загадочными ошибками и обнаружить, что в поездах работает система распознавания "workshop-detection", которая начинает препятствовать работе, если обнаружено вмешательство неавторизованного механика. Короче говоря, производитель делает из поезда "кирпич", если его ремонтирует кто-то другой, показывая массу сообщений о нарушении копирайта и даже вроде имея возможность заблокировать поезд удаленно. Хакерам удалось обойти это ограничение и запустить поезда. А производитель теперь отказывается от ответственности и заявляет, что ничего такого не делал и вообще грозит судиться с хакерами за клевету. Настаивая при этом, что ошибки были вызваны недостаточной квалификацией ремонтников, потому что обслуживать поезда должны только сотрудники производителя. @blognot https://www.404media.co/polish-hackers-repaired-trains-the-manufacturer-artificially-bricked-now-the-train-company-is-threatening-them/\n\n, reply-to:0�[0m
2023/12/14 12:54:48	stdout	�[36m2023/12/14 10:54:48.530�[0m �[31m[WARN] �[0m �[34m{server/server.go:232 server.(*SpamWeb).UnbanURL}�[0m �[31mfailed to compress message "Вот вы всё "Apple", "Google", "монополия сторов" — мелко это всё. Вот поезда — это круто. В Польше одна региональная железная дорога решила сэкономить на обслуживании купленных пассажирских поездов Impuls от компании NEWAG и наняла независимую (от производителя) компанию для этого. После того, как сервисные работы были закончены, несколько поездов отказались заводиться. Ситуация была откровенно угрожающей (поездов не хватало для обеспечения перевозок), когда кто-то из персонала набрал в гугле "польские хакеры" и в итоге к проблеме подключились белые хакеры из группы Dragon Sector. Им удалось разобраться с загадочными ошибками и обнаружить, что в поездах работает система распознавания "workshop-detection", которая начинает препятствовать работе, если обнаружено вмешательство неавторизованного механика. Короче говоря, производитель делает из поезда "кирпич", если его ремонтирует кто-то другой, показывая массу сообщений о нарушении копирайта и даже вроде имея возможность заблокировать поезд удаленно. Хакерам удалось обойти это ограничение и запустить поезда. А производитель теперь отказывается от ответственности и заявляет, что ничего такого не делал и вообще грозит судиться с хакерами за клевету. Настаивая при этом, что ошибки были вызваны недостаточной квалификацией ремонтников, потому что обслуживать поезда должны только сотрудники производителя. @blognot https://www.404media.co/polish-hackers-repaired-trains-the-manufacturer-artificially-bricked-now-the-train-company-is-threatening-them/", encoded string is too long: 1622 characters�[0m
2023/12/14 12:54:48	stdout	�[36m2023/12/14 10:54:48.529�[0m �[37m[DEBUG]�[0m �[34m{events/events.go:306 events.(*TelegramListener).reportToAdminChat}�[0m �[37mreport to admin chat, ban data for {777000 Telegram}, group: 120025072�[0m
2023/12/14 12:54:48	stdout	�[36m2023/12/14 10:54:48.529�[0m �[33m[INFO] �[0m �[34m{events/events.go:206 events.(*TelegramListener).procEvents}�[0m �[33m{777000 Telegram} banned by bot for 9600h0m0s�[0m
2023/12/14 12:54:48	stdout	�[36m2023/12/14 10:54:48.436�[0m �[37m[DEBUG]�[0m �[34m{app/main.go:260 main.execute.makeSpamLogger.func11}�[0m �[37mspam message: this is spam: "Telegram" (777000)�[0m
2023/12/14 12:54:48	stdout	�[36m2023/12/14 10:54:48.436�[0m �[33m[INFO] �[0m �[34m{app/main.go:259 main.execute.makeSpamLogger.func11}�[0m �[33mspam detected from {777000 Telegram}, response: this is spam: "Telegram" (777000)�[0m
2023/12/14 12:54:48	stdout	�[36m2023/12/14 10:54:48.436�[0m �[37m[DEBUG]�[0m �[34m{events/events.go:198 events.(*TelegramListener).procEvents}�[0m �[37mban initiated for {Text:this is spam: "Telegram" (777000) Send:true BanInterval:9600h0m0s User:{ID:777000 Username: DisplayName:Telegram} ChannelID:0 ReplyTo:42734 DeleteReplyTo:true}�[0m
2023/12/14 12:54:48	stdout	�[36m2023/12/14 10:54:48.161�[0m �[37m[DEBUG]�[0m �[34m{events/events.go:338 events.(*TelegramListener).sendBotResponse}�[0m �[37mbot response - this is spam: "Telegram" (777000), reply-to:42734�[0m
2023/12/14 12:54:48	stdout	�[36m2023/12/14 10:54:48.161�[0m �[33m[INFO] �[0m �[34m{bot/spam.go:84 bot.(*SpamFilter).OnMessage}�[0m �[33muser Telegram detected as spammer: {name: stopword, spam: false, details: not found}, {name: emoji, spam: false, details: 0/2}, {name: similarity, spam: false, details: 0.10/0.50}, {name: classifier, spam: true, details: probability: NaN%, certain: true}, {name: cas, spam: false, details: Record not found.}, "Вот вы всё "Apple", "Google", "монополия сторов" — мелко это всё. Вот поезда — это круто. В Польше одна региональная железная дорога решила сэкономить на обслуживании купленных пассажирских поездов Impuls от компании NEWAG и наняла независимую (от производителя) компанию для этого. После того, как сервисные работы были закончены, несколько поездов отказались заводиться. Ситуация была откровенно угрожающей (поездов не хватало для обеспечения перевозок), когда кто-то из персонала набрал в гугле "польские хакеры" и в итоге к проблеме подключились белые хакеры из группы Dragon Sector. \n\nИм удалось разобраться с загадочными ошибками и обнаружить, что в поездах работает система распознавания "workshop-detection", которая начинает препятствовать работе, если обнаружено вмешательство неавторизованного механика. Короче говоря, производитель делает из поезда "кирпич", если его ремонтирует кто-то другой, показывая массу сообщений о нарушении копирайта и даже вроде имея возможность заблокировать поезд удаленно.\n\nХакерам удалось обойти это ограничение и запустить поезда. А производитель теперь отказывается от ответственности и заявляет, что ничего такого не делал и вообще грозит судиться с хакерами за клевету. Настаивая при этом, что ошибки были вызваны недостаточной квалификацией ремонтников, потому что обслуживать поезда должны только сотрудники производителя. \n\n@blognot\n\nhttps://www.404media.co/polish-hackers-repaired-trains-the-manufacturer-artificially-bricked-now-the-train-company-is-threatening-them/"�[0m
2023/12/14 12:54:47	stdout	�[36m2023/12/14 10:54:47.916�[0m �[37m[DEBUG]�[0m �[34m{events/events.go:184 events.(*TelegramListener).procEvents}�[0m �[37mincoming msg: Вот вы всё "Apple", "Google", "монополия сторов" — мелко это всё. Вот поезда — это круто. В Польше одна региональная железная дорога решила сэкономить на обслуживании купленных пассажирских поездов Impuls от компании NEWAG и наняла независимую (от производителя) компанию для этого. После того, как сервисные работы были закончены, несколько поездов отказались заводиться. Ситуация была откровенно угрожающей (поездов не хватало для обеспечения перевозок), когда кто-то из персонала набрал в гугле "польские хакеры" и в итоге к проблеме подключились белые хакеры из группы Dragon Sector. Им удалось разобраться с загадочными ошибками и обнаружить, что в поездах работает система распознавания "workshop-detection", которая начинает препятствовать работе, если обнаружено вмешательство неавторизованного механика. Короче говоря, производитель делает из поезда "кирпич", если его ремонтирует кто-то другой, показывая массу сообщений о нарушении копирайта и даже вроде имея возможность заблокировать поезд удаленно. Хакерам удалось обойти это ограничение и запустить поезда. А производитель теперь отказывается от ответственности и заявляет, что ничего такого не делал и вообще грозит судиться с хакерами за клевету. Настаивая при этом, что ошибки были вызваны недостаточной квалификацией ремонтников, потому что обслуживать поезда должны только сотрудники производителя. @blognot https://www.404media.co/polish-hackers-repaired-trains-the-manufacturer-artificially-bricked-now-the-train-company-is-threatening-them/�[0m
2023/12/14 12:54:47	stdout	�[36m2023/12/14 10:54:47.916�[0m �[37m[DEBUG]�[0m �[34m{events/events.go:162 events.(*TelegramListener).procEvents}�[0m �[37m{"message_id":42734,"from":{"id":777000,"first_name":"Telegram"},"sender_chat":{"id":-1001065632275,"type":"channel","title":"БлоGнот","username":"blognot","photo":null,"location":null},"date":1702551287,"chat":{"id":-1001226560034,"type":"supergroup","title":"БлоGнот комментарии","username":"blognot_chat","photo":null,"location":null},"forward_from_chat":{"id":-1001065632275,"type":"channel","title":"БлоGнот","username":"blognot","photo":null,"location":null},"forward_from_message_id":4533,"forward_date":1702551284,"is_automatic_forward":true,"text":"Вот вы всё "Apple", "Google", "монополия сторов" — мелко это всё. Вот поезда — это круто. В Польше одна региональная железная дорога решила сэкономить на обслуживании купленных пассажирских поездов Impuls от компании NEWAG и наняла независимую (от производителя) компанию для этого. После того, как сервисные работы были закончены, несколько поездов отказались заводиться. Ситуация была откровенно угрожающей (поездов не хватало для обеспечения перевозок), когда кто-то из персонала набрал в гугле "польские хакеры" и в итоге к проблеме подключились белые хакеры из группы Dragon Sector. \n\nИм удалось разобраться с загадочными ошибками и обнаружить, что в поездах работает система распознавания "workshop-detection", которая начинает препятствовать работе, если обнаружено вмешательство неавторизованного механика. Короче говоря, производитель делает из поезда "кирпич", если его ремонтирует кто-то другой, показывая массу сообщений о нарушении копирайта и даже вроде имея возможность заблокировать поезд удаленно.\n\nХакерам удалось обойти это ограничение и запустить поезда. А производитель теперь отказывается от ответственности и заявляет, что ничего такого не делал и вообще грозит судиться с хакерами за клевету. Настаивая при этом, что ошибки были вызваны недостаточной квалификацией ремонтников, потому что обслуживать поезда должны только сотрудники производителя. \n\n@blognot\n\nhttps://www.404media.co/polish-hackers-repaired-trains-the-manufacturer-artificially-bricked-now-the-train-company-is-threatening-them/","entities":[{"type":"mention","offset":1374,"length":8},{"type":"url","offset":1384,"length":135}],"message_auto_delete_timer_changed":null,"proximity_alert_triggered":null,"voice_chat_scheduled":null,"voice_chat_started":null,"voice_chat_ended":null,"voice_chat_participants_invited":null}�[0m
2023/12/14 12:54:12	stdout	�[36m2023/12/14 10:54:12.289�[0m �[33m[INFO] �[0m �[34m{server/server.go:179 server.(*SpamWeb).unbanHandler}�[0m �[33munban user 777000�[0m

Extend supported commands

The idea is to add more commands:

admins only

The /ban command performs the same thing as /spam but skips adding the message to spam samples. Sometimes, users post a message that is not generally spam but can be considered spam in a given context only; for example, someone is trying to promote his product.
/warn - replaces it in response with the predefined text, optionally restricts the user for some period of time

regular users

/report to allow users to report spam, see discussion in #58 for more details

failed to process direct spam report

When we reply to a message with spam we get the following message in the admin channel

However the post is not removed and get the following errors in the console.

tg-spam  | 2024/03/29 19:53:35.575 [WARN]  can't write to log, can't make directories for new logfile: mkdir /srv/logs: permission denied
tg-spam  | 2024/03/29 19:53:35.582 [INFO]  detected spam entry added for user_id:6609487300, name:KamaraShell
tg-spam  | 2024/03/29 19:53:35.928 [INFO]  {6609487300 KamaraShell Shelby Kamara} banned by bot for 9600h0m0s
tg-spam  | 2024/03/29 20:31:31.396 [INFO]  user Сергей Гуськов detected as spammer: {name: stopword, spam: false, details: not found}, {name: emoji, spam: false, details: 0/2}, {name: similarity, spam: false, details: 0.14/0.50}, {name: classifier, spam: true, details: probability of spam: 99.42%}, {name: cas, spam: false, details: record not found}, {name: openai, spam: true, details: Promotion of a get-rich-quick scheme, confidence: 95%}, "A dream come true is when you double your savings in just 1 night, @WhalesofRocketmoonsignal is the only way to make it happen"
tg-spam  | 2024/03/29 20:31:31.396 [WARN]  can't write to log, can't make directories for new logfile: mkdir /srv/logs: permission denied
tg-spam  | 2024/03/29 20:31:31.404 [INFO]  detected spam entry added for user_id:7088948055, name:Thomas0Wright7
tg-spam  | 2024/03/29 20:31:31.732 [INFO]  {7088948055 Thomas0Wright7 Сергей Гуськов} banned by bot for 9600h0m0s
tg-spam  | 2024/03/30 00:04:44.719 [INFO]  user "1752434186" added to approved users
tg-spam  | 2024/03/30 00:07:17.847 [INFO]  user "1752434186" added to approved users
tg-spam  | 2024/03/30 02:52:33.163 [INFO]  user "edgepillar" (1560869066) added to approved users
tg-spam  | 2024/03/30 03:31:20.490 [INFO]  user "Yourself_ZNN" (6172198354) added to approved users
tg-spam  | 2024/03/30 06:07:12.981 [INFO]  user "triplea_z" (482795147) added to approved users
tg-spam  | 2024/03/30 07:08:32.013 [INFO]  remove aproved user: 6017513172
tg-spam  | 2024/03/30 07:08:32.233 [WARN]  failed to process direct spam report: failed to update spam for "": can't update spam samples: can't update spam samples: failed to open /srv/data/spam-dynamic.txt: open /srv/data/spam-dynamic.txt: no such file or directory
tg-spam  | 2024/03/30 07:22:10.251 [INFO]  remove aproved user: 6017513172
tg-spam  | 2024/03/30 07:22:10.460 [WARN]  failed to process direct spam report: failed to update spam for "": can't update spam samples: can't update spam samples: failed to open /srv/data/spam-dynamic.txt: open /srv/data/spam-dynamic.txt: no such file or directory
tg-spam  | 2024/03/30 07:22:44.962 [INFO]  remove aproved user: 6017513172
tg-spam  | 2024/03/30 07:22:45.178 [WARN]  failed to process direct spam report: failed to update spam for "": can't update spam samples: can't update spam samples: failed to open /srv/data/spam-dynamic.txt: open /srv/data/spam-dynamic.txt: no such file or directory

Here is the docker-compose.yml file. We are also getting a can't write to log, can't make directories for new logfile: mkdir /srv/logs: permission denied error. I see in other discussion this folder needs to be created, which it is. We have a logs folder in the same folder where docker-compose.yml exists.

services:
  tg-spam:
    image: umputun/tg-spam:latest
    hostname: tg-spam
    restart: always
    container_name: tg-spam
    user: "1000:1000" # set uid:gid to host user to avoid permission issues with mounted volumes
    logging: &default_logging
      driver: json-file
      options:
        max-size: "10m"
        max-file: "5"
    environment:
      - TZ=America/Chicago
      - TELEGRAM_TOKEN=6901977077:AAEcZdP4i9nFu4yak2uUNzq9S0VZJB7QYRE
      - TELEGRAM_GROUP=zenonnetwork
      - ADMIN_GROUP=-1001991632774 # admin group id
      - LOGGER_ENABLED=true     # enable logging
      - LOGGER_FILE=/srv/logs/tg-spam.log
      - LOGGER_MAX_SIZE=5M      # max log file size in megabytes before rotation
      - NO_SPAM_REPLY=true      # do not reply to spam messages in the public group
      - SERVER_ENABLED=false    # enable server, default port is 8080
      - OPENAI_TOKEN=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
    volumes:
      - ./logs:/srv/log
      - data:/srv/data
    command: --openai.veto

volumes:
  data:

Deletion of spam sample duplicates

If in the admin chat you confirm the ban of users who write the same messages, such messages are added twice. And when you try to remove a duplicate from the sample database, both messages (both lines) are deleted. This is not a problem if you remember this - it is a nuance of placing the sample database in a text file.

Я перейду на русский если автор не против.
Предложил бы хранить семплы как-то по-другому, но идей кроме файловой базы у меня нет.. Да и не шарю я в этом вашем Golang
Честно говоря думаю автор не рассчитывал, что образцов спама будет так много =)
Возможно стоит в будущем предусмотреть какую-то очистку базы от дубликатов, либо проверку на существование идентичной записи при добавлении. Потому что невозможно вспомнить -- что уже было, а что нет.
Спасибо за внимание =)

Add support of auto-generated spam and ham files

In addition to static samples, the system should update and use another pair of files it updates by itself:

dynamic-spam.txt - every message detected with a high probability of spam will be added to this file. The high-probability filter is either multiple filters reacted, cas reported, and (maybe) emoji filter triggered
dynamic-ham.txt - every message that admins unbanned will be added

Add a http port to an unban link

When a bot generates an unban link for admin, it uses only the domain name, supposing that there is a frontend which will translate the request to the bot as a backend. It will be useful to have an additional public port parameter when there is no frontend before the standalone bot.

Problem with spam messages containing pictures

Today I had one case when the bot didn't delete spam message forwarded to a admin group. I'm not sure is it training mode specific problem or general.
Looks like the bot didn't get the message text.
Sha256 e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 is empty string.

Here is logs:

2023/12/28 09:03:24.411 [DEBUG] {events/listener.go:238 events.(*TelegramListener).isAdminChat} message in admin chat -4058117643, from pretty_friend
2023/12/28 09:03:24.411 [DEBUG] {events/admin.go:58 events.(*admin).MsgHandler} message from admin chat: msg id: 401, update id: 84145103, from: pretty_friend, sender: ""
2023/12/28 09:03:24.411 [DEBUG] {events/admin.go:69 events.(*admin).MsgHandler} forwarded message from superuser "pretty_friend" to admin chat -4058117643: ""
2023/12/28 09:03:24.412 [DEBUG] {storage/locator.go:131 storage.(*Locator).Message} failed to find message by hash "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855": sql: no rows in result set
2023/12/28 09:03:24.412 [WARN]  {events/listener.go:115 events.(*TelegramListener).Do} failed to process admin chat message: not found "" in locator
2023/12/28 09:03:24.413 [DEBUG] {events/listener.go:272 events.(*TelegramListener).sendBotResponse} bot response - error: not found "" in locator, reply-to:0

Here is the message, maybe it will be useful for a test case:

💎🎄ГОTОВЬСЯ K HOBОMУ ΓOДᎩ С НОBЫМ ДОХОДOМ🎄💎

❗️𝖬ecтa огрaничeны❗️ 

💰Πpисoединяйcя к нaш𝚎й кoманде в сф𝚎p𝚎 цифpoвых активоⲃ💰

🔥Тeбя ждет нoвoe интереcнoе нaпpaⲃлeние, пacивный yдaл𝚎нный доxoд oт 10 000 ₽₽₽/день, и бecплaтноe обyчeние🧑‍💻💸

Bcё, чтo нyжнo – вoзpаcт от 18 л𝚎т, смаpтфон и 1-2 часа ⲃремени ⲃ день
❗️Всe л𝚎гально и б𝚎з пр𝚎доплaт❗️

Давaй встречать Нoвый гoд c увеpеннocтью ⲃ cвоем Փинaнcовом yспехe! 💰🚀

‼️Пиши напрямyю pукоⲃодителю 👉 @ser______ratov

Here is how it looks like:

I suppose it can be because of a picture. As I know Telegram sends a picture as a separate message and then glue it together with a text.

Originally posted by @alehano in #17 (comment)

UPD: Today I had the second case after forwarding a message to an admin group.

Warnings in logs

During the ban-unban cycle I see the following messages in the log:

tg-spam  | 2024/02/14 08:06:41.823 [INFO]  user Мария Свиткова detected as spammer: {name: stopword, spam: false, details: not found}, {name: emoji, spam: false, details: 0/2}, {name: similarity, spam: false, details: 0.04/0.50}, {name: classifier, spam: true, details: probability of spam: 61.25%}, {name: cas, spam: false, details: record not found}, "не переживай) из 5 северных сияний, которые были у нас в городе, я проспала 2. И еще 2 были видны из любого села по соседству, но не у нас, потому что над нашим стояли очередные облака"
tg-spam  | 2024/02/14 08:06:41.833 [INFO]  detected spam entry added for user_id:370236309, name:irda_noire
tg-spam  | 2024/02/14 08:06:41.877 [INFO]  {370236309 irda_noire Мария Свиткова} banned by bot for 9600h0m0s
tg-spam  | 2024/02/14 08:06:53.479 [WARN]  failed to send message as markdown, Bad Request: can't parse entities: Can't find end of the entity starting at byte offset 34
tg-spam  | 2024/02/14 08:06:57.660 [INFO]  add aproved user: id:370236309, name:"irda_noire"
tg-spam  | 2024/02/14 08:06:57.663 [INFO]  user "irda_noire" (370236309) added to approved users
tg-spam  | 2024/02/14 08:06:57.667 [WARN]  failed to send message as markdown, Bad Request: can't parse entities: Can't find end of the entity starting at byte offset 603
tg-spam  | 2024/02/14 08:06:57.778 [INFO]  user unbanned, chatID: -1002096077129, userID: 370236309:66269, orig: "permanently banned {370236309 irda_noire Мария Свиткова}\n\nне переживай) из 5 северных сияний, которые были у нас в городе, я проспала 2. И еще 2 были видны из любого села по соседству, но не у нас, потому что над нашим стояли очередные облака\n\n**spam detection results**\n- stopword: ham, not found\n- emoji: ham, 0/2\n- similarity: ham, 0.04/0.50\n- classifier: spam, probability of spam: 61.25%\n- cas: ham, record not found"

Lines 4 and 7 are of particular interest: it looks like the bot tries to send a message and can't? Probably repeat the message sent?

Make it possible to run from AWS Lambda

Lambda seems to be a nice fit for this type of service. Technically, it is not trivial but should be double as TG provides webhooks that the bot can register, and lambda will be invoked. In addition, all the persistent parts should be adjusted to support either ebs, efs or dynamo, whatever is cheaper.

I don't have any need for such a mode persobally; however, if someone can implement PR with minimal disruption to the current functionality and without much new complexity - I'll merge.