- Support Python 2.7, 3.3, 3.4, 3.5.
- Simple interfaces.
- Easy to extend.
pip install text-cleaner
WARNING FOR PYTHON 2.7 USERS: Only UCS-4 build is supported(--enable-unicode=ucs4
), UCS-2 build (see this) is NOT SUPPORTED in the latest version.
from text_cleaner import remove, keep
from text_cleaner.processor.common import ASCII
from text_cleaner.processor.chinese import CHINESE
from text_cleaner.processor.misc import RESTRICT_URL
# remove url and ascii characters.
# return: u'点击 查看 '
remove(
'点击http://t.cn/RtU0mZ1 查看,123456,test',
[RESTRICT_URL, ASCII],
)
# keep chinese characters and url.
# return: u'点击 http://t.cn/RtU0mZ1 查看'
keep(
'点击http://t.cn/RtU0mZ1 查看,123456,test',
[CHINESE, RESTRICT_URL],
)
# use processor directly.
# return: u'点击 查看'
RESTRICT_URL.remove('点击http://t.cn/RtU0mZ1 查看')
# return: u'点击<URL> 查看'
RESTRICT_URL.replace('<URL>').remove('点击http://t.cn/RtU0mZ1 查看')
text_cleaner.remove(text, processors):
- text:
str
orbytes
(unicode
orstr
for Python 2). - processors: iterable of processors. remove invokes
remove
of each processor to handle text.
text_cleaner.keep(text, processors):
- same as remove, but invoke
keep
method of processors instead.
DEFAULT_REPLACE_TEXT: ' '
, single space.
RegexProcessor(regex, replace_text=DEFAULT_REPLACE_TEXT)
- contruct a regex processor for regex, replace unmatched components with replace_text.
- replace(self, new_replace_text): create a new processor, with new replace_text is set.
- remove(self, text): remove all occurences of regex from text.
- keep(self, text): keep only the occurences of regex, remove all unmatched components from text.
- verify(self, text): return True if text match regex, otherwise returns False.
UnicodeRange(begin, end):
- begin: int, the begin of unicode range.
- end: int, the end of unicode range.
UnicodeRangeProcessor(ranges, replace_text=DEFAULT_REPLACE_TEXT)
- subclass of RegexProcessor.
- ranges: iterable of instances of UnicodeRange.
Following processors are defined by UnicodeRange and regex. Read the source code if you are sure about what's going on.
text_cleaner.processor.common
, for common usage:
ALPHA
DIGIT
SYMBOLS_AND_PUNCTUATION
ASCII
ALPHA_EXTENSION
DIGIT_EXTENSION
SYMBOLS_AND_PUNCTUATION_EXTENSION
GENERAL_PUNCTUATION
text_cleaner.processor.misc
, misellanious processors:
URL
RESTRICT_URL
ESCAPED_WHITESPACE
WECHAT_EMOJI_EN
WECHAT_EMOJI_ZHCN
WECHAT_EMOJI
text_cleaner.processor.chinese
, Chinese processing:
CHINESE_CHARACTER
: only common characters.CHINESE
: common characters + symbols and puntuations.CHINESE_ALL
: all CJK characters.CHINESE_EXTENSION
CHINESE_COMPATIBILITY
CHINESE_SYMBOLS_AND_PUNCTUATION
How to define URLs is a complex problem. We provide two choices for our users.
URL
: truncate urls till whitespaces.RESTRICT_URL
: truncate urls till non-whitespace ASCII ([!-~] in the ASCII table)
For Chinese users, we recommend using RESTRICT_URL
.
from text_cleaner.processor.misc import RESTRICT_URL, URL
URL.remove('点击http://t.cn/RtU0mZ1 查看')
# '点击 查看'
URL.remove('点击http://t.cn/RtU0mZ1查看')
# '点击 '
RESTRICT_URL.remove('点击http://t.cn/RtU0mZ1查看')
# '点击 查看'