Giter VIP home page Giter VIP logo

twitter-korean-text's Introduction

twitter-korean-text Coverage Status

[//]: # (Travis has been deactivated: Build Status)

트위터에서 만든 오픈소스 한국어 처리기

Scala/Java library to process Korean text with a Java wrapper. twitter-korean-text currently provides Korean normalization and tokenization. Please join our community at Google Forum. The intent of this text processor is not limited to short tweet texts.

스칼라로 쓰여진 한국어 처리기입니다. 현재 텍스트 정규화와 형태소 분석, 스테밍을 지원하고 있습니다. 짧은 트윗은 물론이고 긴 글도 처리할 수 있습니다. 개발에 참여하시고 싶은 분은 Google Forum에 가입해 주세요. 사용법을 알고자 하시는 초보부터 코드에 참여하고 싶으신 분들까지 모두 환영합니다.

twitter-korean-text의 목표는 빅데이터 등에서 간단한 한국어 처리를 통해 색인어를 추출하는 데에 있습니다. 완전한 수준의 형태소 분석을 지향하지는 않습니다.

twitter-korean-text는 normalization, tokenization, stemming, phrase extraction 이렇게 네가지 기능을 지원합니다.

정규화 normalization (입니닼ㅋㅋ -> 입니다 ㅋㅋ, 샤릉해 -> 사랑해)

  • 한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ -> 한국어를 처리하는 예시입니다 ㅋㅋ

토큰화 tokenization

  • 한국어를 처리하는 예시입니다 ㅋㅋ -> 한국어Noun, 를Josa, 처리Noun, 하는Verb, 예시Noun, 입Adjective, 니다Eomi ㅋㅋKoreanParticle

어근화 stemming (입니다 -> 이다)

  • 한국어를 처리하는 예시입니다 ㅋㅋ -> 한국어Noun, 를Josa, 처리Noun, 하다Verb, 예시Noun, 이다Adjective, ㅋㅋKoreanParticle

어구 추출 phrase extraction

  • 한국어를 처리하는 예시입니다 ㅋㅋ -> 한국어, 처리, 예시, 처리하는 예시

Introductory Presentation: Google Slides

Try it here

Gunja Agrawal kindly created a test API webpage for this project: http://gunjaagrawal.com/langhack/

Gunja Agrawal님이 만들어주신 테스트 웹 페이지 입니다. http://gunjaagrawal.com/langhack/

Opensourced here: twitter-korean-tokenizer-api

API

scaladoc

mavendoc

Maven

To include this in your Maven-based JVM project, add the following lines to your pom.xml:

Maven을 이용할 경우 pom.xml에 다음의 내용을 추가하시면 됩니다:

  <dependency>
    <groupId>com.twitter.penguin</groupId>
    <artifactId>korean-text</artifactId>
    <version>4.4</version>
  </dependency>

The maven site is available here http://twitter.github.io/twitter-korean-text/ and scaladocs are here http://twitter.github.io/twitter-korean-text/scaladocs/

Support for other languages.

.net

modamoda kindly offered a .net wrapper: https://github.com/modamoda/TwitterKoreanProcessorCS

node.js

Ch0p kindly offered a node.js wrapper: twtkrjs

Youngrok Kim kindly offered a node.js wrapper: node-twitter-korean-text

Python

Baeg-il Kim kindly offered a Python version: https://github.com/cedar101/twitter-korean-py

Jaepil Jeong kindly offered a Python wrapper: https://github.com/jaepil/twkorean

  • Python Korean NLP project KoNLPy now includes twitter-korean-text. 파이썬에서 쉬운 활용이 가능한 KoNLPy 패키지에 twkorean이 포함되었습니다.

Ruby

jun85664396 kindly offered a Ruby wrapper: twitter-korean-text-ruby

  • This provides access to com.twitter.penguin.korean.TwitterKoreanProcessorJava (Java wrapper).

Jaehyun Shin kindly offered a Ruby wrapper: twitter-korean-text-ruby

  • This provides access to com.twitter.penguin.korean.TwitterKoreanProcessor (Original Scala Class).

Elastic Search

socurites's Korean analyzer for elasticsearch based on twitter-korean-text: tkt-elasticsearch

Get the source 소스를 원하시는 경우

Clone the git repo and build using maven.

Git 전체를 클론하고 Maven을 이용하여 빌드합니다.

git clone https://github.com/twitter/twitter-korean-text.git
cd twitter-korean-text
mvn compile

Open 'pom.xml' from your favorite IDE.

Usage 사용 방법

You can find these examples in examples folder.

examples 폴더에 사용 방법 예제 파일이 있습니다.

from Scala

import com.twitter.penguin.korean.TwitterKoreanProcessor
import com.twitter.penguin.korean.phrase_extractor.KoreanPhraseExtractor.KoreanPhrase
import com.twitter.penguin.korean.tokenizer.KoreanTokenizer.KoreanToken

object ScalaTwitterKoreanTextExample {
  def main(args: Array[String]) {
    val text = "한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ #한국어"

    // Normalize
    val normalized: CharSequence = TwitterKoreanProcessor.normalize(text)
    println(normalized)
    // 한국어를 처리하는 예시입니다ㅋㅋ #한국어

    // Tokenize
    val tokens: Seq[KoreanToken] = TwitterKoreanProcessor.tokenize(normalized)
    println(tokens)
    // List(한국어(Noun: 0, 3), 를(Josa: 3, 1),  (Space: 4, 1), 처리(Noun: 5, 2), 하는(Verb: 7, 2),  (Space: 9, 1), 예시(Noun: 10, 2), 입니(Adjective: 12, 2), 다(Eomi: 14, 1), ㅋㅋ(KoreanParticle: 15, 2),  (Space: 17, 1), #한국어(Hashtag: 18, 4))

    // Stemming
    val stemmed: Seq[KoreanToken] = TwitterKoreanProcessor.stem(tokens)

    println(stemmed)
    // List(한국어(Noun: 0, 3), 를(Josa: 3, 1),  (Space: 4, 1), 처리(Noun: 5, 2), 하다(Verb: 7, 2),  (Space: 9, 1), 예시(Noun: 10, 2), 이다(Adjective: 12, 3), ㅋㅋ(KoreanParticle: 15, 2),  (Space: 17, 1), #한국어(Hashtag: 18, 4))

    // Phrase extraction
    val phrases: Seq[KoreanPhrase] = TwitterKoreanProcessor.extractPhrases(tokens, filterSpam = true, enableHashtags = true)
    println(phrases)
    // List(한국어(Noun: 0, 3), 처리(Noun: 5, 2), 처리하는 예시(Noun: 5, 7), 예시(Noun: 10, 2), #한국어(Hashtag: 18, 4))
  }
}

from Java

import java.util.List;

import scala.collection.Seq;

import com.twitter.penguin.korean.TwitterKoreanProcessor;
import com.twitter.penguin.korean.TwitterKoreanProcessorJava;
import com.twitter.penguin.korean.phrase_extractor.KoreanPhraseExtractor;
import com.twitter.penguin.korean.tokenizer.KoreanTokenizer;

public class JavaTwitterKoreanTextExample {
  public static void main(String[] args) {
    String text = "한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ #한국어";

    // Normalize
    CharSequence normalized = TwitterKoreanProcessorJava.normalize(text);
    System.out.println(normalized);
    // 한국어를 처리하는 예시입니다ㅋㅋ #한국어


    // Tokenize
    Seq<KoreanTokenizer.KoreanToken> tokens = TwitterKoreanProcessorJava.tokenize(normalized);
    System.out.println(TwitterKoreanProcessorJava.tokensToJavaStringList(tokens));
    // [한국어, 를, 처리, 하는, 예시, 입니, 다, ㅋㅋ, #한국어]
    System.out.println(TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList(tokens));
    // [한국어(Noun: 0, 3), 를(Josa: 3, 1),  (Space: 4, 1), 처리(Noun: 5, 2), 하는(Verb: 7, 2),  (Space: 9, 1), 예시(Noun: 10, 2), 입니(Adjective: 12, 2), 다(Eomi: 14, 1), ㅋㅋ(KoreanParticle: 15, 2),  (Space: 17, 1), #한국어(Hashtag: 18, 4)]


    // Stemming
    Seq<KoreanTokenizer.KoreanToken> stemmed = TwitterKoreanProcessorJava.stem(tokens);
    System.out.println(TwitterKoreanProcessorJava.tokensToJavaStringList(stemmed));
    // [한국어, 를, 처리, 하다, 예시, 이다, ㅋㅋ, #한국어]
    System.out.println(TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList(stemmed));
    // [한국어(Noun: 0, 3), 를(Josa: 3, 1),  (Space: 4, 1), 처리(Noun: 5, 2), 하다(Verb: 7, 2),  (Space: 9, 1), 예시(Noun: 10, 2), 이다(Adjective: 12, 3), ㅋㅋ(KoreanParticle: 15, 2),  (Space: 17, 1), #한국어(Hashtag: 18, 4)]


    // Phrase extraction
    List<KoreanPhraseExtractor.KoreanPhrase> phrases = TwitterKoreanProcessorJava.extractPhrases(tokens, true, true);
    System.out.println(phrases);
    // [한국어(Noun: 0, 3), 처리(Noun: 5, 2), 처리하는 예시(Noun: 5, 7), 예시(Noun: 10, 2), #한국어(Hashtag: 18, 4)]

  }
}

Basics

TwitterKoreanProcessor.scala is the central object that provides the interface for all the features.

TwitterKoreanProcessor.scala에 지원하는 모든 기능을 모아 두었습니다.

Running Tests

mvn test will run our unit tests

모든 유닛 테스트를 실행하려면 mvn test를 이용해 주세요.

Tools

We provide tools for quality assurance and test resources. They can be found under src/main/scala/com/twitter/penguin/korean/qa and src/main/scala/com/twitter/penguin/korean/tools.

Contribution

Refer to the general contribution guide. We will add this project-specific contribution guide later.

설치 및 수정하는 방법 상세 안내

Performance 처리 속도

Tested on Intel i7 2.3 Ghz

Initial loading time (초기 로딩 시간): 2~4 sec

Average time per parsing a chunk (평균 어절 처리 시간): 0.12 ms

Tweets (Avg length ~50 chars)

Tweets 100K 200K 300K 400K 500K 600K 700K 800K 900K 1M
Time in Seconds 57.59 112.09 165.05 218.11 270.54 328.52 381.09 439.71 492.94 542.12
Average per tweet: 0.54212 ms

Benchmark test by KoNLPy

Benchmark test

From http://konlpy.org/ko/v0.4.2/morph/

Author(s)

License

Copyright 2014 Twitter, Inc.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

twitter-korean-text's People

Contributors

alanbato avatar bistros avatar caniszczyk avatar hohyon-ryu avatar jhsbeat avatar jun85664396 avatar keepcosmos avatar readmecritic avatar retrieverjo avatar rokoroku avatar stray-leone avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

twitter-korean-text's Issues

Add adjective/verb stemming

Instead of full adj/verb analysis, we can have this simple stemming, which is more useful for search:

새로운 스테밍을 추가했었다. -> 새롭다 + 스테밍 + 을 + 추가 + 하다

Add email regex

twitter-text/java/src/com/twitter/Regex.java misses Email Regex.

Hashtag classifying issue

안녕하세요

-#hash "#hash 등등 일부 특수문자가 해시태그 앞에 붙어있을때는 해당 문자까지 포함해서 해시태그가 되어버리는 이슈가 있습니다 (pos=Hashtag, text=""hash" 와 같은 식으로). 실제 트위터에서는 #hash 부분만 해시태그로 동작합니다. twitter-text 쪽 이슈일 듯 하지만, 정확한 구조를 몰라 이쪽 repo에 이슈 올립니다.

slf4j-nop needs a non-default scope

Not sure how this is being used, but it shouldn't be included in the library's default scope, because it will become present on the referencing project's classpath, where it will compete with other logging implementations.

Please move this dep to a non-default scope, such as test
http://www.slf4j.org/faq.html#maven2

Can not check out repository on Windows platform

Can not check out repository on Windows platform for bottoms,

fatal: cannot create directory at 'src/main/resources/com/twitter/penguin/korean/util/aux': Invalid argument

I think 'aux' is reserved name for Windows.

윈도우 플랫폼에서 repo 를 체크아웃 할 수 없습니다.
'aux' 라는 디렉토리 명이 윈도우에서 path 명으로 사용할 수 없기 때문인 것 같습니다.

Current API offset/length are confusing: Decouple 3 components of tokenize()

Normalize, tokenize, stemming should be separate components.

Context:

From Hongju Lee and Zhigang Qi

The impractical offset caused by transforming a token ("입니닼") into its' root ("이다") while conducting token splitting and POS tagging. Actually, such transformation is neither POS tagging nor tokenizing but lemmatizing (or stemming). Putting all those different component into one interface outputing the same data structure somehow makes easier to use but also produces confusion.

In my opinion, along with such nice and easy interface, it would be better to have all those three components separately as runnables so that we can chain them together. Especially for the case we have more than one option for each component.

How to the offset() method result when enabling stemmer and normalizer?

In KoreanTokenJava class, the result of offset() method seems weird.
Input:
"한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ"

Output:
[한국어(Noun: 0, 3), 를(Josa: 3, 1), 처리(Noun: 5, 2), 하다(Verb: 7, 2), 예시(Noun: 10, 2), 이다(Adjective: 12, 3), ㅋㅋ(KoreanParticle: 15, 2), 한국어(Noun: 17, 3), 를(Josa: 20, 1), 처리(Noun: 22, 2), 하다(Verb: 24, 2), 예시(Noun: 27, 2), 이다(Adjective: 29, 3), ㅋㅋ(KoreanParticle: 32, 2)]

TwitterKoreanProcessorJava.java#L94

TwitterKoreanProcessorJava.java#L94 : this line is returning something in Scala world, and when iterating over the result in Java it's unbearably slow. Is there a way to make the iteration over tokenizeToStrings(CharSequence text)'s result faster in Java?

Maybe this is not a problem for Scala users, but it is a problem for Java users..

Detokenizer throws exception with certain inputs

I've come across a minor bug in the newly added detokenization routine, where some inputs result in a java.lang.UnsupportedOperationException. Example:

com.twitter.penguin.korean.TwitterKoreanProcessor.detokenize(List("", "제품을", "사용하겠습니다"))
// throws java.lang.UnsupportedOperationException: empty.init

It seems like this could be easily fixed by initialising the list to be output differently. For now, I'm circumventing the problem by always prepending an empty string to the input:

com.twitter.penguin.korean.TwitterKoreanProcessor.detokenize(List("", "", "제품을", "사용하겠습니다"))
// works

This is neither critical nor urgent, but it would be nice if this could be fixed in some of the future releases of this great library.

Changing required scala version to 2.11.5+

안녕하세요?
좋은 라이브러리 만들어주셔서 잘 사용하고 있습니다.

README에 있는 예제 중, Tokenize에서
TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList 메소드를 이용하면, Java의 List<KoreanTokenJava>로 반환됩니다.

KoreanTokenJava 클래스는 getter가 없어서 내부에 있는 내용을 알기 어렵습니다.
그래서, KoreanTokenJava와 tokensToJavaKoreanTokenList에 각각 대응하는 클래스와 메소드를 각각 구현하고 있습니다.

tokensToJavaKoreanTokenList 함수 구현 중에,

....
public static List<KoreanTokenJava> tokensToJavaKoreanTokenList(Seq<KoreanToken> tokens) {
    Iterator<KoreanToken> tokenized = tokens.iterator();
....

를 쓰신 부분이 있는데요, (com.twitter.penguin.korean.TwitterKoreanProcessor의 69번째 라인)
이 부분을 JDK 8, 그리고 scala-library-2.10.4.jar 환경에서 사용하게 될 경우, 다음과 같은 에러메시지가 발생합니다.

The method iterator() is ambiguous for the type Seq<KoreanTokenizer.KoreanToken>

알려진 바에 의하면, 스칼라 2.11.4까지는 JVM 스펙에 대한 위반사항이 있었고, 이것이 2.11.5에서 수정되었다고 합니다. (출처)

혹시 pom.xml에 명시된 scala.version을 2.11.5 보다 최신 버젼으로 올려주실 수 있으신지요.

제가 문제의 원인을 잘못 이해한 것이라면, 지적 부탁드립니다.
또한, tokenize 되어 Seq에 저장된 토큰들(Seq<SomeKindOfToken> tokens)을 자바에서 읽어올 수 있는 방법이,

List<KoreanTokenJava> list = TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList(tokens);

외에 존재하는지도 문의드립니다.

[Bug] tokenization solutions is sometimes empty.

This was thrown in a scalding job:

Caused by: java.lang.UnsupportedOperationException: empty.minBy
at scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:252)
at scala.collection.AbstractTraversable.minBy(Traversable.scala:104)
at com.twitter.penguin.korean.v2.tokenizer.KoreanTokenizer$.com$twitter$penguin$korean$v2$tokenizer$KoreanTokenizer$$parseKoreanChunk(KoreanTokenizer.scala:244)
at com.twitter.penguin.korean.v2.tokenizer.KoreanTokenizer$$anonfun$tokenize$1.apply(KoreanTokenizer.scala:257)
at com.twitter.penguin.korean.v2.tokenizer.KoreanTokenizer$$anonfun$tokenize$1.apply(KoreanTokenizer.scala:254)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
at scala.collection.immutable.List.flatMap(List.scala:344)
at com.twitter.penguin.korean.v2.tokenizer.KoreanTokenizer$.tokenize(KoreanTokenizer.scala:254)
at com.twitter.penguin.korean.v2.TwitterKoreanProcessor$.tokenize(TwitterKoreanProcessor.scala:49)
at com.twitter.penguin.korean.v2.TwitterKoreanProcessor.tokenize(TwitterKoreanProcessor.scala)
at com.twitter.common_internal.text.tokenizer.korean.TwitterKoreanTextTokenizerV2.reset(TwitterKoreanTextTokenizerV2.java:117)
at com.twitter.common.text.token.TwitterTokenStream.reset(TwitterTokenStream.java:96)
at com.twitter.common_internal.text.tokenizer.TokenStreamSwitcher.reset(TokenStreamSwitcher.java:25)
at com.twitter.common.text.token.TwitterTokenStream.reset(TwitterTokenStream.java:96)
at com.twitter.common.text.token.TokenProcessor.reset(TokenProcessor.java:48)
at com.twitter.common.text.token.TwitterTokenStream.reset(TwitterTokenStream.java:96)
at com.twitter.common.text.token.TokenProcessor.reset(TokenProcessor.java:48)
at com.twitter.common.text.token.TwitterTokenStream.reset(TwitterTokenStream.java:96)
at com.twitter.common.text.combiner.ExtractorBasedTokenCombiner.reset(ExtractorBasedTokenCombiner.java:57)
at com.twitter.common.text.token.TokenProcessor.reset(TokenProcessor.java:48)
at com.twitter.common.text.token.TwitterTokenStream.reset(TwitterTokenStream.java:96)
at com.twitter.common.text.combiner.ExtractorBasedTokenCombiner.reset(ExtractorBasedTokenCombiner.java:57)
at com.twitter.common.text.token.TokenProcessor.reset(TokenProcessor.java:48)
at com.twitter.common.text.token.TwitterTokenStream.reset(TwitterTokenStream.java:96)
at com.twitter.common.text.combiner.ExtractorBasedTokenCombiner.reset(ExtractorBasedTokenCombiner.java:57)
at com.twitter.common.text.token.TokenProcessor.reset(TokenProcessor.java:48)
at com.twitter.common.text.token.TwitterTokenStream.reset(TwitterTokenStream.java:96)
at com.twitter.common.text.combiner.ExtractorBasedTokenCombiner.reset(ExtractorBasedTokenCombiner.java:57)
at com.twitter.common.text.token.TokenProcessor.reset(TokenProcessor.java:48)
at com.twitter.common.text.token.TwitterTokenStream.reset(TwitterTokenStream.java:96)
at com.twitter.common.text.combiner.ExtractorBasedTokenCombiner.reset(ExtractorBasedTokenCombiner.java:57)
at com.twitter.common.text.token.TokenProcessor.reset(TokenProcessor.java:48)
at com.twitter.common.text.token.TwitterTokenStream.reset(TwitterTokenStream.java:96)
at com.twitter.common.text.combiner.ExtractorBasedTokenCombiner.reset(ExtractorBasedTokenCombiner.java:57)
at com.twitter.common.text.token.TokenProcessor.reset(TokenProcessor.java:48)
at com.twitter.common.text.token.TwitterTokenStream.reset(TwitterTokenStream.java:96)
at com.twitter.common.text.token.TokenProcessor.reset(TokenProcessor.java:48)
at com.twitter.common.text.token.TwitterTokenStream.reset(TwitterTokenStream.java:96)
at com.twitter.common.text.combiner.ExtractorBasedTokenCombiner.reset(ExtractorBasedTokenCombiner.java:57)
at com.twitter.common.text.token.TokenProcessor.reset(TokenProcessor.java:48)
at com.twitter.common.text.token.TwitterTokenStream.reset(TwitterTokenStream.java:96)
at com.twitter.common.text.combiner.ExtractorBasedTokenCombiner.reset(ExtractorBasedTokenCombiner.java:57)
at com.twitter.common.text.token.TokenProcessor.reset(TokenProcessor.java:48)
at com.twitter.common.text.token.TwitterTokenStream.reset(TwitterTokenStream.java:96)
at com.twitter.common.text.token.TokenProcessor.reset(TokenProcessor.java:48)
at com.twitter.common.text.token.TwitterTokenStream.reset(TwitterTokenStream.java:96)
at com.twitter.common.text.token.TokenProcessor.reset(TokenProcessor.java:48)
at com.twitter.common.text.token.TwitterTokenStream.reset(TwitterTokenStream.java:96)
at com.twitter.common.text.combiner.ExtractorBasedTokenCombiner.reset(ExtractorBasedTokenCombiner.java:57)
at com.twitter.common.text.token.TokenProcessor.reset(TokenProcessor.java:48)
at com.twitter.common.text.token.TwitterTokenStream.reset(TwitterTokenStream.java:96)
at com.twitter.common.text.token.TokenProcessor.reset(TokenProcessor.java:48)
at com.twitter.common.text.token.TwitterTokenStream.reset(TwitterTokenStream.java:96)
at com.twitter.common.text.token.TokenizedCharSequenceStream.reset(TokenizedCharSequenceStream.java:132)
at com.twitter.common.text.token.TokenizedCharSequence.createFrom(TokenizedCharSequence.java:346)
at com.twitter.common_internal.text.pipeline.TwitterTextTokenizer.tokenize(TwitterTextTokenizer.java:156)
at com.twitter.cortex.util.PenguinTweetTokenizer.tokens(TweetTokenizer.scala:45)
at com.twitter.cortex.util.PenguinTweetTokenizer.hashtags(TweetTokenizer.scala:50)
at com.twitter.cortex.datatypes.Status$.apply(Status.scala:157)

Tokenizer throws exception with certain input

The tokenizer throws an UnsupportedOperationException with the following input:

해쵸쵸쵸쵸쵸쵸쵸쵸춏

It also seems to throw the exception with more than 8 of the '쵸' character in the middle, but doesn't fail with less than 8. Here's a more complete stack trace:

java.lang.UnsupportedOperationException: empty.minBy
    at scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:252)
    at scala.collection.AbstractTraversable.minBy(Traversable.scala:104)
    at com.twitter.penguin.korean.tokenizer.KoreanTokenizer$.com$twitter$penguin$korean$tokenizer$KoreanTokenizer$$parseKoreanChunk(KoreanTokenizer.scala:197)
    at com.twitter.penguin.korean.tokenizer.KoreanTokenizer$$anonfun$tokenize$1.apply(KoreanTokenizer.scala:99)
    at com.twitter.penguin.korean.tokenizer.KoreanTokenizer$$anonfun$tokenize$1.apply(KoreanTokenizer.scala:96)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
    at scala.collection.immutable.List.flatMap(List.scala:344)
    at com.twitter.penguin.korean.tokenizer.KoreanTokenizer$.tokenize(KoreanTokenizer.scala:96)
    at com.twitter.penguin.korean.TwitterKoreanProcessor$.tokenize(TwitterKoreanProcessor.scala:49)
    at com.twitter.penguin.korean.TwitterKoreanProcessor.tokenize(TwitterKoreanProcessor.scala)
    at com.twitter.penguin.korean.TwitterKoreanProcessorJava.tokenize(TwitterKoreanProcessorJava.java:56)

Thanks!

Implement detokenization

NLP applications such as natural language generation or machine translation produce morpheme-tokenized text if the underlying models are trained on morpheme-tokenized training material which, with agglutinative languages such as Korean, often has a positive impact on performance. The tokenized system output, however, has to be detokenized to accommodate end users.

It would be great if TwitterKoreanProcessor would implement a method to convert a sequence of tokens back into a string:

val tokens = List("한국어", "", "처리", "하는",  "예시", "입니", "", "ㅋㅋ", "#한국어")
val segment: String = TwitterKoreanProcessor.detokenize(tokens)
// "한국어를 처리하는 예시입니다ㅋㅋ #한국어"

Basically speaking, adjacent items in tokens should be joined together with either an empty string or whitespace, depending on whether or not they belong to the same word unit.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.