Giter VIP home page Giter VIP logo

twitter / twitter-text Goto Github PK

View Code? Open in Web Editor NEW
3.1K 181.0 511.0 5.45 MB

Twitter Text Libraries. This code is used at Twitter to tokenize and parse text to meet the expectations for what can be used on the platform.

Home Page: https://developer.twitter.com/en/docs/counting-characters

License: Apache License 2.0

Shell 0.02% Ruby 14.66% HTML 40.41% CSS 0.89% Java 12.76% JavaScript 22.60% Objective-C 8.26% Scala 0.39%
twitter-text tweet twitter unicode emoji java ruby nodejs objective-c

twitter-text's Introduction

twitter-text

This repository is a collection of libraries and conformance tests to standardize parsing of Tweet text. It synchronizes development, testing, creating issues, and pull requests for twitter-text's implementations and specification. These libraries are responsible for determining the quantity of characters in a Tweet and identifying and linking any URL, @username, #hashtag, or $cashtag.

See implementations and conformance in this repository below:

Other language implementations

The following implementations exist in other programming languages, but are not supported by or used by Twitter. We'd like to thank the authors for building and maintaining these alternatives.

If you would like to contribute a link to other implementations, please consider sending a Pull Request, or letting us know via the Twitter Developer Community forums.

Copyright and License

Copyright 2012-2020 Twitter, Inc and other contributors

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

twitter-text's People

Contributors

amatsuda avatar andypiper avatar bcherry avatar bigloser avatar caniszczyk avatar changok avatar codemonkey3045 avatar couch avatar dlamacchia avatar edengol avatar eileencodes avatar geedew avatar hoverbird avatar howardr avatar jakl avatar jsha avatar kaushlakers avatar kennethkufluk avatar kl-7 avatar kscanne avatar moz65535 avatar mzsanford avatar niw avatar psychs avatar rcastera avatar sayrer avatar sferik avatar shinypb avatar sudhee avatar toddwschneider avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

twitter-text's Issues

Valid URL is not extracted

Conformance has a test for ipv4 and ipv6 addresses:

text: "http://192.168.0.1/index.html?src=asdf"

However calling extractUrlsWithIndices() doesn't extract those URLs. This example code

var twitter = require('twitter-text');
console.log(twitter.extractUrlsWithIndices('http://192.168.0.1/index.html?src=asdf'));

returns [].

Document options

There is no documentation of the options and how they customize the output. From what I could find, these are the public options:

cashtagClass
cashtagUrlBase
checkUrlOverlap
hashtagClass
hashtagUrlBase
htmlAttrs
htmlEscapeNonEntities
invisibleTagAttrs
linkAttributeBlock
linkTextBlock
listClass
listUrlBase
suppressDataScreenName
suppressLists
suppressNoFollow
symbolTag
targetBlank
textWithSymbolTag
title
urlClass
urlEntities
urlTarget
usernameClass
usernameIncludeSymbol
usernameUrlBase

Wrong URL extracted when in angle brackets and pipe separated text

Here is the case: we have some texts that are copy pasted from a Wiki and has a markup format to put a link into a page. <URL|description>, well the extract method gets the full text inside angle brackets.
Here is a sample message:
"Please refer to <https://dev1.mycompany.com:8080/d/documents/248647|#248647>"

The url should be: https://dev1.mycompany.com:8080/d/documents/248647
but instead I'm getting: https://dev1.mycompany.com:8080/d/documents/248647|#248647

If you can think a way of tweaking the regexs to fix this that would be awesome.
Thanks

enable protocol in autolinked URLs

I'm using the Java library and am wondering how I can make sure the protocol is included the extracted href. Not including the protocol in the link can cause issues when a non https external url is linked to from a site with SSL. Is there a way to ensure the protocol is included instead of producing href="//..?

RTL hashtag support?

Suppose I leave a comment with hashtags like this: #door #porte #porta דלת#.

This is an interesting edge case. The hash tag should be at the beginning of the word, so with the Hebrew word (read Right To Left) the hashtag is technically in the correct position.

However, adding a hash to the right of the RTL word doesn't parse as a hashtag.

Should it?

Unable to install nokogiri 1.5.10

I am unable to install twitter-text due to the gemspec using an old version of nokogiri.

$ bundle install                                               [system]
Fetching gem metadata from http://rubygems.org/.........
Fetching version metadata from http://rubygems.org/..
Resolving dependencies...
Using rake 10.4.2
Using diff-lcs 1.2.5
Using docile 1.1.5
Using json 1.8.3
Using multi_json 1.11.2
Installing nokogiri 1.5.11 (was 1.6.6.4) with native extensions

Gem::Ext::BuildError: ERROR: Failed to build gem native extension.

    current directory: /private/var/folders/7l/398p04d15r14xrvtxpwvcvv00000gp/T/bundler20151121-58319-1tybnxrnokogiri-1.5.11/gems/nokogiri-1.5.11/ext/nokogiri
/System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/bin/ruby -r ./siteconf20151121-58319-4nx6fd.rb extconf.rb
checking for libxml/parser.h... yes
checking for libxslt/xslt.h... yes
checking for libexslt/exslt.h... yes
checking for iconv_open() in iconv.h... no
checking for iconv_open() in -liconv... yes
checking for xmlParseDoc() in -lxml2... no
-----
libxml2 is missing.  please visit http://nokogiri.org/tutorials/installing_nokogiri.html for help with installing dependencies.
-----
*** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of necessary
libraries and/or headers.  Check the mkmf.log file for more details.  You may
need configuration options.

Provided configuration options:
    --with-opt-dir
    --without-opt-dir
    --with-opt-include
    --without-opt-include=${opt-dir}/include
    --with-opt-lib
    --without-opt-lib=${opt-dir}/lib
    --with-make-prog
    --without-make-prog
    --srcdir=.
    --curdir
    --ruby=/System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/bin/ruby
    --with-zlib-dir
    --without-zlib-dir
    --with-zlib-include
    --without-zlib-include=${zlib-dir}/include
    --with-zlib-lib
    --without-zlib-lib=${zlib-dir}/lib
    --with-iconv-dir
    --without-iconv-dir
    --with-iconv-include
    --without-iconv-include=${iconv-dir}/include
    --with-iconv-lib
    --without-iconv-lib=${iconv-dir}/lib
    --with-xml2-dir
    --without-xml2-dir
    --with-xml2-include
    --without-xml2-include=${xml2-dir}/include
    --with-xml2-lib
    --without-xml2-lib=${xml2-dir}/lib
    --with-xslt-dir
    --without-xslt-dir
    --with-xslt-include
    --without-xslt-include=${xslt-dir}/include
    --with-xslt-lib
    --without-xslt-lib=${xslt-dir}/lib
    --with-libxslt-config
    --without-libxslt-config
    --with-pkg-config
    --without-pkg-config
    --with-libxml-2.0-config
    --without-libxml-2.0-config
    --with-libiconv-config
    --without-libiconv-config
    --with-iconvlib
    --without-iconvlib
    --with-xml2lib
    --without-xml2lib

To see why this extension failed to compile, please check the mkmf.log which can be found here:

  /var/folders/7l/398p04d15r14xrvtxpwvcvv00000gp/T/bundler20151121-58319-1tybnxrnokogiri-1.5.11/extensions/universal-darwin-15/2.0.0/nokogiri-1.5.11/mkmf.log

extconf failed, exit code 1

Gem files will remain installed in /var/folders/7l/398p04d15r14xrvtxpwvcvv00000gp/T/bundler20151121-58319-1tybnxrnokogiri-1.5.11/gems/nokogiri-1.5.11 for inspection.
Results logged to /var/folders/7l/398p04d15r14xrvtxpwvcvv00000gp/T/bundler20151121-58319-1tybnxrnokogiri-1.5.11/extensions/universal-darwin-15/2.0.0/nokogiri-1.5.11/gem_make.out
An error occurred while installing nokogiri (1.5.11), and Bundler cannot continue.
Make sure that `gem install nokogiri -v '1.5.11'` succeeds before bundling.

This looks like the same issue as nomad/cupertino issue #161 which was resolved by updating the nokogiri dependency to 1.6.3. Actually it looks like @jakl was already using nokogiri 1.6.3 in the latest Gemfile.lock. Can we update twitter-text.gemspec?

Remove hashbangs from search urls

For example:

DEFAULT_CASHTAG_URL_BASE = "https://twitter.com/#!/search?q=%24";

Would become:

DEFAULT_CASHTAG_URL_BASE = "https://twitter.com/search?q=%24";

java.lang.ExceptionInInitializerError

Using the latest Java library com.twitter:twitter-text:1.11.0 I'm getting the following crash when calling new Extractor().extractEntitiesWithIndices(tweet).

java.lang.ExceptionInInitializerError
    at com.twitter.Extractor.extractURLsWithIndices(Extractor.java:297)
    at com.twitter.Extractor.extractEntitiesWithIndices(Extractor.java:156)

Line 297: Matcher matcher = Regex.VALID_URL.matcher(text);

So there might be an issue compiling the URL regex pattern?
The exception is not thrown when using the old 1.9.9 version.

Mentions don't include @ character

When autoLink-ing tweets, @ sign is not included in mention

mention_without_

On twitter.com, @ sign is wrapped arround s html tag so you can style it with a matching color. Is there any way to achieve this ?

Tweet Tokenization Code

I would like to suggest the need for Tweet tokenization code added as part of your library. Tweet tokenization is a common task implemented by people who work on twitter data analysis. Existing libraries for tokenizing english language are not sufficient for twitter data because of the noisy langauge being used. There are a library which has implemented the tweet tokenization efficiently but unfortunately it is available under GPLv2 license ( https://github.com/brendano/ark-tweet-nlp ) unlike your Apache License.

It would be great if you can add a simple tokenizer which takes care of treating hashtags, @mentions, URLs and punctuations and words as a individual token rather than just splitting the string using spaces.

/pkg folder

Greeting,

Thanks for developing the package, its really useful.

wondering do we need to pack the file under /pkg to npm when building the package? as we are archiving all the version js, the npm package is actually quite large but we are not directly using files under the folder. or just keep the current file in that package version will be great.

@hankhsiao

autoLink and extractUrlsWithoutProtocol

Is there a reason why you cant extract urls without protocol when using auto_link ?

  twttr.txt.autoLink = function(text, options) {
    var entities = twttr.txt.extractEntitiesWithIndices(text, {extractUrlsWithoutProtocol: false});
    return twttr.txt.autoLinkEntities(text, entities, options);
  };

Currently its always passing false with no option to change it.

Twitter returning @mentions preceded by @

I wanted to duplicate this here in case an issue would be responded to differently then a pull request.
As shown here:
https://twitter.com/kung_fu_mike/status/570024458640973824
https://twitter.com/StreetGeekEnt/statuses/554800426768277504
The second one also contains this in its API response JSON:
"user_mentions": [
{
"screen_name": "NME",
"name": "NME",
"id": 19063323,
"id_str": "19063323",
"indices": [
1,
5
]
}

I have a pull request out here:
#34

However I also thought this might be some sort of accidental regression?

twitter-text/rb/lib/twitter-text/rewriter.rb was changed 5 months ago and pulled into 1.12.0 and has a bug in it

Twitter::Rewriter.rewrite_entities was changed on 1/10/2015. This line of code, indices = entity.respond_to?(:indices) ? entity.indices : entity[:indices] does not work as intended ( on line 9 and line 15 ). The respond_to? command returns true on the entity which is a hash, but entity.indices return an EMPTY array. SO line 16 blows up with a 'bad range' error. I reverted to version 1.11.0 of this gem for the time being. Please fix. Thx.

mvn release:perform javadoc plexus stringutils error

I wasn't able to include the javadocs in the 1.12.0 release because of this error. I tried updating all plugin versions to their latest, and adding javadoc failOnError=false to the pom. Nothing seemed to fix it. This should be looked at later by someone familiar with maven and pom files.

mvn javadoc:javadoc # this javadoc generation is one step that's part of mvn release:perform
...
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:2.10.3:jar (attach-javadocs) on project twitter-text: Execution attach-javado
cs of goal org.apache.maven.plugins:maven-javadoc-plugin:2.10.3:jar failed: An API incompatibility was encountered while executing org.apache.maven.plugins:ma
ven-javadoc-plugin:2.10.3:jar: java.lang.NoSuchMethodError: org.codehaus.plexus.util.StringUtils.unifyLineSeparators(Ljava/lang/String;)Ljava/lang/String;

docs for Java api outdated

I am trying to port the extractor methods to Go Language and while referring to the Javadoc found it was outdated. This folder should ideally either be removed entirely or re-generated with each commit

Add email regex.

twitter-text/java/src/com/twitter/Regex.java misses Email Regex.

localhost should be considered a valid domain

Somewhat similar to #40, localhost should be considered a valid domain.

For example:

http://localhost/
or
http://localhost:8080/

However, the following code

twttr.txt.extractUrls("http://localhost/example");

returns [].

Different length on mac-osx & ec2 amazon-linux

I am using twitter-text module in my node project, So while debugging I just notice twitter-text is giving me different length on different machines.
Sample text - "Get your 12-month EconyPink collection @Kissstarter http://kck.sp/1LTPRhj #amazing #lipstick #beauty #eyeliner ..."
116 on mac osx
115 on ec2 machine amazon-linux

twitterText.extract() are't extract some links correctly

Problem occurs, when i try to extract from string link to youtube playlist, like it:

https://www.youtube.com/watch?v=f1DVxtjiBc4&feature=youtu.be&list=PLMBnwIwFEFHeaOuHqSEpZPhcFkRSrunE-

plugin can't see the last character -, so i can see video, but playlist for this video isn't shown.
When i try do the same thing on Twitter, i see the same picture^

https://www.youtube.com/watch?v=f1DVxt…-

link shortener can't see -.

Thank you for response

Use of carriage returns in `tweet_length`

We encountered an issue where the Twitter.com web interface is reporting a different length than the Ruby gem when a "\r\n" line break is used. eg:

Web interface (3 characters):
image

Ruby (4 characters):

> Twitter::Validation.tweet_length "a\r\nb"
=> 4

As a work around, we are going to use a string replace before validating to change "\r\n" to "\n".

Not sure if this should be considered a bug in this gem or a string encoding issue (problem was found on Mac OS 10.10), but wanted to report in case it is a bug. Any thoughts?

Twitter text js regexp questions

Currently I'm studying the regexps of twitter text js repo on github. As you're a great contributor I was wo ndering if I could ask you some questions. If you have some time, could you please answer the following questions?

The questions are about the following regexps:

1)  twttr.txt.regexen.endHashtagMatch = regexSupplant(/^(?:#{hashSigns}|:\/\/)/);
2)  twttr.txt.regexen.hashtagBoundary = regexSupplant(/(?:^|$|[^&a-z0-9_#{latinAccentChars}#{nonLatinHashtagChars}])/);
3)  twttr.txt.regexen.validHashtag = regexSupplant(/(#{hashtagBoundary})(#{hashSigns})(#{hashtagAlphaNumeric}*#{hashtagAlpha}#{hashtagAlphaNumeric}*)/gi);
4)  twttr.txt.regexen.endMentionMatch = regexSupplant(/^(?:#{atSigns}|[#{latinAccentChars}]|:\/\/)/);
5)  twttr.txt.regexen.validReply = regexSupplant(/^(?:#{spaces})*#{atSigns}([a-zA-Z0-9_]{1,20})/);

Question 1:
Regexp 2 is used for some kind of boundary, but it starts with ^|$ . This means the beginning or ending of a line, but as you can see in regexp 3, that it's at the beginning of a regexp, why is the $ over there? Is this because of the support for the rtl languages?

Question 2:
At the end of regexp 2 there's something like :// , where is this used for? I can't figure it out. In regexp 4 something similar happens.

Question 3:
At the beginning of regexp 2 there's something like |[^&a-z0-9 , what does the & mean in this context?

Question 4:
Where is regexp 1 and 4 used for? What does the endHashtagMatch and endMentionMatch do?

Question 5
Regexp 3 has #{hashtagBoundary} at the beginning, why doesn't have regexp 5 something similar.

Using target _blank while parsing twitter tweets using twitter text

I want to add target="_blank" property to all the twitter tweets parsed urls using twitter text. I followed this example https://github.com/twitter/twitter-text/tree/master/js#auto-linking-examples and successfully able to parse the entities but I can't find a way to add target="_blank" property to all the urls. Can anyone please tell me is there any options supported for this in this library? If not, is there anyway this can be achieved?

Add "·" MIDDLEDOT (U+00B7) support

*Note: this issue is copied from old "twitter-text-conformance" repo
twitter-archive/twitter-text-conformance#63

Hi,

MDIDDLEDOT (U+00B7) is very used as inner-word punctuation in Catalan, a mandatory diacritical char in Catalan ortography rules. Currently Twitter doesn't allow to use "·" in several places, so I request to improve its support in Twitter.

I requested it in Twitter support forum, without feedback. So, I request it here. If that's not the place, please, report it to L10N Twitter team.

For instance:

  1. It's unable to make hashtags like #il·lusió
  2. It's unable to set valid URLs like "http://www.l·l.cat" in user's profile
  3. It's unable to create or name list like "al·lucinant"

About 1 and 3
You can do a workaround using a legacy compatible characters ŀ (U+0140) / Ŀ (U+013F). According to Unicode, it's preferred to use their decomposition: l+· and L+·. So, the weird effect is that you can use ĿL in hashtags (#iŀlusió works fine), but not the preferred Unicode encoding L·L (#il·lusió fails).

About 2
MIDDLEDOT (U+00B7) is a valid char (between 2 Ls) in .CAT and .ES TLDs, and its allowed by RFC592

So, please, improve U+00B7 support in Twitter.

Thanks in advance.

Related links

Emoji escaping

Hello,

This is probably more a question than a reporting of an issue. I'm using twitter-text to escape non standard characters for use in a webpage. It works really well - thanks for making it - except with emojis. Should I expect twitter-text to work with emojis?

Here's an example of what I'm doing...

var text = "Jason Grigsby, ☁4"
console.log(twitter.htmlEscape(text));

I'm passing the escaped values through handlebars.js.

If html escaping isn't possible with emojis, are you able to tell me the best way to make sure emojis are presented properly?

Thanks,
/t

Single/Double quote character length!

When I type a single or a double quote it returns the character length as 5.

Am using
var remainingCharacters = 140 - twttr.txt.getTweetLength(tweet);

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.