twitter / twitter-text Goto Github PK

Twitter Text Libraries. This code is used at Twitter to tokenize and parse text to meet the expectations for what can be used on the platform.

Home Page: https://developer.twitter.com/en/docs/counting-characters

License: Apache License 2.0

Shell 0.02% Ruby 14.66% HTML 40.41% CSS 0.89% Java 12.76% JavaScript 22.60% Objective-C 8.26% Scala 0.39%

emoji java nodejs objective-c ruby tweet twitter twitter-text unicode

twitter-text's Issues

Add "·" MIDDLEDOT (U+00B7) support

*Note: this issue is copied from old "twitter-text-conformance" repo
twitter-archive/twitter-text-conformance#63

Hi,

MDIDDLEDOT (U+00B7) is very used as inner-word punctuation in Catalan, a mandatory diacritical char in Catalan ortography rules. Currently Twitter doesn't allow to use "·" in several places, so I request to improve its support in Twitter.

I requested it in Twitter support forum, without feedback. So, I request it here. If that's not the place, please, report it to L10N Twitter team.

For instance:

It's unable to make hashtags like #il·lusió
It's unable to set valid URLs like "http://www.l·l.cat" in user's profile
It's unable to create or name list like "al·lucinant"

About 1 and 3
You can do a workaround using a legacy compatible characters ŀ (U+0140) / Ŀ (U+013F). According to Unicode, it's preferred to use their decomposition: l+· and L+·. So, the weird effect is that you can use ĿL in hashtags (#iŀlusió works fine), but not the preferred Unicode encoding L·L (#il·lusió fails).

About 2
MIDDLEDOT (U+00B7) is a valid char (between 2 Ls) in .CAT and .ES TLDs, and its allowed by RFC592

So, please, improve U+00B7 support in Twitter.

Thanks in advance.

Using target _blank while parsing twitter tweets using twitter text

I want to add target="_blank" property to all the twitter tweets parsed urls using twitter text. I followed this example https://github.com/twitter/twitter-text/tree/master/js#auto-linking-examples and successfully able to parse the entities but I can't find a way to add target="_blank" property to all the urls. Can anyone please tell me is there any options supported for this in this library? If not, is there anyway this can be achieved?

RTL hashtag support?

Suppose I leave a comment with hashtags like this: #door #porte #porta דלת#.

This is an interesting edge case. The hash tag should be at the beginning of the word, so with the Hebrew word (read Right To Left) the hashtag is technically in the correct position.

However, adding a hash to the right of the RTL word doesn't parse as a hashtag.

Should it?

Emojis in URLs

Context - Coke started a campaign with emojis in URLs. They are not hyperlinked in Twitter apps.
http://mashable.com/2015/02/20/coke-emoji-web-addresses/

Whether or not they should be clickable, is debatable?
http://www.washingtonpost.com/news/the-intersect/wp/2015/02/23/the-surprisingly-complex-reason-you-never-see-emoji-urls/

There certainly are some gTLDs that support this format of URL but they are not common.

Twitter returning @mentions preceded by @

I wanted to duplicate this here in case an issue would be responded to differently then a pull request.
As shown here:
https://twitter.com/kung_fu_mike/status/570024458640973824
https://twitter.com/StreetGeekEnt/statuses/554800426768277504
The second one also contains this in its API response JSON:
"user_mentions": [
{
"screen_name": "NME",
"name": "NME",
"id": 19063323,
"id_str": "19063323",
"indices": [
1,
5
]
}

I have a pull request out here:
#34

However I also thought this might be some sort of accidental regression?

JS: autoLinkWithJson mutates one the json input param

The json parameter passed into autoLinkWithJSON gets mutated here: https://github.com/twitter/twitter-text/blob/master/js/twitter-text.js#L683

The means that if you have tweet text with emoji at the start, and some links / usernames after the emoji, and call autoLinkWithJSON with the text and entities twice, it'll give you a different result each time.

Add Python support

regexen.validHashtag.test returns `true` and then `false` for the same input.

Steps to reproduce:

console.log(twttr.txt.regexen.validHashtag.test("#hello")); -> true
console.log(twttr.txt.regexen.validHashtag.test("#hello")); -> false

Expected:

Both calls should be true

Here is a gist: http://jsbin.com/japicakefi/1/edit?html,css,js,console,output

Note the console log shows true false true for the same input!

https://en.wikipedia.org/wiki/"Crocodile"_Dundee

This URL fails to get linkified: https://en.wikipedia.org/wiki/"Crocodile"_Dundee

Wrong URL extracted when in angle brackets and pipe separated text

Here is the case: we have some texts that are copy pasted from a Wiki and has a markup format to put a link into a page. <URL|description>, well the extract method gets the full text inside angle brackets.
Here is a sample message:
"Please refer to <https://dev1.mycompany.com:8080/d/documents/248647|#248647>"

The url should be: https://dev1.mycompany.com:8080/d/documents/248647
but instead I'm getting: https://dev1.mycompany.com:8080/d/documents/248647|#248647

If you can think a way of tweaking the regexs to fix this that would be awesome.
Thanks

/pkg folder

Greeting,

Thanks for developing the package, its really useful.

wondering do we need to pack the file under /pkg to npm when building the package? as we are archiving all the version js, the npm package is actually quite large but we are not directly using files under the folder. or just keep the current file in that package version will be great.

@hankhsiao

Curly Brackets in URL parameters not being correctly linked

The URL example.com/?id={1234} is not being correctly linked.

Example - https://twitter.com/carwash/status/591508694111653888

The URL is http://archaeopress.com/ArchaeopressShop/Public/displayProductDetail.asp?id={E35F9954-5653-493D-884B-4A7D2DE66610}

Twitter ignores the { - so it renders as http://archaeopress.com/ArchaeopressShop/Public/displayProductDetail.asp?id={E35F9954-5653-493D-884B-4A7D2DE66610}

autoLink and extractUrlsWithoutProtocol

Is there a reason why you cant extract urls without protocol when using auto_link ?

  twttr.txt.autoLink = function(text, options) {
    var entities = twttr.txt.extractEntitiesWithIndices(text, {extractUrlsWithoutProtocol: false});
    return twttr.txt.autoLinkEntities(text, entities, options);
  };

Currently its always passing false with no option to change it.

Remove hashbangs from search urls

For example:

DEFAULT_CASHTAG_URL_BASE = "https://twitter.com/#!/search?q=%24";

Would become:

DEFAULT_CASHTAG_URL_BASE = "https://twitter.com/search?q=%24";

getTweetLength returns 23 for 'www.google.com'

Any idea why? It's not an https link, so should be 22, right?

Use of carriage returns in `tweet_length`

We encountered an issue where the Twitter.com web interface is reporting a different length than the Ruby gem when a "\r\n" line break is used. eg:

Web interface (3 characters):

Ruby (4 characters):

> Twitter::Validation.tweet_length "a\r\nb"
=> 4

As a work around, we are going to use a string replace before validating to change "\r\n" to "\n".

Not sure if this should be considered a bug in this gem or a string encoding issue (problem was found on Mac OS 10.10), but wanted to report in case it is a bug. Any thoughts?

Emails containing RT@ should not be usernames

I believe this was the only fix in the middle of active development during the switch to this mono-repo.
Would someone like to start a unified pull request here?
/cc @psychs @niw @jugyo

Previous work, not yet in the monorepo:
twitter-archive/twitter-text-conformance#83
twitter-archive/twitter-text-rb#123
twitter-archive/twitter-text-objc@18b3b2b
twitter-archive/twitter-text-objc@a3bdbaa

Twitter text js regexp questions

Currently I'm studying the regexps of twitter text js repo on github. As you're a great contributor I was wo ndering if I could ask you some questions. If you have some time, could you please answer the following questions?

The questions are about the following regexps:

1)  twttr.txt.regexen.endHashtagMatch = regexSupplant(/^(?:#{hashSigns}|:\/\/)/);
2)  twttr.txt.regexen.hashtagBoundary = regexSupplant(/(?:^|$|[^&a-z0-9_#{latinAccentChars}#{nonLatinHashtagChars}])/);
3)  twttr.txt.regexen.validHashtag = regexSupplant(/(#{hashtagBoundary})(#{hashSigns})(#{hashtagAlphaNumeric}*#{hashtagAlpha}#{hashtagAlphaNumeric}*)/gi);
4)  twttr.txt.regexen.endMentionMatch = regexSupplant(/^(?:#{atSigns}|[#{latinAccentChars}]|:\/\/)/);
5)  twttr.txt.regexen.validReply = regexSupplant(/^(?:#{spaces})*#{atSigns}([a-zA-Z0-9_]{1,20})/);

Question 1:
Regexp 2 is used for some kind of boundary, but it starts with ^|$ . This means the beginning or ending of a line, but as you can see in regexp 3, that it's at the beginning of a regexp, why is the $ over there? Is this because of the support for the rtl languages?

Question 2:
At the end of regexp 2 there's something like :// , where is this used for? I can't figure it out. In regexp 4 something similar happens.

Question 3:
At the beginning of regexp 2 there's something like |[^&a-z0-9 , what does the & mean in this context?

Question 4:
Where is regexp 1 and 4 used for? What does the endHashtagMatch and endMentionMatch do?

Question 5
Regexp 3 has #{hashtagBoundary} at the beginning, why doesn't have regexp 5 something similar.

twitter-text says tweet is valid but fails to post with error 32: "Could not authenticate you."

When a tweet starts with an asterisk, twitter-text says it is valid, but Twitter sends back an error: "Could not authenticate you." (code 32)

Twitter::Validation.tweet_invalid?("*TEST* Does this work?") # returns false

client.update("*TEST* Does this work?") # raises Twitter::Error::Unauthorized: Could not authenticate you.

client.update("TEST Does this work?") # works

JS: autoLinkWithJSON overwrites attributes with undefined values

If you use the extraction methods for mentions, hashtags and cashtags to produce JSON for use with autoLinkWithJSON, autoLinkWithJSON overwrites screenName, hashtag and cashtag attributes with undefined, e.g.: https://github.com/twitter/twitter-text/blob/master/js/twitter-text.js#L667

Add email regex.

twitter-text/java/src/com/twitter/Regex.java misses Email Regex.

Unable to install nokogiri 1.5.10

I am unable to install twitter-text due to the gemspec using an old version of nokogiri.

$ bundle install                                               [system]
Fetching gem metadata from http://rubygems.org/.........
Fetching version metadata from http://rubygems.org/..
Resolving dependencies...
Using rake 10.4.2
Using diff-lcs 1.2.5
Using docile 1.1.5
Using json 1.8.3
Using multi_json 1.11.2
Installing nokogiri 1.5.11 (was 1.6.6.4) with native extensions

Gem::Ext::BuildError: ERROR: Failed to build gem native extension.

    current directory: /private/var/folders/7l/398p04d15r14xrvtxpwvcvv00000gp/T/bundler20151121-58319-1tybnxrnokogiri-1.5.11/gems/nokogiri-1.5.11/ext/nokogiri
/System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/bin/ruby -r ./siteconf20151121-58319-4nx6fd.rb extconf.rb
checking for libxml/parser.h... yes
checking for libxslt/xslt.h... yes
checking for libexslt/exslt.h... yes
checking for iconv_open() in iconv.h... no
checking for iconv_open() in -liconv... yes
checking for xmlParseDoc() in -lxml2... no
-----
libxml2 is missing.  please visit http://nokogiri.org/tutorials/installing_nokogiri.html for help with installing dependencies.
-----
*** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of necessary
libraries and/or headers.  Check the mkmf.log file for more details.  You may
need configuration options.

Provided configuration options:
    --with-opt-dir
    --without-opt-dir
    --with-opt-include
    --without-opt-include=${opt-dir}/include
    --with-opt-lib
    --without-opt-lib=${opt-dir}/lib
    --with-make-prog
    --without-make-prog
    --srcdir=.
    --curdir
    --ruby=/System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/bin/ruby
    --with-zlib-dir
    --without-zlib-dir
    --with-zlib-include
    --without-zlib-include=${zlib-dir}/include
    --with-zlib-lib
    --without-zlib-lib=${zlib-dir}/lib
    --with-iconv-dir
    --without-iconv-dir
    --with-iconv-include
    --without-iconv-include=${iconv-dir}/include
    --with-iconv-lib
    --without-iconv-lib=${iconv-dir}/lib
    --with-xml2-dir
    --without-xml2-dir
    --with-xml2-include
    --without-xml2-include=${xml2-dir}/include
    --with-xml2-lib
    --without-xml2-lib=${xml2-dir}/lib
    --with-xslt-dir
    --without-xslt-dir
    --with-xslt-include
    --without-xslt-include=${xslt-dir}/include
    --with-xslt-lib
    --without-xslt-lib=${xslt-dir}/lib
    --with-libxslt-config
    --without-libxslt-config
    --with-pkg-config
    --without-pkg-config
    --with-libxml-2.0-config
    --without-libxml-2.0-config
    --with-libiconv-config
    --without-libiconv-config
    --with-iconvlib
    --without-iconvlib
    --with-xml2lib
    --without-xml2lib

To see why this extension failed to compile, please check the mkmf.log which can be found here:

  /var/folders/7l/398p04d15r14xrvtxpwvcvv00000gp/T/bundler20151121-58319-1tybnxrnokogiri-1.5.11/extensions/universal-darwin-15/2.0.0/nokogiri-1.5.11/mkmf.log

extconf failed, exit code 1

Gem files will remain installed in /var/folders/7l/398p04d15r14xrvtxpwvcvv00000gp/T/bundler20151121-58319-1tybnxrnokogiri-1.5.11/gems/nokogiri-1.5.11 for inspection.
Results logged to /var/folders/7l/398p04d15r14xrvtxpwvcvv00000gp/T/bundler20151121-58319-1tybnxrnokogiri-1.5.11/extensions/universal-darwin-15/2.0.0/nokogiri-1.5.11/gem_make.out
An error occurred while installing nokogiri (1.5.11), and Bundler cannot continue.
Make sure that `gem install nokogiri -v '1.5.11'` succeeds before bundling.

This looks like the same issue as nomad/cupertino issue #161 which was resolved by updating the nokogiri dependency to 1.6.3. Actually it looks like @jakl was already using nokogiri 1.6.3 in the latest Gemfile.lock. Can we update twitter-text.gemspec?

auto-linking example not match twitter api

twitter api result:

this project's provided example:

[INFO] I'm working to port implementation of Extractor to Go Language

Hi! I wanted to share that I'm working on porting the implementation of Extractor functions to the Go Language based on the Java implementation. It is currently still limited in the sense that it doesn't support URL extraction (and hence overlap of hashtags/URLs), lists with mentions and cashtags yet. However, it passes the conformance tests for hashtags (except url overlap), mentions & replies.

add php support or update the old one

Document options

There is no documentation of the options and how they customize the output. From what I could find, these are the public options:

cashtagClass
cashtagUrlBase
checkUrlOverlap
hashtagClass
hashtagUrlBase
htmlAttrs
htmlEscapeNonEntities
invisibleTagAttrs
linkAttributeBlock
linkTextBlock
listClass
listUrlBase
suppressDataScreenName
suppressLists
suppressNoFollow
symbolTag
targetBlank
textWithSymbolTag
title
urlClass
urlEntities
urlTarget
usernameClass
usernameIncludeSymbol
usernameUrlBase

big npm package size cause of 520 MB file inside

Here is fresh twitter-text package - https://registry.npmjs.org/twitter-text/-/twitter-text-1.13.0.tgz
Size = 184 mb

Seems like it is because of ctags was published with the package:
520M Apr 10 01:58 tags

Please unpublish this version and publish it again without it.

Add 'click' to the gTLD list

I cannot see 'click' in the gTLD list.

npm WARN install: Refusing to install twitter-text as a dependency of itself

upon following the npm install instructions in README.md: npm install twitter-text:

npm WARN install Refusing to install twitter-text as a dependency of itself

java.lang.ExceptionInInitializerError

Using the latest Java library com.twitter:twitter-text:1.11.0 I'm getting the following crash when calling new Extractor().extractEntitiesWithIndices(tweet).

java.lang.ExceptionInInitializerError
    at com.twitter.Extractor.extractURLsWithIndices(Extractor.java:297)
    at com.twitter.Extractor.extractEntitiesWithIndices(Extractor.java:156)

Line 297: Matcher matcher = Regex.VALID_URL.matcher(text);

So there might be an issue compiling the URL regex pattern?
The exception is not thrown when using the old 1.9.9 version.

Tweet Tokenization Code

I would like to suggest the need for Tweet tokenization code added as part of your library. Tweet tokenization is a common task implemented by people who work on twitter data analysis. Existing libraries for tokenizing english language are not sufficient for twitter data because of the noisy langauge being used. There are a library which has implemented the tweet tokenization efficiently but unfortunately it is available under GPLv2 license ( https://github.com/brendano/ark-tweet-nlp ) unlike your Apache License.

It would be great if you can add a simple tokenizer which takes care of treating hashtags, @mentions, URLs and punctuations and words as a individual token rather than just splitting the string using spaces.

Valid URL is not extracted

Conformance has a test for ipv4 and ipv6 addresses:

twitter-text/conformance/validate.yml

Line 129 in 10a8a18

text: "http://192.168.0.1/index.html?src=asdf"

However calling extractUrlsWithIndices() doesn't extract those URLs. This example code

var twitter = require('twitter-text');
console.log(twitter.extractUrlsWithIndices('http://192.168.0.1/index.html?src=asdf'));

returns [].

Mentions don't include @ character

When autoLink-ing tweets, @ sign is not included in mention

On twitter.com, @ sign is wrapped arround s html tag so you can style it with a matching color. Is there any way to achieve this ?

Emoji in text seems to break text replacing

I use the "autoLinkWithJSON" function with tweets and all work well except for tweets that contains emojis in their text. This is an example of how the tweet with emoji appears after using autoLinkWithJSON function:

And this is the original tweet: https://twitter.com/HototPlus/status/673133696803840000

Is it a bug or is there any specific function for manage tweets with emoji? thx

localhost should be considered a valid domain

Somewhat similar to #40, localhost should be considered a valid domain.

For example:

http://localhost/
or
http://localhost:8080/

However, the following code

twttr.txt.extractUrls("http://localhost/example");

returns [].

twitterText.extract() are't extract some links correctly

Problem occurs, when i try to extract from string link to youtube playlist, like it:

https://www.youtube.com/watch?v=f1DVxtjiBc4&feature=youtu.be&list=PLMBnwIwFEFHeaOuHqSEpZPhcFkRSrunE-

plugin can't see the last character -, so i can see video, but playlist for this video isn't shown.
When i try do the same thing on Twitter, i see the same picture^

https://www.youtube.com/watch?v=f1DVxt…-

link shortener can't see -.

Thank you for response

JS Documentation

Where is the JS/library documentation?

twitter-text/rb/lib/twitter-text/rewriter.rb was changed 5 months ago and pulled into 1.12.0 and has a bug in it

Twitter::Rewriter.rewrite_entities was changed on 1/10/2015. This line of code, indices = entity.respond_to?(:indices) ? entity.indices : entity[:indices] does not work as intended ( on line 9 and line 15 ). The respond_to? command returns true on the entity which is a hash, but entity.indices return an EMPTY array. SO line 16 blows up with a 'bad range' error. I reverted to version 1.11.0 of this gem for the time being. Please fix. Thx.

Add all valid .tld

Issue: Currently, Twitter will auto-link a URL like example.zip but will not auto-link www.bbc

Proof: https://twitter.com/edent/status/581394369804136450

Solution: The latest TLDs from http://data.iana.org/TLD/tlds-alpha-by-domain.txt should be added to https://github.com/twitter/twitter-text/blob/master/conformance/tld_lib.yml and https://github.com/twitter/twitter-text/blob/master/conformance/tlds.yml

mvn release:perform javadoc plexus stringutils error

I wasn't able to include the javadocs in the 1.12.0 release because of this error. I tried updating all plugin versions to their latest, and adding javadoc failOnError=false to the pom. Nothing seemed to fix it. This should be looked at later by someone familiar with maven and pom files.

mvn javadoc:javadoc # this javadoc generation is one step that's part of mvn release:perform
...
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:2.10.3:jar (attach-javadocs) on project twitter-text: Execution attach-javado
cs of goal org.apache.maven.plugins:maven-javadoc-plugin:2.10.3:jar failed: An API incompatibility was encountered while executing org.apache.maven.plugins:ma
ven-javadoc-plugin:2.10.3:jar: java.lang.NoSuchMethodError: org.codehaus.plexus.util.StringUtils.unifyLineSeparators(Ljava/lang/String;)Ljava/lang/String;

docs for Java api outdated

I am trying to port the extractor methods to Go Language and while referring to the Javadoc found it was outdated. This folder should ideally either be removed entirely or re-generated with each commit

Updated Cocoapod Spec

Looks like the podspec is currently referencing the deprecated objc repository.

Also it'd be great to remove the existing pod from the Specs repository and update with a new one referencing this repo... https://github.com/CocoaPods/Specs/blob/master/Specs/twitter-text-objc/1.6.1/twitter-text-objc.podspec.json

Thanks for the updates! :)

NEW TLD .amsterdam

Is missing in the library. thanks

Single/Double quote character length!

When I type a single or a double quote it returns the character length as 5.

Am using
var remainingCharacters = 140 - twttr.txt.getTweetLength(tweet);

where is TldLists.GTLDS ?

Can't compile the java version

add js to bower

please add this to bower. thanks!

enable protocol in autolinked URLs

I'm using the Java library and am wondering how I can make sure the protocol is included the extracted href. Not including the protocol in the link can cause issues when a non https external url is linked to from a site with SSL. Is there a way to ensure the protocol is included instead of producing href="//..?

include conformance directory inside gem

currently the tests fail because it can't find conformance files inside test folder

Different length on mac-osx & ec2 amazon-linux

I am using twitter-text module in my node project, So while debugging I just notice twitter-text is giving me different length on different machines.
Sample text - "Get your 12-month EconyPink collection @Kissstarter http://kck.sp/1LTPRhj #amazing #lipstick #beauty #eyeliner ..."
116 on mac osx
115 on ec2 machine amazon-linux

Emoji escaping

Hello,

This is probably more a question than a reporting of an issue. I'm using twitter-text to escape non standard characters for use in a webpage. It works really well - thanks for making it - except with emojis. Should I expect twitter-text to work with emojis?

Here's an example of what I'm doing...

var text = "Jason Grigsby, ☁4"
console.log(twitter.htmlEscape(text));

I'm passing the escaped values through handlebars.js.

If html escaping isn't possible with emojis, are you able to tell me the best way to make sure emojis are presented properly?

Thanks,
/t

twitter / twitter-text Goto Github PK

twitter-text's Issues

Recommend Projects

Recommend Topics

Recommend Org