Giter VIP home page Giter VIP logo

google-10000-english's People

Contributors

daviddliu avatar dgoedtkindt avatar dmuth avatar elizafox avatar first20hours avatar hingston avatar jakebathman avatar koseki avatar nickvollmar avatar skotzko avatar vgel avatar worldlywisdom avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

google-10000-english's Issues

Unclear license

LICENSE.md doesn't actually contain a license, but rather attribution. Is any or all of this material in the public domain, all rights reserved, or something in between?

Some bad words not filtered from clean versions

Here are some of the bad/potentially offensive words I've found that aren't being filtered (click triangle to show):

Bad words sexcam, livesex, jo (slang abbreviation for masturbation), worldsex, vibrators, cumshots, twinks, xnxx (porn site), shemales, upskirts, milfhunter, milfs, bangbus (porn site)

There's probably a few others I'm not noticing or my moral compass doesn't think are a big deal, but those are the big ones.

Clearer copyright

There is a licence file, but that file just shows where the data is originated. I clicked one of the linked and read the licence there. One thing was,

User shall not publish, retransmit, display, redistribute, reproduce or commercially exploit the Data in any form, except that...

So, it means that you can never use this data in a paid app in any way? It may be better to show the usage terms in the licence text.

Frequency Fail

I have a hard time believing "information" is more frequent than "when".

Also, there are numerous entries for single letters like "x" and state abbreviations like "sd", IMO are not useful entries.

missing very common words

according to the Oxford 3k world list:

uncontrolled
generously
disturb
alarming
poisonous
steeply
rumour
thickly
congratulate
artificially
grandson
unimportant
unfriendly
nervously
skilfully
unexpectedly
injure
swollen
dissolve
coldly
midday
faithfully
irritating
photocopy
violently
salty
amused
bitterly
irritated
knitted
disgust
criticize
cupboard
entertainer
complicate
accuse
dishonestly
immoral
wrongly
unkind
congratulation
offend
invent
spoil
tonne
cheerfully
strangely
embarrassed
awfully
unload
confidently
anxiously
disapprove
oddly
fasten
beak
swearing
gravely
motorbike
depress
unwillingly
reckon
untidy
entitle
irritate
frighten
milligram
faintly
rudely
frightening
yawn
impatient
kilogram
drugstore
unlucky
dishonest
grandparent
angrily
willingly
alarmed
lorry
coughing
enthusiastically
awkwardly
disgusted
millimetre
deserted
noisily
annoy
thirsty
unsteady
grandchild
hairdresser
embarrassment
suitcase
amuse
downwards
curiously
chairwoman
disapproving
niece
cheaply
exaggerated
artistically
granddaughter
neatly
pleasantly
centimetre
skilful
admiration
confuse
clap
devote
insulting
farthest
politely
waiter
exaggerate
bandage
tiring
brightly
frightened
loudly
contrasting
embarrass
kilometre
disappoint
depressing
carelessly
stiffly
wildly
sideways
calmly
cannot
careless
upsetting
confine
disapproval
disgusting
unfairly
morally
pronounce
amaze

Not In Order by Frequency in English Language

I find it very hard to believe that "ebay" is the 217th most common word in the English language, and Google's ngram viewer agrees with me (words in linked ngram search appear AFTER "ebay" in "google-10000-english.txt").

Further analysis of the data, using the Google ngram viewer itself, indicates that the order of this list in no way represents the actual relative frequencies at which these words are used.

Of course this list is still useful to many who are looking for a list of common words, but I take issue with the claim that these are the "most common English words in order of frequency."

Masturbating?

Exists in 10000 english long words no swears, not really a swear but could be inappropriate

Is there a Spanish version?

This is very interesting and nice work. I have been searching for a list similar to this in Spanish and other languages for some time now without any luck. Do you know where I could find such a list or at least the data sets to create one? Thanks!

Various brand names are included

Don't know if the list is supposed to include brand names, but there are several, including
toshiba, lexus, kijiji, levitra, paxil, firewire, nextel, hewlett, ericsson, garmin, and more.

top 10k english words that are words?

Hi, are you interested in having another permutation of the 10k list that is only valid words? I needed that, so I munged the list a little. You probably would want the whole 10k, but it's pretty close to what I ran.

This is relevant to #1.

Contractions?

Is it possible to add contractions to the list? e.g. "you're" or "can't"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.