Giter VIP home page Giter VIP logo

picpic-core's Introduction

Hey.

My name is Martin Schön. I am a web developer.

Currently, I am building and maintaining the software behind one of Europe’s largest news websites, together with my colleagues at SPIEGEL Tech Lab.

Although I worked in the fast and sometimes chaotic world of news publishers for the last ten years, I enjoy the web most when it’s tidy, calm and made by passionate individuals. I like

🍦 progressive enhancement

🌱 the sustainable web

🧑‍💻 the indie web

Technologies I recently liked using

Projects I am proud of

  • Building an audio player from scratch
  • Adding bookmarks to DER SPIEGEL
  • Reducing the amount of data DER SPIEGEL fetches for article rendering by the factor 10
  • Relaunching ZEIT ONLINE’s festival platform Z2X
  • Building a robot photo editor

picpic-core's People

Contributors

mgschoen avatar

Watchers

 avatar

picpic-core's Issues

Common interface for all ML approaches

  • move keyword threshold from FFNN to Benchmark
  • make FFNN return a list of probabilities instead of ones and zeros
  • create a wrapper module ste-ml.js that can be initialised with only a modelType (svm or ffnn atm), a modelPath (path to trained exported .model) and a list of stemmedUniqueTerms and that behaves similar to ste-statistical.js

Special chars break tokenization

In this case, the name of the Spanish king's brother-in-law Iñaki Urdangarin was not matched with the image keyword Inaki Urdangarin because the original term was tokenized to I aki Urdangarin. Make sure the tokenizer's split heuristic is aware of such special chars.

Use RegexpTokenizer instead of WordTokenizer:

const Tokenizer = new Natural.WordTokenizer()

Exclude illustration and graphics from Getty results

This system is designed for assigning news photography to articles, but Getty's database also includes illustration and infographics. Both go beyond the scope of this system and should therefore be excluded from search results.

See Node API:

https://github.com/gettyimages/gettyimages-api_nodejs/blob/82060b68003ab6be5f37fa12edb32321886514dc/lib/searchimages.js#L129-L132

and API Docs:

http://developers.gettyimages.com/docs/#operation/Search_GetImagesByPhrase

(&graphical_styles=illustration&graphical_styles_filter_type=exclude)

Enable ArticlePreprocessor to exclude stopwords from stemmed terms

Preprocessor.prototype.getStemmedTerms = function (sortFunction) {
if (this.stemmedUniqueTerms) {
let terms = []
for (let term in this.stemmedUniqueTerms) {
let entry = this.stemmedUniqueTerms[term]
entry.stemmedTerm = term
terms.push(entry)
}
if (sortFunction) {
return terms.sort(sortFunction)
}
return terms
} else {
return []
}
}

This function returns all the preprocessed terms in an article. Add an argument that specifies if it should strictly return all terms or if it should exclude stopwords.

E.g.

 Preprocessor.prototype.getStemmedTerms = function (sortFunction, excludeStopwords) { 
    ... 
         if (excludeStopwords) {
             terms = terms.filter(term => stopwords.indexOf(term) < 0)
         }
         if (sortFunction) { 
             return terms.sort(sortFunction) 
         } 
    ...
 }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.