Giter VIP home page Giter VIP logo

topn-go's Introduction

Topn-go

Topn implemented by Go.

Topn-go is used to find the top n most frequently occurring urls from a large file. It can process huge files with limited memory efficiently.

Main idea

Preliminary implementation

To handle a large file, we can firstly split it into small files, then deal with them one by one easily. So the pipeline is like this:

  1. Partition a large file into small files that can be placed in memory. (Map phase)
  2. Merge and sort urls in small files, and find the top n candidates from each file. Then sort all candidates and write to file. (Reduce phase)

The Reduce phase is a classic parallel problem. Since the sort operation on small file are independent of each other, we run them in parallel naturally. I leverage min-heap to store the top n candidates from each file, then merge them in another min-heap and write it to file.

So the biggest problem is how to partition file efficiently in Map phase. In the preliminary implementation, i think of the relation between file reader and writers like producer and consumers. The producer reads file line by line, then send the url to corresponding chan in Go-lang. The consumers use select to listen its chan, receive url, and append it to file. It looks a reasonable pattern in this scenario. However, it is inefficient due to the feature of Hard Disk Drive(HDD).

For each read and write to the disk, the latency highly depends on the track location. In the Producer-Consumer pattern, consumers are writing url simultaneously. Every time the data is received, a write operation is triggered. Lots of time is wasted in seeking tracks. So the efficiency is very low.

Optimization

Inspired by Combiner in map-reduce model, we can apply Combiner function that does partial merging of url before it write to the disk. I maintain a buffer map to to keep url pairs, and combine the url pair if they have the same address. When map size reached a certain value, pairs in map will be flush to corresponding file. Experimental results show that partial combining significantly reduces redundant disk IO and speeds up the MapReduce operation.

Run and Test

How to run topn-go:

make test_topn

How to clean up all test data:

make clean

How to generate test data:

make gendata

NOTE: go 1.12 is required

MIT License

topn-go's People

Contributors

hezuojiao avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.