Giter VIP home page Giter VIP logo

datafusion-contest-2022's Introduction

Data Fusion Contest 2022. 1-st place on the Matching Task.

Link to competition

Our solution acheived 1-st place on the private leaderboard.

Overview

Validation

We split train data at 6 folds. So we have about 3000 pairs in fold. This is similar as contest validation and test set.

Final validation quality (with ensemble):

     mean   [values by folds]
pre: 0.5425 [0.549 , 0.533 , 0.546 , 0.541 , 0.541 , 0.545 ]	
mrr: 0.2028 [0.202 , 0.204 , 0.204 , 0.202 , 0.202 , 0.203 ]
r1:  0.2952 [0.296 , 0.295 , 0.297 , 0.294 , 0.294 , 0.295 ]	

Final validation quality (with single model):

     mean   [values by folds]
pre: 0.5025 [0.514 , 0.497 , 0.506 , 0.496 , 0.498 , 0.504 ]	
mrr: 0.1947 [0.194 , 0.198 , 0.194 , 0.194 , 0.193 , 0.195 ]
r1:  0.2808 [0.282 , 0.283 , 0.281 , 0.279 , 0.279 , 0.281 ]

Model

This is neural network based solution. We use pytorch-lifestream library to simplify work with sequences data.

We create neural network to encode transaction sequences and clickstream to vector representation. Matching and ranking based on L2 distance between transaction sequence embedding and clickstream embedding.

Network include two stages:

  • TrxEncoder, which transform each individual transaction or click into vector representation
  • SequenceEncoder, which reduce sequence of vectors from TrxEncoder into one final vector

Training batch contains 128 pairs of bank-rtk users. Short transaction subsequence sampled twice from each bank user and the same for clickstream. So we have 256 trx sequences samples for 128 bank users and 256 click sequences samples for 128 rtk users. Then we make 256*256 matrix with pairwise L2 distances, and the matrix with match-unmatched labels. Labels used to sample positive and negative samples for Softmax Loss.

SequenceEncoder is RNN network, shared for trx and clicks. Weights initialised randomly. TrxEncoder is a mix of embedding for categorical features and normalised numerical. We make TrxEncoder for transactions and TrxEncoder for clicks. We pretrain weights for both TrxEncoder.

We were inspired BERT architecture for pretrain. We use TrxEncoder + TransformerEncoder and MLM (Masked Language Model) task. TransformerEncoder predict masked TrxEncoder vectors.

We use ensembling in final solution. We train 11 models and average paired distances from output.

Timing

We used Tesla V100 with 32Gb. The same settings also works with 16Gb.

Stage Time
Full train time with 5 models in ensemble 142 min
- Data load and preprocessing 8 min
- Pretrain TrxEncoder (both trx and click) 19 min
- Train one SeqEncoder (there are 5 in ensemble) 23 min
Full inference time (2930 bank uids, 2445 rtk uids) 2 min

How to reproduce

Install and use environment

pipenv sync --dev

pipenv shell

Run

# Load data
sh get_data.sh

# Split data
python src/split_folds.py

cd src/

# Pipeline with validation, valid_fold_id from [0, 1, 2, 3, 4, 5] shold be provided as parameter
python nn_train.py valid_fold_id=4
python nn_inference.py valid_fold_id=4
python nn_evaluation.py valid_fold_id=4


# Pipeline without validation, train model before submit
python nn_distance_train.py valid_fold_id=None
# there are no data for validation, all data was in train
# python nn_distance_inference.py

datafusion-contest-2022's People

Contributors

ivkireev86 avatar ovsovnikita avatar dllllb avatar

Stargazers

Artem avatar  avatar  avatar Omar Zoloev avatar  avatar  avatar  avatar Oleg Kulikov avatar Aleksandr Kalashnikov avatar Lev Novitskiy avatar Ivanochkin Alexander avatar WizardofOz avatar  avatar Stanislav avatar  avatar Rinat Mahmutov avatar Dmitry Sokolov avatar  avatar Alexander avatar  avatar Anton Klenitskiy avatar Nikita Romanov avatar Vlad Samarov avatar  avatar Serg Iv avatar  avatar  avatar

Watchers

 avatar

Forkers

dllllb tdl77 tsebaka

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.