Comments (4)
这种问题貌似前人都已经实现了,看这里
http://www.searchtb.com/2011/07/doclist-compress.html?spm=0.0.0.0.LBo3qP
http://www.searchtb.com/2013/04/pfordelta.html
from mdrill.
发现lucene在4.0的版本里提供了这种功能(我们决定采用这种方式,测试了下解压缩速度很理想)
参考地址 http://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings/lucene/src/java/org/apache/lucene/util/pfor2/PForDelta.java
from mdrill.
已发布
from mdrill.
��针对不同的数据列 采用不同的压缩算法,常规列使用PForDelta,像男女,日期采用RepeatCompress
的实现代码如下RepeatCompress
package com.alimama.mdrill.buffer;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
public class RepeatCompress {
public static class RepeatCompressRtn{
public int[] bytes;
public int index;
}
public static void main(String[] args) {
long t1=System.currentTimeMillis();
int[] test=new int[10240];
for(int j=0;j<test.length;j++)
{
test[j]=j/100;
}
for(int i=0;i<1;i++)
{
RepeatCompressRtn compress2=compress(test, test.length);
int[] compress=new int[compress2.index];
for(int k=0;k<compress.length;k++)
{
compress[k]=compress2.bytes[k];
}
// System.out.println(Arrays.toString(test));
// System.out.println(Arrays.toString(compress));
int[] result=decompress(compress);
System.out.println(Arrays.equals(result, test)+","+compress.length+","+test.length+","+compress.length*100d/test.length);
}
System.out.println(System.currentTimeMillis()-t1);
// long t1=System.currentTimeMillis();
//
// for(int k=0;k<10;k++)
// {
// int[] test=new int[10240];
// for(int j=0;j<test.length;j++)
// {
// test[j]=(int) (100*Math.random());
//
// }
// RepeatCompressRtn rtn=groupNumEncode(test, test.length);
// int[] compress=new int[rtn.index];
// for(int i=0;i<compress.length;i++)
// {
// compress[i]=rtn.bytes[i];
// }
//
//
// int[] result=groupNumDecode(compress);
// }
//
// System.out.println(System.currentTimeMillis()-t1);
}
public static class groupCount implements Comparable<groupCount>{
int cnt;
@Override
public String toString() {
return "[cnt=" + cnt + ", val=" + val + "]";
}
int val;
@Override
public int compareTo(groupCount o) {
return (this.cnt<o.cnt ? 1 : (this.cnt==o.cnt ? 0 : -1));
}
}
private static RepeatCompressRtn groupNumEncode(final int[] inBlock, int blockSize)
{
HashMap<Integer,groupCount> groupCount=new HashMap<Integer,groupCount>(blockSize);
for(int i=0;i<blockSize;i++)
{
int val=inBlock[i];
groupCount g=groupCount.get(val);
if(g==null)
{
g=new groupCount();
g.cnt=0;
g.val=val;
groupCount.put(val,g);
}
g.cnt++;
}
ArrayList<groupCount> sortlist=new ArrayList<RepeatCompress.groupCount>();
sortlist.addAll(groupCount.values());
Collections.sort(sortlist );
HashMap<Integer,Integer> repeat=new HashMap<Integer,Integer>(blockSize);
int[] groupNum=new int[blockSize<<1];
int groupNumIndex=1;
for(groupCount g:sortlist)
{
int groupNumval=groupNumIndex++;
repeat.put(g.val, groupNumval);
groupNum[groupNumval]=g.val;
}
RepeatCompressRtn rtn=new RepeatCompressRtn();
rtn.bytes=new int[blockSize<<2];
rtn.index=0;
for(int i=0;i<blockSize;i++)
{
int val=inBlock[i];
Integer g=repeat.get(val);
rtn.bytes[rtn.index++]=g;
}
for(int i=1;i<groupNumIndex;i++)
{
rtn.bytes[rtn.index++]=groupNum[i];
}
rtn.bytes[rtn.index++]=groupNumIndex-1;
return rtn;
}
private static int[] groupNumDecode(final int[] inBlock)
{
int lastNum=inBlock.length-1;
int[] groupNum=new int[inBlock[lastNum]+1];
int start=lastNum-inBlock[lastNum];
for(int i=start;i<lastNum;i++)
{
groupNum[i-start+1]=inBlock[i];
}
int[] rtn=new int[start];
for(int i=0;i<start;i++)
{
rtn[i]=groupNum[inBlock[i]];
}
return rtn;
}
public static RepeatCompressRtn compress(final int[] inBlock, int blockSize)
{
RepeatCompressRtn group=groupNumEncode(inBlock, blockSize);
// RepeatCompressRtn group=new RepeatCompressRtn();
// group.bytes=inBlock;
// group.index=blockSize;
RepeatCompressRtn rtn=new RepeatCompressRtn();
rtn.bytes=new int[group.index<<2];
rtn.index=0;
int last=Integer.MIN_VALUE;
int cnt=0;
for(int i=0;i<group.index;i++)
{
int val=group.bytes[i];
if(last!=val)
{
rtn.bytes[rtn.index++]=val<<1;
// System.out.println(i+"@"+val+"@"+Arrays.toString(rtn.bytes));
last=val;
cnt=1;
}else{
cnt++;
if(cnt==2)
{
// System.out.println(i+"@"+"22>"+val+"@"+Arrays.toString(rtn.bytes));
rtn.bytes[rtn.index-1]=(val<<1)+1;
rtn.bytes[rtn.index++]=2;
// System.out.println(i+"@"+"44>"+val+"@"+Arrays.toString(rtn.bytes));
}else{
// System.out.println(i+"@"+"33"+val+"@"+Arrays.toString(rtn.bytes));
rtn.bytes[rtn.index-1]++;
}
}
}
rtn.bytes[rtn.index++]=group.index;
return rtn;
}
public static int[] decompress(final int[] compress)
{
int groupsize=compress.length-1;
int[] rtn=new int[compress[groupsize]];
int index=0;
for(int i=0;i<groupsize;)
{
int val=compress[i++];
int num=val>>1;
int type=val-(num<<1);
int cnt=1;
if(type==1)
{
cnt=compress[i++];
}
for(int j=0;j<cnt;j++)
{
rtn[index++]=num;
}
}
return groupNumDecode(rtn);
}
}
from mdrill.
Related Issues (20)
- make index成功,tablelist中能看到有记录,但jdbc查询时返回出错,何解
- 翻页BUG
- 实时部分的内存索引可以多个
- 6万多条记录的merger sort 耗时1秒多 HOT 4
- 非utf8环境下的查询含有中文的列 有BUG
- 如果请求的分区过多,导致传递的shards太长
- mdrill这名 怎么读呢? HOT 1
- distinct count目前存在的问题以及改进思路 HOT 1
- 细节性能优化 HOT 6
- 细节性能优化2 HOT 2
- 数据不准确,怎么搞呀? HOT 1
- 谁能共享下mdrill安装组件包?
- 麻烦 谁有Hadoop-Myeclipse插件 发个!网上找的都不能用啊! HOT 1
- 找不到hbase:0.94-adh3u3.1-cdh4依赖 HOT 2
- 资源列表有几个文档没法下载, 在阿里内网
- 在执行./bluewhale mdrill create ./create.sql时报错
- Report a misuse of ConcurrentHashMap
- 现在aliyun的ADS是基于mdrill的么
- [bug] if语句中的condition恒为true (SameObjEquals)
- [bug] 使用 “==” 比较两个相同的表达式 (EqualToSameExpression)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mdrill.