starrocks / starrocks Goto Github PK

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.

Home Page: https://starrocks.io

License: Apache License 2.0

CMake 0.25% C++ 44.93% C 0.47% Shell 0.17% Python 0.49% Lex 0.01% Yacc 0.01% Java 53.03% Makefile 0.01% Thrift 0.40% CSS 0.01% JavaScript 0.01% Mustache 0.01% HTML 0.04% ANTLR 0.13% Dockerfile 0.04%

analytics big-data cloudnative database datalake delta-lake distributed-database hudi iceberg join lakehouse lakehouse-platform mpp olap real-time-analytics real-time-updates realtime-database sql star-schema vectorized

starrocks's Introduction

Download | Docs | Benchmarks | Demo

StarRocks, a Linux Foundation project, is the next-generation data platform designed to make data-intensive real-time analytics fast and easy. It delivers query speeds 5 to 10 times faster than other popular solutions. StarRocks can perform real-time analytics well while updating historical records. It can also enhance real-time analytics with historical data from data lakes easily. With StarRocks, you can get rid of the de-normalized tables and get the best performance and flexibility.

Learn more 👉🏻 Introduction to StarRocks

Features

🚀 Native vectorized SQL engine: StarRocks adopts vectorization technology to make full use of the parallel computing power of CPU, achieving sub-second query returns in multi-dimensional analyses, which is 5 to 10 times faster than previous systems.
📊 Standard SQL: StarRocks supports ANSI SQL syntax (fully supported TPC-H and TPC-DS). It is also compatible with the MySQL protocol. Various clients and BI software can be used to access StarRocks.
💡 Smart query optimization: StarRocks can optimize complex queries through CBO (Cost Based Optimizer). With a better execution plan, the data analysis efficiency will be greatly improved.
⚡ Real-time update: The updated model of StarRocks can perform upsert/delete operations according to the primary key, and achieve efficient query while concurrent updates.
🪟 Intelligent materialized view: The materialized view of StarRocks can be automatically updated during the data import and automatically selected when the query is executed.
✨ Querying data in data lakes directly: StarRocks allows direct access to data from Apache Hive™, Apache Iceberg™, and Apache Hudi™ without importing.
🎛️ Resource management: This feature allows StarRocks to limit resource consumption for queries and implement isolation and efficient use of resources among tenants in the same cluster.
💠 Easy to maintain: Simple architecture makes StarRocks easy to deploy, maintain and scale out. StarRocks tunes its query plan agilely, balances the resources when the cluster is scaled in or out, and recovers the data replica under node failure automatically.

Architecture Overview

StarRocks’s streamlined architecture is mainly composed of two modules: Frontend (FE) and Backend (BE). The entire system eliminates single points of failure through seamless and horizontal scaling of FE and BE, as well as replication of metadata and data.

Starting from version 3.0, StarRocks supports a new shared-data architecture, which can provide better scalability and lower costs.

Resources

📚 Read the docs

Section	Description
Deploy	Learn how to run and configure StarRocks.
Articles	How-tos, Tutorials, Best Practices and Architecture Articles.
Docs	Full documentation.
Blogs	StarRocks deep dive and user stories.

❓ Get support

Slack community: join technical discussions, ask questions, and meet other users!
YouTube channel: subscribe to the latest video tutorials and webcasts.
GitHub issues: report an issue with StarRocks.

Contributing to StarRocks

We welcome all kinds of contributions from the community, individuals and partners. We owe our success to your active involvement.

See Contributing.md to get started.
Set up StarRocks development environment:

IDE Setup

Understand our GitHub workflow for opening a pull request; use this PR Template when submitting a pull request.
Pick a good first issue and start contributing.

📝 License: StarRocks is licensed under Apache License 2.0.

👥 Community Membership: Learn more about different contributor roles in StarRocks community.

Used By

This project is used by the following companies. Learn more about their use cases:

starrocks's People

Contributors

Stargazers

Watchers

Forkers

yongbingwang johnsonginati yingtingdong kangkaisen liuyehcf yuchen914 littlebeebee deepthinker666 jackwener reminiscent sakuraluck wz2cool wittech 924060929 yuanjianfaith jerrylen wuxueyang96 tigerfeilin jsnorman ikaruga4600 shunpeizhang jackylee-ch trueeyu 13671653088 francisoliverlee aa8808021 wqwl611 luoyuxia wg1026688210 gavinh1984 pangpangzhao zyclove yunzailin xiaozan-dev hustnn deriskoi linkerist aer1208 xiejiajun sherlockzn zzcclp yuexingri qidaye zbtzbtzbt ielihs xinghuayu007 spongedu happenlee blockspacer coldzoo jshonwang caneguy run-lin kasher001 duoluodexiaokeke tinkerrrr feihengye lakeshen huozhanfeng yusha2016 jaogoy harveysun dixingxing0 wuleistarrocks 2pc evan17 seaven dragonly cygusmile aaaaaaron jaylinyu aload winbillatgithub kioco yubobo alanzhanglei achx90 gavinljj 3aceshowhand b-xiang robincoin gump518 zhanglei miamia0 leiysky yuehan124 drlimd1987 chenyjsr suqcnn symbolss satanson 97090576 smallmartial forestlzj princekellyxyc magmongoing godliness y520y wangshisan dh-cloud

starrocks's Issues

hyperscan thirdparty build should use TP_INSTALL_DIR

if build use default prefix it will cause some permission problem.
and it should use cmake not $CUSTM_CMAKE
then after install and use hs/hs.h to include

Add some comments for deleting old query executor related codes

test starRocks error in Debian 9 linux 4.14

hi, I run cmd sh ./run-ut.sh --be --run in my Debian 9 linux 4.14.
And got the error ./run-ut.sh: 19: set: Illegal option -o pipefail

Rename OlapMeta to KVStore

starrocks/be/src/storage/olap_meta.h

Line 45 in dbb993b

class OlapMeta {

The class OlapMeta is a thin wrapper on RocksDB to provides key-value storage interfaces, the OlapMeta is somehow inaccurate, suggest renaming it to KVStore.

Reduce the possibility of port conflicts and simplify deployment steps

starrocks version: StarRocks-1.18.2
today I deployed a starrocks cluster according to documents.
It failed when I started Be, then I checked the log (log/be.out ):
W0908 16:38:03.955521 11189 task_worker_pool.cpp:1060] Fail to report task to :0, err=-1
E0908 16:38:03.959837 10949 doris_main.cpp:236] Doris Be http service did not start correctly, exiting

I noticed a problem with Be http service，but I still don't know why.
After some efforts, I know that some port in the configuration file conflicted.

I believe this has also troubled many friends. Let's share it here。
I came up with two solutions to make it easier for everyone to use：
1.Before start the script(start_be.sh or start_fe.sh), we first check whether the configured ports have conflicts
We can add the following logic to the startup script：
confdir=cd "$curdir"; pwd/../conf/be.conf
port_detected() {
serviceandport=grep -v ^# $confdir | grep "_port" | sed 's/ //g'
serviceandportarray=(${serviceandport// / })
for pair in ${serviceandportarray[@]}
do
service=echo $pair | awk -F '=' '{print $1}'
port=echo $pair | awk -F '=' '{print $2}'
res=netstat -anp | grep $port
if [ ! -z "$res" ]; then
echo "port $port already in use ! "
echo "Please check the configuration $service in $confdir"
exit 1
fi
done
}

2.Generate more log information in the doris_main.cpp

Should not treat Status::EndOfFile as an error

Should not print an error log if the status is EndOfFile In the following line:

starrocks/be/src/exec/vectorized/olap_scan_node.cpp

Line 256 in 1817cb8

QUERY_LOG(ERROR) << status;

librdkafka memory leak in old version

When using lz4 compress, there may be exist memory leak or memory usage in old version of librdkafka.
Should update to be a new version

confluentinc/librdkafka@3cf6848
confluentinc/librdkafka@4970df4

Missing Code Style Guidelines

No code style guidelines in CONTRIBUTING.md

Deploy.md mkdir -p meta or mkdir -p doris-meta

When I follow the documentation to install，I encountered this error when I started FE。

2021-09-08 14:47:18,863 ERROR (main|1) [Catalog.initialize():774] /home/starrock/StarRocks-1.18.2/fe/doris-meta does not exist, 
will exit
2021-09-08 14:47:31,350 ERROR (main|1) [Catalog.initialize():774] /home/starrock/StarRocks-1.18.2/fe/doris-meta does not exist, 
will exit

The second step in the documentation is mkdir -p meta,There should be something wrong with the documentation

Reduce the number of disk accesses while parsing a segment footer

In the current implementation, parsing a segment footer needs two disk accesses:

starrocks/be/src/storage/rowset/segment_v2/segment.cpp

Line 172 in 97daf1d

RETURN_IF_ERROR(rblock->read(file_size - 12, Slice(fixed_buf, 12)));

starrocks/be/src/storage/rowset/segment_v2/segment.cpp

Line 187 in 97daf1d

RETURN_IF_ERROR(rblock->read(file_size - 12 - footer_length, footer_buf));

If there are a large number of segment files, this will hurt the performance significantly. I have written a demo program to reduce the number of I/Os by reading more data at once, and the query results improved from 47s to 31 seconds. However, since the size of the footer can vary over a wide range, it is difficult to decide how much data is appropriate to read at once

Reduce memory usage of rowset meta

In the current implementation, once you opened a segment file, the path of the segment file will be saved in a lot of places:

starrocks/be/src/storage/rowset/segment_v2/segment.h

Line 155 in 7068a63

std::string _fname;

starrocks/be/src/storage/rowset/segment_v2/column_reader.h

Line 236 in 7068a63

const std::string& _file_name;

starrocks/be/src/storage/rowset/segment_v2/indexed_column_reader.h

Line 140 in 7068a63

std::string _file_name;

If a table has many columns or a table has a lot of segment files, this will consume a lot of memory.

Replace OLAPStatus with Status

Status is used in StarRocks to report success and various kinds of errors, but some legacy code still using the deprecated OLAPStatus as the return value, should replace them with Status and remove OLAPStatus finally.

undefined symbol for some function when sue old query engine

mysql> select * from test_basic where substr(id_varchar, 1, 2) = 'k';
ERROR 1064 (HY000): Unable to find _ZN9starrocks15StringFunctions9substringEPN9starrocks_udf15FunctionContextERKNS1_9StringValERKNS1_6IntValES9_
dlerror: /home/disk1/kks/starrocks/output/be/lib/starrocks_be: undefined symbol: _ZN9starrocks15StringFunctions9substringEPN9starrocks_udf15FunctionContextERKNS1_9StringValERKNS1_6IntValES9_

Deploy.md SHOW PROC '/backends'\\G And SHOW PROC '/frontends'\\G ERROR

When I follow the documentation to install , mysql client cannot parse when I use the following command

SHOW PROC '/frontends'\\G
ERROR: Unknown command '\\'.
-> ;
ERROR 1064 (HY000): Please check your sql, we meet an error when parsing.

But this command can be executed

show backends;
+-----------+-----------------+-----------------+---------------+--------+----------+----------+---------------------+-------------- 
-------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+------- 
--+----------------+--------+----------------+--------------------------------------------------------+
| BackendId | Cluster         | IP              | HeartbeatPort | BePort | HttpPort | BrpcPort | LastStartTime       | LastHeartbeat       | 
Alive | SystemDecommissioned | ClusterDecommissioned | TabletNum | DataUsedCapacity | AvailCapacity | TotalCapacity | 
UsedPct | MaxDiskUsedPct | ErrMsg | Version        | Status                                                 |
+-----------+-----------------+-----------------+---------------+--------+----------+----------+---------------------+-------------- 
-------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+------- 
--+----------------+--------+----------------+--------------------------------------------------------+
| 10002     | default_cluster | 192.168.134.128 | 9050          | 9060   | 8040     | 8060     | 2021-09-08 15:01:37 | 2021-09-08 
15:13:47 | true  | false                | false                 | 10        | .000             | 1.364 GB      | 16.986 GB     | 91.97 % | 91.97 %        |        
| 1.18.2-caa8b52 | {"lastSuccessReportTabletsTime":"2021-09-08 15:13:37"} |

And so does this command

SHOW PROC '/frontends'\\G

compatible with bitmap/hll/percentile append_value_multiple_times usage by using append_strings instead in default value column iterator.

As I mentioned in #9, if cast raw data directly as follows,

    _pool.emplace_back(*reinterpret_cast<T*>(slice->data));

will cause crash, as illustrated below

create database alter_table_test_db_1630980262173;

use alter_table_test_db_1630980262173;

CREATE TABLE `aggregate_table_with_null` ( `k1` date, `k2` datetime, `k3` char(20), `k4` varchar(20), `k5` boolean, `v1` tinyint sum, `v2` smallint sum, `v3` int sum, `v4` bigint max, `v5` largeint max, `v6` float min, `v7` double min, `v8` decimal(27,9) sum ) ENGINE=OLAP AGGREGATE KEY(`k1`, `k2`, `k3`, `k4`, `k5`) COMMENT "OLAP" DISTRIBUTED BY HASH(`k1`, `k2`, `k3`, `k4`, `k5`) BUCKETS 3 PROPERTIES ( "replication_num" = "1", "storage_format" = "v2" );

$ curl --location-trusted -u root: -T /home/disk1/jenkins/workspace/doris_daily_build/dorisdb-test/lib/../common/data/basic_types_data -XPUT -H label:stream_load_aggregate_table_with_null_1630980262 -H column_separator:	 http://172.26.92.141:8034/api/alter_table_test_db_1630980262173/aggregate_table_with_null/_stream_load

select sum(cast(v1 as int)), sum(v2), sum(v3), max(v4), max(v5), min(v6), min(v7), sum(v8) from aggregate_table_with_null where k1 = '2020-06-23' and k2 <= '2020-06-23 18:11:00';

ALTER TABLE aggregate_table_with_null ADD COLUMN add_key BITMAP BITMAP_UNION AFTER v8;

$ curl --location-trusted -u root: -T /home/disk1/jenkins/workspace/doris_daily_build/dorisdb-test/lib/../common/data/basic_types_data -XPUT -H label:stream_load_1630980262326 -H column_separator:	 http://172.26.92.141:8034/api/alter_table_test_db_1630980262173/aggregate_table_with_null/_stream_load

SHOW COLUMNS FROM aggregate_table_with_null
select sum(cast(v1 as int)), sum(v2), sum(v3), max(v4), max(v5), min(v6), min(v7), sum(v8) from aggregate_table_with_null where k1 = '2020-06-23' and k2 <= '2020-06-23 18:11:00';

$ curl --location-trusted -u root: -T /home/disk1/jenkins/workspace/doris_daily_build/dorisdb-test/lib/../case/system_case/test_ddl/test_alter_table/basic_types_data_add_key_bitmap -XPUT -H label:stream_load_1630980290588 -H columns:k1,k2,k3,k4,k5,v1,v2,v3,v4,v5,v6,v7,v8,add_key=to_bitmap(v1) -H column_separator:	 http://172.26.92.141:8034/api/alter_table_test_db_1630980262173/aggregate_table_with_null/_stream_load

select count(distinct add_key) from aggregate_table_with_null;

crash:
*** Aborted at 1630980290 (unix time) try "date -d @1630980290" if you are using GNU date ***
PC: @          0x14e7794 std::_Rb_tree<>::_M_copy<>()
*** SIGSEGV (@0xffffffffffffff48) received by PID 154029 (TID 0x7fa598bbb700) from PID 18446744073709551432; stack trace: ***
    @          0x30ca922 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fa5be00b630 (unknown)
    @          0x14e7794 std::_Rb_tree<>::_M_copy<>()
    @          0x14f3886 starrocks::vectorized::ObjectColumn<>::append_value_multiple_times()
    @          0x14da184 starrocks::vectorized::NullableColumn::append_value_multiple_times()
    @          0x2ae0a6d starrocks::segment_v2::DefaultValueColumnIterator::next_batch()
    @          0x2a2bfa9 starrocks::vectorized::SegmentIterator::_do_get_next()
    @          0x2a302d1 starrocks::vectorized::SegmentIterator::do_get_next()
    @          0x1e06f07 starrocks::SegmentIteratorWrapper::do_get_next()
    @          0x1bb0cab starrocks::vectorized::TimedChunkIterator::do_get_next()
    @          0x1bfe695 starrocks::vectorized::UnionIterator::do_get_next()
    @          0x1bb0cab starrocks::vectorized::TimedChunkIterator::do_get_next()
    @          0x1c3205d starrocks::vectorized::AggregateIterator::do_get_next()
    @          0x1bb0cab starrocks::vectorized::TimedChunkIterator::do_get_next()
    @          0x1bf1c0a starrocks::vectorized::Reader::do_get_next()
    @          0x2651fe4 starrocks::vectorized::OlapScanner::get_chunk()
    @          0x249352c starrocks::vectorized::OlapScanNode::_scanner_thread()
    @          0x1e34077 starrocks::PriorityThreadPool::work_thread()
    @          0x307aa77 thread_proxy
    @     0x7fa5be003ea5 start_thread
    @     0x7fa5bc36e8dd __clone

this can be compatible with bitmap/hll/percentile append_value_multiple_times usage by using append_strings instead in default value column iterator.

No such file or directory msg when firstly start FE

bin/start_fe.sh: line 81: /home/disk1/kks/starrocks/output/fe/log/fe.out: No such file or directory
bin/start_fe.sh: line 82: /home/disk1/kks/starrocks/output/fe/log/fe.out: No such file or directory

Please keep the NOTICE file of the Apache Doris project

Hello, I am Ming Wen, Mentor of Apache Doris(Incubating), Member of Apache Software Foundation.
StarRocks is forked from Apache Doris(Incubating), so you need to follow the Apache 2.0 License and keep the complete NOTICE file.

After copying the data from src to dst, the null flag should be set to false.

There is no corresponding setting for AggregateFuncTraits of VARCHAR type.
#113

the deploy is broken

readme.md install章节，deploy链接404

[Bug] ES-Extern Table couldn't report error when the value from ES table could't convert to Schema Type

Steps to reproduce the behavior (Required)

First we create a ES Extern Table

CREATE EXTERNAL TABLE `but_es_extern_table` (
  `key1` int(11) NOT NULL COMMENT "",
  `key2` int(11) NOT NULL COMMENT "",
) ENGINE=ELASTICSEARCH 
COMMENT "ELASTICSEARCH"
PROPERTIES (
"hosts" = "$ES_HOST",
"user" = "root",
"password" = "",
"index" = "doris_external_data_not_null",
"type" = "_doc",
"transport" = "http",
"enable_docvalue_scan" = "true",
"max_docvalue_fields" = "20",
"enable_keyword_sniff" = "true"
);

The Value key2 in ES is "Hello World"

select key2 from but_es_extern_table where key1 = 1;

Expected behavior (Required)

report a ERROR

Real behavior (Required)

StarRocks version (Required)

trunk-c6dc14045e

cannot access Deploy.md

"For detailed instructions, please refer to deploy."
deploy links to https://github.com/StarRocks/docs/blob/master/quick_start/Deploy.md, which is 404

Add some comments for deleting old query optimizer related codes

baidu/palo仓库改名为StarRocks / starrocks吗

baidu/palo我看说明

下载说明
Palo 发行版为基于 Apache Doris 官方 Release 版本的 3 位迭代版本(tags)。包括快速的 Bug 修复和新功能更新。

是改了名字为StarRocks / starrocks吗？

Loading progress does not update

The loading progress in the output of command show load does not update until the load job is finished.

[Optimizer] Pushdown equivalence predicate for outer/semi join

select join1.id from join1 left join join2 on join1.id = join2.id where join1.id > 1;

We can know join2.id > 1;, and can push down to join2 scan node

pom.xml dependencies are obsolete: sleepycat version 7.3.7 not found

Can I use this version ：com.sleepycat je-18.3.12
maven repository not found sleepycat version 7.3.7

Compiling StarRocks supports docker

目前发现只能在真实环境编译，真实环境依赖关系比较复杂，搭建编译环境成为一件比较头痛的事情，希望可以做一个docker版本的编译环境，能够更加的让小伙伴们都参与到源代码修改上来。

using a cast expression when creating a materialized view may result in a be null pointer error

as illustrated below

-- create table
CREATE TABLE `test`.`demo_spark_load_tbl` (
`k1`  varchar(50) NULL  COMMENT "",
`v1`  String      NULL  COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`k1` )
COMMENT "OLAP"
DISTRIBUTED BY HASH(`v1` ) BUCKETS 3
PROPERTIES (
"replication_num" = "1",
"in_memory" = "false",
"storage_format" = "DEFAULT"
);

-- create materialized views
create materialized view groupk1
as
select k1, sum(cast(v1 as double))
from demo_spark_load_tbl
group by k1 
order by k1;

this will cause a null point error in be

crash
cProgram terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000003ab9a36 in doris::AggregateInfo::update (this=0x0, dst=0x7fcf215281801 src=..., mem_pool=0x0) at /home/disk2/StarRocks/be/src/olap/aggregate_func.h:69 69 _update_fn(dst, src, mem_pool);
[Current thread is 1 (Thread 0x7fcf2152e700 (LWP 163957))] 
(gdb) bt
#0 0x0000000003ab9a36 in doris::AggregateInfo::update (this=0x0, dst=0x7fcf21528180, src=..., mem_pool=0x0) at /home/disk2/StarRocks/be/src/olap/aggregate_func.h:69
 #1 0x0000000003ab9ab6 xllb04bdx/fcf21528180, src=..., mem_pool=0x0) at /home/disk2/StarRocks/be/src/olap/field.h:94
#2 0x0000000004el05dc in doris::agg_update_row<doris::RowCursor, doris::RowCursor> (dst=0x7fcf21528200, src=..., mem_pool=0x0) at /home/disk2/StarRocks/be/src/olap/row.h:155
#3 0x0000000004e03fa0 in doris::RowBlockMerger::merge (this=0x7fcf215284a0, row_block_arr=..., rowset_writer=0xb217c00, merged_rows=0x7fcf215284e8) at /home/disk2/StarRocks/be/src/olap/schema_change.cpp:964
#4 0x0000000004e07853 in doris::SchemaChangeWithSorting::_internal_sorting (this=0xllafe460, row_block_arr=..., version=..., version_hash=8594233787343111260, new_tablet=...,
new_rowset_type=doris::BETA_ROWSET, segments_overlap=doris::NONOVERLAPPING, rowset=0x7fcf215286e0) at /home/disk2/StarRocks/be/src/olap/schema_change.cpp:1453
#5 0x0000000004e068b7 in doris::SchemaChangeWithSorting::process (this=0xllafe460, rowset_reader=..., new_rowset_writer=0xb217a40, new_tablet=..., base_tablet=...)
at /home/disk2/StarRocks/be/src/olap/schema_change・cpp:1356
#6 0x0000000004e0a9f6 in doris::SchemaChangeHandler::_convert_historical_rowsets (sc_params=...) at /home/disk2/StarRocks/be/src/olap/schema_change.cpp:1859
#7 0x0000000004e096fd in doris::SchemaChangeHandler::_do_process_alter_tablet_v2_normal (this=0x7fcf215296f0, request=..., base_tablet=..., new_tablet=...)
at /home/disk2/StarRocks/be/src/olap/schema_change.cpp:1724
#8 0x0000000004e08833 in doris::SchemaChangeHandler::_do_process_alter_tablet_v2 (this=0x7fcf215296f0, request=...) at /home/disk2/StarRocks/be/src/olap/schema_change.cpp:1568 in 
#9 0x0000000004e0815b doris::SchemaChangeHandler::process_alter_tablet_v2 (this=0x7fcf215296f0, request=...) at /home/disk2/StarRocks/be/src/olap/schema_change.cpp:1506
#10 0x0000000004d62dce in doris::EngineAlterTabletTask::execute (this=0x7fcf215299b0) at /home/disk2/StarRocks/be/src/olap/task/engine_alter_tablet_task.cpp:43
#11 0x000000000384f896 in doris::StorageEngine::execute_task (this=0xa638840, task=0x7fcf215299b0) at /home/disk2/StarRocks/be/src/olap/storage_engine.cpp:916
#12 0x0000000004263abb in doris::TaskWorkerPool::_alter_tablet (this=0xa63a5a0, worker_pool_this=0xa63a5a0, agent_task_req=..., signature=10168, task_type=doris::TTaskType::ALTER, finish_task_request= /home/disk2/StarRocks/be/src/agent/task_worker_pool.cpp:508

The reason is that fe ignore cast when creating the materialized view. So be will try to perform sum aggregation for k1 whose type is string, but be does not support sum aggregation for string types

Missing pull request template

With the pull request template, we can encourage contributors to include specific, structured information.

pull request template

Show tablet detail failed

Show tablet detail using the DetailCmd from show tablet tablet_id throws NullPointerException.

cpu memory and network cost weights

public static double getRealCost(CostEstimate costEstimate) {
double cpuCostWeight = 0.5;
double memoryCostWeight = 2;
double networkCostWeight = 1.5;
return costEstimate.getCpuCost() * cpuCostWeight +
costEstimate.getMemoryCost() * memoryCostWeight +
costEstimate.getNetworkCost() * networkCostWeight;
}

why memory weights most in starrocks' cost model?

ASAN Crash for SQL: select * from t0 where a (abs(1) in (1, 2));

SQL will cause crash in ASAN mode

select * from t0 where a (abs(1) in (1, 2));

Forgot to remove the skipped tablet from the queue of trash sweep

Should rewrite the following line of code to it = _shutdown_tablets.erase(it);

starrocks/be/src/storage/tablet_manager.cpp

Line 1047 in 97dc139

++it;

deploy need centos system, is ubuntu ok?

The be number is not calculated during broadcast estimation

The number of Be is not taken into account when determining whether to enable BROADCAST_ROW_LIMIT

Doc Error：Replica Management 404

link : https://docs.dorisdb.com/zh-cn/master/administration/Rplica
404 error

Improve metrics API of FE

Currently FE's metrics API (http://fe_host:fe_port/metrics) return metrics like this:

...
# HELP starrocks_fe_job job statistics
# TYPE starrocks_fe_job gauge
starrocks_fe_job{job="load", type="HADOOP", state="UNKNOWN"} 0
# HELP starrocks_fe_job job statistics
# TYPE starrocks_fe_job gauge
starrocks_fe_job{job="load", type="HADOOP", state="PENDING"} 0
# HELP starrocks_fe_job job statistics
# TYPE starrocks_fe_job gauge
starrocks_fe_job{job="load", type="HADOOP", state="ETL"} 0
...

And we encountered an exception when using prometheus's text-parser to parse the metrics text:

text format parsing error in line 40: second HELP line for metric name "starrocks_fe_job"

StarRocks should follow the format definition of prometheus and return metrics like this:

...
# HELP starrocks_fe_job job statistics
# TYPE starrocks_fe_job gauge
starrocks_fe_job{job="load", type="HADOOP", state="UNKNOWN"} 0
starrocks_fe_job{job="load", type="HADOOP", state="PENDING"} 0
starrocks_fe_job{job="load", type="HADOOP", state="ETL"} 0
...

I'll submit a PR to improve it.

Support show load for insert 0 row

When I do the insert statement，If the amount of data inserted is 0 rows，show load can not work well for this operation。

select * from detail;
+---------------------+------------+---------+-------------+---------+
| event_time          | event_type | user_id | device_code | channel |
+---------------------+------------+---------+-------------+---------+
| 2021-01-01 00:00:00 |          1 |       2 |           3 |       4 |
+---------------------+------------+---------+-------------+---------+
insert into detail with label `test-0` select * from detail where user_id < 2;
Query OK, 0 rows affected (0.02 sec)

show load where label ='test-0';
Empty set (0.00 sec)


insert into detail  with label `test-1` select * from detail where user_id = 2;
Query OK, 2 rows affected (0.04 sec)
{'label':'test-1', 'status':'VISIBLE', 'txnId':'1006'}

show load where label ='test-1';
| 30029 | test-1 | FINISHED | ETL:100%; LOAD:100% | INSERT | NULL    | cluster:N/A; timeout(s):3600; max_filter_ratio:0.0 | NULL     
| 2021-09-08 11:50:07 | 2021-09-08 11:50:07 | 2021-09-08 11:50:07 | 2021-09-08 11:50:07 | 2021-09-08 11:50:07 |      | 
{"Unfinished backends":{},"ScannedRows":0,"TaskNumber":0,"All backends":{},"FileNumber":0,"FileSize":0} |

[Optimizer] Optimize duplicate aggregation nodes

Like：

select distinct v2 from (select distinct v2 from t0) as x;
select SUM(v2) from (select SUM(v2) as v2 from t0 group by v1, v3) as x group by v1;

HashJoin adds the defense that the total length of the string column in the right table exceeds the range of uint32_t

Currently, in order to achieve simplicity, HashJoinNode uses BigChunk,
Splice the Chunks from Scan on the right table into a big Chunk
In some scenarios, such as when the left and right tables are selected incorrectly or when the large table is joined, the (BinaryColumn) in the Chunk exceeds the range of uint32_t, which will cause the output of wrong data.
Currently, a defense needs to be added. After a better solution is available, the BigChunk mechanism can be removed.

[Bug] Explain verbose plan for hdfsnode has redundant information

cardinality is redundant

[BUG] BE DataStreamSender cannot report detailed error information in case of a compress error

case : bitmap column

CREATE TABLE `bitmap_64` (
  `k` int(11) NULL COMMENT "",
  `v` bitmap BITMAP_UNION NULL COMMENT ""
) ENGINE=OLAP 
AGGREGATE KEY(`k`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`k`) BUCKETS 1 
PROPERTIES (
"replication_num" = "1",
"in_memory" = "false",
"storage_format" = "DEFAULT"
);

data:

10 lines
k is from 1 - 10
v is bitmap column each bitmap is about 10000000 items. The data is evenly distributed between 0 and 2^64

execute sql

 select bitmap_union_count(v) from bitmap_64;

result:

    ERROR 1064 (HY000): Internal_error

except:

    ERROR 1064 (HY000): The input size for compression should be less than 2113929216

Copy From Apache Doris！

link-1:
https://mp.weixin.qq.com/s?__biz=Mzg5MDEyODc1OA==&mid=2247498504&idx=1&sn=d6723f7d4fb92f45060c2e7311c2633f
link-2:
#127

相信这两天很多社区小伙伴都看到 StarRocks 所谓”开源“的动态了，开源用户群里有很多小伙伴在讨论，也有很多关心 Apache Doris 的朋友来问我们，诸如“如何看待 StarRocks ‘开源' ”、” Apache Doris 跟 StarRocks 是什么关系“、”社区分化的原因是什么“、“为什么 StarRocks 不回馈给 Apache Doris ”的问题。

作为 Apache Doris 主要维护团队，我们觉得有必要给大家澄清一些事情。

关于 Apache Doris 和 DorisDB、StarRocks 的关系

Apache Doris 的前世今生相信很多同学都有些许了解，之前在公众号里有过历史文章阐明关系，在 Apache Doris X Apache Pulsar 联合 Meetup 上也做过题为 “ Apache Doris 的过去、现在和未来 ”的分享。

Doris 最早是解决百度凤巢统计报表的专用系统，随着百度业务的飞速发展对系统进行了多次迭代，逐渐承担起百度内部业务的统计报表和多维分析需求。2013 年，我们把 Doris 进行了 MPP 框架的升级，并将新系统命名为 Palo ，2017 年我们以百度 Palo 的名字在 GitHub 上进行了开源，2018 年贡献给 Apache 基金会时，由于与国外数据库厂商重名，因此选择用回最初的名字，这就是 Apache Doris 的由来。

那么 StarRocks 以及 DorisDB 是什么？

2020 年 2 月，百度 Doris 团队的个别同学离职创业，基于 Apache Doris 之前的版本做了自己的商业化闭源产品 DorisDB ，这就是 StarRocks 的前身。

关于社区分化的原因
按照 Apache License，基于开源产品进行商业化是被允许的。所以我们初期是希望能共同建设 Apache Doris 社区的，个人在职业上的选择与社区无关。在开源社区，每个人的社区身份都是被认可的。

后来我们发现，事情发展与我们的预期背道而驰。

比如 DorisDB 团队在对外宣传时，会宣称自己“是 Apache Doris 的主创团队”、“ Apache Doris 的核心开发人员大部分在任职”等诸类话术。

实际上， GitHub 上公开的数据显示，Apache Doris 贡献代码前三的 Contributor 全部在百度 Doris 团队就职，不知所谓的“大部分”和“主创”从何说起。

最近一年，提交 Commits 数量前二十的 Contributor 中，有一半来自百度 Doris 团队，另一半来自小米、美团、字节跳动、蜀海、网易等 Apache Doris 的开源用户，在此也对所有的 Contributors 表示由衷地感激。

而唯一一个 DorisDB 的 Contributor ，入职 DorisDB 时间为 2021 年 8 月 27 日。没错，入职 DorisDB 快两周了，之前在百度 Doris 团队。

实际上，从 2020 年初起， DorisDB 团队几乎没有向 Apache Doris 提交过一行代码。少部分开发者原本是 Apache Doris 的 Contributor ，在加入 DorisDB 团队后，同样不再向 Apache Doris 贡献一行代码。

比如 DorisDB 团队在人员扩张时，会故意定向挖 Apache Doris 企业用户的员工。开源社区的发展离不开用户的支持，挖用户墙角更无异于自掘坟墓。对于员工个人主动的选择我们不去评判，但这让企业用户对自己员工的培养做了嫁衣。而短视的人是不会看到这些的，更认为与他们毫无关系， Apache Doris 的死活与他们无关，只要自己能招到人就行。

比如 DorisDB 的商标问题，从品牌角度来说，开源项目与商业化产品的品牌必须存在区分度，比如 Linux 和 RedHat 、 Hadoop 与 Cloudera 、Apache Kylin 和 Kyligence 。

而 DorisDB 和 Apache Doris ，相信很多开源用户在初次接触 Doris 的时候都会迷惑这两个产品的区别是什么，甚至以为是同一个产品。这也是 DorisDB 的目的所在，品牌上的混淆可以带来用户流量，这就够了。而 Apache 基金会对此事件有过多次发声， DorisDB 及其团队不管不问，企图继续混淆视听，直到最后在 Apache 基金会的压力下，才不得不通过所谓的“开源”来更名。

比如所谓的“致 Clickhouse 的一封信”。Apache Doris 与 Clickhouse 都是 MPP 数据库领域的优秀产品，各自擅长的领域或适用的场景存在差异，所有用户可以基于技术认知和业务需求来抉择到底该选择哪一款产品，甚至在大多场景里两者是可以并存和相互补足的。

Apache Doris不会、也十分不认可，通过贬低 Clickhouse 来达到推广自己的目的，这与开源的精神十分不符。而 DorisDB 选择向 Clickhouse 开战的行为，也使 Apache Doris 承受了许多本不应该由我们承担的骂名和非议。

比如 Apache Doris 的向量化执行引擎，本来至少提前一个季度就可以与用户们见面。DorisDB 已经有接近两年没有参与过一次社区讨论，唯独在我们把向量化引擎的代码提交 PR 并发起 Veto 这一关键的时间点，给了唯一的 -1 。DorisDB 给 -1 的理由我想不言而喻，无非是为了自己的商业化利益来阻拦社区的关键发展。

尽管无意义的 -1 可以忽视，但我们仍遵守社区规范，这无疑带来了我们许多额外的工作量，也打乱了我们原定的发版节奏。不过幸好最晚 9 月中旬，我们自己的向量化引擎就会提交到社区了，欢迎所有小伙伴关注。

………

诸如此类的事情日积月累，我们明白其实社区的分化已经无可避免。作为 Apache Doris 的维护团队，我们其实不愿意面对这样的局面，但当少数人想要凌驾于社区规则之上并持续向社区吸血时，附骨之蛆不要也罢。

关于如何看待 StarRocks “开源”
两个方面来看。

对于改名。

从 2021 年下半年开始，我们就在努力地筹备 Apache Doris 毕业的事宜，横在我们面前的阻碍，其中最重要的事情之一就是 DorisDB 对 Apache Doris 的品牌侵权问题。

因为他们最初将产品命名为 DorisDB 就受到了 Apache 基金会的质疑，进而阻碍了 Apache Doris 的毕业进程，也给 Apache Doris 社区带来了困扰。最终在 Apache 基金会的施压和我们的抗议下，不得已作出了改名的行为。

改名对我们来说，意味着 Branding 问题不复存在，意味着扫清了毕业路上最大的一个障碍，我们也会继续尽全力投入在 Apache Doris 的毕业筹备工作上。

对于“开源”

注意“开源”这个地方打了引号，其实可以给大家科普一下开源许可协议的背景和差异。

开源促进组织OSI （Open Source Initiative，也被译为开放源代码促进会，官网地址 https://opensource.org/ ）是一个推动开源软件发展的非盈利组织。OSI 定义了近百种开源协议，这已经成为了开源协议的事实标准。换句话说，不被 OSI 认可就不是开源。

Apache License 2.0 作为最主流的开源协议，被 OSI 认定为“受欢迎且被广泛使用或具有强大社区的许可证“（The following OSI-approved licenses are popular, widely used, or have strong communities）。

有关 Apache License 2.0 的具体内容，可以在 Apache 官网（ http://www.apache.org/licenses/LICENSE-2.0 ）查阅。简单来说，分发完全自由、允许项目代码被修改、允许作为开源或商业化软件再次发布，一旦授权永久有效，在修改代码或有源代码衍生的代码中需要带有原来代码的协议、专利声明等。这是对于任何商业化公司和开源用户都极其友好的协议，而 Apache Doris 作为 Apache 基金会的项目，遵守的就是 Apache License 2.0。

StarRocks 呢？遵守的是 Elastic License 2.0。

Elastic 和 AWS 的纷争我们不再去重提。Elastic 修改开源协议为 SSPL 和 Elastic License 双许可，其本意是保护自身原厂的权益，要求未对项目作出贡献的情况下不得发布自己的开源及商业化产品。StarRocks Fork 的是 Apache Doris ，这身份已经本末倒置了，请问 DorisDB 是否有回馈上游？这不仅不遵守基本的 Upstream First 原则，还声称要保护 StarRocks 的知识产权，这也是十分双标了。

再者，Elastic License 的内容如果有心去看，可以发现有“不得将产品作为托管服务提供给其他人”、“不得规避许可证密钥功能或删除/隐藏受许可证密钥保护的功能”、“不能更改许可证”等条件，简单来说就是，你想在云上用？不行！如果有些商业化功能我屏蔽了不让你用，你想不花钱就用？不行！你只能遵守我的协议，如果分发后想换成别的协议？不行。

种种限定的孰是孰非我们不去细究，至少 OSI 不认可 Elastic 协议，认定其是“伪开源协议”，充其量叫“源代码可获取”。不信？可以去看看 Apache Skywalking 是如何回应 Elasticsearch 修改开源协议的。

为什么 StarRocks 不回馈给 Apache Doris ？
说完了如何看待 StarRocks “开源”，再说下他们为什么不回馈上游的 Apache Doris。

我们之前一直是希望能够共同建设 Apache Doris 社区的，至少在他们所谓的“开源”之前，但是多次沟通没有任何结果。

至于 StarRocks “开源”之后，那就更不可能了，因为 StarRocks 选择的Elastic License 开源协议不被认可， StarRocks 的代码也无法被引入到任何 Apache 以及所有 OSI 认可的项目中，这包括了大数据领域几乎所有耳熟能详的项目，诸如 Hive、Spark、Flink、Pulsar、Kafka、Impala、Kylin、Clickhouse、Hudi 等等。换句话说，基本上已经被主流的开源大数据项目隔绝开来。当然，Apache Doris 也引入不了任何 StarRocks 的代码，这也是 StarRocks 选择 Elastic License 开源协议的出发点之一。

作为 Fork 自 Apache Doris 的项目，StarRocks 开出新的分支后不仅分裂了社区，而且不回馈上游，甚至还修改了开源协议、誓要与上游彻底决裂，这种所谓的 “开源”行为本身就很违背开源精神。

诚然，法无禁止即可为。Apache License 上约束不了这样的恶意行为，但大多社区用户是知道的，真正理解开源的人也是知道的，到底有何是不可为的。

总结一下
OK，前面讲的东西比较多，大家消化消化，话题收敛。

谴责也好，澄清也好，抱怨也好，在既定的事实面前都没有任何作用，说点实际一些的。

这个时代不会阻止任何一朵星光闪耀，每个人都有自己的星辰大海。

只不过未来各自朝着各自的启明星和灯塔航行而已。

祝未来一切都好，毕竟 StarRocks 飞得再高，也是站在了巨人肩膀上。

话题回到 Apache Doris。

作为 Apache Doris 的创始和维护团队，我们一直坚持在做，也会持续去做，会更加开放地拥抱开源，不会因为任何人的任何行为改变自己的方向。

我们很清楚接下来该如何开展后续工作，将会持续在社区用户的支持、产品性能的增强以及功能边界的拓展上投入更多人力和精力。我们希望能帮助更多社区用户，解决数据分析的痛点与难题，这是Apache Doris 开源之初的愿景，一路走来，从未改变。我们也希望，能让 Apache Doris 进一步完善，向世界顶级的分析型数据库产品之路上再进一步。

说几点我们最近在做的事情吧。

Apache Doris 向量化执行引擎， 9 月底会出第一版，预计在单表上性能会有数倍提升，年底向量化执行引擎就会全面推出，敬请期待。

Doris Manager 可视化运维监控平台，我们也会尽快推出并开源，让运维更便捷和简易。

还有很多事情都在之前的开发者会议上提到了，可以查看之前的开发者会议总结社区活动｜ Apache Doris 社区开发者会议总结

社区活动方面，不管是 SIG 、征文活动、开发者会议还是即将举办的 Meetup ，希望大家多多参与，希望每个人都能在社区获得成长、有更多收获。

最后，感谢陪伴。

以上。

Bothering open source community problems...

Hi StarRocks developers!
I'm an average developer from the open source community. Just before I am about to take a look at those issues and try to make some contribution, I found a very confusing article which seems to point out some very bad open source behaviors.
I just want to make things clear, and everyone believe in the spirit of open source happy. Thanks!

ref: https://mp.weixin.qq.com/s?__biz=Mzg5MDEyODc1OA==&mid=2247498504&idx=1&sn=d6723f7d4fb92f45060c2e7311c2633f

Fe Core Class Missing: com.starrocks.analysis.SqlParser and com.starrocks.analysis.SqlScanner

com/starrocks/load/loadv2/BulkLoadJob.java need these two files ，but I can't find it in the source code .
And another file not found ----> com.starrocks.thrift.TUniqueId thrift package not found

Some files licensed header are wrong

Does it support importing data with flink

Remove old query executor codes

[Optimizer] Support in/exists subquery in expression

starrocks support scalar subquery like:

select * from t0 where case v1 + 1 > (select max(v1) from t2) when 1 then 1 end;

but don't support like:

select * from t0 where case v1 + 1 IN (select max(v1) from t2) when 1 then 1 end;
select * from t0 where case Exists (select max(v1) from t2) when 1 then 1 end;

Now the optimizer supports IN/Exists subqueries transform to outer join, can be try to support the above case

Should support count (*) when creating a materialized view

Currently when creating a materialized view, we must define count function in count(someColumn) way, which will be convert to SUM CASE WHEN someColumn IS NULL THEN 0 ELSE 1 END.
But for most scenes, users actually need count (*) instead of count (someColumn), so support count(*) would be a strong demand.