apache / pinot Goto Github PK

Apache Pinot - A realtime distributed OLAP datastore

License: Apache License 2.0

Java 97.23% Thrift 0.02% HTML 0.05% CSS 0.01% JavaScript 0.01% Shell 0.18% Python 0.02% Dockerfile 0.06% Smarty 0.04% TypeScript 1.92% Scala 0.44% NASL 0.01% FreeMarker 0.01%

java

pinot's Introduction

What is Apache Pinot?
Features
When should I use Pinot?
Building Pinot
Deploying Pinot to Kubernetes
Join the Community
Documentation
License

What is Apache Pinot?

Apache Pinot is a real-time distributed OLAP datastore, built to deliver scalable real-time analytics with low latency. It can ingest from batch data sources (such as Hadoop HDFS, Amazon S3, Azure ADLS, Google Cloud Storage) as well as stream data sources (such as Apache Kafka).

Pinot was built by engineers at LinkedIn and Uber and is designed to scale up and out with no upper bound. Performance always remains constant based on the size of your cluster and an expected query per second (QPS) threshold.

For getting started guides, deployment recipes, tutorials, and more, please visit our project documentation at https://docs.pinot.apache.org.

Features

Pinot was originally built at LinkedIn to power rich interactive real-time analytic applications such as Who Viewed Profile, Company Analytics, Talent Insights, and many more. UberEats Restaurant Manager is another example of a customer facing Analytics App. At LinkedIn, Pinot powers 50+ user-facing products, ingesting millions of events per second and serving 100k+ queries per second at millisecond latency.

Column-oriented: a column-oriented database with various compression schemes such as Run Length, Fixed Bit Length.
Pluggable indexing: pluggable indexing technologies Sorted Index, Bitmap Index, Inverted Index.
Query optimization: ability to optimize query/execution plan based on query and segment metadata.
Stream and batch ingest: near real time ingestion from streams and batch ingestion from Hadoop.
Query: SQL based query execution engine.
Upsert during real-time ingestion: update the data at-scale with consistency
Multi-valued fields: support for multi-valued fields, allowing you to query fields as comma separated values.
Cloud-native on Kubernetes: Helm chart provides a horizontally scalable and fault-tolerant clustered deployment that is easy to manage using Kubernetes.

When should I use Pinot?

Pinot is designed to execute real-time OLAP queries with low latency on massive amounts of data and events. In addition to real-time stream ingestion, Pinot also supports batch use cases with the same low latency guarantees. It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion. Pinot works very well for querying time series data with lots of dimensions and metrics.

Example query:

SELECT sum(clicks), sum(impressions) FROM AdAnalyticsTable
  WHERE
       ((daysSinceEpoch >= 17849 AND daysSinceEpoch <= 17856)) AND
       accountId IN (123456789)
  GROUP BY
       daysSinceEpoch TOP 100

Building Pinot

More detailed instructions can be found at Quick Demo section in the documentation.

# Clone a repo
$ git clone https://github.com/apache/pinot.git
$ cd pinot

# Build Pinot
$ mvn clean install -DskipTests -Pbin-dist

# Run the Quick Demo
$ cd build/
$ bin/quick-start-batch.sh

Deploying Pinot to Kubernetes

Please refer to Running Pinot on Kubernetes in our project documentation. Pinot also provides Kubernetes integrations with the interactive query engine, Trino Presto, and the data visualization tool, Apache Superset.

Join the Community

Ask questions on Apache Pinot Slack
Please join Apache Pinot mailing lists
[email protected] (subscribe to pinot-dev mailing list)
[email protected] (posting to pinot-dev mailing list)
[email protected] (subscribe to pinot-user mailing list)
[email protected] (posting to pinot-user mailing list)
Apache Pinot Meetup Group: https://www.meetup.com/apache-pinot/

Documentation

Check out Pinot documentation for a complete description of Pinot's features.

License

Apache Pinot is under Apache License, Version 2.0

pinot's People

Contributors

Stargazers

Watchers

Forkers

dhaval2025 kdemarest shirshanka vzhabiuk sjfloat mahashg jfim zhenzhongxu fysoft2006 shabesoglu ideado skpcandy jaymz9634 sungsoo younghai ravindranathakila narayana-glassbeam kiranajayakumar bigdatajungle codeaudit sripadapavan rtvt123 ilovejs phdkiran microinfo eagle518 afthill xiaoguanyu tchen0123 gitter-badger jekey hadoop835 yubobo krasnyanskiy leobelive santoshbl goverdata mapbased thtinh nekulin atlantech cybernetics numcruncher alisheikh sachin-tuwarpatil easyfmxu oopsoutofmemory maczam keyor stevenliuit heuristic-cons zblach mlc0202 chanehua bitdm sunyerui iamcrzay lintian06 sirmmo michaelhow virhappy jaebinyo rtbarber rstonkus ruo91 yuananf angiarao dongyanfeng spirom erlis1 mindis songfang zzmjohn kishoreg zhenchuan nihed lihongweimail magastzheng airy-ict peterhgao sunguo louisroehrs pb-pravin swathimystery jbijoux janewqiu magiciangtao bobqiu cocosli wlf061 kennethxian jpbetz rakibulislam nkhuyu itcheeban sangkyunyoon guoyang2011 branky irichgreen matrixdigi

pinot's Issues

Performance Benchmark

It's great work. Thanks for making available to public.
Have you done any benchmarking? I like to what kind of hardware do I need to run analytical query in sub-second time over n billion records. Trying to get a ballpark figure.

Join tables

is the a way to join between to 2 tables?

After the server restart

Hang up a server node, restart data quantity is not correct, the query of the amount of data, sometimes return values is correct, sometimes return less than the correct values, all the tables are the situation.

StartServerCommand.class ,line 139 ,has a bug

Configuration configuration = readConfigFromFile(_configFileName);
if (configuration == null) {
if (_configFileName != null) {
LOGGER.error("Error: Unable to find file {}.", _configFileName);
return false;
}
configuration = new PropertiesConfiguration(); //useless

realtime and offline not working together

I want to test the hybrid mode, so I followed tutorial: https://github.com/linkedin/pinot/wiki/How-To-Use-Pinot. Here is my steps:

Offline flow

create an offline table named "flights" by:
bin/pinot-admin.sh AddSchema -schemaFile flights-schema.json -exec
bin/pinot-admin.sh AddTable -filePath flights-definition.json
ingest flights data 1988-2014 from Hadoop...
after that, I checked by counting all records in the "flights" table:
bin/pinot-admin.sh PostQuery -query "select count(*) from flights"
and see that all data has been loaded successfully:
Result: {"traceInfo":{},"numDocsScanned":844482,"aggregationResults":[{"function":"count_star","value":"844482"}],"timeUsedMs":14,"segmentStatistics":[],"exceptions":[],"totalDocs":844482}

Realtime flow

streaming 2015 flights data into kafka:
bin/pinot-admin.sh StreamAvroIntoKafka -avroFile flights-2015_1.avro -kafkaTopic flights-realtime &
create a realtime table named "flights" to consume from the Kafka topic:
bin/pinot-admin.sh AddTable -filePath flights-definition-realtime.json
I didn't call AddSchema, since it's already created in offline flow.

The doc said that if a table has same name for offline and realtime, the query processing for this table will switch to hybrid mode. So I suppose I should be able to get both offline and realtime data from table "flights" now. Unfortunately after created realtime table "flights", I can't query count from "flights" any longer, here is what I got:
result: {"traceInfo":{},"numDocsScanned":0,"aggregationResults":[],"timeUsedMs":0,"segmentStatistics":[],"exceptions":[],"totalDocs":0}

BUT, I found I can query from flights_REALTIME for realtime data and flights_OFFLINE for offline data. These two table names, I guess are created internally by Pinot.

Don't know what's wrong, any help would be appreciated.

Adding/removing physical segment for a logical table

I have the following setup:

2 (two) pinot servers hosting same real-time table (same table name / kafka consumer group id); each one of the servers receivs 1/2 of total of msgs from the kafka stream.
1 (one zookeper), 1 (one) broker and 1 (one) controller

What happens if one of the servers crashes (the server process itself), I understand that the second server will consume 100% of the messages while the second server is down (I can see that the broker query returns about half of the documets numbers in this case), but how can I add the second server again (if at all) in order to restore the original setup ?

What's going to happen if a broker/controller crashes ?

In general, how should I manage adding/removing pinot segments for reducing workload balance and/or system crash ?

Thank you

SegmentOnlineOfflineStateModelFactory.class ,method:onBecomeOfflineFromOnline has a bug

SegmentOnlineOfflineStateModelFactory.class ,method:onBecomeOfflineFromOnline has a bug.
line 279 : INSTANCE_DATA_MANAGER.removeSegment(segmentId);
in IndexSegmentImpl.class.
line 116: indexContainerMap.get(column).getInvertedIndex().close();
throws NullPointException.
indexContainerMap.get(column) return value is null.

How I set configuration about store data for only 10 minutes in real-time node?

I have some question about retentionTime.

I want to store the data in Real-time node with only 10 minutes.
And I tested below configuration :

{
    "tableName":"loadTest",
    "segmentsConfig" : {
        "retentionTimeUnit":"MINUTES",
        "retentionTimeValue":"10",
        "segmentPushFrequency":"daily",
        "segmentPushType":"APPEND",
        "replication" : "1",
        "schemaName" : "loadTest",
        "timeColumnName" : "time",
        "timeType" : "MILLISECONDS",
        "segmentAssignmentStrategy" : "BalanceNumSegmentAssignmentStrategy"
    },
    "tableIndexConfig" : {
        "loadMode"  : "HEAP",
        "lazyLoad"  : "false",
                "streamConfigs": {
                        "streamType": "kafka",
                        "stream.kafka.consumer.type": "highLevel",
                        "stream.kafka.topic.name": "loadTest",
                        "stream.kafka.decoder.class.name": "com.linkedin.pinot.core.realtime.impl.kafka.KafkaJSONMessageDecoder",
                        "stream.kafka.zk.broker.url": "localhost:2181",
                        "stream.kafka.hlc.zk.connect.string": "localhost:2181"
                }
    },
    "tableType":"REALTIME",
        "tenants" : {
                "broker":"brokerOne",
                "server":"serverOne"
        },
    "metadata": {
    }
}

An error had occurred.

2015/06/24 17:18:53.593 ERROR [com.linkedin.pinot.controller.helix.core.retention.RetentionManager] [] Error creating TimeRetentionStrategy for table: loadTest_REALTIME, Exception: java.lang.NullPointerException

Can you explain about retention configurations?
How I set configuration about store data for only 10 minutes in real-time node?

Support the dimension contains PredicateEvaluator

I have found Pinot that contains regex Predicate but not implemented yet, I suggest that dimension string contains predicate would be much higher priority cause it was mostly used by filter querys.

To implement it with current dictionary algorithm, pinot should scan the whole dictionary index of the datasource,it would be poor efficiency, may be pinot build other index file for this filter such as lucene did ?

java.net.UnknownHostException occurs during quick-start-offline.sh

The reason is that PostQueryCommand._brokerHost is only initialized in execute() but not in run(). Please see below trace.

Offline quickstart complete
Total number of documents in the table
Query : select count(*) from baseballStats limit 0
2015/06/26 16:06:20.135 ERROR [org.apache.zookeeper.server.NIOServerCnxn] [] Thread Thread[main,5,main] died
java.net.UnknownHostException: null
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:178)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:579)
        at java.net.Socket.connect(Socket.java:528)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
        at sun.net.www.http.HttpClient.New(HttpClient.java:308)
        at sun.net.www.http.HttpClient.New(HttpClient.java:326)
        at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
        at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
        at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091)
        at com.linkedin.pinot.tools.admin.command.PostQueryCommand.run(PostQueryCommand.java:92)
        at com.linkedin.pinot.tools.admin.command.QuickstartRunner.runQuery(QuickstartRunner.java:148)
        at com.linkedin.pinot.tools.Quickstart.execute(Quickstart.java:174)
        at com.linkedin.pinot.tools.Quickstart.main(Quickstart.java:218)

Broker responds to GET requests?

In PinotClientRequestServlet in the Broker, it looks like there is a handler for GET requests that's looking for a "bql" parameter:

  @Override
  protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
    try {
      resp.getOutputStream().print(handleRequest(new JSONObject(req.getParameter("bql"))).toJson().toString());
      resp.getOutputStream().flush();
      resp.getOutputStream().close();
    } catch (final Exception e) {
      resp.getOutputStream().print(e.getMessage());
      resp.getOutputStream().flush();
      resp.getOutputStream().close();
      LOGGER.error("Caught exception while processing GET request", e);
      brokerMetrics.addMeteredValue(null, BrokerMeter.UNCAUGHT_GET_EXCEPTIONS, 1);
    }
  }

Is GET part of the officially supported API? If so, should "bql" be "pql" to be consistent with the documentation?

Also, should a more efficient / robust library like Jackson be used to parse JSON here?

Having clause support

Hi! Any chance to get having clause supported? As it is so important in OLAP.

using the command bin/pinot-admin.sh to excute sql,but prompt connection timed out, how can I deal with it?thx

unable to order aggregated results

based on : https://groups.google.com/forum/#!topic/pinot_users/Svz-ScObEj4

select count(*), sum(hitsByPitch) from baseballStats where hitsByPitch>5 group by teamID top 20 asc

errors:

ProcessingException(errorCode: 150,message: MismatchedTokenException(30 != -1))

Adding schema/table in realtime mode

Hello.

Have you changed something in terms of adding schema/table in realtime (kafka) mode?
I'm using the same files for schema/table that i've been using since my first test last week, and always after a command './pinot-admin AddSchema ...' there was a feedback line printed to stdout with some info, as well as after the '/pinot-admin AddTable ...' command.

I add schema and table after running broker, controller and server.

Since last Jun26 there no such printouts anymore, and the query returns 'no table hit' error, and there's no consumption from kafka, as expected...

Thanks.

metricField with long type throw IndexOutOfBoundsException

table schema file
{
"dimensionFieldSpecs" : [
{
"name": "field",
"dataType" : "STRING",
"delimiter" : null,
"singleValueField" : true
},
{
"name": "host",
"dataType" : "STRING",
"delimiter" : null,
"singleValueField" : true
}
],
"timeFieldSpec" : {
"incomingGranularitySpec" : {
"timeType" : "SECONDS",
"dataType" : "LONG",
"name" : "time"
}
},
"metricFieldSpecs" : [
{
"name" : "value",
"dataType" : "LONG",
"delimiter" : null,
"singleValueField" : true
}
],
"schemaName" : "changefield"
}
table definition file
{
"tableName":"changefield",
"segmentsConfig" : {
"retentionTimeUnit":"DAYS",
"retentionTimeValue":"7",
"segmentPushFrequency":"daily",
"segmentPushType":"APPEND",
"replication" : "1",
"schemaName" : "changefield",
"timeColumnName" : "time",
"timeType" : "SECONDS",
"segmentAssignmentStrategy" : "BalanceNumSegmentAssignmentStrategy"
},
"tableIndexConfig" : {
"invertedIndexColumns" : ["field", "host"],
"loadMode" : "HEAP",
"lazyLoad" : "false",
"streamConfigs": {
"streamType": "kafka",
"stream.kafka.consumer.type": "highLevel",
"stream.kafka.topic.name": "changefield-realtime",
"stream.kafka.decoder.class.name": "com.linkedin.pinot.core.realtime.impl.kafka.KafkaJSONMessageDecoder",
"stream.kafka.zk.broker.url": "localhost:2181",
"stream.kafka.hlc.zk.connect.string": "localhost:2181"
}
},
"tableType":"REALTIME",
"tenants" : {
"broker":"brokerOne",
"server":"serverOne"
},
"metadata": {
}
}
one example record： ["field311", "host10", "1435636716222", "815"]

when i use the above schema i get IndexOutOfBoundsException when dump segment, but if i change the value dataType from "LONG" to "INT", it just success.
Exception ：
2015/06/29 11:48:33.386 INFO [com.linkedin.pinot.core.data.manager.realtime.RealtimeSegmentDataManager] [] Trying to build segment!
2015/06/29 11:48:33.394 INFO [com.linkedin.pinot.core.segment.creator.impl.SegmentIndexCreationDriverImpl] [] Start building StatsCollector!
2015/06/29 11:49:59.837 ERROR [com.linkedin.pinot.core.data.manager.realtime.RealtimeSegmentDataManager] [] Caught exception in the realtime indexing thread
java.lang.IndexOutOfBoundsException
at java.nio.Buffer.checkIndex(Buffer.java:538)
at java.nio.DirectByteBuffer.getLong(DirectByteBuffer.java:766)
at com.linkedin.pinot.core.index.reader.impl.FixedByteWidthRowColDataFileReader.getLong(FixedByteWidthRowColDataFileReader.java:204)
at com.linkedin.pinot.core.index.readerwriter.impl.FixedByteSingleColumnSingleValueReaderWriter.getLong(FixedByteSingleColumnSingleValueReaderWriter.java:140)
at com.linkedin.pinot.core.realtime.impl.RealtimeSegmentImpl.getRawValueRowAt(RealtimeSegmentImpl.java:539)
at com.linkedin.pinot.core.realtime.converter.RealtimeSegmentRecordReader.next(RealtimeSegmentRecordReader.java:78)
at com.linkedin.pinot.core.segment.creator.impl.SegmentIndexCreationDriverImpl.build(SegmentIndexCreationDriverImpl.java:105)
at com.linkedin.pinot.core.realtime.converter.RealtimeSegmentConverter.build(RealtimeSegmentConverter.java:88)
at com.linkedin.pinot.core.data.manager.realtime.RealtimeSegmentDataManager$2.run(RealtimeSegmentDataManager.java:153)
at java.lang.Thread.run(Thread.java:744)

How is this different from Druid?

Some documentation/explanation regarding this would help greatly help people looking into or already using druid.

Can't upload any segments after loading about 10000 segment parts

After loading 9000 segments , we couldn't upload any segments successfully and pinot didn't log any useful information . We looked into and found , there were two places we should fix .
First: zookeeper doesn't support znode size bigger than 1M without adding parameter jute.maxbuffer to system environment
Second : Helix cut down zookeeper message to 1M at ZNRecordStreamingSerializer.java line:168

After we fixing above two places , we can load more segments now.

cZxid = 0x20002aec3
ctime = Fri Aug 21 14:01:46 CST 2015
mZxid = 0x2000a17a2
mtime = Sun Aug 23 21:20:26 CST 2015
pZxid = 0x20002aec3
cversion = 0
dataVersion = 2024
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 1171020
numChildren = 0

Hope make myself clearly!
Thanks to my buddies @iamcrzay @code4love

countDistinct is not support. It should be distinctCount.

countDistinct is not support.
DistinctCountAggregationFunction.class is exist.
It should be distinctCount

About Pinot's develop plans and timelines

Hi, I would like to know about the feature plans and ETA/timelines of Pinot.
We need to upgrade our realtime analyzing system and have done a lot of investigations on deferent kinds of system. Pinot is great and looks pretty suitable for us. But currently it do lack some query features we really need. They are:

Calculation on dimensions and aggregated metrics. e.g. select (dimA / 100) as dimB, (sum(metricA) + sum(metricB)) as metricC from table group by dimB order by metricC.
Not only the topN, but the real & completed order by support. e.g. select dimA, sum(metricA), sum(metricB) from table group by dimA order by sum(metricA) desc, sum(metricB) asc offset 10, 20
Having clause.

Broker InstanceName Problem

if I set

./pinot-admin.sh StartBroker -brokerPort 8099 -zkAddress "localhost:2181" -brokerInstName "MyNameIsBroker"

Then I received Exception below:

ZKExceptionHandler - Exception while invoking init callback for listener:com.linkedin.pinot.broker.broker.helix.LiveInstancesChangeListenerImpl@50b7114a
java.lang.ArrayIndexOutOfBoundsException: 1
    at com.linkedin.pinot.broker.broker.helix.LiveInstancesChangeListenerImpl.onLiveInstanceChange(LiveInstancesChangeListenerImpl.java:75)
    at org.apache.helix.manager.zk.CallbackHandler.invoke(CallbackHandler.java:173)
    at org.apache.helix.manager.zk.CallbackHandler.init(CallbackHandler.java:324)
    at org.apache.helix.manager.zk.CallbackHandler.<init>(CallbackHandler.java:114)
    at org.apache.helix.manager.zk.ZKHelixManager.addListener(ZKHelixManager.java:299)
    at org.apache.helix.manager.zk.ZKHelixManager.addLiveInstanceChangeListener(ZKHelixManager.java:318)
    at com.linkedin.pinot.broker.broker.helix.HelixBrokerStarter.<init>(HelixBrokerStarter.java:120)
    at com.linkedin.pinot.tools.admin.command.StartBrokerCommand.execute(StartBrokerCommand.java:98)
    at com.linkedin.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:85)
    at com.linkedin.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:94)

So is this an issue?

After the restart, why want to delete the data

class StartControllerCommand.java
line 123:
if (zkClient.exists(helixClusterName)) {
zkClient.deleteRecursive(helixClusterName);
}

How to use binary operator

Hello

I want to excute query 'select sum_duration/sum_usage_count from test limit 1', but I got error like
org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
org.antlr.runtime.DFA.predict(DFA.java:144) com.linkedin.pinot.pql.parsers.PQLParser.selection_list(PQLParser.java:2133) com.linkedin.pinot.pql.parsers.PQLParser.select_stmt(PQLParser.java:1459) com.linkedin.pinot.pql.parsers.PQLParser.statement(PQLParser.java:1246) com.linkedin.pinot.pql.parsers.PQLCompiler.compile(PQLCompiler.java:50) com.linkedin.pinot.broker.servlet.PinotClientRequestServlet.handleRequest(PinotClientRequestServlet.java:99) com.linkedin.pinot.broker.servlet.PinotClientRequestServlet.doPost(PinotClientRequestServlet.java:82) javax.servlet.http.HttpServlet.service(HttpServlet.java:688) javax.servlet.http.HttpServlet.service(HttpServlet.java:770) org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:669) org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) org.eclipse.jetty.server.Server.handle(Server.java:365) org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937) org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998) org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865) org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) java.lang.Thread.run(Thread.java:745)

The table test 's structure:

action_data STRING

area STRING

device_type STRING

is_new STRING

sum_duration DOUBLE

sum_usage_count INT

time_range INT

user_count INT

Is there any error in the query statement? Waiting for your answer!

Incorrect secondary ordering

Using quick-start-offline.sh with the repository as of commit dd3b058, a secondary ordering term is not working as expected.

For example, curl -X POST -d '{"pql":"select yearID, playerName from baseballStats order by yearID, playerName"}' http://localhost:7000/query yields:

{
  "selectionResults": {
    "columns": [
      "yearID",
      "playerName"
    ],
    "results": [
      [
        "1871",
        "Caleb Clark"
      ],
      [
        "1871",
        "Asahel"
      ],
      [
        "1871",
        "Arthur Algernon"
      ],
      [
        "1871",
        "Andrew Jackson"
      ],
      [
        "1871",
        "Alfred L."
      ],
      [
        "1871",
        "Alfred James"
      ],
      [
        "1871",
        "Albert Goodwill"
      ],
      [
        "1871",
        "Albert G."
      ],
      [
        "1871",
        "Adrian Constantine"
      ],
      [
        "1871",
        "Calvin Alexander"
      ]
    ]
  },
  "traceInfo": {},
  "numDocsScanned": 97889,
  "aggregationResults": [],
  "timeUsedMs": 20,
  "segmentStatistics": [],
  "exceptions": [],
  "totalDocs": 97889
}

Notice that the playerName values in the grouping are not in the correct order.

quick-start-realtime.sh gives exception on the start "URI is not hierarchical"

I've build pinot according to README

/var/www/RnD/pinot/pinot-distribution/target/pinot-0.016-pkg/bin$ ./quick-start-realtime.sh
Exception in thread "main" java.lang.IllegalArgumentException: URI is not hierarchical
at java.io.File.(File.java:418)
at com.linkedin.pinot.tools.RealtimeQuickStart.main(RealtimeQuickStart.java:35)

/var/www/RnD/pinot/pinot-distribution/target/pinot-0.016-pkg/bin$ java -version
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-2~deb7u1)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

quick-start-offline.sh return to error "URI is not hierarchical"

Dear linkedin

after compile pinot, I try to start pinot

./quick-start-offline.sh
Exception in thread "main" java.lang.IllegalArgumentException: URI is not hierarchical
at java.io.File.(File.java:392)
at com.linkedin.pinot.tools.Quickstart.execute(Quickstart.java:112)
at com.linkedin.pinot.tools.Quickstart.main(Quickstart.java:201)

but it return error

Error when close inverted index for column

This exception happens when SegmentTarPush
I hava three pageview log file, d1 d2 d3, after SegmentCreation, I push the index to controller,
d1 works fine with d3, but exceptions happens when pushing d2. It seems the index was broken(part of d1's data cannot be queried)

Inverted Index

Can you explain how the inverted index is used within Pinot? I have used Druid previously, and they implement a bitmap index for the dimensions so they can do very fast filtering on low cardinality data sets. I imagine the inverted index is used similarly here, but it is not documented.

Also, may I suggest having a Google group so these questions can go there instead.

quickstart consumes 100% of one CPU core for unknown reason.

Hello,

Just found a strange issue - quickstart consumes 100% of one core. I'm using it for local development/testing on mac pro. As result, fun is always running, which is uncomfortable.

After quick investigations found that code:
/pinot-tools/src/main/java/com/linkedin/pinot/tools/Quickstart.java

long st = System.currentTimeMillis();
    while (true) {
      if (System.currentTimeMillis() - st >= (60 * 60) * 1000) {
        break;
      }
    }

Is it possible to add Thread.yeld() or Thread.sleep() here to reduce burden?

Thank you!

error mvn install package -DskipTests

[INFO] Total time: 51.248s
[INFO] Finished at: Wed Jun 17 13:19:19 EDT 2015
[INFO] Final Memory: 121M/237M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.5.5:single (make-bundles) on project pinot-distribution: Failed to create assembly: Error creating assembly archive bin: Problem creating zip: Execution exception: Java heap space -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.5.5:single (make-bundles) on project pinot-distribution: Failed to create assembly: Error creating assembly archive bin: Problem creating zip: Execution exception
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
    at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)

The same topic, the newly built a few tables, some table can consumption, some table can't consumption, please help

Hybrid quickstart

We should also have a hybrid mode quick start for user to play with.

Data was changed itself

Hello

I'm so interested in Pinot that I try to test it.
I'm in trouble with below issue.

I sended data to the realtime node using kafka.
After a while, data was changed itself.

First result are as follows: (sql query result)

{
    "exceptions": [],
    "traceInfo": {},
    "timeUsedMs": 1,
    "totalDocs": 56700,
    "segmentStatistics": [],
    "aggregationResults": [],
    "selectionResults": {
        "results": [
            ["A", "1435048500111", "1"],
            ["A", "1435048500113", "1"],
            ["A", "1435048500115", "1"],
            ["A", "1435048500117", "1"],
            ["A", "1435048500119", "1"],
        ],
        "columns": ["name", "time", "value"]
    },
    "numDocsScanned": 5
}

But after a while, the value of column "value" was changed itself.
Next same query result are as follows: (sql query result)

{
    "exceptions": [],
    "traceInfo": {},
    "timeUsedMs": 1,
    "totalDocs": 56700,
    "segmentStatistics": [],
    "aggregationResults": [],
    "selectionResults": {
        "results": [
            ["A", "1435048500111", "0"],
            ["A", "1435048500113", "0"],
            ["A", "1435048500115", "0"],
            ["A", "1435048500117", "0"],
            ["A", "1435048500119", "0"],
        ],
        "columns": ["name", "time", "value"]
    },
    "numDocsScanned": 5
}

Is that because the column name, "value" is reserved keyword?

cf.
schema file

{
        "dimensionFieldSpecs" : [
                {
                        "name": "name",
                        "dataType" : "STRING",
                        "delimiter" : null,
                        "singleValueField" : true
                }
        ],
        "timeFieldSpec" : {
                "incomingGranularitySpec" : {
                        "timeType" : "MILLISECONDS",
                        "dataType" : "LONG",
                        "name" : "time"
                }
        },
        "metricFieldSpecs" : [
                {
                        "name" : "value",
                        "dataType" : "INT",
                        "delimiter" : null,
                        "singleValueField" : true
                }
        ],
        "schemaName" : "loadTest"
}

table file

{
    "tableName":"loadTest",
    "segmentsConfig" : {
        "retentionTimeUnit":"DAYS",
        "retentionTimeValue":"7",
        "segmentPushFrequency":"daily",
        "segmentPushType":"APPEND",
        "replication" : "1",
        "schemaName" : "loadTest",
        "timeColumnName" : "time",
        "timeType" : "MILLISECONDS",
        "segmentAssignmentStrategy" : "BalanceNumSegmentAssignmentStrategy"
    },
    "tableIndexConfig" : {
        "invertedIndexColumns" : ["time"],
        "loadMode"  : "HEAP",
        "lazyLoad"  : "false",
                "streamConfigs": {
                        "streamType": "kafka",
                        "stream.kafka.consumer.type": "highLevel",
                        "stream.kafka.topic.name": "loadTest",
                        "stream.kafka.decoder.class.name": "com.linkedin.pinot.core.realtime.impl.kafka.KafkaJSONMessageDecoder",
                        "stream.kafka.zk.broker.url": "localhost:2181",
                        "stream.kafka.hlc.zk.connect.string": "localhost:2181"
                }
    },
    "tableType":"REALTIME",
        "tenants" : {
                "broker":"brokerOne",
                "server":"serverOne"
        },
    "metadata": {
    }
}

So is this an issue?
Thanks.

Exception from pinotBroker.log after issueed any query

2015-10-10 18:03:29,552 INFO  [qtp1861781750-549] com.linkedin.pinot.broker.servlet.PinotClientRequestServlet - Broker received Query String is: select count(*)  from campaign
2015-10-10 18:03:29,552 ERROR [pool-3-thread-810] 
2015-10-12 10:46:46,258 ERROR [pool-3-thread-26] com.linkedin.pinot.transport.scattergather.ScatterGatherImpl - Got exception sending request (538). Setting error future
java.lang.NullPointerException
        at com.linkedin.pinot.transport.scattergather.ScatterGatherImpl$SingleRequestHandler.run(ScatterGatherImpl.java:392)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
        at com.linkedin.pinot.transport.scattergather.ScatterGatherImpl$SingleRequestHandler.run(ScatterGatherImpl.java:392)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
com.linkedin.pinot.transport.scattergather.ScatterGatherImpl - Got exception sending request (14679). Setting error future
java.lang.NullPointerException
2015-10-10 18:03:29,552 INFO  [qtp1861781750-549] com.linkedin.pinot.transport.common.AsyncResponseFuture - Error Future for request 14679 Request is no longer pending. Cannot cancel !!
2015-10-10 18:03:29,552 INFO  [qtp1861781750-549] com.linkedin.pinot.common.metrics.AbstractMetrics -  Phase:REDUCE took 0
2015-10-10 18:03:29,552 INFO  [qtp1861781750-549] com.linkedin.pinot.common.metrics.AbstractMetrics -  Phase:QUERY_EXECUTION took 0
2015-10-10 18:03:29,552 INFO  [qtp1861781750-549] com.linkedin.pinot.broker.servlet.PinotClientRequestServlet - Broker Response : BrokerResponse(totalDocs:0, numDocsScanned:0, timeUsedMs:0, aggregationResults:[], selectionResults:null, segmentStatistics:[], exceptions:[], traceInfo:{})

In my test in only happens on high client connections. e.g, it works fine when I use 10 threads to do the query test, but complain exceptions after I rise to 100 threads.

Count Distinct

Someone can give an example of count distinct ?
Im getting error when i try to query this,

select count( distinct AtBatting) from baseballStats limit 10
Or

select countDistinct ( AtBatting) from baseballStats limit 10

A big problem about inverted index creation

SegmentColumnarIndexCreator.java

if (config.createInvertedIndexEnabled()) {
invertedIndexCreatorMap.put(
column,
new BitmapInvertedIndexCreator(file, indexCreationInfo.getSortedUniqueElementsArray().length, schema
.getFieldSpecFor(column)));
}

config.createInvertedIndexEnabled() method which comes from SegmentGeneratorConfig.java will decide inverted index creation.
It has a problem because we need depend on table config info(invertedIndexColumns) to decide that a column should create inverted index or not.
And config.setCreateInvertedIndex method don't call in the code.So config.createInvertedIndexEnabled() always be false.Table config info(invertedIndexColumns) can't take effect.

Table Config
{
"tableIndexConfig": {
"invertedIndexColumns":["numberOfGamesAsBatter"],
"loadMode":"HEAP",
"lazyLoad":"false"
},
"tenants":{"server":"","broker":""},
"tableType":"OFFLINE","metadata":{"customConfigs":{"d2Name":"xlntBetaPinot"}},
"segmentsConfig":{
"retentionTimeUnit":"DAYS",
"segmentPushFrequency":"daily",
"replication":1,
"timeColumnName":"",
"retentionTimeValue":"700",
"timeType":"",
"segmentPushType":"APPEND",
"schemaName":"baseball",
"segmentAssignmentStrategy":"BalanceNumSegmentAssignmentStrategy"
},
"tableName":"baseballStats"
}

TimeoutException, only run select * from table limit 10

Hi
I only run select * from table limit 10.
about 7 hundred million data,
4 server nodes
thare are 3 server is 8g 4core.

could you please give me some help?
is need to add server to cluster?
or config the timeoutms?

ProcessingException(errorCode:350, message:java.util.concurrent.TimeoutException\n\tat java.util.concurrent.FutureTask.get(FutureTask.java:205)\n\tat com.linkedin.pinot.core.operator.MCombineOperator.nextBlock(MCombineOperator.java:191)\n\tat com.linkedin.pinot.core.operator.UResultOperator.nextBlock(UResultOperator.java:52)\n\tat com.linkedin.pinot.core.plan.GlobalPlanImplV0.execute(GlobalPlanImplV0.java:58)\n\tat com.linkedin.pinot.core.query.executor.ServerQueryExecutorV1Impl.processQuery(ServerQueryExecutorV1Impl.java:133)\n\tat com.linkedin.pinot.server.request.SimpleRequestHandler.processRequest(SimpleRequestHandler.java:81)\n\tat com.linkedin.pinot.transport.netty.NettyServer$NettyChannelInboundHandler.channelRead(NettyServer.java:218)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)\n\tat io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)\n\tat io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)\n\tat java.lang.Thread.run(Thread.java:745)\n)", "ProcessingException(errorCode:350, message:java.util.concurrent.TimeoutException\n\tat java.util.concurrent.FutureTask.get(FutureTask.java:205)\n\tat com.linkedin.pinot.core.operator.MCombineOperator.nextBlock(MCombineOperator.java:191)\n\tat com.linkedin.pinot.core.operator.UResultOperator.nextBlock(UResultOperator.java:52)\n\tat com.linkedin.pinot.core.plan.GlobalPlanImplV0.execute(GlobalPlanImplV0.java:58)\n\tat com.linkedin.pinot.core.query.executor.ServerQueryExecutorV1Impl.processQuery(ServerQueryExecutorV1Impl.java:133)\n\tat com.linkedin.pinot.server.request.SimpleRequestHandler.processRequest(SimpleRequestHandler.java:81)\n\tat com.linkedin.pinot.transport.netty.NettyServer$NettyChannelInboundHandler.channelRead(NettyServer.java:218)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)\n\tat io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)\n\tat io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)\n\tat java.lang.Thread.run(Thread.java:745)\n)", "ProcessingException(errorCode:350, message:java.util.concurrent.TimeoutException\n\tat java.util.concurrent.FutureTask.get(FutureTask.java:205)\n\tat com.linkedin.pinot.core.operator.MCombineOperator.nextBlock(MCombineOperator.java:191)\n\tat com.linkedin.pinot.core.operator.UResultOperator.nextBlock(UResultOperator.java:52)\n\tat com.linkedin.pinot.core.plan.GlobalPlanImplV0.execute(GlobalPlanImplV0.java:58)\n\tat com.linkedin.pinot.core.query.executor.ServerQueryExecutorV1Impl.processQuery(ServerQueryExecutorV1Impl.java:133)\n\tat com.linkedin.pinot.server.request.SimpleRequestHandler.processRequest(SimpleRequestHandler.java:81)\n\tat com.linkedin.pinot.transport.netty.NettyServer$NettyChannelInboundHandler.channelRead(NettyServer.java:218)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)\n\tat io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)\n\tat io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)\n\tat java.lang.Thread.run(Thread.java:745)\n)

Why PQLCompiler can't be compiled because of two missing classes?

When I downloaded the latest pinot code, two classes are missing, could you please solve the problem?

import com.linkedin.pinot.pql.parsers.PQLLexer;
import com.linkedin.pinot.pql.parsers.PQLParser;

public class PQLCompiler extends AbstractCompiler {

EXTRA_JVM_ARGUMENTS in pinot-admin.sh prevents configuration of log4j file and -Xms -Xmx

Currently the only way to increase log verbosity from ERROR to INFO on pinot is to modify the pinot-log4j.properties file inside the pinot distribution.

The main problem is that pinot-admin.sh appends the EXTRA_JVM_ARGUMENTS JVM properties after the user provided JAVA_OPTS.

This causes a few problems.

There is no simple way to specify an alternate log4j file. Usually this is would simply be done by adding -Dlog4j.configuration=... to JAVA_OPTS, but EXTRA_JVM_ARGUMENTS overrides the settings.
Any -Xms or -Xmx settings provided in JAVA_OPTS are ignored.

Given that pinot has complex classpath, developer really must start it using pinot-admin.sh. And in order to properly operate and debug pinot, developers will often need to increase log verbosity and properly tune JVM settings.

Error: Could not find or load main class com.linkedin.pinot.tools.Quickstart

After i run this script
git clone https://github.com/linkedin/pinot.git
cd pinot
mvn install package -DskipTests
cd pinot-distribution/target/pinot-0.016-pkg
chmod +x bin/*.sh

bin/quick-start-offline.sh
Im getting this error Error: Could not find or load main class com.linkedin.pinot.tools.Quickstart

Grouping data by time interval

I've been working on adding Pinot as a datasource to Grafana, a visualization tool for time series charts.

I'm looking for the the best way to query Pinot.

In InfluxDB's query language, grouping results into time buckets is usually done like:

SELECT MEAN(value) FROM measure WHERE time > (now() - 1h) GROUP BY time(5m)

Where the result set would look something like

{
  "columns": ["value", "time"],
  "results": [
    [40,1441756680100],
    [45,1441756680200],
    [43,1441756680300],
    ...
  ]
}

And can easily be plotted.

Does Pinot support this sort of query? I've been trying a few group by queries but have been unable to formulate one that provides this sort of grouped response. If this is not yet supported, would it be possible to add support?

When the number of table more than eight, there will be no consumption of kafka

More than 8 tables in the cluster number, then the next table can't consumption kafka.

I tested many times, discover finally create table does not generate the data properly.

Created in front of the table, respectively table1_realtime,table1_offline,
                                            table2_realtime,table2_offline,
                                            table3_realtime,table3_offline,
                                            table4_realtime,table4_offline,
                                            table5_realtime
table5_realtime cannot consumption kafka.

"select *" failed with exception

Caught exception while reducing results
java.lang.UnsupportedOperationException
at java.util.AbstractList.remove(AbstractList.java:161)
at java.util.AbstractList$Itr.remove(AbstractList.java:374)
at java.util.AbstractList.removeRange(AbstractList.java:571)
at java.util.AbstractList.clear(AbstractList.java:234)
at com.linkedin.pinot.core.query.selection.SelectionOperatorUtils.getSelectionColumns(SelectionOperatorUtils.java:99)
at com.linkedin.pinot.core.query.selection.SelectionOperatorUtils.render(SelectionOperatorUtils.java:156)
at com.linkedin.pinot.core.query.reduce.DefaultReduceService.reduceOnSelectionResults(DefaultReduceService.java:201)
at com.linkedin.pinot.core.query.reduce.DefaultReduceService.reduceOnDataTable(DefaultReduceService.java:157)
at com.linkedin.pinot.requestHandler.BrokerRequestHandler$1.call(BrokerRequestHandler.java:305)
at com.linkedin.pinot.requestHandler.BrokerRequestHandler$1.call(BrokerRequestHandler.java:302)
at com.linkedin.pinot.common.metrics.AbstractMetrics.timePhase(AbstractMetrics.java:103)
at com.linkedin.pinot.requestHandler.BrokerRequestHandler.getDataTableFromBrokerRequest(BrokerRequestHandler.java:302)
at com.linkedin.pinot.requestHandler.BrokerRequestHandler.processSingleTableBrokerRequest(BrokerRequestHandler.java:157)
at com.linkedin.pinot.requestHandler.BrokerRequestHandler.processBrokerRequest(BrokerRequestHandler.java:123)
at com.linkedin.pinot.broker.servlet.PinotClientRequestServlet$1.call(PinotClientRequestServlet.java:141)
at com.linkedin.pinot.broker.servlet.PinotClientRequestServlet$1.call(PinotClientRequestServlet.java:136)
at com.linkedin.pinot.common.metrics.AbstractMetrics.timePhase(AbstractMetrics.java:103)
at com.linkedin.pinot.broker.servlet.PinotClientRequestServlet.handleRequest(PinotClientRequestServlet.java:135)
at com.linkedin.pinot.broker.servlet.PinotClientRequestServlet.doPost(PinotClientRequestServlet.java:84)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:688)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:770)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:669)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)

configuration file can't set

broker.properties
server.properties
controller.properties
can't be load.

Encoding problem

PinotClientRequestServlet has encoding problem:

@Override

protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {

try {

resp.getOutputStream().print(handleRequest(new JSONObject(req.getParameter("bql"))).toJson().toString());

resp.getOutputStream().flush();

resp.getOutputStream().close();

} catch (final Exception e) {

resp.getOutputStream().print(e.getMessage());

resp.getOutputStream().flush();

resp.getOutputStream().close();

LOGGER.error("Caught exception while processing GET request", e);

brokerMetrics.addMeteredValue(null, BrokerMeter.UNCAUGHT_GET_EXCEPTIONS, 1);

}

}
@Override

protected void doPost(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {

try {

resp.getOutputStream().print(handleRequest(extractJSON(req)).toJson().toString());

resp.getOutputStream().flush();

resp.getOutputStream().close();

} catch (final Exception e) {

resp.getOutputStream().print(e.getMessage());

resp.getOutputStream().flush();

resp.getOutputStream().close();

LOGGER.error("Caught exception while processing POST request", e);

brokerMetrics.addMeteredValue(null, BrokerMeter.UNCAUGHT_POST_EXCEPTIONS, 1);

}

}

It should be


 @Override
  protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
    try {
      resp.setCharacterEncoding("UTF-8");
      resp.getOutputStream().write(handleRequest(new JSONObject(req.getParameter("bql"))).toJson().toString().getBytes("UTF-8"));
      resp.getOutputStream().flush();
      resp.getOutputStream().close();
    } catch (final Exception e) {
      resp.getOutputStream().print(e.getMessage());
      resp.getOutputStream().flush();
      resp.getOutputStream().close();
      LOGGER.error("Caught exception while processing GET request", e);
      brokerMetrics.addMeteredValue(null, BrokerMeter.UNCAUGHT_GET_EXCEPTIONS, 1);
    }
  }

  @Override
  protected void doPost(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
    try {
      resp.setCharacterEncoding("UTF-8");
      resp.getOutputStream().write(handleRequest(extractJSON(req)).toJson().toString().getBytes("UTF-8"));
      resp.getOutputStream().flush();
      resp.getOutputStream().close();
    } catch (final Exception e) {
      resp.getOutputStream().print(e.getMessage());
      resp.getOutputStream().flush();
      resp.getOutputStream().close();
      LOGGER.error("Caught exception while processing POST request", e);
      brokerMetrics.addMeteredValue(null, BrokerMeter.UNCAUGHT_POST_EXCEPTIONS, 1);
    }

Server get wrong segment download url from zookeeper

We found that pinot-server get wrong segment download url from zookeeper, like http://localhost:9999/segments/myTable/mySegment_0.tar.gz. It seems that controller set "localhost" as it's hostname.

[pool-2-thread-1] 2015/06/19 12:36:58.665 INFO [com.linkedin.pinot.common.utils.FileUploadUtils] [] Get file from:http://localhost:9999/segments/myTable/mySegment_0.tar.gz to local file:/tmp/PinotSegment8098/segmentTar/mySegment_0.tar.gz
[pool-2-thread-1] 2015/06/19 12:36:58.666 INFO [org.apache.commons.httpclient.HttpMethodDirector] [] I/O exception (java.net.ConnectException) caught when processing request: Connection refused
[pool-2-thread-1] 2015/06/19 12:36:58.666 INFO [org.apache.commons.httpclient.HttpMethodDirector] [] Retrying request
[pool-2-thread-1] 2015/06/19 12:36:58.667 INFO [org.apache.commons.httpclient.HttpMethodDirector] [] I/O exception (java.net.ConnectException) caught when processing request: Connection refused
[pool-2-thread-1] 2015/06/19 12:36:58.667 INFO [org.apache.commons.httpclient.HttpMethodDirector] [] Retrying request
[pool-2-thread-1] 2015/06/19 12:36:58.667 INFO [org.apache.commons.httpclient.HttpMethodDirector] [] I/O exception (java.net.ConnectException) caught when processing request: Connection refused
[pool-2-thread-1] 2015/06/19 12:36:58.667 INFO [org.apache.commons.httpclient.HttpMethodDirector] [] Retrying request
[pool-2-thread-1] 2015/06/19 12:36:58.667 ERROR [com.linkedin.pinot.common.utils.FileUploadUtils] [] Caught exception
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at java.net.Socket.connect(Socket.java:538)
        at java.net.Socket.<init>(Socket.java:434)
        at java.net.Socket.<init>(Socket.java:286)
        at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
        at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
        at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
        at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
        at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at com.linkedin.pinot.common.utils.FileUploadUtils.getFile(FileUploadUtils.java:90)
        at com.linkedin.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.downloadSegmentToLocal(SegmentOnlineOfflineStateModelFactory.java:327)
        at com.linkedin.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOfflineForOfflineSegment(SegmentOnlineOfflineStateModelFactory.java:212)
        at com.linkedin.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:128)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:344)
        at org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:290)
        at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:93)
        at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:50)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

PS: In com.linkedin.pinot.tools.admin.command.StartControllerCommand, you can find following codes:

 conf.setControllerHost("localhost");
 conf.setControllerVipHost("localhost");

I think it should be

String hostname = InetAddress.getLocalHost().getHostName()

'pinot.broker.time.out' and 'pinot.server.query.executor.timeout' parameters ,how to set?

default 10s,how to set ?
query sql time out .

Terminating quickstart with SIGINT throws IllegalStateException because of shutdown hook

Terminating bin/quick-start-offline.sh with SIGINT (ctrl+c) throws the following Exception:

java.lang.IllegalStateException: Shutdown in progress
    at java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:66)
    at java.lang.Runtime.addShutdownHook(Runtime.java:211)
    at com.linkedin.pinot.tools.admin.command.AbstractBaseCommand.<init>(AbstractBaseCommand.java:44)
    at com.linkedin.pinot.tools.admin.command.StopProcessCommand.<init>(StopProcessCommand.java:32)
    at com.linkedin.pinot.tools.admin.command.QuickstartRunner.stop(QuickstartRunner.java:109)
    at com.linkedin.pinot.tools.Quickstart$1.run(Quickstart.java:145)

The problem is that the AbstractBaseCommand always registers the shutdown hook w/o handling the case that the shutdown is already in progress. For StopProcessCommand the shutdown is always in progress (modulo slow timings).

I think it's OK to ignore the exception or just call cleanup in the constructor if the general contract is that cleanup must be called in any case.

NPE inside query executor

Just hit the below NPE, I believe it was triggered by this query:

SELECT COUNT(timestamp) FROM spans WHERE timestamp BETWEEN 1435708800 AND 1437519830 GROUP BY toServiceName

java.lang.NullPointerException
    at com.linkedin.pinot.core.operator.MCombineOperator.trimToSize(MCombineOperator.java:257)
    at com.linkedin.pinot.core.operator.MCombineOperator.nextBlock(MCombineOperator.java:246)
    at com.linkedin.pinot.core.operator.UResultOperator.nextBlock(UResultOperator.java:52)
    at com.linkedin.pinot.core.plan.GlobalPlanImplV0.execute(GlobalPlanImplV0.java:58)
    at com.linkedin.pinot.core.query.executor.ServerQueryExecutorV1Impl.processQuery(ServerQueryExecutorV1Impl.java:133)
    at com.linkedin.pinot.server.request.SimpleRequestHandler.processRequest(SimpleRequestHandler.java:81)
    at com.linkedin.pinot.transport.netty.NettyServer$NettyChannelInboundHandler.channelRead(NettyServer.java:218)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)

Thirdeye Setup, Documentation, Examples

I have been trying to get Thirdeye running with an example dataset. I have encountered some difficulties. Some of which are resolved, and I've written them down here for completeness, while others are requests for help. I can create new issues for some of these, if required.

The Thirdeye Hadoop jobs does not take in Hadoop job memory settings (e.g -Dmapreduce.map.memory.mb), and job.properties memory settings are not configured.
No example of config.yml. Here is one that worked for me.

collection: registrations
dimensions:
  - name: MY_DIM_1
  - name: MY_DIM_2
metrics:
  - name: count
    type: INT
time:
  columnName: date
  input:
    size: 1
    unit: HOURS
  bucket:
    size: 1
    unit: HOURS
  retention:
    size: 365
    unit: DAYS
rollup:
  functionClass: com.linkedin.thirdeye.bootstrap.rollup.TotalAggregateBasedRollupFunction
  functionConfig:
    metricName: "count"
    threshold: "100"
split:
  threshold: 3

I am unsure of what exactly can be put here, and there was a TopKRollupSpec that was interesting...

In the bootstrap phases, I had to add some extra properties to job.properties, or the jobs will error out. Overall, I added the following:

thirdeye.time.min=2010-01-01
thirdeye.time.max=2020-01-01
thirdeye.dimension.index.ref=data_2010-01-01-000000_2020-01-01-000000
thirdeye.flow.schedule=MONTHLY

In particular, I believe the first two lines were needed in the analysis job, as per https://github.com/linkedin/pinot/blob/master/thirdeye/thirdeye-bootstrap/src/main/java/com/linkedin/thirdeye/bootstrap/ThirdEyeJob.java#L1131

The 3rd line was required for the startree_bootstrap_phase1 and startree_bootstrap_phase2 job
, and thirdeye.flow.schedule was needed for the server_package job.

Without Kerberos set up in our Hadoop cluster, I had to use server_package job in the thirdeye-bootstrap package to combine the {thirdeye.root}/{thirdeye.collection}/METRIC_INDEX/DATE_RANGE/startree_bootstrap_phase2/data-x.tar.gz files to a single data.tar.gz.
Then I had to POST the endpoint /collections/{thirdeye.collection} with that data.tar.gz file, and after a restart of thirdeye-server, the collection got picked up.
Commands I used:

curl -XPOST \
  --data-binary @{MY_PATH_TO}/config.yml \
  -H 'Content-Type:application/octet-stream' \
  localhost:8080/collections/registrations
curl -XPOST \
  --data-binary @{MY_PATH_TO}/data.tar.gz \
  -H 'Content-Type:application/octet-stream' \
  localhost:8080/collections/registrations/data/{START_MILLIS}/{END_MILLIS}

Where {START_MILLIS}, {END_MILLIS} correspond to milliseconds from Epoch of the dates I set for thirdeye.time.min and thirdeye.time.max, respectively.

Just having thirdeye-server doesn't seem to be enough if I want the dashboard UI. I had to start thirdeye-dashboard with its own YAML configs (default config will make it start at port 8080, which conflicted with thirdeye-server)
Configs I used:

serverUri: http://localhost:8080
server:
  applicationConnectors:
  - type: http
    port: 8082
  adminConnectors:
  - type: http
    port: 9001
funnelConfigRoot: SOME_FOLDER

The funnelConfigRoot was required, or the FunnelsDataProvider will not be instantiated, even if the folder doesn't contain a funnel.yml. I believe this behavior is because of https://github.com/linkedin/pinot/blob/master/thirdeye/thirdeye-dashboard/src/main/java/com/linkedin/thirdeye/dashboard/ThirdEyeDashboard.java#L73
I got some SocketTimeoutException for queries that thirdeye-dashboard was making to thirdeye-server, but they seem to have disappeared now. Not sure if there is a place to set timeout
In general, there is a lot of 'custom dashboard' or 'custom funnel' things that I am unsure how to get working. Some documentation there would be great.
My dataset I believe is sparse enough that for some combination of dimension values, there are 0 rows. This resulted in a cryptic datatable error message of DataTables warning: table id=tabularView-metric_0 - Requested unknown parameter '8' for row 396. For more information about this error, please see http://datatables.net/tn/4
Using the chrome console, I got an error message of:
"FreeMarker template error (DEBUG mode; use RETHROW in production!): The following has evaluated to null or missing: ==> cell.statsMap['volume_difference'] [in template "com/linkedin/thirdeye/dashboard/views/dimension/heat-map.ftl" at line 162, column 44]...
The full message is available here: https://gist.github.com/cliu587/704d0855fc9ae4d8bddc

I am pretty stuck on the last part, but it only affects the heatmap tab.
The Contributors tab works fine.

Hopefully this is helpful!