Giter VIP home page Giter VIP logo

pinot's Introduction

Apache Pinot

Build Status Release codecov.io Join the chat at https://communityinviter.com/apps/apache-pinot/apache-pinot Twitter Follow License

What is Apache Pinot?

Apache Pinot is a real-time distributed OLAP datastore, built to deliver scalable real-time analytics with low latency. It can ingest from batch data sources (such as Hadoop HDFS, Amazon S3, Azure ADLS, Google Cloud Storage) as well as stream data sources (such as Apache Kafka).

Pinot was built by engineers at LinkedIn and Uber and is designed to scale up and out with no upper bound. Performance always remains constant based on the size of your cluster and an expected query per second (QPS) threshold.

For getting started guides, deployment recipes, tutorials, and more, please visit our project documentation at https://docs.pinot.apache.org.

Apache Pinot

Features

Pinot was originally built at LinkedIn to power rich interactive real-time analytic applications such as Who Viewed Profile, Company Analytics, Talent Insights, and many more. UberEats Restaurant Manager is another example of a customer facing Analytics App. At LinkedIn, Pinot powers 50+ user-facing products, ingesting millions of events per second and serving 100k+ queries per second at millisecond latency.

  • Column-oriented: a column-oriented database with various compression schemes such as Run Length, Fixed Bit Length.

  • Pluggable indexing: pluggable indexing technologies Sorted Index, Bitmap Index, Inverted Index.

  • Query optimization: ability to optimize query/execution plan based on query and segment metadata.

  • Stream and batch ingest: near real time ingestion from streams and batch ingestion from Hadoop.

  • Query: SQL based query execution engine.

  • Upsert during real-time ingestion: update the data at-scale with consistency

  • Multi-valued fields: support for multi-valued fields, allowing you to query fields as comma separated values.

  • Cloud-native on Kubernetes: Helm chart provides a horizontally scalable and fault-tolerant clustered deployment that is easy to manage using Kubernetes.

Apache Pinot query console

When should I use Pinot?

Pinot is designed to execute real-time OLAP queries with low latency on massive amounts of data and events. In addition to real-time stream ingestion, Pinot also supports batch use cases with the same low latency guarantees. It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion. Pinot works very well for querying time series data with lots of dimensions and metrics.

Example query:

SELECT sum(clicks), sum(impressions) FROM AdAnalyticsTable
  WHERE
       ((daysSinceEpoch >= 17849 AND daysSinceEpoch <= 17856)) AND
       accountId IN (123456789)
  GROUP BY
       daysSinceEpoch TOP 100

Building Pinot

More detailed instructions can be found at Quick Demo section in the documentation.

# Clone a repo
$ git clone https://github.com/apache/pinot.git
$ cd pinot

# Build Pinot
$ mvn clean install -DskipTests -Pbin-dist

# Run the Quick Demo
$ cd build/
$ bin/quick-start-batch.sh

Deploying Pinot to Kubernetes

Please refer to Running Pinot on Kubernetes in our project documentation. Pinot also provides Kubernetes integrations with the interactive query engine, Trino Presto, and the data visualization tool, Apache Superset.

Join the Community

Documentation

Check out Pinot documentation for a complete description of Pinot's features.

License

Apache Pinot is under Apache License, Version 2.0

pinot's People

Contributors

akshayrai avatar antumbde avatar apucher avatar brandtg avatar cyy0714 avatar dependabot[bot] avatar dhaval2025 avatar gortiz avatar harleyjj avatar jackie-jiang avatar jackjlli avatar jenniferdai avatar jfim avatar jihaozh avatar justyves avatar kishoreg avatar kkcorps avatar klsince avatar mayankshriv avatar mcvsubbu avatar npawar avatar puneetjaiswal avatar richardstartin avatar saurabhd336 avatar siddharthteotia avatar sunithabeeram avatar ttbach avatar walterddr avatar xiangfu0 avatar zhtaoxiang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pinot's Issues

Performance Benchmark

It's great work. Thanks for making available to public.
Have you done any benchmarking? I like to what kind of hardware do I need to run analytical query in sub-second time over n billion records. Trying to get a ballpark figure.

Join tables

is the a way to join between to 2 tables?

After the server restart

Hang up a server node, restart data quantity is not correct, the query of the amount of data, sometimes return values is correct, sometimes return less than the correct values, all the tables are the situation.

StartServerCommand.class ,line 139 ,has a bug

Configuration configuration = readConfigFromFile(_configFileName);
if (configuration == null) {
if (_configFileName != null) {
LOGGER.error("Error: Unable to find file {}.", _configFileName);
return false;
}
configuration = new PropertiesConfiguration(); //useless

realtime and offline not working together

I want to test the hybrid mode, so I followed tutorial: https://github.com/linkedin/pinot/wiki/How-To-Use-Pinot. Here is my steps:

Offline flow

  • create an offline table named "flights" by:
    bin/pinot-admin.sh AddSchema -schemaFile flights-schema.json -exec
    bin/pinot-admin.sh AddTable -filePath flights-definition.json
  • ingest flights data 1988-2014 from Hadoop...
  • after that, I checked by counting all records in the "flights" table:
    bin/pinot-admin.sh PostQuery -query "select count(*) from flights"
    and see that all data has been loaded successfully:
    Result: {"traceInfo":{},"numDocsScanned":844482,"aggregationResults":[{"function":"count_star","value":"844482"}],"timeUsedMs":14,"segmentStatistics":[],"exceptions":[],"totalDocs":844482}

Realtime flow

  • streaming 2015 flights data into kafka:
    bin/pinot-admin.sh StreamAvroIntoKafka -avroFile flights-2015_1.avro -kafkaTopic flights-realtime &
  • create a realtime table named "flights" to consume from the Kafka topic:
    bin/pinot-admin.sh AddTable -filePath flights-definition-realtime.json
    I didn't call AddSchema, since it's already created in offline flow.

The doc said that if a table has same name for offline and realtime, the query processing for this table will switch to hybrid mode. So I suppose I should be able to get both offline and realtime data from table "flights" now. Unfortunately after created realtime table "flights", I can't query count from "flights" any longer, here is what I got:
result: {"traceInfo":{},"numDocsScanned":0,"aggregationResults":[],"timeUsedMs":0,"segmentStatistics":[],"exceptions":[],"totalDocs":0}

BUT, I found I can query from flights_REALTIME for realtime data and flights_OFFLINE for offline data. These two table names, I guess are created internally by Pinot.

Don't know what's wrong, any help would be appreciated.

Adding/removing physical segment for a logical table

I have the following setup:

2 (two) pinot servers hosting same real-time table (same table name / kafka consumer group id); each one of the servers receivs 1/2 of total of msgs from the kafka stream.
1 (one zookeper), 1 (one) broker and 1 (one) controller

What happens if one of the servers crashes (the server process itself), I understand that the second server will consume 100% of the messages while the second server is down (I can see that the broker query returns about half of the documets numbers in this case), but how can I add the second server again (if at all) in order to restore the original setup ?

What's going to happen if a broker/controller crashes ?

In general, how should I manage adding/removing pinot segments for reducing workload balance and/or system crash ?

Thank you

How I set configuration about store data for only 10 minutes in real-time node?

I have some question about retentionTime.

I want to store the data in Real-time node with only 10 minutes.
And I tested below configuration :

{
    "tableName":"loadTest",
    "segmentsConfig" : {
        "retentionTimeUnit":"MINUTES",
        "retentionTimeValue":"10",
        "segmentPushFrequency":"daily",
        "segmentPushType":"APPEND",
        "replication" : "1",
        "schemaName" : "loadTest",
        "timeColumnName" : "time",
        "timeType" : "MILLISECONDS",
        "segmentAssignmentStrategy" : "BalanceNumSegmentAssignmentStrategy"
    },
    "tableIndexConfig" : {
        "loadMode"  : "HEAP",
        "lazyLoad"  : "false",
                "streamConfigs": {
                        "streamType": "kafka",
                        "stream.kafka.consumer.type": "highLevel",
                        "stream.kafka.topic.name": "loadTest",
                        "stream.kafka.decoder.class.name": "com.linkedin.pinot.core.realtime.impl.kafka.KafkaJSONMessageDecoder",
                        "stream.kafka.zk.broker.url": "localhost:2181",
                        "stream.kafka.hlc.zk.connect.string": "localhost:2181"
                }
    },
    "tableType":"REALTIME",
        "tenants" : {
                "broker":"brokerOne",
                "server":"serverOne"
        },
    "metadata": {
    }
}

An error had occurred.

2015/06/24 17:18:53.593 ERROR [com.linkedin.pinot.controller.helix.core.retention.RetentionManager] [] Error creating TimeRetentionStrategy for table: loadTest_REALTIME, Exception: java.lang.NullPointerException

Can you explain about retention configurations?
How I set configuration about store data for only 10 minutes in real-time node?

Support the dimension contains PredicateEvaluator

I have found Pinot that contains regex Predicate but not implemented yet, I suggest that dimension string contains predicate would be much higher priority cause it was mostly used by filter querys.

To implement it with current dictionary algorithm, pinot should scan the whole dictionary index of the datasource,it would be poor efficiency, may be pinot build other index file for this filter such as lucene did ?

java.net.UnknownHostException occurs during quick-start-offline.sh

The reason is that PostQueryCommand._brokerHost is only initialized in execute() but not in run(). Please see below trace.

Offline quickstart complete
Total number of documents in the table
Query : select count(*) from baseballStats limit 0
2015/06/26 16:06:20.135 ERROR [org.apache.zookeeper.server.NIOServerCnxn] [] Thread Thread[main,5,main] died
java.net.UnknownHostException: null
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:178)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:579)
        at java.net.Socket.connect(Socket.java:528)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
        at sun.net.www.http.HttpClient.New(HttpClient.java:308)
        at sun.net.www.http.HttpClient.New(HttpClient.java:326)
        at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
        at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
        at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091)
        at com.linkedin.pinot.tools.admin.command.PostQueryCommand.run(PostQueryCommand.java:92)
        at com.linkedin.pinot.tools.admin.command.QuickstartRunner.runQuery(QuickstartRunner.java:148)
        at com.linkedin.pinot.tools.Quickstart.execute(Quickstart.java:174)
        at com.linkedin.pinot.tools.Quickstart.main(Quickstart.java:218)

Broker responds to GET requests?

In PinotClientRequestServlet in the Broker, it looks like there is a handler for GET requests that's looking for a "bql" parameter:

  @Override
  protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
    try {
      resp.getOutputStream().print(handleRequest(new JSONObject(req.getParameter("bql"))).toJson().toString());
      resp.getOutputStream().flush();
      resp.getOutputStream().close();
    } catch (final Exception e) {
      resp.getOutputStream().print(e.getMessage());
      resp.getOutputStream().flush();
      resp.getOutputStream().close();
      LOGGER.error("Caught exception while processing GET request", e);
      brokerMetrics.addMeteredValue(null, BrokerMeter.UNCAUGHT_GET_EXCEPTIONS, 1);
    }
  }

Is GET part of the officially supported API? If so, should "bql" be "pql" to be consistent with the documentation?

Also, should a more efficient / robust library like Jackson be used to parse JSON here?

Adding schema/table in realtime mode

Hello.

Have you changed something in terms of adding schema/table in realtime (kafka) mode?
I'm using the same files for schema/table that i've been using since my first test last week, and always after a command './pinot-admin AddSchema ...' there was a feedback line printed to stdout with some info, as well as after the '/pinot-admin AddTable ...' command.

I add schema and table after running broker, controller and server.

Since last Jun26 there no such printouts anymore, and the query returns 'no table hit' error, and there's no consumption from kafka, as expected...

Thanks.

metricField with long type throw IndexOutOfBoundsException

table schema file
{
"dimensionFieldSpecs" : [
{
"name": "field",
"dataType" : "STRING",
"delimiter" : null,
"singleValueField" : true
},
{
"name": "host",
"dataType" : "STRING",
"delimiter" : null,
"singleValueField" : true
}
],
"timeFieldSpec" : {
"incomingGranularitySpec" : {
"timeType" : "SECONDS",
"dataType" : "LONG",
"name" : "time"
}
},
"metricFieldSpecs" : [
{
"name" : "value",
"dataType" : "LONG",
"delimiter" : null,
"singleValueField" : true
}
],
"schemaName" : "changefield"
}
table definition file
{
"tableName":"changefield",
"segmentsConfig" : {
"retentionTimeUnit":"DAYS",
"retentionTimeValue":"7",
"segmentPushFrequency":"daily",
"segmentPushType":"APPEND",
"replication" : "1",
"schemaName" : "changefield",
"timeColumnName" : "time",
"timeType" : "SECONDS",
"segmentAssignmentStrategy" : "BalanceNumSegmentAssignmentStrategy"
},
"tableIndexConfig" : {
"invertedIndexColumns" : ["field", "host"],
"loadMode" : "HEAP",
"lazyLoad" : "false",
"streamConfigs": {
"streamType": "kafka",
"stream.kafka.consumer.type": "highLevel",
"stream.kafka.topic.name": "changefield-realtime",
"stream.kafka.decoder.class.name": "com.linkedin.pinot.core.realtime.impl.kafka.KafkaJSONMessageDecoder",
"stream.kafka.zk.broker.url": "localhost:2181",
"stream.kafka.hlc.zk.connect.string": "localhost:2181"
}
},
"tableType":"REALTIME",
"tenants" : {
"broker":"brokerOne",
"server":"serverOne"
},
"metadata": {
}
}
one example record๏ผš ["field311", "host10", "1435636716222", "815"]

when i use the above schema i get IndexOutOfBoundsException when dump segment, but if i change the value dataType from "LONG" to "INT", it just success.
Exception ๏ผš
2015/06/29 11:48:33.386 INFO [com.linkedin.pinot.core.data.manager.realtime.RealtimeSegmentDataManager] [] Trying to build segment!
2015/06/29 11:48:33.394 INFO [com.linkedin.pinot.core.segment.creator.impl.SegmentIndexCreationDriverImpl] [] Start building StatsCollector!
2015/06/29 11:49:59.837 ERROR [com.linkedin.pinot.core.data.manager.realtime.RealtimeSegmentDataManager] [] Caught exception in the realtime indexing thread
java.lang.IndexOutOfBoundsException
at java.nio.Buffer.checkIndex(Buffer.java:538)
at java.nio.DirectByteBuffer.getLong(DirectByteBuffer.java:766)
at com.linkedin.pinot.core.index.reader.impl.FixedByteWidthRowColDataFileReader.getLong(FixedByteWidthRowColDataFileReader.java:204)
at com.linkedin.pinot.core.index.readerwriter.impl.FixedByteSingleColumnSingleValueReaderWriter.getLong(FixedByteSingleColumnSingleValueReaderWriter.java:140)
at com.linkedin.pinot.core.realtime.impl.RealtimeSegmentImpl.getRawValueRowAt(RealtimeSegmentImpl.java:539)
at com.linkedin.pinot.core.realtime.converter.RealtimeSegmentRecordReader.next(RealtimeSegmentRecordReader.java:78)
at com.linkedin.pinot.core.segment.creator.impl.SegmentIndexCreationDriverImpl.build(SegmentIndexCreationDriverImpl.java:105)
at com.linkedin.pinot.core.realtime.converter.RealtimeSegmentConverter.build(RealtimeSegmentConverter.java:88)
at com.linkedin.pinot.core.data.manager.realtime.RealtimeSegmentDataManager$2.run(RealtimeSegmentDataManager.java:153)
at java.lang.Thread.run(Thread.java:744)

Can't upload any segments after loading about 10000 segment parts

After loading 9000 segments , we couldn't upload any segments successfully and pinot didn't log any useful information . We looked into and found , there were two places we should fix .
First: zookeeper doesn't support znode size bigger than 1M without adding parameter jute.maxbuffer to system environment
Second : Helix cut down zookeeper message to 1M at ZNRecordStreamingSerializer.java line:168

After we fixing above two places , we can load more segments now.

cZxid = 0x20002aec3
ctime = Fri Aug 21 14:01:46 CST 2015
mZxid = 0x2000a17a2
mtime = Sun Aug 23 21:20:26 CST 2015
pZxid = 0x20002aec3
cversion = 0
dataVersion = 2024
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 1171020
numChildren = 0

Hope make myself clearly!
Thanks to my buddies @iamcrzay @code4love

About Pinot's develop plans and timelines

Hi, I would like to know about the feature plans and ETA/timelines of Pinot.
We need to upgrade our realtime analyzing system and have done a lot of investigations on deferent kinds of system. Pinot is great and looks pretty suitable for us. But currently it do lack some query features we really need. They are:

  1. Calculation on dimensions and aggregated metrics. e.g. select (dimA / 100) as dimB, (sum(metricA) + sum(metricB)) as metricC from table group by dimB order by metricC.
  2. Not only the topN, but the real & completed order by support. e.g. select dimA, sum(metricA), sum(metricB) from table group by dimA order by sum(metricA) desc, sum(metricB) asc offset 10, 20
  3. Having clause.

Broker InstanceName Problem

if I set

./pinot-admin.sh StartBroker -brokerPort 8099 -zkAddress "localhost:2181" -brokerInstName "MyNameIsBroker"

Then I received Exception below:

ZKExceptionHandler - Exception while invoking init callback for listener:com.linkedin.pinot.broker.broker.helix.LiveInstancesChangeListenerImpl@50b7114a
java.lang.ArrayIndexOutOfBoundsException: 1
    at com.linkedin.pinot.broker.broker.helix.LiveInstancesChangeListenerImpl.onLiveInstanceChange(LiveInstancesChangeListenerImpl.java:75)
    at org.apache.helix.manager.zk.CallbackHandler.invoke(CallbackHandler.java:173)
    at org.apache.helix.manager.zk.CallbackHandler.init(CallbackHandler.java:324)
    at org.apache.helix.manager.zk.CallbackHandler.<init>(CallbackHandler.java:114)
    at org.apache.helix.manager.zk.ZKHelixManager.addListener(ZKHelixManager.java:299)
    at org.apache.helix.manager.zk.ZKHelixManager.addLiveInstanceChangeListener(ZKHelixManager.java:318)
    at com.linkedin.pinot.broker.broker.helix.HelixBrokerStarter.<init>(HelixBrokerStarter.java:120)
    at com.linkedin.pinot.tools.admin.command.StartBrokerCommand.execute(StartBrokerCommand.java:98)
    at com.linkedin.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:85)
    at com.linkedin.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:94)

So is this an issue?

How to use binary operator

Hello

I want to excute query 'select sum_duration/sum_usage_count from test limit 1', but I got error like
org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
org.antlr.runtime.DFA.predict(DFA.java:144) com.linkedin.pinot.pql.parsers.PQLParser.selection_list(PQLParser.java:2133) com.linkedin.pinot.pql.parsers.PQLParser.select_stmt(PQLParser.java:1459) com.linkedin.pinot.pql.parsers.PQLParser.statement(PQLParser.java:1246) com.linkedin.pinot.pql.parsers.PQLCompiler.compile(PQLCompiler.java:50) com.linkedin.pinot.broker.servlet.PinotClientRequestServlet.handleRequest(PinotClientRequestServlet.java:99) com.linkedin.pinot.broker.servlet.PinotClientRequestServlet.doPost(PinotClientRequestServlet.java:82) javax.servlet.http.HttpServlet.service(HttpServlet.java:688) javax.servlet.http.HttpServlet.service(HttpServlet.java:770) org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:669) org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) org.eclipse.jetty.server.Server.handle(Server.java:365) org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937) org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998) org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865) org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) java.lang.Thread.run(Thread.java:745)

The table test 's structure:

action_data STRING

area STRING

device_type STRING

is_new STRING

sum_duration DOUBLE

sum_usage_count INT

time_range INT

user_count INT

Is there any error in the query statement? Waiting for your answer!

Incorrect secondary ordering

Using quick-start-offline.sh with the repository as of commit dd3b058, a secondary ordering term is not working as expected.

For example, curl -X POST -d '{"pql":"select yearID, playerName from baseballStats order by yearID, playerName"}' http://localhost:7000/query yields:

{
  "selectionResults": {
    "columns": [
      "yearID",
      "playerName"
    ],
    "results": [
      [
        "1871",
        "Caleb Clark"
      ],
      [
        "1871",
        "Asahel"
      ],
      [
        "1871",
        "Arthur Algernon"
      ],
      [
        "1871",
        "Andrew Jackson"
      ],
      [
        "1871",
        "Alfred L."
      ],
      [
        "1871",
        "Alfred James"
      ],
      [
        "1871",
        "Albert Goodwill"
      ],
      [
        "1871",
        "Albert G."
      ],
      [
        "1871",
        "Adrian Constantine"
      ],
      [
        "1871",
        "Calvin Alexander"
      ]
    ]
  },
  "traceInfo": {},
  "numDocsScanned": 97889,
  "aggregationResults": [],
  "timeUsedMs": 20,
  "segmentStatistics": [],
  "exceptions": [],
  "totalDocs": 97889
}

Notice that the playerName values in the grouping are not in the correct order.

quick-start-realtime.sh gives exception on the start "URI is not hierarchical"

I've build pinot according to README

/var/www/RnD/pinot/pinot-distribution/target/pinot-0.016-pkg/bin$ ./quick-start-realtime.sh
Exception in thread "main" java.lang.IllegalArgumentException: URI is not hierarchical
at java.io.File.(File.java:418)
at com.linkedin.pinot.tools.RealtimeQuickStart.main(RealtimeQuickStart.java:35)

/var/www/RnD/pinot/pinot-distribution/target/pinot-0.016-pkg/bin$ java -version
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-2~deb7u1)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

quick-start-offline.sh return to error "URI is not hierarchical"

Dear linkedin

after compile pinot, I try to start pinot

./quick-start-offline.sh
Exception in thread "main" java.lang.IllegalArgumentException: URI is not hierarchical
at java.io.File.(File.java:392)
at com.linkedin.pinot.tools.Quickstart.execute(Quickstart.java:112)
at com.linkedin.pinot.tools.Quickstart.main(Quickstart.java:201)

but it return error

Error when close inverted index for column

This exception happens when SegmentTarPush
I hava three pageview log file, d1 d2 d3, after SegmentCreation, I push the index to controller,
d1 works fine with d3, but exceptions happens when pushing d2. It seems the index was broken(part of d1's data cannot be queried)

Inverted Index

Can you explain how the inverted index is used within Pinot? I have used Druid previously, and they implement a bitmap index for the dimensions so they can do very fast filtering on low cardinality data sets. I imagine the inverted index is used similarly here, but it is not documented.

Also, may I suggest having a Google group so these questions can go there instead.

quickstart consumes 100% of one CPU core for unknown reason.

Hello,

Just found a strange issue - quickstart consumes 100% of one core. I'm using it for local development/testing on mac pro. As result, fun is always running, which is uncomfortable.

After quick investigations found that code:
/pinot-tools/src/main/java/com/linkedin/pinot/tools/Quickstart.java

long st = System.currentTimeMillis();
    while (true) {
      if (System.currentTimeMillis() - st >= (60 * 60) * 1000) {
        break;
      }
    }

Is it possible to add Thread.yeld() or Thread.sleep() here to reduce burden?

Thank you!

error mvn install package -DskipTests

[INFO] Total time: 51.248s
[INFO] Finished at: Wed Jun 17 13:19:19 EDT 2015
[INFO] Final Memory: 121M/237M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.5.5:single (make-bundles) on project pinot-distribution: Failed to create assembly: Error creating assembly archive bin: Problem creating zip: Execution exception: Java heap space -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.5.5:single (make-bundles) on project pinot-distribution: Failed to create assembly: Error creating assembly archive bin: Problem creating zip: Execution exception
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
    at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)

Hybrid quickstart

We should also have a hybrid mode quick start for user to play with.

Data was changed itself

Hello

I'm so interested in Pinot that I try to test it.
I'm in trouble with below issue.

I sended data to the realtime node using kafka.
After a while, data was changed itself.

First result are as follows: (sql query result)

{
    "exceptions": [],
    "traceInfo": {},
    "timeUsedMs": 1,
    "totalDocs": 56700,
    "segmentStatistics": [],
    "aggregationResults": [],
    "selectionResults": {
        "results": [
            ["A", "1435048500111", "1"],
            ["A", "1435048500113", "1"],
            ["A", "1435048500115", "1"],
            ["A", "1435048500117", "1"],
            ["A", "1435048500119", "1"],
        ],
        "columns": ["name", "time", "value"]
    },
    "numDocsScanned": 5
}

But after a while, the value of column "value" was changed itself.
Next same query result are as follows: (sql query result)

{
    "exceptions": [],
    "traceInfo": {},
    "timeUsedMs": 1,
    "totalDocs": 56700,
    "segmentStatistics": [],
    "aggregationResults": [],
    "selectionResults": {
        "results": [
            ["A", "1435048500111", "0"],
            ["A", "1435048500113", "0"],
            ["A", "1435048500115", "0"],
            ["A", "1435048500117", "0"],
            ["A", "1435048500119", "0"],
        ],
        "columns": ["name", "time", "value"]
    },
    "numDocsScanned": 5
}

Is that because the column name, "value" is reserved keyword?

cf.
schema file

{
        "dimensionFieldSpecs" : [
                {
                        "name": "name",
                        "dataType" : "STRING",
                        "delimiter" : null,
                        "singleValueField" : true
                }
        ],
        "timeFieldSpec" : {
                "incomingGranularitySpec" : {
                        "timeType" : "MILLISECONDS",
                        "dataType" : "LONG",
                        "name" : "time"
                }
        },
        "metricFieldSpecs" : [
                {
                        "name" : "value",
                        "dataType" : "INT",
                        "delimiter" : null,
                        "singleValueField" : true
                }
        ],
        "schemaName" : "loadTest"
}

table file

{
    "tableName":"loadTest",
    "segmentsConfig" : {
        "retentionTimeUnit":"DAYS",
        "retentionTimeValue":"7",
        "segmentPushFrequency":"daily",
        "segmentPushType":"APPEND",
        "replication" : "1",
        "schemaName" : "loadTest",
        "timeColumnName" : "time",
        "timeType" : "MILLISECONDS",
        "segmentAssignmentStrategy" : "BalanceNumSegmentAssignmentStrategy"
    },
    "tableIndexConfig" : {
        "invertedIndexColumns" : ["time"],
        "loadMode"  : "HEAP",
        "lazyLoad"  : "false",
                "streamConfigs": {
                        "streamType": "kafka",
                        "stream.kafka.consumer.type": "highLevel",
                        "stream.kafka.topic.name": "loadTest",
                        "stream.kafka.decoder.class.name": "com.linkedin.pinot.core.realtime.impl.kafka.KafkaJSONMessageDecoder",
                        "stream.kafka.zk.broker.url": "localhost:2181",
                        "stream.kafka.hlc.zk.connect.string": "localhost:2181"
                }
    },
    "tableType":"REALTIME",
        "tenants" : {
                "broker":"brokerOne",
                "server":"serverOne"
        },
    "metadata": {
    }
}

So is this an issue?
Thanks.

Exception from pinotBroker.log after issueed any query

2015-10-10 18:03:29,552 INFO  [qtp1861781750-549] com.linkedin.pinot.broker.servlet.PinotClientRequestServlet - Broker received Query String is: select count(*)  from campaign
2015-10-10 18:03:29,552 ERROR [pool-3-thread-810] 
2015-10-12 10:46:46,258 ERROR [pool-3-thread-26] com.linkedin.pinot.transport.scattergather.ScatterGatherImpl - Got exception sending request (538). Setting error future
java.lang.NullPointerException
        at com.linkedin.pinot.transport.scattergather.ScatterGatherImpl$SingleRequestHandler.run(ScatterGatherImpl.java:392)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
        at com.linkedin.pinot.transport.scattergather.ScatterGatherImpl$SingleRequestHandler.run(ScatterGatherImpl.java:392)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
com.linkedin.pinot.transport.scattergather.ScatterGatherImpl - Got exception sending request (14679). Setting error future
java.lang.NullPointerException
2015-10-10 18:03:29,552 INFO  [qtp1861781750-549] com.linkedin.pinot.transport.common.AsyncResponseFuture - Error Future for request 14679 Request is no longer pending. Cannot cancel !!
2015-10-10 18:03:29,552 INFO  [qtp1861781750-549] com.linkedin.pinot.common.metrics.AbstractMetrics -  Phase:REDUCE took 0
2015-10-10 18:03:29,552 INFO  [qtp1861781750-549] com.linkedin.pinot.common.metrics.AbstractMetrics -  Phase:QUERY_EXECUTION took 0
2015-10-10 18:03:29,552 INFO  [qtp1861781750-549] com.linkedin.pinot.broker.servlet.PinotClientRequestServlet - Broker Response : BrokerResponse(totalDocs:0, numDocsScanned:0, timeUsedMs:0, aggregationResults:[], selectionResults:null, segmentStatistics:[], exceptions:[], traceInfo:{})

In my test in only happens on high client connections. e.g, it works fine when I use 10 threads to do the query test, but complain exceptions after I rise to 100 threads.

Count Distinct

Someone can give an example of count distinct ?
Im getting error when i try to query this,

select count( distinct AtBatting) from baseballStats limit 10
Or

select countDistinct ( AtBatting) from baseballStats limit 10

A big problem about inverted index creation

SegmentColumnarIndexCreator.java

if (config.createInvertedIndexEnabled()) {
invertedIndexCreatorMap.put(
column,
new BitmapInvertedIndexCreator(file, indexCreationInfo.getSortedUniqueElementsArray().length, schema
.getFieldSpecFor(column)));
}

config.createInvertedIndexEnabled() method which comes from SegmentGeneratorConfig.java will decide inverted index creation.
It has a problem because we need depend on table config info(invertedIndexColumns) to decide that a column should create inverted index or not.
And config.setCreateInvertedIndex method don't call in the code.So config.createInvertedIndexEnabled() always be false.Table config info(invertedIndexColumns) can't take effect.

Table Config
{
"tableIndexConfig": {
"invertedIndexColumns":["numberOfGamesAsBatter"],
"loadMode":"HEAP",
"lazyLoad":"false"
},
"tenants":{"server":"","broker":""},
"tableType":"OFFLINE","metadata":{"customConfigs":{"d2Name":"xlntBetaPinot"}},
"segmentsConfig":{
"retentionTimeUnit":"DAYS",
"segmentPushFrequency":"daily",
"replication":1,
"timeColumnName":"",
"retentionTimeValue":"700",
"timeType":"",
"segmentPushType":"APPEND",
"schemaName":"baseball",
"segmentAssignmentStrategy":"BalanceNumSegmentAssignmentStrategy"
},
"tableName":"baseballStats"
}

TimeoutException, only run select * from table limit 10

Hi
I only run select * from table limit 10.
about 7 hundred million data,
4 server nodes
thare are 3 server is 8g 4core.

could you please give me some help?
is need to add server to cluster?
or config the timeoutms?

ProcessingException(errorCode:350, message:java.util.concurrent.TimeoutException\n\tat java.util.concurrent.FutureTask.get(FutureTask.java:205)\n\tat com.linkedin.pinot.core.operator.MCombineOperator.nextBlock(MCombineOperator.java:191)\n\tat com.linkedin.pinot.core.operator.UResultOperator.nextBlock(UResultOperator.java:52)\n\tat com.linkedin.pinot.core.plan.GlobalPlanImplV0.execute(GlobalPlanImplV0.java:58)\n\tat com.linkedin.pinot.core.query.executor.ServerQueryExecutorV1Impl.processQuery(ServerQueryExecutorV1Impl.java:133)\n\tat com.linkedin.pinot.server.request.SimpleRequestHandler.processRequest(SimpleRequestHandler.java:81)\n\tat com.linkedin.pinot.transport.netty.NettyServer$NettyChannelInboundHandler.channelRead(NettyServer.java:218)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)\n\tat io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)\n\tat io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)\n\tat java.lang.Thread.run(Thread.java:745)\n)", "ProcessingException(errorCode:350, message:java.util.concurrent.TimeoutException\n\tat java.util.concurrent.FutureTask.get(FutureTask.java:205)\n\tat com.linkedin.pinot.core.operator.MCombineOperator.nextBlock(MCombineOperator.java:191)\n\tat com.linkedin.pinot.core.operator.UResultOperator.nextBlock(UResultOperator.java:52)\n\tat com.linkedin.pinot.core.plan.GlobalPlanImplV0.execute(GlobalPlanImplV0.java:58)\n\tat com.linkedin.pinot.core.query.executor.ServerQueryExecutorV1Impl.processQuery(ServerQueryExecutorV1Impl.java:133)\n\tat com.linkedin.pinot.server.request.SimpleRequestHandler.processRequest(SimpleRequestHandler.java:81)\n\tat com.linkedin.pinot.transport.netty.NettyServer$NettyChannelInboundHandler.channelRead(NettyServer.java:218)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)\n\tat io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)\n\tat io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)\n\tat java.lang.Thread.run(Thread.java:745)\n)", "ProcessingException(errorCode:350, message:java.util.concurrent.TimeoutException\n\tat java.util.concurrent.FutureTask.get(FutureTask.java:205)\n\tat com.linkedin.pinot.core.operator.MCombineOperator.nextBlock(MCombineOperator.java:191)\n\tat com.linkedin.pinot.core.operator.UResultOperator.nextBlock(UResultOperator.java:52)\n\tat com.linkedin.pinot.core.plan.GlobalPlanImplV0.execute(GlobalPlanImplV0.java:58)\n\tat com.linkedin.pinot.core.query.executor.ServerQueryExecutorV1Impl.processQuery(ServerQueryExecutorV1Impl.java:133)\n\tat com.linkedin.pinot.server.request.SimpleRequestHandler.processRequest(SimpleRequestHandler.java:81)\n\tat com.linkedin.pinot.transport.netty.NettyServer$NettyChannelInboundHandler.channelRead(NettyServer.java:218)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)\n\tat io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)\n\tat io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)\n\tat java.lang.Thread.run(Thread.java:745)\n)

EXTRA_JVM_ARGUMENTS in pinot-admin.sh prevents configuration of log4j file and -Xms -Xmx

Currently the only way to increase log verbosity from ERROR to INFO on pinot is to modify the pinot-log4j.properties file inside the pinot distribution.

The main problem is that pinot-admin.sh appends the EXTRA_JVM_ARGUMENTS JVM properties after the user provided JAVA_OPTS.

This causes a few problems.

  • There is no simple way to specify an alternate log4j file. Usually this is would simply be done by adding -Dlog4j.configuration=... to JAVA_OPTS, but EXTRA_JVM_ARGUMENTS overrides the settings.
  • Any -Xms or -Xmx settings provided in JAVA_OPTS are ignored.

Given that pinot has complex classpath, developer really must start it using pinot-admin.sh. And in order to properly operate and debug pinot, developers will often need to increase log verbosity and properly tune JVM settings.

Grouping data by time interval

I've been working on adding Pinot as a datasource to Grafana, a visualization tool for time series charts.

I'm looking for the the best way to query Pinot.

In InfluxDB's query language, grouping results into time buckets is usually done like:

SELECT MEAN(value) FROM measure WHERE time > (now() - 1h) GROUP BY time(5m)

Where the result set would look something like

{
  "columns": ["value", "time"],
  "results": [
    [40,1441756680100],
    [45,1441756680200],
    [43,1441756680300],
    ...
  ]
}

And can easily be plotted.

Does Pinot support this sort of query? I've been trying a few group by queries but have been unable to formulate one that provides this sort of grouped response. If this is not yet supported, would it be possible to add support?

When the number of table more than eight, there will be no consumption of kafka

More than 8 tables in the cluster number, then the next table can't consumption kafka.

I tested many times, discover finally create table does not generate the data properly.

Created in front of the table, respectively table1_realtime,table1_offline,
                                            table2_realtime,table2_offline,
                                            table3_realtime,table3_offline,
                                            table4_realtime,table4_offline,
                                            table5_realtime
table5_realtime cannot consumption kafka. 

"select *" failed with exception

Caught exception while reducing results
java.lang.UnsupportedOperationException
at java.util.AbstractList.remove(AbstractList.java:161)
at java.util.AbstractList$Itr.remove(AbstractList.java:374)
at java.util.AbstractList.removeRange(AbstractList.java:571)
at java.util.AbstractList.clear(AbstractList.java:234)
at com.linkedin.pinot.core.query.selection.SelectionOperatorUtils.getSelectionColumns(SelectionOperatorUtils.java:99)
at com.linkedin.pinot.core.query.selection.SelectionOperatorUtils.render(SelectionOperatorUtils.java:156)
at com.linkedin.pinot.core.query.reduce.DefaultReduceService.reduceOnSelectionResults(DefaultReduceService.java:201)
at com.linkedin.pinot.core.query.reduce.DefaultReduceService.reduceOnDataTable(DefaultReduceService.java:157)
at com.linkedin.pinot.requestHandler.BrokerRequestHandler$1.call(BrokerRequestHandler.java:305)
at com.linkedin.pinot.requestHandler.BrokerRequestHandler$1.call(BrokerRequestHandler.java:302)
at com.linkedin.pinot.common.metrics.AbstractMetrics.timePhase(AbstractMetrics.java:103)
at com.linkedin.pinot.requestHandler.BrokerRequestHandler.getDataTableFromBrokerRequest(BrokerRequestHandler.java:302)
at com.linkedin.pinot.requestHandler.BrokerRequestHandler.processSingleTableBrokerRequest(BrokerRequestHandler.java:157)
at com.linkedin.pinot.requestHandler.BrokerRequestHandler.processBrokerRequest(BrokerRequestHandler.java:123)
at com.linkedin.pinot.broker.servlet.PinotClientRequestServlet$1.call(PinotClientRequestServlet.java:141)
at com.linkedin.pinot.broker.servlet.PinotClientRequestServlet$1.call(PinotClientRequestServlet.java:136)
at com.linkedin.pinot.common.metrics.AbstractMetrics.timePhase(AbstractMetrics.java:103)
at com.linkedin.pinot.broker.servlet.PinotClientRequestServlet.handleRequest(PinotClientRequestServlet.java:135)
at com.linkedin.pinot.broker.servlet.PinotClientRequestServlet.doPost(PinotClientRequestServlet.java:84)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:688)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:770)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:669)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)

Encoding problem

PinotClientRequestServlet has encoding problem:

@Override
protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
try {
resp.getOutputStream().print(handleRequest(new JSONObject(req.getParameter("bql"))).toJson().toString());
resp.getOutputStream().flush();
resp.getOutputStream().close();
} catch (final Exception e) {
resp.getOutputStream().print(e.getMessage());
resp.getOutputStream().flush();
resp.getOutputStream().close();
LOGGER.error("Caught exception while processing GET request", e);
brokerMetrics.addMeteredValue(null, BrokerMeter.UNCAUGHT_GET_EXCEPTIONS, 1);
}
}

@Override
protected void doPost(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
try {
resp.getOutputStream().print(handleRequest(extractJSON(req)).toJson().toString());
resp.getOutputStream().flush();
resp.getOutputStream().close();
} catch (final Exception e) {
resp.getOutputStream().print(e.getMessage());
resp.getOutputStream().flush();
resp.getOutputStream().close();
LOGGER.error("Caught exception while processing POST request", e);
brokerMetrics.addMeteredValue(null, BrokerMeter.UNCAUGHT_POST_EXCEPTIONS, 1);
}
}

It should be


 @Override
  protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
    try {
      resp.setCharacterEncoding("UTF-8");
      resp.getOutputStream().write(handleRequest(new JSONObject(req.getParameter("bql"))).toJson().toString().getBytes("UTF-8"));
      resp.getOutputStream().flush();
      resp.getOutputStream().close();
    } catch (final Exception e) {
      resp.getOutputStream().print(e.getMessage());
      resp.getOutputStream().flush();
      resp.getOutputStream().close();
      LOGGER.error("Caught exception while processing GET request", e);
      brokerMetrics.addMeteredValue(null, BrokerMeter.UNCAUGHT_GET_EXCEPTIONS, 1);
    }
  }

  @Override
  protected void doPost(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
    try {
      resp.setCharacterEncoding("UTF-8");
      resp.getOutputStream().write(handleRequest(extractJSON(req)).toJson().toString().getBytes("UTF-8"));
      resp.getOutputStream().flush();
      resp.getOutputStream().close();
    } catch (final Exception e) {
      resp.getOutputStream().print(e.getMessage());
      resp.getOutputStream().flush();
      resp.getOutputStream().close();
      LOGGER.error("Caught exception while processing POST request", e);
      brokerMetrics.addMeteredValue(null, BrokerMeter.UNCAUGHT_POST_EXCEPTIONS, 1);
    }

Server get wrong segment download url from zookeeper

We found that pinot-server get wrong segment download url from zookeeper, like http://localhost:9999/segments/myTable/mySegment_0.tar.gz. It seems that controller set "localhost" as it's hostname.

[pool-2-thread-1] 2015/06/19 12:36:58.665 INFO [com.linkedin.pinot.common.utils.FileUploadUtils] [] Get file from:http://localhost:9999/segments/myTable/mySegment_0.tar.gz to local file:/tmp/PinotSegment8098/segmentTar/mySegment_0.tar.gz
[pool-2-thread-1] 2015/06/19 12:36:58.666 INFO [org.apache.commons.httpclient.HttpMethodDirector] [] I/O exception (java.net.ConnectException) caught when processing request: Connection refused
[pool-2-thread-1] 2015/06/19 12:36:58.666 INFO [org.apache.commons.httpclient.HttpMethodDirector] [] Retrying request
[pool-2-thread-1] 2015/06/19 12:36:58.667 INFO [org.apache.commons.httpclient.HttpMethodDirector] [] I/O exception (java.net.ConnectException) caught when processing request: Connection refused
[pool-2-thread-1] 2015/06/19 12:36:58.667 INFO [org.apache.commons.httpclient.HttpMethodDirector] [] Retrying request
[pool-2-thread-1] 2015/06/19 12:36:58.667 INFO [org.apache.commons.httpclient.HttpMethodDirector] [] I/O exception (java.net.ConnectException) caught when processing request: Connection refused
[pool-2-thread-1] 2015/06/19 12:36:58.667 INFO [org.apache.commons.httpclient.HttpMethodDirector] [] Retrying request
[pool-2-thread-1] 2015/06/19 12:36:58.667 ERROR [com.linkedin.pinot.common.utils.FileUploadUtils] [] Caught exception
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at java.net.Socket.connect(Socket.java:538)
        at java.net.Socket.<init>(Socket.java:434)
        at java.net.Socket.<init>(Socket.java:286)
        at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
        at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
        at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
        at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
        at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at com.linkedin.pinot.common.utils.FileUploadUtils.getFile(FileUploadUtils.java:90)
        at com.linkedin.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.downloadSegmentToLocal(SegmentOnlineOfflineStateModelFactory.java:327)
        at com.linkedin.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOfflineForOfflineSegment(SegmentOnlineOfflineStateModelFactory.java:212)
        at com.linkedin.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:128)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:344)
        at org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:290)
        at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:93)
        at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:50)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

PS: In com.linkedin.pinot.tools.admin.command.StartControllerCommand, you can find following codes:

 conf.setControllerHost("localhost");
 conf.setControllerVipHost("localhost");

I think it should be

String hostname = InetAddress.getLocalHost().getHostName()

Terminating quickstart with SIGINT throws IllegalStateException because of shutdown hook

Terminating bin/quick-start-offline.sh with SIGINT (ctrl+c) throws the following Exception:

java.lang.IllegalStateException: Shutdown in progress
    at java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:66)
    at java.lang.Runtime.addShutdownHook(Runtime.java:211)
    at com.linkedin.pinot.tools.admin.command.AbstractBaseCommand.<init>(AbstractBaseCommand.java:44)
    at com.linkedin.pinot.tools.admin.command.StopProcessCommand.<init>(StopProcessCommand.java:32)
    at com.linkedin.pinot.tools.admin.command.QuickstartRunner.stop(QuickstartRunner.java:109)
    at com.linkedin.pinot.tools.Quickstart$1.run(Quickstart.java:145)

The problem is that the AbstractBaseCommand always registers the shutdown hook w/o handling the case that the shutdown is already in progress. For StopProcessCommand the shutdown is always in progress (modulo slow timings).

I think it's OK to ignore the exception or just call cleanup in the constructor if the general contract is that cleanup must be called in any case.

NPE inside query executor

Just hit the below NPE, I believe it was triggered by this query:

SELECT COUNT(timestamp) FROM spans WHERE timestamp BETWEEN 1435708800 AND 1437519830 GROUP BY toServiceName
java.lang.NullPointerException
    at com.linkedin.pinot.core.operator.MCombineOperator.trimToSize(MCombineOperator.java:257)
    at com.linkedin.pinot.core.operator.MCombineOperator.nextBlock(MCombineOperator.java:246)
    at com.linkedin.pinot.core.operator.UResultOperator.nextBlock(UResultOperator.java:52)
    at com.linkedin.pinot.core.plan.GlobalPlanImplV0.execute(GlobalPlanImplV0.java:58)
    at com.linkedin.pinot.core.query.executor.ServerQueryExecutorV1Impl.processQuery(ServerQueryExecutorV1Impl.java:133)
    at com.linkedin.pinot.server.request.SimpleRequestHandler.processRequest(SimpleRequestHandler.java:81)
    at com.linkedin.pinot.transport.netty.NettyServer$NettyChannelInboundHandler.channelRead(NettyServer.java:218)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)

Thirdeye Setup, Documentation, Examples

I have been trying to get Thirdeye running with an example dataset. I have encountered some difficulties. Some of which are resolved, and I've written them down here for completeness, while others are requests for help. I can create new issues for some of these, if required.

  • The Thirdeye Hadoop jobs does not take in Hadoop job memory settings (e.g -Dmapreduce.map.memory.mb), and job.properties memory settings are not configured.
  • No example of config.yml. Here is one that worked for me.
collection: registrations
dimensions:
  - name: MY_DIM_1
  - name: MY_DIM_2
metrics:
  - name: count
    type: INT
time:
  columnName: date
  input:
    size: 1
    unit: HOURS
  bucket:
    size: 1
    unit: HOURS
  retention:
    size: 365
    unit: DAYS
rollup:
  functionClass: com.linkedin.thirdeye.bootstrap.rollup.TotalAggregateBasedRollupFunction
  functionConfig:
    metricName: "count"
    threshold: "100"
split:
  threshold: 3

I am unsure of what exactly can be put here, and there was a TopKRollupSpec that was interesting...

  • In the bootstrap phases, I had to add some extra properties to job.properties, or the jobs will error out. Overall, I added the following:
thirdeye.time.min=2010-01-01
thirdeye.time.max=2020-01-01
thirdeye.dimension.index.ref=data_2010-01-01-000000_2020-01-01-000000
thirdeye.flow.schedule=MONTHLY

In particular, I believe the first two lines were needed in the analysis job, as per https://github.com/linkedin/pinot/blob/master/thirdeye/thirdeye-bootstrap/src/main/java/com/linkedin/thirdeye/bootstrap/ThirdEyeJob.java#L1131

The 3rd line was required for the startree_bootstrap_phase1 and startree_bootstrap_phase2 job
, and thirdeye.flow.schedule was needed for the server_package job.

  • Without Kerberos set up in our Hadoop cluster, I had to use server_package job in the thirdeye-bootstrap package to combine the {thirdeye.root}/{thirdeye.collection}/METRIC_INDEX/DATE_RANGE/startree_bootstrap_phase2/data-x.tar.gz files to a single data.tar.gz.
    Then I had to POST the endpoint /collections/{thirdeye.collection} with that data.tar.gz file, and after a restart of thirdeye-server, the collection got picked up.
    Commands I used:
curl -XPOST \
  --data-binary @{MY_PATH_TO}/config.yml \
  -H 'Content-Type:application/octet-stream' \
  localhost:8080/collections/registrations
curl -XPOST \
  --data-binary @{MY_PATH_TO}/data.tar.gz \
  -H 'Content-Type:application/octet-stream' \
  localhost:8080/collections/registrations/data/{START_MILLIS}/{END_MILLIS}

Where {START_MILLIS}, {END_MILLIS} correspond to milliseconds from Epoch of the dates I set for thirdeye.time.min and thirdeye.time.max, respectively.

  • Just having thirdeye-server doesn't seem to be enough if I want the dashboard UI. I had to start thirdeye-dashboard with its own YAML configs (default config will make it start at port 8080, which conflicted with thirdeye-server)
    Configs I used:
serverUri: http://localhost:8080
server:
  applicationConnectors:
  - type: http
    port: 8082
  adminConnectors:
  - type: http
    port: 9001
funnelConfigRoot: SOME_FOLDER
  • The funnelConfigRoot was required, or the FunnelsDataProvider will not be instantiated, even if the folder doesn't contain a funnel.yml. I believe this behavior is because of https://github.com/linkedin/pinot/blob/master/thirdeye/thirdeye-dashboard/src/main/java/com/linkedin/thirdeye/dashboard/ThirdEyeDashboard.java#L73
  • I got some SocketTimeoutException for queries that thirdeye-dashboard was making to thirdeye-server, but they seem to have disappeared now. Not sure if there is a place to set timeout
  • In general, there is a lot of 'custom dashboard' or 'custom funnel' things that I am unsure how to get working. Some documentation there would be great.
  • My dataset I believe is sparse enough that for some combination of dimension values, there are 0 rows. This resulted in a cryptic datatable error message of DataTables warning: table id=tabularView-metric_0 - Requested unknown parameter '8' for row 396. For more information about this error, please see http://datatables.net/tn/4
    Using the chrome console, I got an error message of:
    "FreeMarker template error (DEBUG mode; use RETHROW in production!): The following has evaluated to null or missing: ==&gt; cell.statsMap['volume_difference'] [in template "com/linkedin/thirdeye/dashboard/views/dimension/heat-map.ftl" at line 162, column 44]...
    The full message is available here: https://gist.github.com/cliu587/704d0855fc9ae4d8bddc

I am pretty stuck on the last part, but it only affects the heatmap tab.
The Contributors tab works fine.

Hopefully this is helpful!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.