Apache drill parallel query is done sequentially - rest

I tested apache drill in local file system.
I used rest api to query some parquet files. when i run a rest query i can not execute another query and it waits until first query finishes. i want two queries use half of cpu. but it seems multiple queries are finishing sequentially.

This is regression, which is present in 1.13 and 1.14 versions:
https://issues.apache.org/jira/browse/DRILL-6693
It is resolved for now. The fix is in master branch and will be a part of upcoming Drill 1.15 version.

Under apache drill options in the UI, check the following options:
exec.queue.enable
exec.queue.large
exec.queue.small
Description:
exec.queue.enable: Changes the state of query queues. False allows unlimited concurrent queries.
exec.queue.large: Sets the number of large queries that can run concurrently in the cluster. Range: 0-1000
exec.queue.small: Sets the number of small queries that can run concurrently in the cluster. Range: 0-1001
It also depends on the complexity of the query, if the query has joins it will consider it as multiple queries internetally and exec.queue.large should be higher.

Related

ksqlDB recommendations for deploying large set of queries

I am running a ksqlDB streaming application that consists of a large number of queries (>60 queries), including many joins and aggregations. My data comes from various sources, and requires plenty of manipulation to produce the desired processed data, hence the large number of queries. I've run this set of queries on a single machine, using interactive mode, and it produces the right results. But I observe an increasing consumer lag when I increase the amount of data fed into the application.
I read on ksqlDB's Capacity Planning page that I can scale by adding more servers, which is what I plan to do.
Under Important Sizing Factors, it's also stated that "You should avoid running a large number of queries on one ksqlDB cluster. Instead, use interactive mode to play with your data and develop sets of queries that function together. Then, run these in their own headless cluster." However, I am unsure how to do this- my queries are all dependent on each other.
Does anyone have any general recommendations on how to deploy a large number of interdependent ksql queries? As an added requirement, the data is refreshed each day and is independent for the each new day, so I need to do some sort of refresh of the queries each day.
I think that's just a recommendation if you can group queries that depend each other, and then split those groups into headless mode servers.
Another way, if you use interactive mode, is to partitioned your topics and add more ksql servers to your cluster. This will allow ksql to split the workload across the cluster, each server consuming and processing one partition. Say you have 4 partitions per topic and 2 servers, then you'll have 1 server processing 2 partitions and another server other 2 partitions. This should decrease the workload on each server.
Another improvement is to reduce the number of streams threads. Each query you create runs with 4 kafka streams threads by default. The more number of threads, the more parallel work is done in the server. With a large number of queries, performance decreases and lag is incremented. Try with 1 thread and see if that works. Set ksql.streams.num.stream.threads=1 in the ksql-server.properties to configure it.

client_backend vs parallel_worker?

I'm running:
select *
from pg_stat_activity
And it shows 2 rows with same query content (under query field), and in active state,
but one row show client_backed (backend_type) and the other row show parallel_worker (backend_type)
why do I have 2 instances of same query ? (I have run just one query in my app)
what is the different between client_backed and parallel_worker ?
Since PostgreSQL v10 there is parallel processing for queries:
If the optimizer decides it is a good idea and there are enough resources, PostgreSQL will start parallel worker processes that execute the query together with your client backend. Eventually, the client backend will gather all the information from the parallel workers and finish query processing.
This speeds up query processing, but uses more resources on the database server.
The parameters that govern this are, among others max_parallel_workers, which limits the total limit for parallel worker processes, and max_parallel_workers_per_gather, which limits the numbers of parallel workers for a single query.

Apache Nifi : Oracle To Mongodb data transfer

I want to transfer data from oracle to MongoDB using apache nifi. Oracle has a total of 9 million records.
I have created nifi flow using QueryDatabaseTable and PutMongoRecord processors. This flow is working fine but has some performance issues.
After starting the nifi flow, records in the queue for SplitJson -> PutMongoRecord are increasing.
Is there any way to slow down records putting into the queue by SplitJson processor?
OR
Increase the rate of insertion in PutMongoRecord?
Right now, in 30 minutes 100k records are inserted, how to speed up this process?
#Vishal. The solution you are looking for is to increase the concurrency of PutMongoRecord:
You can also experiment with the the BATCH size in the configuration tab:
You can also reduce the execution time splitJson. However you should remember this process is going to take 1 flowfile and make ALOT of flowfiles regardless of the timing.
How much you can increase concurrency is going to depend on how many nifi nodes you have, and how many CPU Cores each node has. Be experimental and methodical here. Move up in single increments (1-2-3-etc) and test your file in each increment. If you only have 1 node, you may not be able to tune the flow to your performance expectations. Tune the flow instead for stability and as fast as you can get it. Then consider scaling.
How much you can increase concurrency and batch is also going to depend on the MongoDB Data Source and the total number of connections you can get fro NiFi to Mongo.
In addition to Steven's answer, there are two properties on QueryDatabaseTable that you should experiment with:
Max Results Per Flowfile
Use Avro logical types
With the latter, you might be able to do a direct shift from Oracle to MongoDB because it'll convert Oracle date types into Avro ones and those should in turn by converted directly into proper Mongo date types. Max results per flowfile should also allow you to specify appropriate batching without having to use the extra processors.

How read in parallel all data from OrientDB

I want read all data from Orientdb database and I dont want get an iterator, I want read in some way all data in parallel by chunks from distinct pc across the network. There is any way to read databaseĀ“s clusters in parallel or any other way to do this?
I have seen the Spark connector for Orientdb, they query directly the clusters of the Orientdb classes in order to read the values of a complete class in parallel.
Orient-Spark
Git-code
You can use PARALLEL in a SELECT query.
See: https://orientdb.com/docs/last/SQL-Query.html
PARALLEL Executes the query against x concurrent threads, where x refers to the number of processors or cores found on the host operating system of the query. You may find PARALLEL execution useful on long running queries or queries that involve multiple cluster. For simple queries, using PARALLEL may cause a slow down due to the overhead inherent in using multiple threads.

PagingItemReader vs CursorItemReader in Spring batch

I have a spring batch with multiple steps, some sequential and some parallel. Some of these steps involve fetching millions of rows and the query has multiple joins and left joins. I tried using JdbcPagingItemReader but the order by clause simply hangs the query. I don't get results even after 30 minutes of waiting. So I switched to JdbcCursorItemReader.
Is that approach fine ? I understand that the JdbcCursorItemReader fetches all the data at once and writes it out based on the commit interval. Is there any option to specify to the reader to fetch, say, 50000 records at a time, so that my application and the system is not overloaded ?
Thank you for your response, Michael. I have 22 customized Item readers which are extended from jdbcCursorItemReader. If there are multiple threads, how would the Spring batch handle the resultset? Is there a possibility of multiple threads reading from the same resultset in this case, also?
The JdbcCursorItemReader has the ability to configure the fetchSize (how many records are returned from the db with each request), however that depends on your database and it's configuration. For example, most databases you can configure the fetch size and it's honored. However, MySql requires you set the fetch side to Integer.MIN_VALUE in order to stream results. Sqlite is another that has special requirements.
That being said, it is important to know that JdbcCursorItemReader is not thread safe (multiple threads would be reading from the same ResultSet).
I personally would advocate for tuning your query but assuming the above conditions, you should be able to use the JdbcCursorItemReader fine.