Apache Druid throwing network error in console - druid

When we are trying to query a datasource , the query runs for 5+ mins and throws network error in console. We are trying to fetch some huge result in millions. Is this some limitation in druid, where we can't fetch huge records? Other aggregated queries are running fine and producing results.
SELECT * FROM "datasource"
WHERE "__time" >= TIMESTAMP '2021-06-21' and "__time" <= TIMESTAMP '2021-06-23' and consumerid=1234
Segment Granularity: DAY
Query Granularity : DAY
Segments Created: 736
Avg Segment Size: 462 MB
Total Datasource Size: 340.28 GB
Replicated Size: 680.55 GB
secondary partition: single_dim (consumerid)
Is there some way to overcome this Issue?
I've tried this via API also, after 5.30 seconds it throws error.
curl --location --request POST 'https://druiddev-druid.int.org/druid/v2/sql' --header 'Authorization: Basic Username/p' --header 'Content-Type: application/json' --data-raw '{
"query": "SELECT * FROM datasource WHERE consumerid=1234 and buydate='\''01/01/2021'\''",
"resultFormat" : "csv",
"batchSize":20480
}' > output.dat
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 17.9M 0 17.9M 0 165 58699 0 --:--:-- 0:05:20 --:--:-- 86008
curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)

Related

How to add result generated by pgbench on terminal to csv file

I m trying to add pgbench result to csv . The result I got is in below given format
pgbench -c 10 -j 2 -t 10000 example
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 50
query mode: simple
number of clients: 10
number of threads: 2
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
latency average: 4.176 ms
tps = 2394.718707 (including connections establishing)
tps = 2394.874350 (excluding connections establishing)

How does JDBC sink connector inserts values into postgres database

I'm using JDBC sink connector to load data from kafka topic to postgres database.
here is my configuration:
curl --location --request PUT 'http://localhost:8083/connectors/sink_1/config' \
--header 'Content-Type: application/json' \
--data-raw '{
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url":"jdbc:postgresql://localhost:5432/postgres",
"connection.user":"user",
"connection.password":"passwd",
"tasks.max" : "10",
"topics":"<topic_name_same_as_tablename>",
"insert.mode":"insert",
"key.converter":"org.apache.kafka.connect.converters.ByteArrayConverter",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"quote.sql.identifiers":"never",
"errors.tolerance":"all",
"errors.deadletterqueue.topic.name":"failed_records",
"errors.deadletterqueue.topic.replication.factor":"1",
"errors.log.enable":"true"
}'
In my table, I have 100k+ records so, I tried partitioning the topic into 10 and I tried with tasks.max with 10 to speed up the loading process, which was much faster when compared to single partition.
Can someone help me understand how the sink connector loads data into postgres? How will be the insert statement it will consider? either approach-1 or approach-2? If it is approach-1 then can we achieve approach-2? if yes, how can we?

Kafka producer quota and timeout exceptions

I am trying to come up with a configuration that would enforce producer quota setup based on an average byte rate of producer.
I did a test with a 3 node cluster. The topic however was created with 1 partition and 1 replication factor so that the producer_byte_rate can be measured only for 1 broker (the leader broker).
I set the producer_byte_rate to 20480 on client id test_producer_quota.
I used kafka-producer-perf-test to test out the throughput and throttle.
kafka-producer-perf-test --producer-props bootstrap.servers=SSL://kafka-broker1:6667 \
client.id=test_producer_quota \
--topic quota_test \
--producer.config /myfolder/client.properties \
--record.size 2048 --num-records 4000 --throughput -1
I expected the producer client to learn about the throttle and eventually smooth out the requests sent to the broker. Instead I noticed there is alternate throghput of 98 rec/sec and 21 recs/sec for a period of more than 30 seconds. During this time average latency slowly kept increseing and finally when it hits 120000 ms, I start to see Timeout exception as below
org.apache.kafka.common.errors.TimeoutException : Expiring 7 records for quota_test-0: 120000 ms has passed since batch creation.
What is possibly causing this issue?
The producer is hitting timeout when latency reaches 120 seconds (default value of delivery.timeout.ms )
Why isnt the producer not learning about the throttle and quota and slowing down or backing off
What other producer configuration could help alleviate this timeout issue ?
(2048 * 4000) / 20480 = 400 (sec)
This means that, if your producer is trying to send the 4000 records full speed ( which is the case because you set throughput to -1), then it might batch them and put them in the queue.. in maybe one or two seconds (depending on your CPU).
Then, thanks to your quota settings (20480), you can be sure that the broker won't 'complete' the processing of those 4000 records before at least 399 or 398 seconds.
The broker does not return an error when a client exceeds its quota, but instead attempts to slow the client down. The broker computes the amount of delay needed to bring a client under its quota and delays the response for that amount of time.
Your request.timeout.ms being set to 120 seconds, you then have this timeoutException.

Should topic created via kafka-topics automatically have associated subjects created?

I'm attempting to mimic 'confluent load' (which isn't recommended for production usage) to add the connectors which automatically creates the topic, subject, etc. that allows for ksql stream and table creation. I'm using curl to interact with the rest interface.
When kafka-topics is used to create topics, does this also create the associated subjects for "topicName-value", etc.?
$ curl -X GET http://localhost:8082/topics | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 199 100 199 0 0 14930 0 --:--:-- --:--:-- --:--:-- 15307
[
"Topic_OracleSource2"
]
curl -X GET http://localhost:8081/subjects | jq
[]
Nothing shows. However, performing a curl:
curl -X POST -H "Content-Type: application/vnd.kafka.avro.v2+json" -H "Accept: application/vnd.kafka.v2+json" --data '{"value_schema": "{\"type\": \"record\", \"name\": \"User\", \"fields\": [{\"name\": \"name\", \"type\": \"string\"}]}", "records": [{"value": {"name": "testUser"}}]}' "http://localhost:8082/topics/avrotest"
creates the subject:
curl -X GET http://localhost:8081/subjects | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 18 100 18 0 0 2020 0 --:--:-- --:--:-- --:--:-- 2250
[
"avrotest-value"
]
As far as I know, doing it this way isn't recommended as topics are created on the fly and not pre-created in a controlled environment.
The reason this question comes about is that it seems the subject 'topicName-value/key' pair is needed to create streams for the topic inside KSQL.
Without subject, I can only see data coming across with the avro-based connector created but can't further perform transformation using ksql stream and table.
kafka-topics only interacts with Zookeeper and Kafka. It has no notion of the existence of a Schema Registry.
The process that creates the Avro schema / subject is the Avro Serializer configuration via the producer. If a Kafka Connect source is configured with the AvroConverter, it'll register a schema itself upon getting data, so you should not need curl, assuming you are satisfied with the generated schema
To my knowledge, there's no way to prevent KSQL from auto-registering a schema in the registry.
seems the subject 'topicName-value/key' pair is needed to create streams for the topic inside KSQL.
If you want to use Avro, yes. But, no not "needed" for other data formats KSQL supports
can't further perform transformation using ksql stream and table.
You'll need to be more explicit about why that is. Are you getting errors?
When kafka-topics is used to create topics, does this also create the associated subjects for "topicName-value", etc.?
No, the subjects are not created automatically. (kafka-topics today doesn't even allow you to pass an Avro schema.)
Might be worth a feature request?

Performance testing in Kafka

Can someone please explain on how is performance tested in Kafka using,
bin/kafka-consumer-perf-test.sh --topic benchmark-3-3-none \
--zookeeper kafka-zk-1:2181,kafka-zk-2:2181,kafka-zk-3:2181 \
--messages 15000000 \
--threads 1
and
bin/kafka-producer-perf-test.sh --topic benchmark-1-1-none \
--num-records 15000000 \
--record-size 100 \
--throughput 15000000 \
--producer-props \
acks=1 \
bootstrap.servers=kafka-kf-1:9092,kafka-kf-2:9092,kafka-kf-3:9092 \
buffer.memory=67108864 \
compression.type=none \
batch.size=8196
I am not clear on what are the paramters and what is the output that should be obtained. How will I check if I send 1000 messages to Kafka topics ,its performance and acknowledgement.
When we run this we get the following,
Producer
| start.time | end.time | compression | message.size | batch.size | total.data.sent.in.MB | MB.sec | total.data.sent.in.nMsg | nMsg.sec |
| 2016-02-03 21:38:28:094 | 2016-02-03 21:38:28:449 | 0 | 100 | 200 | 0.01 | 0.0269 | 100 | 281.6901 |
Where,
• total.data.sent.in.MB shows total data send to cluster in MB.
• MB.sec indicates how much data transferred in MB per sec(Throughput on size).
• total.data.sent.in.nMsg will show the count of total message which were sent during this test.
• And last nMsg.sec shows how many messages sent in a sec(Throughput on count of messages
Consumer
| start.time | end.time | fetch.size | data.consumed.in.MB | MB.sec | data.consumed.in.nMs | nMsg.sec |
| 2016-02-04 11:29:41:806 | 2016-02-04 11:29:46:854 | 1048576 | 0.0954 | 1.9869 | 1001 | 20854.1667
where,
• start.time, end.time will show when was test started and completed.
• fetch.size** shows the amount of data to fetch in a single request.
• data.consumed.in.MB**** shows the size of all messages consumed.
• ***MB.sec* indicates how much data transferred in MB per sec(Throughput on size).
• data.consumed.in.nMsg will show the count of total message which were consumed during this test.
• And last nMsg.sec shows how many messages consumed in a sec(Throughput on count of messages).
I would rather suggest going for a specialized performance testing tool like Apache JMeter and Pepper-Box - Kafka Load Generator in order to load test your Kafka installation.
This way you will be able to conduct the load having full control of threads, ramp-up time, message size and content, etc. You will also be able to generate HTML Reporting Dashboard having tables and charts with interesting metrics.
See Apache Kafka - How to Load Test with JMeter article for more details if needed.
If anyone runs into this question please note that kafka-producer-perf-test.sh should produce a different output as of Kafka v2.12-3.3.2.
For example, to send 1000 messages to a Kafka topic use command line parameter --num-records 1000 (and --topic <topic_name> of course). Generated output should resemble the following and include number of messages sent in steps, speed in terms of messages sent per second and MB per seconds, average latencies (I chose to send 1M messages):
323221 records sent, 64644.2 records/sec (63.13 MB/sec), 7.5 ms avg latency, 398.0 ms max latency.
381338 records sent, 76267.6 records/sec (74.48 MB/sec), 1.2 ms avg latency, 27.0 ms max latency.
1000000 records sent, 70244.450688 records/sec (68.60 MB/sec), 15.21 ms avg latency, 475.00 ms max latency, 1 ms 50th, 96 ms 95th, 353 ms 99th, 457 ms 99.9th.