SQL request never ends - scala

I am trying to get some data from my Cassandra database using in a program, but the request never completes.
My Spark configuration looks like this:
object ExternalConf {
var cassandraHost : String = "cassandra_cassandra-001,cassandra_cassandra-002,cassandra_cassandra-003,cassandra_cassandra-004"
var masterSpark: String ="local[*]"
}
object Spark {
val session : SparkSession = SparkSession
.builder()
.appName("KStreaming")
// .config("spark.cassandra.connection.host", ExternalConf.cassandraHost) //default value or args
.config("spark.cassandra.connection.host", "cassandra_node") //preprod
.config("spark.cassandra.auth.username", "cassandra")
.config("spark.cassandra.auth.password", "cassandra")
.config("output.batch.grouping.buffer.size", "50")
.config("output.batch.size.bytes", "102400")
.config("spark.driver.maxResultSize", "4g")
.config("spark.sql.broadcastTimeout", "1800")
.master(ExternalConf.masterSpark)
.getOrCreate();
session.sql("CREATE OR REPLACE TEMPORARY VIEW dbv2_product_categories USING org.apache.spark.sql.cassandra OPTIONS (table 'dbv2_product_categories', keyspace 'preprod', pushdown 'true')")
import session.implicits._
}
And the code that makes a SELECT:
def extractData(data: RDD[ConsumerRecord[String, String]]) = {
import Spark.session.implicits._
data
.foreach(message => {
var persistedProductCategory: DataFrame = Spark.session.sql("SELECT * FROM dbv2_product_categories WHERE account_id = '" + accountId + "' AND name = '" + shopifyProduct.product_type + "'")
})
}
The request never ends. Here is my stderr (I stripped the beginning of it to make it shorter):
22/08/29 09:50:04 INFO ConsumerConfig: ConsumerConfig values:
metric.reporters = []
metadata.max.age.ms = 300000
partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor]
reconnect.backoff.ms = 50
sasl.kerberos.ticket.renew.window.factor = 0.8
max.partition.fetch.bytes = 1048576
bootstrap.servers = [kafka:9092]
ssl.keystore.type = JKS
enable.auto.commit = false
sasl.mechanism = GSSAPI
interceptor.classes = null
exclude.internal.topics = true
ssl.truststore.password = null
client.id = consumer-4
ssl.endpoint.identification.algorithm = null
max.poll.records = 2147483647
check.crcs = true
request.timeout.ms = 40000
heartbeat.interval.ms = 3000
auto.commit.interval.ms = 5000
receive.buffer.bytes = 65536
ssl.truststore.type = JKS
ssl.truststore.location = null
ssl.keystore.password = null
fetch.min.bytes = 1
send.buffer.bytes = 131072
value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
group.id = spark-executor--183698333
retry.backoff.ms = 100
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
ssl.trustmanager.algorithm = PKIX
ssl.key.password = null
fetch.max.wait.ms = 500
sasl.kerberos.min.time.before.relogin = 60000
connections.max.idle.ms = 540000
session.timeout.ms = 30000
metrics.num.samples = 2
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
ssl.protocol = TLS
ssl.provider = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.keystore.location = null
ssl.cipher.suites = null
security.protocol = PLAINTEXT
ssl.keymanager.algorithm = SunX509
metrics.sample.window.ms = 30000
auto.offset.reset = none
22/08/29 09:50:04 WARN ConsumerConfig: The configuration schema.registry.url = http://schema-registry:8081 was supplied but isn't a known config.
22/08/29 09:50:04 INFO AppInfoParser: Kafka version : 0.10.0.1
22/08/29 09:50:04 INFO AppInfoParser: Kafka commitId : a7a17cdec9eaa6c5
22/08/29 09:50:04 INFO CachedKafkaConsumer: Initial fetch for spark-executor--183698333 62c44e48be54d0002900bd62_products 0 0
22/08/29 09:50:04 INFO AbstractCoordinator: Discovered coordinator kafka:9092 (id: 2147483646 rack: null) for group spark-executor--183698333.
22/08/29 09:50:04 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect.
22/08/29 09:50:04 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark/work/driver-20220829094956-7907/spark-warehouse').
22/08/29 09:50:04 INFO SharedState: Warehouse path is 'file:/opt/spark/work/driver-20220829094956-7907/spark-warehouse'.
22/08/29 09:50:05 INFO JobScheduler: Added jobs for time 1661766605000 ms
22/08/29 09:50:05 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
22/08/29 09:50:05 INFO SparkSqlParser: Parsing command: CREATE OR REPLACE TEMPORARY VIEW dbv2_product_categories USING org.apache.spark.sql.cassandra OPTIONS (table 'dbv2_product_categories', keyspace 'kiliba', pushdown 'true')
22/08/29 09:50:07 INFO ClockFactory: Using native clock to generate timestamps.
22/08/29 09:50:07 INFO NettyUtil: Found Netty's native epoll transport in the classpath, using it
22/08/29 09:50:07 INFO Cluster: New Cassandra host cassandra_node/10.0.0.177:9042 added
22/08/29 09:50:07 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
22/08/29 09:50:08 INFO SparkSqlParser: Parsing command: SELECT * FROM dbv2_product_categories WHERE account_id = 'shopifytest_62c44e48be54d0002900bd62' AND name = ''
22/08/29 09:50:08 INFO CassandraSourceRelation: Input Predicates: [IsNotNull(account_id), IsNotNull(name), EqualTo(account_id,shopifytest_62c44e48be54d0002900bd62), EqualTo(name,)]
22/08/29 09:50:08 INFO CassandraSourceRelation: Input Predicates: [IsNotNull(account_id), IsNotNull(name), EqualTo(account_id,shopifytest_62c44e48be54d0002900bd62), EqualTo(name,)]
22/08/29 09:50:08 INFO CodeGenerator: Code generated in 175.144922 ms
22/08/29 09:50:08 INFO CodeGenerator: Code generated in 16.62666 ms
22/08/29 09:50:08 INFO SparkContext: Starting job: count at Kstreaming.scala:361
22/08/29 09:50:08 INFO DAGScheduler: Registering RDD 7 (count at Kstreaming.scala:361)
22/08/29 09:50:08 INFO DAGScheduler: Got job 1 (count at Kstreaming.scala:361) with 1 output partitions
22/08/29 09:50:08 INFO DAGScheduler: Final stage: ResultStage 2 (count at Kstreaming.scala:361)
22/08/29 09:50:08 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
22/08/29 09:50:08 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
22/08/29 09:50:09 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[7] at count at Kstreaming.scala:361), which has no missing parents
22/08/29 09:50:09 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 15.8 KB, free 5.2 GB)
22/08/29 09:50:09 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 7.8 KB, free 5.2 GB)
22/08/29 09:50:09 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.0.0.229:43123 (size: 7.8 KB, free: 5.2 GB)
22/08/29 09:50:09 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
22/08/29 09:50:09 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[7] at count at Kstreaming.scala:361) (first 15 tasks are for partitions Vector(0))
22/08/29 09:50:09 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
22/08/29 09:50:10 INFO JobScheduler: Added jobs for time 1661766610000 ms
22/08/29 09:50:15 INFO JobScheduler: Added jobs for time 1661766615000 ms
22/08/29 09:50:16 INFO CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
22/08/29 09:50:20 INFO JobScheduler: Added jobs for time 1661766620000 ms
22/08/29 09:50:25 INFO JobScheduler: Added jobs for time 1661766625000 ms
22/08/29 09:50:30 INFO JobScheduler: Added jobs for time 1661766630000 ms
22/08/29 09:50:35 INFO JobScheduler: Added jobs for time 1661766635000 ms
22/08/29 09:50:40 INFO JobScheduler: Added jobs for time 1661766640000 ms
22/08/29 09:50:45 INFO JobScheduler: Added jobs for time 1661766645000 ms
22/08/29 09:50:50 INFO JobScheduler: Added jobs for time 1661766650000 ms
22/08/29 09:50:55 INFO JobScheduler: Added jobs for time 1661766655000 ms
22/08/29 09:51:00 INFO JobScheduler: Added jobs for time 1661766660000 ms
22/08/29 09:51:05 INFO JobScheduler: Added jobs for time 1661766665000 ms
22/08/29 09:51:10 INFO JobScheduler: Added jobs for time 1661766670000 ms
22/08/29 09:51:15 INFO JobScheduler: Added jobs for time 1661766675000 ms
22/08/29 09:51:20 INFO JobScheduler: Added jobs for time 1661766680000 ms
22/08/29 09:51:25 INFO JobScheduler: Added jobs for time 1661766685000 ms
22/08/29 09:51:30 INFO JobScheduler: Added jobs for time 1661766690000 ms
When I execute the SQL command directly in cqlsh, it works and I get a result almost instantly:
cqlsh:kiliba> SELECT * FROM dbv2_product_categories WHERE account_id = 'shopifytest_62c44e48be54d0002900bd62' AND name = '';
account_id | name | id | breadcrumb | parent_id
------------+------+----+------------+-----------
(0 rows)
How comes my program fails to give me a response and seems to hang eternally?

Related

Kafka Avro Consumer (kafka-avro-console-consumer) Logging Level

Is there any way to turn off the "INFO" logging level in /usr/bin/kafka-avro-console-consumer?
Or, is there an alternate way to view data that is in an avro schema?
Right now I use the program (via Docker) and the output contains a large number of log messages (which I do not want) before it emits the data on the topic that I do want:
[2022-05-10 14:08:33,090] INFO Registered kafka:type=kafka.Log4jController MBean (kafka.utils.Log4jControllerRegistration$)
[2022-05-10 14:08:33,643] INFO ConsumerConfig values:
allow.auto.create.topics = true
auto.commit.interval.ms = 5000
auto.offset.reset = earliest
bootstrap.servers = [kafka:9092]
check.crcs = true
client.dns.lookup = use_all_dns_ips
client.id = consumer-console-consumer-25832-1
client.rack =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = console-consumer-25832
group.instance.id = null
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
internal.throw.on.fetch.stable.offset.unsupported = false
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 500
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
security.providers = null
send.buffer.bytes = 131072
session.timeout.ms = 10000
socket.connection.setup.timeout.max.ms = 30000
socket.connection.setup.timeout.ms = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.3]
ssl.endpoint.identification.algorithm = https
ssl.engine.factory.class = null
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.certificate.chain = null
ssl.keystore.key = null
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLSv1.3
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.certificates = null
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
(org.apache.kafka.clients.consumer.ConsumerConfig)
[2022-05-10 14:08:33,809] INFO Kafka version: 6.2.0-ce (org.apache.kafka.common.utils.AppInfoParser)
[2022-05-10 14:08:33,809] INFO Kafka commitId: 5c753752ae1445a1 (org.apache.kafka.common.utils.AppInfoParser)
[2022-05-10 14:08:33,809] INFO Kafka startTimeMs: 1652191713801 (org.apache.kafka.common.utils.AppInfoParser)
[2022-05-10 14:08:33,814] INFO [Consumer clientId=consumer-console-consumer-25832-1, groupId=console-consumer-25832] Subscribed to topic(s): dbserver1.public.x_account (org.apache.kafka.clients.consumer.KafkaConsumer)
[2022-05-10 14:08:34,471] INFO [Consumer clientId=consumer-console-consumer-25832-1, groupId=console-consumer-25832] Cluster ID: 9-BvG22VQrimBsWAceE02Q (org.apache.kafka.clients.Metadata)
[2022-05-10 14:08:34,473] INFO [Consumer clientId=consumer-console-consumer-25832-1, groupId=console-consumer-25832] Discovered group coordinator kafka:9092 (id: 2147483646 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2022-05-10 14:08:34,477] INFO [Consumer clientId=consumer-console-consumer-25832-1, groupId=console-consumer-25832] (Re-)joining group (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2022-05-10 14:08:34,501] INFO [Consumer clientId=consumer-console-consumer-25832-1, groupId=console-consumer-25832] (Re-)joining group (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2022-05-10 14:08:34,507] INFO [Consumer clientId=consumer-console-consumer-25832-1, groupId=console-consumer-25832] Successfully joined group with generation Generation{generationId=1, memberId='consumer-console-consumer-25832-1-d482e04b-8ed0-4d68-b533-a010dde3c99a', protocol='range'} (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2022-05-10 14:08:34,511] INFO [Consumer clientId=consumer-console-consumer-25832-1, groupId=console-consumer-25832] Finished assignment for group at generation 1: {consumer-console-consumer-25832-1-d482e04b-8ed0-4d68-b533-a010dde3c99a=Assignment(partitions=[dbserver1.public.x_account-0])} (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2022-05-10 14:08:34,524] INFO [Consumer clientId=consumer-console-consumer-25832-1, groupId=console-consumer-25832] Successfully synced group in generation Generation{generationId=1, memberId='consumer-console-consumer-25832-1-d482e04b-8ed0-4d68-b533-a010dde3c99a', protocol='range'} (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2022-05-10 14:08:34,525] INFO [Consumer clientId=consumer-console-consumer-25832-1, groupId=console-consumer-25832] Notifying assignor about the new Assignment(partitions=[dbserver1.public.x_account-0]) (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2022-05-10 14:08:34,529] INFO [Consumer clientId=consumer-console-consumer-25832-1, groupId=console-consumer-25832] Adding newly assigned partitions: dbserver1.public.x_account-0 (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2022-05-10 14:08:34,540] INFO [Consumer clientId=consumer-console-consumer-25832-1, groupId=console-consumer-25832] Found no committed offset for partition dbserver1.public.x_account-0 (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
The command that generates the above output is provided below:
docker exec -i [schema-registry-container] /usr/bin/kafka-avro-console-consumer \
--bootstrap-server kafka:9092 \
--topic [some-topic] \
--from-beginning \
--property schema.registry.url="http://schema-registry:8081"
For the Confluent Schema Registry image, you can add this env-var for completely disabling any consumer package logs
SCHEMA_REGISTRY_LOG4J_LOGGERS="org.apache.kafka.clients.consumer=OFF"
Comma-separate more packages to further configure log4j
alternate way to view data that is in an avro schema
kcat or tools like AKHQ, Conduktor, etc.

Kafka consumer does not poll records intermittently

I have wrote a simple utility in scala to read kafka message as byte array.
The utility works on one machine but not on the other. Both are on same OS (centos 7) and same kafka server as well (which is in another machine all together).
However Kafka Tool (www.kafkatool.com) works on the machine which the utility not - so its not likely accessibility issue.
Following is the essence of the consumer code:
import java.io.BufferedOutputStream
import java.util.Properties
import org.apache.kafka.clients.consumer.KafkaConsumer
val outputFile = "output.txt"
val topic = "test_topic"
val server = "localhost:9092"
val id = "record-tool"
val props = new Properties()
props.put("bootstrap.servers", server)
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer")
props.put("auto.offset.reset", "earliest")
props.put("enable.auto.commit", "false")
props.put("max.partition.fetch.bytes", "104857600")
props.put("group.id", id)
val bos = new BufferedOutputStream(new FileOutputStream(outputFile))
val consumer = new KafkaConsumer[String, Array[Byte]](props)
consumer.subscribe(Seq(topic).asJava)
Stream.continually(consumer.poll(5000).asScala.toList).takeWhile(_.nonEmpty).flatten.foreach(c => bos.write(c.value)))
consumer.close
bos.close
I dont see any errors in the logs as well, following is debug log
[root#vm util]# bin/record-tool consume --server=kafka-server:9092 --topic=test_topic --asBin --debug
16:44:02.548 [main] INFO org.apache.kafka.clients.consumer.ConsumerConfig - ConsumerConfig values:
metric.reporters = []
metadata.max.age.ms = 300000
partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor]
reconnect.backoff.ms = 50
sasl.kerberos.ticket.renew.window.factor = 0.8
max.partition.fetch.bytes = 104857600
bootstrap.servers = [kafka-server:9092]
ssl.keystore.type = JKS
enable.auto.commit = false
sasl.mechanism = GSSAPI
interceptor.classes = null
exclude.internal.topics = true
ssl.truststore.password = null
client.id =
ssl.endpoint.identification.algorithm = null
max.poll.records = 2147483647
check.crcs = true
request.timeout.ms = 40000
heartbeat.interval.ms = 3000
auto.commit.interval.ms = 5000
receive.buffer.bytes = 65536
ssl.truststore.type = JKS
ssl.truststore.location = null
ssl.keystore.password = null
fetch.min.bytes = 1
send.buffer.bytes = 131072
value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
group.id = record-tool
retry.backoff.ms = 100
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
ssl.trustmanager.algorithm = PKIX
ssl.key.password = null
fetch.max.wait.ms = 500
sasl.kerberos.min.time.before.relogin = 60000
connections.max.idle.ms = 540000
session.timeout.ms = 30000
metrics.num.samples = 2
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
ssl.protocol = TLS
ssl.provider = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.keystore.location = null
ssl.cipher.suites = null
security.protocol = PLAINTEXT
ssl.keymanager.algorithm = SunX509
metrics.sample.window.ms = 30000
auto.offset.reset = earliest
16:44:02.550 [main] DEBUG org.apache.kafka.clients.consumer.KafkaConsumer - Starting the Kafka consumer
16:44:02.621 [main] DEBUG org.apache.kafka.clients.Metadata - Updated cluster metadata version 1 to Cluster(nodes = [kafka-server:9092 (id: -1 rack: null)], partitions = [])
16:44:02.632 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name connections-closed:
16:44:02.636 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name connections-created:
16:44:02.637 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name bytes-sent-received:
16:44:02.637 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name bytes-sent:
16:44:02.638 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name bytes-received:
16:44:02.638 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name select-time:
16:44:02.639 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name io-time:
16:44:02.649 [main] INFO org.apache.kafka.clients.consumer.ConsumerConfig - ConsumerConfig values:
metric.reporters = []
metadata.max.age.ms = 300000
partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor]
reconnect.backoff.ms = 50
sasl.kerberos.ticket.renew.window.factor = 0.8
max.partition.fetch.bytes = 104857600
bootstrap.servers = [kafka-server:9092]
ssl.keystore.type = JKS
enable.auto.commit = false
sasl.mechanism = GSSAPI
interceptor.classes = null
exclude.internal.topics = true
ssl.truststore.password = null
client.id = consumer-1
ssl.endpoint.identification.algorithm = null
max.poll.records = 2147483647
check.crcs = true
request.timeout.ms = 40000
heartbeat.interval.ms = 3000
auto.commit.interval.ms = 5000
receive.buffer.bytes = 65536
ssl.truststore.type = JKS
ssl.truststore.location = null
ssl.keystore.password = null
fetch.min.bytes = 1
send.buffer.bytes = 131072
value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
group.id = record-tool
retry.backoff.ms = 100
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
ssl.trustmanager.algorithm = PKIX
ssl.key.password = null
fetch.max.wait.ms = 500
sasl.kerberos.min.time.before.relogin = 60000
connections.max.idle.ms = 540000
session.timeout.ms = 30000
metrics.num.samples = 2
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
ssl.protocol = TLS
ssl.provider = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.keystore.location = null
ssl.cipher.suites = null
security.protocol = PLAINTEXT
ssl.keymanager.algorithm = SunX509
metrics.sample.window.ms = 30000
auto.offset.reset = earliest
16:44:02.657 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name heartbeat-latency
16:44:02.657 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name join-latency
16:44:02.657 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name sync-latency
16:44:02.659 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name commit-latency
16:44:02.663 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name bytes-fetched
16:44:02.664 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name records-fetched
16:44:02.664 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name fetch-latency
16:44:02.664 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name records-lag
16:44:02.664 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name fetch-throttle-time
16:44:02.666 [main] INFO org.apache.kafka.common.utils.AppInfoParser - Kafka version : 0.10.0.1
16:44:02.666 [main] INFO org.apache.kafka.common.utils.AppInfoParser - Kafka commitId : a7a17cdec9eaa6c5
16:44:02.668 [main] DEBUG org.apache.kafka.clients.consumer.KafkaConsumer - Kafka consumer created
16:44:02.680 [main] DEBUG org.apache.kafka.clients.consumer.KafkaConsumer - Subscribed to topic(s): test_topic
16:44:02.681 [main] DEBUG org.apache.kafka.clients.consumer.internals.AbstractCoordinator - Sending coordinator request for group record-tool to broker kafka-server:9092 (id: -1 rack: null)
16:44:02.695 [main] DEBUG org.apache.kafka.clients.NetworkClient - Initiating connection to node -1 at kafka-server:9092.
16:44:02.816 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name node--1.bytes-sent
16:44:02.817 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name node--1.bytes-received
16:44:02.818 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name node--1.latency
16:44:02.820 [main] DEBUG org.apache.kafka.clients.NetworkClient - Completed connection to node -1
16:44:02.902 [main] DEBUG org.apache.kafka.clients.NetworkClient - Sending metadata request {topics=[test_topic]} to node -1
16:44:02.981 [main] DEBUG org.apache.kafka.clients.Metadata - Updated cluster metadata version 2 to Cluster(nodes = [kafka-server.mydomain.com:9092 (id: 0 rack: null)], partitions = [Partition(topic = test_topic, partition = 0, leader = 0, replicas = [0,], isr = [0,]])
16:44:02.982 [main] DEBUG org.apache.kafka.clients.consumer.internals.AbstractCoordinator - Received group coordinator response ClientResponse(receivedTimeMs=1583225042982, disconnected=false, request=ClientRequest(expectResponse=true, callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler#434a63ab, request=RequestSend(header={api_key=10,api_version=0,correlation_id=0,client_id=consumer-1}, body={group_id=record-tool}), createdTimeMs=1583225042692, sendTimeMs=1583225042904), responseBody={error_code=0,coordinator={node_id=0,host=kafka-server.mydomain.com,port=9092}})
16:44:02.983 [main] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator - Discovered coordinator kafka-server.mydomain.com:9092 (id: 2147483647 rack: null) for group record-tool.
16:44:02.983 [main] DEBUG org.apache.kafka.clients.NetworkClient - Initiating connection to node 2147483647 at kafka-server.mydomain.com:9092.
16:44:02.986 [main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Revoking previously assigned partitions [] for group record-tool
16:44:02.986 [main] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator - (Re-)joining group record-tool
16:44:02.988 [main] DEBUG org.apache.kafka.clients.consumer.internals.AbstractCoordinator - Sending JoinGroup ({group_id=record-tool,session_timeout=30000,member_id=,protocol_type=consumer,group_protocols=[{protocol_name=range,protocol_metadata=java.nio.HeapByteBuffer[pos=0 lim=25 cap=25]}]}) to coordinator kafka-server.mydomain.com:9092 (id: 2147483647 rack: null)
16:44:03.051 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name node-2147483647.bytes-sent
16:44:03.051 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name node-2147483647.bytes-received
16:44:03.052 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name node-2147483647.latency
16:44:03.052 [main] DEBUG org.apache.kafka.clients.NetworkClient - Completed connection to node 2147483647
16:44:03.123 [main] DEBUG org.apache.kafka.clients.consumer.internals.AbstractCoordinator - Received successful join group response for group record-tool: {error_code=0,generation_id=3,group_protocol=range,leader_id=consumer-1-f82633ab-a06e-4474-8ddb-1ec096d6c7f2,member_id=consumer-1-f82633ab-a06e-4474-8ddb-1ec096d6c7f2,members=[{member_id=consumer-1-f82633ab-a06e-4474-8ddb-1ec096d6c7f2,member_metadata=java.nio.HeapByteBuffer[pos=0 lim=25 cap=25]}]}
16:44:03.123 [main] DEBUG org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Performing assignment for group record-tool using strategy range with subscriptions {consumer-1-f82633ab-a06e-4474-8ddb-1ec096d6c7f2=Subscription(topics=[test_topic])}
16:44:03.124 [main] DEBUG org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Finished assignment for group record-tool: {consumer-1-f82633ab-a06e-4474-8ddb-1ec096d6c7f2=Assignment(partitions=[test_topic-0])}
16:44:03.124 [main] DEBUG org.apache.kafka.clients.consumer.internals.AbstractCoordinator - Sending leader SyncGroup for group record-tool to coordinator kafka-server.mydomain.com:9092 (id: 2147483647 rack: null): {group_id=record-tool,generation_id=3,member_id=consumer-1-f82633ab-a06e-4474-8ddb-1ec096d6c7f2,group_assignment=[{member_id=consumer-1-f82633ab-a06e-4474-8ddb-1ec096d6c7f2,member_assignment=java.nio.HeapByteBuffer[pos=0 lim=33 cap=33]}]}
16:44:03.198 [main] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator - Successfully joined group record-tool with generation 3
16:44:03.199 [main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Setting newly assigned partitions [test_topic-0] for group record-tool
16:44:03.200 [main] DEBUG org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Group record-tool fetching committed offsets for partitions: [test_topic-0]
16:44:03.268 [main] DEBUG org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Group record-tool has no committed offset for partition test_topic-0
16:44:03.269 [main] DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Resetting offset for partition test_topic-0 to earliest offset.
16:44:03.270 [main] DEBUG org.apache.kafka.clients.NetworkClient - Initiating connection to node 0 at kafka-server.mydomain.com:9092.
16:44:03.336 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name node-0.bytes-sent
16:44:03.337 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name node-0.bytes-received
16:44:03.337 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name node-0.latency
16:44:03.338 [main] DEBUG org.apache.kafka.clients.NetworkClient - Completed connection to node 0
16:44:03.407 [main] DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Fetched offset 0 for partition test_topic-0
16:44:06.288 [main] DEBUG org.apache.kafka.clients.consumer.internals.AbstractCoordinator - Received successful heartbeat response for group record-tool
16:44:07.702 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Removed sensor with name connections-closed:
16:44:07.702 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Removed sensor with name connections-created:
16:44:07.702 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Removed sensor with name bytes-sent-received:
16:44:07.702 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Removed sensor with name bytes-sent:
16:44:07.703 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Removed sensor with name bytes-received:
16:44:07.703 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Removed sensor with name select-time:
16:44:07.704 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Removed sensor with name io-time:
16:44:07.704 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Removed sensor with name node--1.bytes-sent
16:44:07.705 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Removed sensor with name node--1.bytes-received
16:44:07.705 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Removed sensor with name node--1.latency
16:44:07.705 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Removed sensor with name node-2147483647.bytes-sent
16:44:07.706 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Removed sensor with name node-2147483647.bytes-received
16:44:07.706 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Removed sensor with name node-2147483647.latency
16:44:07.706 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Removed sensor with name node-0.bytes-sent
16:44:07.707 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Removed sensor with name node-0.bytes-received
16:44:07.707 [main] DEBUG org.apache.kafka.common.metrics.Metrics - Removed sensor with name node-0.latency
16:44:07.707 [main] DEBUG org.apache.kafka.clients.consumer.KafkaConsumer - The Kafka consumer has closed.
What I noticed within takeWhile(_.nonEmpty) is the list is empty.
Is there any mistake in the code? Thanks.

Flink application sink KafkaProducer is throwing java heap space error (outofmemory)

I have created flink app which takes a datastream of strings and sink it with Kafka. The datastream of strings is simple strings fromCollection.
List<String> listOfStrings = new ArrayList<>();
listOfStrings.add("testkafka1");
listOfStrings.add("testkafka2");
listOfStrings.add("testkafka3");
listOfStrings.add("testkafka4");
DataStream<String> testStringStream = env.fromCollection(listOfStrings);
The flink runs on Kubernetes with parllelism 1 and 1 task manager. As soon as flink job starts it is failing with following error.
ERROR org.apache.kafka.common.utils.KafkaThread - Uncaught exception in kafka-producer-network-thread | producer-1:
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:97)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:75)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:203)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:167)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:381)
at org.apache.kafka.common.network.Selector.poll(Selector.java:326)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:433)
at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:71)
at org.apache.kafka.clients.producer.internals.Sender.awaitLeastLoadedNodeReady(Sender.java:409)
at org.apache.kafka.clients.producer.internals.Sender.maybeSendTransactionalRequest(Sender.java:337)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:204)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:162)
at java.lang.Thread.run(Thread.java:748)
The taskmanager config I have is (Taken from taskmanager logs)
Starting Task Manager
config file:
jobmanager.rpc.address: component-app-adb71002-tm-5c6f4d58bd-rtblz
jobmanager.rpc.port: 6123
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
taskmanager.numberOfTaskSlots: 2
parallelism.default: 1
jobmanager.execution.failover-strategy: region
blob.server.port: 6124
query.server.port: 6125
blob.server.port: 6125
fs.s3a.aws.credentials.provider: org.apache.flink.fs.s3base.shaded.com.amazonaws.auth.DefaultAWSCredentialsProviderChain
jobmanager.heap.size: 524288k
jobmanager.rpc.port: 6123
jobmanager.web.port: 8081
metrics.internal.query-service.port: 50101
metrics.reporter.dghttp.apikey: f52362263f032f2ebc3622cafc0171cd
metrics.reporter.dghttp.class: org.apache.flink.metrics.datadog.DatadogHttpReporter
metrics.reporter.dghttp.tags: componentingestion,dev
query.server.port: 6124
taskmanager.heap.size: 1048576k
taskmanager.numberOfTaskSlots: 1
web.upload.dir: /opt/flink
jobmanager.rpc.address: component-app-adb71002
taskmanager.host: 10.42.6.6
Starting taskexecutor as a console application on host component-app-adb71002-tm-5c6f4d58bd-rtblz.
2020-02-11 15:19:20,519 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - --------------------------------------------------------------------------------
2020-02-11 15:19:20,520 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Starting TaskManager (Version: 1.9.2, Rev:c9d2c90, Date:24.01.2020 # 08:44:30 CST)
2020-02-11 15:19:20,520 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - OS current user: flink
2020-02-11 15:19:20,520 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Current Hadoop/Kerberos user: <no hadoop dependency found>
2020-02-11 15:19:20,520 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.242-b08
2020-02-11 15:19:20,521 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum heap size: 922 MiBytes
2020-02-11 15:19:20,521 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JAVA_HOME: /usr/local/openjdk-8
2020-02-11 15:19:20,521 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - No Hadoop Dependency available
2020-02-11 15:19:20,521 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM Options:
2020-02-11 15:19:20,521 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:+UseG1GC
2020-02-11 15:19:20,521 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xms922M
2020-02-11 15:19:20,521 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xmx922M
2020-02-11 15:19:20,521 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:MaxDirectMemorySize=8388607T
2020-02-11 15:19:20,521 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2020-02-11 15:19:20,522 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2020-02-11 15:19:20,522 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Program Arguments:
2020-02-11 15:19:20,522 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - --configDir
2020-02-11 15:19:20,522 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - /opt/flink/conf
2020-02-11 15:19:20,522 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Classpath: /opt/flink/lib/flink-metrics-datadog-1.9.2.jar:/opt/flink/lib/flink-table-blink_2.12-1.9.2.jar:/opt/flink/lib/flink-table_2.12-1.9.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.12-1.9.2.jar:::
Producer config that I have is
acks = 1
batch.size = 16384
bootstrap.servers = [XXXXXXXXXXXXXXXX] ---I masked it intentionally
buffer.memory = 33554432
client.id =
compression.type = none
connections.max.idle.ms = 540000
enable.idempotence = false
interceptor.classes = null
key.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
linger.ms = 0
max.block.ms = 60000
max.in.flight.requests.per.connection = 5
max.request.size = 1048576
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
receive.buffer.bytes = 32768
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retries = 3
retry.backoff.ms = 100
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
send.buffer.bytes = 131072
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = null
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
transaction.timeout.ms = 60000
transactional.id = Source: Collection Source -> Sink: Unnamed-eb99017e0f9125fa6648bf56123bdcf7-19
value.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
Most of the producer config is default, is there something I am missing here or something wrong with the config ?
As Dominik suggested, the issue is not related to Heap.
If the broker is setup with ssl authentication and client is not setup for ssl auth, this exception is thrown.
this is a bug with kafka.
https://issues.apache.org/jira/browse/KAFKA-4090

To print output of SparkSQL to dataframe

I'm currently running Analyze command for particular table and could see the statistics being printed in the Spark-Console
However when I try to write the output to a DF I could not see the statistics.
Spark Version : 1.6.3
val a : DataFrame = sqlContext.sql("ANALYZE TABLE sample PARTITION (company='aaa', market='aab', edate='2019-01-03', pdate='2019-01-10') COMPUTE STATISTICS").collect()
Output in spark Shell
Partition sample{company=aaa, market=aab, etdate=2019-01-03, p=2019-01-10} stats: [numFiles=1, numRows=215, totalSize=7551, rawDataSize=461390]
19/03/22 02:49:33 INFO Task: Partition sample{company=aaa, market=aab, edate=2019-01-03, pdate=2019-01-10} stats: [numFiles=1, numRows=215, totalSize=7551, rawDataSize=461390]
Output of dataframe
19/03/22 02:49:33 INFO PerfLogger: </PERFLOG method=runTasks start=1553237373445 end=1553237373606 duration=161 from=org.apache.hadoop.hive.ql.Driver>
19/03/22 02:49:33 INFO PerfLogger: </PERFLOG method=Driver.execute start=1553237373445 end=1553237373606 duration=161 from=org.apache.hadoop.hive.ql.Driver>
19/03/22 02:49:33 INFO Driver: OK
19/03/22 02:49:40 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
19/03/22 02:49:40 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 940 bytes result sent to driver
19/03/22 02:49:40 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 4 ms on localhost (1/1)
19/03/22 02:49:40 INFO DAGScheduler: ResultStage 2 (show at <console>:47) finished in 0.004 s
19/03/22 02:49:40 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
19/03/22 02:49:40 INFO DAGScheduler: Job 2 finished: show at <console>:47, took 0.007774 s
+------+
|result|
+------+
+------+
Could you please let me know how to get the same statistics output into the Dataframe.
Thanks.!
If you want to print from a Dataframe the way you are using, you can use,
val a : DataFrame = sqlContext.sql("ANALYZE TABLE sample PARTITION (company='aaa', market='aab', edate='2019-01-03', pdate='2019-01-10') COMPUTE STATISTICS")
a.select("*").show()

Spark SQL freeze

I have a problem with Spark SQL. I read some data from csv files. Next to I do groupBy and join operation, and finished task is write joined data to file. My problem is time gap (I show that on log below with space).
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1069
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1003
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 965
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1073
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1038
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 900
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 903
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 938
18/08/07 23:39:40 INFO storage.BlockManagerInfo: Removed broadcast_84_piece0 on 10.4.110.24:36423 in memory (size: 32.8 KB, free: 4.1 GB)
18/08/07 23:39:40 INFO storage.BlockManagerInfo: Removed broadcast_84_piece0 on omm104.in.nawras.com.om:43133 in memory (size: 32.8 KB, free: 4.1 GB)
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 969
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1036
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 970
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1006
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1039
18/08/07 23:39:47 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
18/08/07 23:39:54 INFO parquet.ParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Post-Scan Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Output Data Schema: struct<_c0: string, _c1: string, _c2: string, _c3: string, _c4: string ... 802 more fields>
18/08/08 00:14:22 INFO execution.FileSourceScanExec: Pushed Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Post-Scan Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Output Data Schema: struct<_c0: string, _c1: string, _c2: string, _c3: string, _c4: string ... 802 more fields>
18/08/08 00:14:22 INFO execution.FileSourceScanExec: Pushed Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Post-Scan Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Output Data Schema: struct<_c0: string, _c1: string, _c2: string, _c3: string, _c4: string ... 802 more fields>
18/08/08 00:14:22 INFO execution.FileSourceScanExec: Pushed Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with:
Dataframes are small sized ~5000 records, and ~800 columns.
I using following code:
val parentDF = ...
val childADF = ...
val childBDF = ...
val aggregatedAColName = "CHILD_A"
val aggregatedBColName = "CHILD_B"
val columns = List("key_col_0", "key_col_1", "key_col_2", "key_col_3", "val_0")
val keyColumns = List("key_col_0", "key_col_1", "key_col_2", "key_col_3")
val nestedAColumns = keyColumns.map(x => col(x)) :+ struct(columns.map(col): _*).alias(aggregatedAColName)
val childADataFrame = childADF
.select(nestedAColumns: _*)
.repartition(keyColumns.map(col): _*)
.groupBy(keyColumns.map(col): _*)
.agg(collect_list(aggregatedAColName).alias(aggregatedAColName))
val joinedWithA = parentDF.join(childADataFrame, keyColumns, "left")
val nestedBColumns = keyColumns.map(x => col(x)) :+ struct(columns.map(col): _*).alias(aggregatedBColName)
val childBDataFrame = childBDF
.select(nestedBColumns: _*)
.repartition(keyColumns.map(col): _*)
.groupBy(keyColumns.map(col): _*)
.agg(collect_list(aggregatedBColName).alias(aggregatedBColName))
val joinedWithB = joinedWithA.join(childBDataFrame, keyColumns, "left")
Processing time on 30 files (~85 k records all) is strange high ~38 min.
Have you ever seen similar problem?
Try to avoid repartition() call as it causes unnecessary data movements within the nodes.
According to Learning Spark
Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions.
In a simple way COALESCE :- is only for decreases the no of partitions , No shuffling of data it just compress the partitions.