Snappy insertion into druid - druid

Facing problem in snappy ingestion to druid. Things start break after org.apache.hadoop.mapred.LocalJobRunner - map task executor complete. Its able to fetch the input file.
My specs json file -
{
"hadoopCoordinates": "org.apache.hadoop:hadoop-client:2.6.0",
"spec": {
"dataSchema": {
"dataSource": "apps_searchprivacy",
"granularitySpec": {
"intervals": [
"2017-01-23T00:00:00.000Z/2017-01-23T01:00:00.000Z"
],
"queryGranularity": "HOUR",
"segmentGranularity": "HOUR",
"type": "uniform"
},
"metricsSpec": [
{
"name": "count",
"type": "count"
},
{
"fieldName": "event_value",
"name": "event_value",
"type": "longSum"
},
{
"fieldName": "landing_impression",
"name": "landing_impression",
"type": "longSum"
},
{
"fieldName": "user",
"name": "DistinctUsers",
"type": "hyperUnique"
},
{
"fieldName": "cost",
"name": "cost",
"type": "doubleSum"
}
],
"parser": {
"parseSpec": {
"dimensionsSpec": {
"dimensionExclusions": [
"landing_page",
"skip_url",
"ua",
"user_id"
],
"dimensions": [
"t3",
"t2",
"t1",
"aff_id",
"customer",
"evt_id",
"install_date",
"install_week",
"install_month",
"install_year",
"days_since_install",
"months_since_install",
"weeks_since_install",
"success_url",
"event",
"chrome_version",
"value",
"event_label",
"rand",
"type_tag_id",
"channel_name",
"cid",
"log_id",
"extension",
"os",
"device",
"browser",
"cli_ip",
"t4",
"t5",
"referal_url",
"week",
"month",
"year",
"browser_version",
"browser_name",
"landing_template",
"strvalue",
"customer_group",
"extname",
"countrycode",
"issp",
"spdes",
"spsc"
],
"spatialDimensions": []
},
"format": "json",
"timestampSpec": {
"column": "time_stamp",
"format": "yyyy-MM-dd HH:mm:ss"
}
},
"type": "hadoopyString"
}
},
"ioConfig": {
"inputSpec": {
"dataGranularity": "hour",
"filePattern": ".*\\..*",
"inputPath": "hdfs://c8-auto-hadoop-service-1.srv.media.net:8020/data/apps_test_output",
"pathFormat": "'ts'=yyyyMMddHH",
"type": "granularity"
},
"type": "hadoop"
},
"tuningConfig": {
"ignoreInvalidRows": "true",
"type": "hadoop",
"useCombiner": "false"
}
},
"type": "index_hadoop"
}
Error Getting
2017-02-03T14:39:50,738 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.MapTask - (EQUATOR) 0 kvi 26214396(104857584)
2017-02-03T14:39:50,738 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.MapTask - mapreduce.task.io.sort.mb: 100
2017-02-03T14:39:50,738 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.MapTask - soft limit at 83886080
2017-02-03T14:39:50,738 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.MapTask - bufstart = 0; bufvoid = 104857600
2017-02-03T14:39:50,738 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.MapTask - kvstart = 26214396; length = 6553600
2017-02-03T14:39:50,738 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.MapTask - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2017-02-03T14:39:50,847 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.MapTask - Starting flush of map output
2017-02-03T14:39:50,849 INFO [Thread-22] org.apache.hadoop.mapred.LocalJobRunner - map task executor complete.
2017-02-03T14:39:50,850 WARN [Thread-22] org.apache.hadoop.mapred.LocalJobRunner - job_local233667772_0001
java.lang.Exception: java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) ~[hadoop-mapreduce-client-common-2.6.0.jar:?]
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) [hadoop-mapreduce-client-common-2.6.0.jar:?]
Caused by: java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native Method) ~[hadoop-common-2.6.0.jar:?]
at org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63) ~[hadoop-common-2.6.0.jar:?]
at org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:192) ~[hadoop-common-2.6.0.jar:?]
at org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:176) ~[hadoop-common-2.6.0.jar:?]
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:90) ~[hadoop-mapreduce-client-core-2.6.0.jar:?]
at org.apache.hadoop.mapreduce.lib.input.DelegatingRecordReader.initialize(DelegatingRecordReader.java:84) ~[hadoop-mapreduce-client-core-2.6.0.jar:?]
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:545) ~[hadoop-mapreduce-client-core-2.6.0.jar:?]
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:783) ~[hadoop-mapreduce-client-core-2.6.0.jar:?]
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) ~[hadoop-mapreduce-client-core-2.6.0.jar:?]
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) ~[hadoop-mapreduce-client-common-2.6.0.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_121]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_121]
2017-02-03T14:39:51,130 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Job job_local233667772_0001 failed with state FAILED due to: NA
2017-02-03T14:39:51,139 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Counters: 0
2017-02-03T14:39:51,143 INFO [task-runner-0-priority-0] io.druid.indexer.JobHelper - Deleting path[var/druid/hadoop-tmp/apps_searchprivacy/2017-02-03T143903.262Z_bb7a812bc0754d4aabcd4bc103ed648a]
2017-02-03T14:39:51,158 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=index_hadoop_apps_searchprivacy_2017-02-03T14:39:03.257Z, type=index_hadoop, dataSource=apps_searchprivacy}]
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]
at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:204) ~[druid-indexing-service-0.9.2.jar:0.9.2]
at io.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:208) ~[druid-indexing-service-0.9.2.jar:0.9.2]
at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.2.jar:0.9.2]
at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.2.jar:0.9.2]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_121]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_121]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_121]
at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.2.jar:0.9.2]
... 7 more
Caused by: com.metamx.common.ISE: Job[class io.druid.indexer.IndexGeneratorJob] failed!
at io.druid.indexer.JobHelper.runJobs(JobHelper.java:369) ~[druid-indexing-hadoop-0.9.2.jar:0.9.2]
at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:94) ~[druid-indexing-hadoop-0.9.2.jar:0.9.2]
at io.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessing.runTask(HadoopIndexTask.java:261) ~[druid-indexing-service-0.9.2.jar:0.9.2]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_121]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_121]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_121]
at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.2.jar:0.9.2]
... 7 more
2017-02-03T14:39:51,165 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_hadoop_apps_searchprivacy_2017-02-03T14:39:03.257Z] status changed to [FAILED].
2017-02-03T14:39:51,168 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
"id" : "index_hadoop_apps_searchprivacy_2017-02-03T14:39:03.257Z",
"status" : "FAILED",
"duration" : 43693
}

It seems that jvm can't load native shared library (like .dll or .so), check is it available on machine(s) running the task, and if so check is its dir on the classpath of the jvm.

Related

Kafka Connect Mongo Sink connector is failing due to schema

I'm doing my first steps in the kafka-connect area and things are not going as expected.
I created a Mongo sink connector (kafka to mongo), but it fails to consume the messages with the following error:
Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:489)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:469)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:325)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:228)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:196)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:184)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:234)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.connect.errors.DataException: Failed to deserialize data for topic test_topic to Avro:
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:114)
at org.apache.kafka.connect.storage.Converter.toConnectData(Converter.java:87)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$1(WorkerSinkTask.java:489)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
... 13 more
I don't wanna use any schema but for some reason it fails on the deserialize, probably because the schema config is missing.
My configurations:
{
"name": "kafka-to-mongo",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"tasks.max": "1",
"topics": "test_topic",
"connection.uri": "mongodb://mongodb4:27017/test-db",
"database": "auto-resolve",
"value.converter.schemas.enable": "false",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"collection": "test_collection",
"document.id.strategy": "com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy",
"document.id.strategy.partial.value.projection.list": "id",
"document.id.strategy.partial.value.projection.type": "AllowList",
"writemodel.strategy": "com.mongodb.kafka.connect.sink.writemodel.strategy.ReplaceOneBusinessKeyStrategy"
}
}
Example of a message in the topic:
{
"_id": "62a6e4d88c1e1e0011902616",
"type": "alert",
"key": "test444",
"timestamp": 1655104728,
"source_system": "api.test",
"tags": [
{
"type": "aaa",
"value": "bbb"
},
{
"type": "fff",
"value": "rrr"
}
],
"ack": "no",
"main": "nochange",
"all_ids": []
}
Any ideas? Thanks!

Kafka sink connector --> postgres, fails with avro JSON data

I setup a Kafka JDBC sink to send events to a PostgreSQL.
I wrote this simple producer that sends JSON with schema (avro) data to a topic as follows:
producer.py (kafka-python)
biometrics = {
"heartbeat": self.pulse, # integer
"oxygen": self.oxygen,# integer
"temprature": self.temprature, # float
"time": time # string
}
avro_value = {
"schema": open(BASE_DIR+"/biometrics.avsc").read(),
"payload": biometrics
}
producer.send("biometrics",
key="some_string",
value=avro_value
)
Value Schema:
{
"type": "record",
"name": "biometrics",
"namespace": "athlete",
"doc": "athletes biometrics"
"fields": [
{
"name": "heartbeat",
"type": "int",
"default": 0
},
{
"name": "oxygen",
"type": "int",
"default": 0
},
{
"name": "temprature",
"type": "float",
"default": 0.0
},
{
"name": "time",
"type": "string"
"default": ""
}
]
}
Connector config (without hosts, passwords etc)
{
"name": "jdbc_sink",
"connector.class": "io.aiven.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"key.converter": "org.apache.kafka.connect.storage.StringConverter ",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"topics": "biometrics",
"insert.mode": "insert",
"auto.create": "true"
}
But my connector fails hard, with three errors and I am unable to spot the reason for any of them:
TL;DR; log Version
(Error 1) Caused by: org.apache.kafka.connect.errors.DataException: biometrics
(Error 2) Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
(Error 3) Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
Full log
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:206)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:132)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:498)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:475)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:325)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:229)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:201)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:185)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:235)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.kafka.connect.errors.DataException: biometrics
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:98)
at org.apache.kafka.connect.storage.Converter.toConnectData(Converter.java:87)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$1(WorkerSinkTask.java:498)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:156)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:190)
... 13 more
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
Could someone help me understand those errors and the underlying reason?
The error is because you need to use the JSONConverter class w/ value.converter.schemas.enabled=true in your Connector since that was what was produced, but the schema payload is not an Avro schema represntation of the payload, so it might still fail with those changes alone...
If you want to actually send Avro, then use the AvroProducer in confluent-kafka library, which requires running the Schema Registry.

Orientdb 3.0.26 OETL JDBC oracle Driver not found

Trying to Import data from an oracle database into a OrientDB graph using OETL.
I've noticed that on jdbc-drivers.json the URL for oracle driver is not working
{
"Oracle": {
"className": "oracle.jdbc.driver.OracleDriver",
## "className": "oracle.jdbc.driver.OracleDriver",
"url": "http://svn.ets.berkeley.edu/nexus/content/repositories/myberkeley/com/oracle/ojdbc7/12.1.0.1/ojdbc7-12.1.0.1.jar",
"version": "7-12.1.0.1",
"format": [
"jdbc:oracle:thin:#<HOST>:<PORT>:<SID>"
]
},
So i replace it with both
"url": "https://github.com/MHTaleb/ojdbc7/raw/master/ojdbc7.jar",
and also localy with
"url": "C:\Users\sviu\Desktop\orientdb-3.0.26\ojdbc7.jar",
I'm using the following configuration
{
"config": {
"log": "error"
},
"extractor" : {
"jdbc": {
"driver": "oracle.jdbc.OracleDriver",
## "driver": "oracle.jdbc.driver.OracleDriver",
"url": "jdbc:oracle:thin:#URL:1521:SID",
"userName": "DB_OWNER",
"userPassword": "DBPWD",
"query": "select * from LOT" }
},
"transformers" : [
{ "vertex": { "class": "Lot"} }
],
"loader" : {
"orientdb": {
"dbURL": "plocal:../databases/LDS",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoCreate": true,
"dbType": "graph"
}
}
}
Tried both driver names and classname change combination in jdbc-drivers.json and lots.json
Always getting driver not found.
C:\Users\sviu\Desktop\orientdb-3.0.26\bin>oetl.bat lots.json
OrientDB etl v.3.0.26 - Veloce (build 39bd95d70410374ab1cf6770ef2feca26145b04f, branch 3.0.x) https://www.orientdb.com
Exception in thread "main" com.orientechnologies.orient.core.exception.OConfigurationException: Error on creating ETL processor
at com.orientechnologies.orient.etl.OETLProcessorConfigurator.parse(OETLProcessorConfigurator.java:153)
at com.orientechnologies.orient.etl.OETLProcessorConfigurator.parseConfigAndParameters(OETLProcessorConfigurator.java:92)
at com.orientechnologies.orient.etl.OETLProcessor.main(OETLProcessor.java:116)
Caused by: com.orientechnologies.orient.core.exception.OConfigurationException: [JDBC extractor] JDBC Driver oracle.jdbc.OracleDriver not found
at com.orientechnologies.orient.etl.extractor.OETLJDBCExtractor.configure(OETLJDBCExtractor.java:69)
at com.orientechnologies.orient.etl.OETLProcessorConfigurator.configureComponent(OETLProcessorConfigurator.java:158)
at com.orientechnologies.orient.etl.OETLProcessorConfigurator.configureExtractor(OETLProcessorConfigurator.java:216)
at com.orientechnologies.orient.etl.OETLProcessorConfigurator.parse(OETLProcessorConfigurator.java:115)
... 2 more
Caused by: java.lang.ClassNotFoundException: oracle.jdbc.OracleDriver
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at com.orientechnologies.orient.etl.extractor.OETLJDBCExtractor.configure(OETLJDBCExtractor.java:67)
... 5 more
Any insight would be appreciated.
BR

S3 sink record field TimeBasedPartitioner not working

I am trying to deploy s3 sink connector where s3 partitions needs to be based on a field coming in data props.eventTime
Following is my config :
{
"name" : "test_timeBasedPartitioner",
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"partition.duration.ms": "3600000",
"s3.region": "us-east-1",
"topics.dir": "dream11",
"flush.size": "50000",
"topics": "test_topic",
"s3.part.size": "5242880",
"tasks.max": "5",
"timezone": "Etc/UTC",
"locale": "en",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"rotate.schedule.interval.ms": "1800000",
"path.format": "'EventDate'=YYYYMMdd",
"s3.bucket.name": "test_bucket",
"partition.duration.ms": "86400000",
"timestamp.extractor": "RecordField",
"timestamp.field": "props.eventTime"
}
Following is my sample json that is present in kafka topic :
{
"eventName": "testEvent",
"props": {
"screen_resolution": "1436x720",
"userId": 0,
"device_name": "1820",
"eventTime": "1565792661712"
}
}
And the exception that I am getting is :
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:546)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:302)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:205)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:173)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: Invalid format: "1564561484906" is malformed at "4906"
at org.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187)
at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:826)
at io.confluent.connect.storage.partitioner.TimeBasedPartitioner$RecordFieldTimestampExtractor.extract(TimeBasedPartitioner.java:281)
at io.confluent.connect.s3.TopicPartitionWriter.executeState(TopicPartitionWriter.java:199)
at io.confluent.connect.s3.TopicPartitionWriter.write(TopicPartitionWriter.java:176)
at io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:195)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:524)
... 10 more
Is there something that I am missing here to configure?
Any help would be appreciated.
Your field props.eventTime is coming in as microsecond and not millisecond.
This can be identified in the stack trace and by inspecting the relevant code in org.joda.time doParseMillis method, which is used by the Connector partitioner TimeBasedPartitioner and its timestamp extractor from message payload RecordFieldTimestampExtractor when the timestamp.field is a STRING:
Caused by: java.lang.IllegalArgumentException: Invalid format: "1564561484906" is malformed at "4906"
at org.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187)
at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:826)
at io.confluent.connect.storage.partitioner.TimeBasedPartitioner$RecordFieldTimestampExtractor.extract(TimeBasedPartitioner.java:281)
You could follow one of these solutions:
Write your own TimestampExtractor to support microsecond. You can check how to write a custom TimestampExtractor here.
Change/Transform your source data to include a field that comes in with millisecond instead of microsecond
Followup on some issues where default TimestampExtractor flexibility is discussed and suggest/contribute to have it support your use case.

java.lang.IllegalArgumentException: Failed to parse query: {"query":

I am trying to execute the Elasticsearch DSL query in Spark 2.2 and Scala 2.11.8. The version of Elasticsearch is 2.4.4, while the version of Kibana that I use is 4.6.4.
This is the library that I use in Spark:
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-20_2.11</artifactId>
<version>5.2.2</version>
</dependency>
This is my code:
val spark = SparkSession.builder()
.config("es.nodes","localhost")
.config("es.port",9200)
.config("es.nodes.wan.only","true")
.config("es.index.auto.create","true")
.appName("ES test")
.master("local[*]")
.getOrCreate()
val myquery = """{"query":
{"bool": {
"must": [
{
"has_child": {
"filter": {
...
}
}
}
]
}
}}"""
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", myquery)
.option("pushdown", "true")
.load("myindex/items")
I provided the main corpus of the DSL query. I am getting the error:
java.lang.IllegalArgumentException: Failed to parse query: {"query":
Initially, I was thinking that the problem consists in the version of Elasticsearch. As far as I understand from this GitHub, the version 4 of Elasticsearch is not supported.
However, if I run the same code with a simple query, it correctly retrieves records from Elasticsearch.
var df = spark.read
.format("org.elasticsearch.spark.sql")
.option("es.query", "?q=public:*")
.load("myindex/items")
Therefore I assume that the issue is not related to the version, but it's rather related to the way how I represent the query.
This query works fine with cURL, but maybe it should be somehow updated before passing it to Spark?
Full error stacktrace:
Previous exception in task: Failed to parse query: {"query":
{"bool": {
"must": [
{
"has_child": {
"filter": {
"bool": {
"must": [
{
"term": {
"project": 579
}
},
{
"terms": {
"status": [
0,
1,
2
]
}
}
]
}
},
"type": "status"
}
},
{
"has_child": {
"filter": {
"bool": {
"must": [
{
"term": {
"project": 579
}
},
{
"terms": {
"entity": [
4634
]
}
}
]
}
},
"type": "annotation"
}
},
{
"term": {
"project": 579
}
},
{
"range": {
"publication_date": {
"gte": "2017/01/01",
"lte": "2017/04/01",
"format": "yyyy/MM/dd"
}
}
},
{
"bool": {
"should": [
{
"terms": {
"typology": [
"news",
"blog",
"forum",
"socialnetwork"
]
}
},
{
"terms": {
"publishing_platform": [
"twitter"
]
}
}
]
}
}
]
}}
org.elasticsearch.hadoop.rest.query.QueryUtils.parseQuery(QueryUtils.java:59)
org.elasticsearch.hadoop.rest.RestService.createReader(RestService.java:417)
org.elasticsearch.spark.rdd.AbstractEsRDDIterator.reader$lzycompute(AbstractEsRDDIterator.scala:49)
org.elasticsearch.spark.rdd.AbstractEsRDDIterator.reader(AbstractEsRDDIterator.scala:42)
org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:61)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
org.apache.spark.scheduler.Task.run(Task.scala:108)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:138)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:116)
at org.apache.spark.scheduler.Task.run(Task.scala:118)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
18/01/22 21:43:12 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.util.TaskCompletionListenerException: Failed to parse query: {"query":
and this one:
8/01/22 21:47:40 WARN ScalaRowValueReader: Field 'cluster' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
18/01/22 21:47:40 WARN ScalaRowValueReader: Field 'project' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
18/01/22 21:47:40 WARN ScalaRowValueReader: Field 'client' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
18/01/22 21:47:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
scala.MatchError: Buffer(13473953) (of class scala.collection.convert.Wrappers$JListWrapper)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:61)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:58)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
18/01/22 21:47:40 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): scala.MatchError: Buffer(13473953) (of class scala.collection.convert.Wrappers$JListWrapper)
So the error says
Caused by: org.codehaus.jackson.JsonParseException: Unexpected character (':' (code 58)):
at [Source: java.io.StringReader#76aeea7a; line: 2, column: 33]
and if you look at your query
val myquery = """{"query":
"bool": {
you'll see that this probably maps to the : just after "bool" and obviously what you have is an invalid JSON. To make it clear I just re-format it as
{"query": "bool": { ...
Most probably you forgot { after "query" and the matching } at the end. Compare this to an example at official docs.
{
"query": {
"bool" : {
"must" : {