I'm trying to setup an SftpCSVSourceConnector in my local env and I'm having some trouble setting a schema to the connector. This is what I'm trying to do
curl -i -X PUT -H "Accept:application/json" \
-H "Content-Type:application/json" http://localhost:8083/connectors/nc-csv-02/config \
-d '{
"tasks.max" : "1",
"connector.class" : "io.confluent.connect.sftp.SftpCsvSourceConnector",
"kafka.topic": "sftp-csv-00",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"input.path" : "/",
"csv.separator.char" : 59,
"finished.path" : "/finished",
"error.path" : "/error",
"schema.generation.key.fields" : "msisdn",
"input.file.pattern" : ".*\\.dat",
"schema.generation.enabled" : "false",
"csv.first.row.as.header" : "true",
The exception I see in the worker task is
org.apache.kafka.common.config.ConfigException: Invalid value com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "fields" (class com.github.jcustenborder.kafka.connect.utils.jackson.SchemaSerializationModule$Storage), not marked as ignorable (10 known properties: "defaultValue", "valueSchema", "doc", "type", "name", "keySchema", "version", "parameters", "isOptional", "fieldSchemas"])
at [Source: (String)"{"fields":[{"default":null,"name":"msisdn","type":["null","string"]}],"name":"NCKeySchema","type":"record"}"; line: 1, column: 12] (through reference chain: com.github.jcustenborder.kafka.connect.utils.jackson.SchemaSerializationModule$Storage["fields"]) for configuration Could not read schema from 'key.schema'
at io.confluent.connect.sftp.source.SftpSourceConnectorConfig.readSchema(SftpSourceConnectorConfig.java:334)
at io.confluent.connect.sftp.source.SftpSourceConnectorConfig.<init>(SftpSourceConnectorConfig.java:117)
at io.confluent.connect.sftp.source.SftpCsvSourceConnectorConfig.<init>(SftpCsvSourceConnectorConfig.java:156)
at io.confluent.connect.sftp.SftpCsvSourceConnector.start(SftpCsvSourceConnector.java:44)
at org.apache.kafka.connect.runtime.WorkerConnector.doStart(WorkerConnector.java:185)
at org.apache.kafka.connect.runtime.WorkerConnector.start(WorkerConnector.java:210)
at org.apache.kafka.connect.runtime.WorkerConnector.doTransitionTo(WorkerConnector.java:349)
at org.apache.kafka.connect.runtime.WorkerConnector.doTransitionTo(WorkerConnector.java:332)
at org.apache.kafka.connect.runtime.WorkerConnector.doRun(WorkerConnector.java:141)
at org.apache.kafka.connect.runtime.WorkerConnector.run(WorkerConnector.java:118)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
The schemas I'm trying to use for key and value are
"fields": [
"default": null,
"name": "msisdn",
"type": [
"name": "NCKeySchema",
"type": "record"
"name" : "NCPortabilityMovementEvent",
"type" : "record",
"fields" : [
"default" : null,
"name" : "action",
"type" : [
"default" : null,
"name" : "msisdn",
"type" : [
"default" : null,
"name" : "previousNRN",
"type" : [
"default" : null,
"name" : "newNRN",
"type" : [
"default" : null,
"name" : "effectiveDate",
"type" : [
"default" : null,
"name" : "referenceID",
"type" : [
What am I doing wrong here ?
I tried this with schema.generation.enabled=true and removing the key.schema and value.schema the connector worked just fine.

You're providing Avro schemas, which are not correct. You'll need to define Connect schemas, which are type=STRUCT with fieldSchemas. The format itself is not well documented, but there are examples here https://docs.confluent.io/kafka-connect-sftp/current/source-connector/csv_source_connector.html#sftp-connector-csv-with-schema-example
You can find the source code of the schema json deserializer here - https://github.com/jcustenborder/connect-utils/tree/master/connect-utils-jackson/src/main/java/com/github/jcustenborder/kafka/connect/utils/jackson


PySpark: MutableLong cannot be cast to MutableInt (no long in dataframe)

I'm trying to read a profiles table from Athena in PySpark using Glue client from boto3 and checking if it's empty. Why Spark bug on converting Int to Long, knowing that I do not have Long type in the table read? There is nothing on Google, nor on StackOverflow that answers this problem.
Here is a code sum-up:
dataframe = GlueContext(session.sparkContext).create_dynamic_frame.from_catalog(
if dataframe.rdd.isEmpty():
dataframe = session.sparkContext.emptyRDD().toDF(schema)
I'm getting the error:
ERROR GlueExceptionAnalysisListener: [Glue Exception Analysis] Event: GlueETLJobExceptionEvent
File "/myScript.py", line 246, in load_table
if dataframe.rdd.isEmpty()
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file s3://bucket/path/to/profiles/vault=c27/subgroup=1/part-00003-a97d95f5-713c-4756-808b-38c3866842cb.c000.snappy.parquet
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt
Here is the Athena DDL:
`id` string,
`anonymousids` array<string>,
`lastconsentinsightusage` boolean,
`lastconsentactivationusage` boolean,
`gender` string,
`age` int,
`iata` string,
`continent` string,
`country` string,
`city` string,
`state` string,
`brandvisit` int,
`knownprofileinmarket` date,
`devicebrowser` string,
`devicebrand` string,
`deviceos` string,
`returnedinmarket` date)
`vault` varchar(5),
`subgroup` int)
And here is the parquet schema:
"type" : "record",
"name" : "spark_schema",
"fields" : [ {
"name" : "Id",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "anonymousIds",
"type" : [ "null", {
"type" : "array",
"items" : {
"type" : "record",
"name" : "list",
"fields" : [ {
"name" : "element",
"type" : [ "null", "string" ],
"default" : null
} ]
} ],
"default" : null
}, {
"name" : "lastConsentInsightUsage",
"type" : [ "null", "boolean" ],
"default" : null
}, {
"name" : "lastConsentActivationUsage",
"type" : [ "null", "boolean" ],
"default" : null
}, {
"name" : "gender",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "age",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "iata",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "continent",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "country",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "city",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "state",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "brandVisit",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "knownProfileInMarket",
"type" : [ "null", {
"type" : "int",
"logicalType" : "date"
} ],
"default" : null
}, {
"name" : "deviceBrowser",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "deviceBrand",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "deviceOs",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "returnedInMarket",
"type" : [ "null", {
"type" : "int",
"logicalType" : "date"
} ],
"default" : null
} ]
And a line of the parquet file:
{"Id": "34e9bbcd3dd577d6bc3f9b82d9dd99666dafa0203486d2a604f59b7702d50d7d", "anonymousIds": [{"element": "5510"}], "lastConsentInsightUsage": true, "lastConsentActivationUsage": true, "gender": "F", "age": 40, "iata": "", "continent": "EU", "country": "DE", "city": "Frankfurt (Oder)", "state": "BB", "brandVisit": 9, "knownProfileInMarket": 18765, "deviceBrowser": "Googlebot", "deviceBrand": "Spider", "deviceOs": "Other", "returnedInMarket": 18765}

Debezium mongodb kafka connector not producing some of records in topic as it is in mongodb

In my mongodb there i have this data
mongo01:PRIMARY> db.col.find({"_id" : ObjectId("5d8777f188fef5555b")})
{ "_id" : ObjectId("5d8777f188fef5555b"), "attachments" : [ { "name" : "Je", "src" : "https://google.co", "type" : "image/png" } ], "tags" : [ 51, 52 ], "last_comment" : [ ], "hashtags" : [ "Je" ], "badges" : [ ], "feed_id" : "1", "company_id" : 1, "message" : "aJsm9LtK", "group_id" : "106", "feed_type" : "post", "thumbnail" : "", "group_tag" : false, "like_count" : 0, "clap_count" : 0, "comment_count" : 0, "created_by" : 520, "created_at" : "1469577278628", "updated_at" : "1469577278628", "status" : 1, "__v" : 0 }
mongo01:PRIMARY> db.col.find({"_id" : ObjectId("5d285b4554e3b584bf97759")})
{ "_id" : ObjectId("5d285b4554e3b584bf97759"), "attachments" : [ ], "tags" : [ ], "last_comment" : [ ], "company_id" : 1, "group_id" : "00e35289", "feed_type" : "post", "group_tag" : false, "status" : 1, "feed_id" : "3dc44", "thumbnail" : "{}", "message" : "s2np1HYrPuFF", "created_by" : 1, "html_content" : "", "created_at" : "144687057949", "updated_at" : "144687057949", "like_count" : 0, "clap_count" : 0, "comment_count" : 0, "__v" : 0, "badges" : [ ], "hashtags" : [ ] }
I am using this debezium mongodb connector in order to get the mongodb data in kafka topic.
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json"
http://localhost:8083/connectors/ -d '{
"name": "mongo_connector-4",
"config": {
"connector.class": "io.debezium.connector.mongodb.MongoDbConnector",
"mongodb.hosts": "mongo01/localhost:27017",
"mongodb.name": "mongo_1",
"collection.whitelist": "data.col",
"key.converter.schemas.enable": false,
"value.converter.schemas.enable": false,
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"transforms" : "unwrap",
"transforms.unwrap.type" : "io.debezium.connector.mongodb.transforms.UnwrapFromMongoDbEnvelope",
"transforms.unwrap.drop.tombstones" : "false",
"transforms.unwrap.delete.handling.mode" : "drop",
"transforms.unwrap.operation.header" : "true",
"errors.tolerance" : "all",
now while printing the topic in ksql i am getting that for some records data came with all columns(as it was in mongodb) while for some records some columns
are missing.
ksql> print 'mongo_1.data.col' from beginning;
Why this is happening and how to resolve this issue?
PS: the only difference i found that both records have different order of columns.
While searching about this issue only close thing i found here https://github.com/hpgrahsl/kafka-connect-mongodb
something they are saying about post-processing and redacting fields which have sensitive data. But as you can see both mine records are similar and have no sensitive data(by sensitive data i mean encrypted data, maybe they meant something else).
Are not the missing values after updates? Don't forget that MongoDB connector provides patch for updates not after - https://debezium.io/documentation/reference/0.10/connectors/mongodb.html#change-events-value
If you need to construct full format after in case of MongoDB you need to introduce a Kafka Streams pipeline that would store the event after insert into a persistent store and then merge the patch with the original insert to create the final event.

Kapacitor how to create task using template via the rest api?

I can successfully create templates and tasks using the rest api.
How do i create a task using a template via rest api?
Which endpoint should i use?
Okay found out how:
Basically just use the same task rest endpoint and do a post and pass in the json.
In the json you can specify templateId and the vars like below.
"status": "disabled"
,"id": "test_task4"
,"template-id": "generic_mean_alert"
,"vars" : {
"measurement": {"type" : "string", "value" : "cpu" },
"where_filter": {"type": "lambda", "value": "\"cpu\" == 'cpu-total'"},
"groups": {"type": "list", "value": [{"type":"string", "value":"host"},{"type":"string", "value":"dc"}]},
"field": {"type" : "string", "value" : "usage_idle" },
"warn": {"type" : "lambda", "value" : "\"mean\" < 30.0" },
"crit": {"type" : "lambda", "value" : "\"mean\" < 10.0" },
"window": {"type" : "duration", "value" : "1m" },
"slack_channel": {"type" : "string", "value" : "#alerts_testing" }
,"dbrps": [ { "db": "test","rp": "autogen" } ]
,"type": "stream"

How to insert data into druid via tranquility

By following tutorial at http://druid.io/docs/latest/tutorials/tutorial-loading-streaming-data.html , I was able to insert data into druid via Kafka console
Kafka console
The spec file looks as following
"dataSchema" : {
"dataSource" : "wikipedia",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"timestampSpec" : {
"column" : "timestamp",
"format" : "auto"
"dimensionsSpec" : {
"dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
"dimensionExclusions" : [],
"spatialDimensions" : []
"metricsSpec" : [{
"type" : "count",
"name" : "count"
}, {
"type" : "doubleSum",
"name" : "added",
"fieldName" : "added"
}, {
"type" : "doubleSum",
"name" : "deleted",
"fieldName" : "deleted"
}, {
"type" : "doubleSum",
"name" : "delta",
"fieldName" : "delta"
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "DAY",
"queryGranularity" : "NONE"
"ioConfig" : {
"type" : "realtime",
"firehose": {
"type": "kafka-0.8",
"consumerProps": {
"zookeeper.connect": "localhost:2181",
"zookeeper.connection.timeout.ms" : "15000",
"zookeeper.session.timeout.ms" : "15000",
"zookeeper.sync.time.ms" : "5000",
"group.id": "druid-example",
"fetch.message.max.bytes" : "1048586",
"auto.offset.reset": "largest",
"auto.commit.enable": "false"
"feed": "wikipedia"
"plumber": {
"type": "realtime"
"tuningConfig": {
"type" : "realtime",
"maxRowsInMemory": 500000,
"intermediatePersistPeriod": "PT10m",
"windowPeriod": "PT10m",
"basePersistDirectory": "\/tmp\/realtime\/basePersist",
"rejectionPolicy": {
"type": "messageTime"
I start realtime via
java -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Ddruid.realtime.specFile=examples/indexing/wikipedia.spec -classpath config/_common:config/realtime:lib/* io.druid.cli.Main server realtime
In Kafka console, I paste and enter the following
{"timestamp": "2013-08-10T01:02:33Z", "page": "Good Bye", "language" : "en", "user" : "catty", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
Then I tend to perform query by creating select.json and run curl -X POST 'http://localhost:8084/druid/v2/?pretty' -H 'content-type: application/json' -d #select.json
"queryType": "select",
"dataSource": "wikipedia",
"granularity": "all",
"intervals": [
"filter" : {"type":"and",
"fields" : [
{ "type": "selector", "dimension": "user", "value": "catty" }
"pagingSpec":{"pagingIdentifiers": {}, "threshold":500}
I was able to get the following result.
[ {
"timestamp" : "2013-08-10T01:02:33.000Z",
"result" : {
"pagingIdentifiers" : {
"wikipedia_2013-08-10T00:00:00.000Z_2013-08-11T00:00:00.000Z_2013-08-10T00:00:00.000Z" : 0
"events" : [ {
"segmentId" : "wikipedia_2013-08-10T00:00:00.000Z_2013-08-11T00:00:00.000Z_2013-08-10T00:00:00.000Z",
"offset" : 0,
"event" : {
"timestamp" : "2013-08-10T01:02:33.000Z",
"continent" : "North America",
"robot" : "false",
"country" : "United States",
"city" : "San Francisco",
"newPage" : "true",
"unpatrolled" : "true",
"namespace" : "article",
"anonymous" : "false",
"language" : "en",
"page" : "Good Bye",
"region" : "Bay Area",
"user" : "catty",
"deleted" : 200.0,
"added" : 57.0,
"count" : 1,
"delta" : -143.0
} ]
} ]
It seem that I had setup Druid correctly.
Now, I would like to insert data via HTTP endpoint. According to How realtime data input to Druid?, it seems like recommended way is to use tranquility
I have indexing service started via
java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath config/_common:config/overlord:lib/*: io.druid.cli.Main server overlord
conf/server.json looks like
"dataSources" : [
"spec" : {
"dataSchema" : {
"dataSource" : "wikipedia",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"timestampSpec" : {
"column" : "timestamp",
"format" : "auto"
"dimensionsSpec" : {
"dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
"dimensionExclusions" : [],
"spatialDimensions" : []
"metricsSpec" : [{
"type" : "count",
"name" : "count"
}, {
"type" : "doubleSum",
"name" : "added",
"fieldName" : "added"
}, {
"type" : "doubleSum",
"name" : "deleted",
"fieldName" : "deleted"
}, {
"type" : "doubleSum",
"name" : "delta",
"fieldName" : "delta"
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "DAY",
"queryGranularity" : "NONE"
"tuningConfig" : {
"windowPeriod" : "PT10M",
"type" : "realtime",
"intermediatePersistPeriod" : "PT10M",
"maxRowsInMemory" : "100000"
"properties" : {
"task.partitions" : "1",
"task.replicants" : "1"
"properties" : {
"zookeeper.connect" : "localhost",
"http.port" : "8200",
"http.threads" : "8"
Then, I start the server using
bin/tranquility server -configFile conf/server.json
I perform post to http://xx.xxx.xxx.xxx:8200/v1/post/wikipedia, with content-type equals application/json
{"timestamp": "2013-08-10T01:02:33Z", "page": "Selamat Pagi", "language" : "en", "user" : "catty", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
I get the the following respond
It seems that tranquility has received our data, but failed to send it to druid!
I try to run curl -X POST 'http://localhost:8084/druid/v2/?pretty' -H 'content-type: application/json' -d #select.json, but doesn't get the output I inserted via tranquility.
Any idea why? Thanks.
This generally happens when the data you send is out of the window period. If you are inserting data manually, give the exact current timestamp (UTC) in milliseconds. Else it can be easily done if you are using any script to generate data. Make sure it is UTC current time.
It is extremely difficult to setup druid to work properly with real-time data insertion.
The best bet I found is, use https://github.com/implydata . Imply is a set of wrappers around druid, to make it easy to use.
However, the real-time insertion in imply is not perfect either. I had experiment OutOfMemoryException, after inserting 30 millions items via real-time. This will caused data loss on previous inserted 30 millions rows.
The detailed regarding data loss can be found here : https://groups.google.com/forum/#!topic/imply-user-group/95xpYojxiOg
An issue ticket has been filed : https://github.com/implydata/distribution/issues/8
Druid streaming windowPeriod is very short (10 minutes). Outside this period, your event will be ignored.
As you got {"result":{"received":1,"sent":0}}, your worker threads are working fine. Tranquility decides what data is sent to the druid based on the timestamp associated with the data.
This period is decided by the "windowPeriod" configuration. So if your type is realtime ("type":"realtime") and window period is PT10M ("windowPeriod" : "PT10M"), tranquility will send any data between t-10, t+10 and not send anything outside this period.
I disagree with the insertion efficiency problems, we have been sending 3million rows every 15 minutes since June 2016 and has been running beautifully. Of course, we have a stronger infrastructure deemed for the scale.
Another reason for not inserting, is out memory on the coordinador/overloard are running

Meteor, MongoDB - db.collection.find() for OR condition

In MongoDB, I have the following JSONs in a collection named "Jobs"
"userId": "testUser1",
"default": "true",
"someData": "data"
"userId": "testUser1",
"default": "false",
"someData": "data"
"userId": "testUser2",
"default": "true",
"someData": "data"
"userId": "testUser2",
"default": "false",
"someData": "data"
In Meteor, I am trying to select based on two condition
- Select documents for the given userId OR default is true
I have the following code in meteor:
Jobs.find({$or:[{userid:"testUser1"}, {default:"true"}]});
But it is selecting only two JSONs:
"userId": "testUser1",
"default": "true",
"someData": "data"
"userId": "testUser1",
"default": "false",
"someData": "data"
and its NOT giving the below JSON in response:
"userId": "testUser2",
"default": "true",
"someData": "data"
I researched with $where but even that is not working.
How to retrieve the right document from the MongoDB?
Try without $or
Jobs.find({userId: "testUser2", "default": "true"});
Just to be clear, you're trying to get all three of the records you mention, right? If so, I think your issue is that the 'true' values are strings, not bools and I'm guessing that you're searching on bool. Try this:
{"userId" : "testUser1", "default" : "true", "someData" : "data" }
{"userId" : "testUser1", "default" : "false", "someData" : "data" }
{"userId" : "testUser2", "default" : "true", "someData" : "data" }
{"userId" : "testUser2", "default" : "false", "someData" : "data" }
db.Jobs.find({ $or: [{ userId: 'testUser1' }, { default : 'true' } ] })
{"userId" : "testUser1", "default" : "true", "someData" : "data" }
{"userId" : "testUser1", "default" : "false", "someData" : "data" }
{"userId" : "testUser2", "default" : "true", "someData" : "data" }