Kafka mongo db source connector not working - apache-kafka

Hi in my POC I am using both the sink and the source mongodb connector.
The sink connector works fine. But the source connector does not push data into the resultant topic. The objective is to push full documents of all changes (Insert and Update) in a collection call 'request'.
Below is the code.
curl -X PUT http://localhost:8083/connectors/source-mongodb-request/config -H "Content-Type: application/json" -d '{
"tasks.max":1,
"connector.class":"com.mongodb.kafka.connect.MongoSourceConnector",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"org.apache.kafka.connect.storage.StringConverter",
"connection.uri":"mongodb://localhost:27017",
"pipeline":"[]",
"database":"proj",
"publish.full.document.only":"true",
"collection":"request",
"topic.prefix": ""
}'
No messages are getting pushed to proj.request topic. The topic gets created once I insert a record in the collection 'request'.
Would be great to get help on this, as its a make or break task for the POC.
Things work fine n the connectors on confluent cloud. But its the on premise set up on which I need to get this working.

make sure you have a valid pipeline - stages included in your properties file such as this one
"pipeline":" [{"$match":{"type":{"$in"["insert","update","replace"]}}}]",
Refer : https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline/

Related

run spark job using databricks Resr API

I am using databricks rest API to run spark jobs.
I am using the foollowing commands:
curl -X POST -H "Authorization: XXXX" 'url/api/2.0/jobs/create' -d ' {"name":"jobname","existing_cluster_id":"0725-095337-jello70","libraries": [{"jar": "dbfs:/mnt/pathjar/name-9edeec0f.jar"}],"email_notifications":{},"timeout_seconds":0,"spark_jar_task": {"main_class_name": "com.company.DngApp"}}'
curl -X POST -H "Authorization: XXXX" 'url/api/2.0/jobs/run-now' -d '{"job_id":25854,"jar_params":["--param","value"]}'
here param is an input args but I want to find a way to override spark driver properties, usually I do :
--driver-java-options='-Dparam=value'
but I am looking for the equivalent for the databricks rest API side
You cannot use "--driver-java-options" in Jar params.
Reason:
Note: Jar_params is a list of parameters for jobs with JAR tasks, e.g. "jar_params": ["john doe", "35"].
The parameters will be used to invoke the main function of the main class specified in the Spark JAR task. If not specified upon run-now, it will default to an empty list. jar_params cannot be specified in conjunction with notebook_params. The JSON representation of this field (i.e. {"jar_params":["john doe","35"]}) cannot exceed 10,000 bytes.
For more details, Azure Databricks - Jobs API - Run Now.
You can use spark_conf to pass in a string of user-specified spark configuration key-value pairs.
An object containing a set of optional, user-specified Spark configuration key-value pairs. You can also pass in a string of extra JVM options to the driver and the executors via spark.driver.extraJavaOptions and spark.executor.extraJavaOptions respectively.
Example Spark confs: {"spark.speculation": true, "spark.streaming.ui.retainedBatches": 5} or {"spark.driver.extraJavaOptions": "-verbose:gc -XX:+PrintGCDetails"}
For more details, refer "NewCluster configuration".
Hope this helps.

Read file created in HDFS with Livy

I am using Livy to run the wordcount example by creating jar file which is working perfectly fine and writing output in HDFS. Now I want to get the result back to my HTML page. I am using Spark scala, sbt, HDFS and Livy.
The GET/batches REST API only shows log and state.
How do I get output results?
Or how can I read a file in HDFS using REST API in Livy? Please help me out with this.
Thanks in advance.
If you check the status for the batches using curl you will get the status of Livy batch job which will come as Finished(If spark driver has launched successfully).
To read the output:
1. You can do SSH using paramiko to the machine where hdfs is running and run hdfs dfs -ls / to check the output and perform your desired tasks.
Using the Livy rest API you need to write a script which does the step 1 and that script can be called through curl command to fetch the output from HDFS but in this case Livy will launch seperate spark driver and output will come in the STDOUT of the driver logs.
curl -vvv -u : :/batches -X POST --data '{"file": "http://"}' -H "Content-Type: application/json"
First one is the sure way of getting the output though I am not 100% sure about how second approach will behave.
You can use WebHDFS in you REST call.Get the WebHDFS enabled first by ur Admin.
Use the webHDFS URL
Create HttpURLConnection object
Set Request method as GET
then use buffer reader to getInputStream.

How to artemis queue from command line?

Is there any way to purger artemis queues? I have already purged then by going cd data/paging. This is the location where I have installed my artemis broker.
There is a UI called haw.io of artemis , though I Have deleted all the files in the paging directory, it sill shows the message on the UI, which in the correct case should not be there.
Please suggest.
From command line in your broker instance bin folder:
artemis queue delete --user user --password password --name queue-name
Artemis Broker provides a REST management API that users can use to read and change many of the broker's parameters in run time. Therefore, it's possible to purge a queue from command line using a command line like this:
curl -X POST -H "Content-Type: application/json" -d '{ "type": "EXEC", "mbean": "org.apache.activemq.artemis:address=\"test.performance.queue\",broker=\"0.0.0.0\",component=addresses,queue=\"test.performance.queue\",routing-type=\"anycast\",subcomponent=queues", "operation": "removeMessages(java.lang.String)", "arguments": [ "" ] }' http://localhost:8161/jolokia/exec | jq .
In this example above, I am purging the contents of a queue named test.performance.queue on a broker instance 0.0.0.0. These parameters need to be adjusted for the specific case.
Obs: note that I used jq . simply to make the response JSON prettier (you don't need to do that if you don't care about the response):
{
"request": {
"mbean": "org.apache.activemq.artemis:address=\"test.performance.queue\",broker=\"0.0.0.0\",component=addresses,queue=\"test.performance.queue\",routing-type=\"anycast\",subcomponent=queues",
"arguments": [
""
],
"type": "exec",
"operation": "removeMessages(java.lang.String)"
},
"value": 13001,
"timestamp": 1503740691,
"status": 200
}
Another possibility, might be to use the BMIC tool, which provides access to several APIs used for managing ActiveMQ 6 and Artemis brokers (disclaimer: I am the maintainer of the tool). Using that, you can do the same thing using this command:
./bmic queue -u admin -p admin -s localhost --name test.performance.queue --purge
One benefit of the tool over the curl command is that you don't need to care about the broker parameters, as the tool will (try to) do the discovery for you.
There are lots of ways to manage an instance of Apache ActiveMQ Artemis. For example, you can use:
JMX via a GUI tool like JConsole or JVisualVM
Web-based console
REST via Jolokia
Management messages (e.g. via core, JMS, AMQP, etc.)
However, you cannot simply delete files out from underneath the broker.

Storing Avro schema in schema registry

I am using Confluent's JDBC connector to send data into Kafka in the Avro format. I need to store this schema in the schema registry, but I'm not sure what format it accepts. I've read the documentation here, but it doesn't mention much.
I have tried this (taking the Avro output and pasting it in - for one int and one string field):
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" --data '{"type":"struct","fields":[{"type":"int64","optional":true,"field":"id"},{"type":"string","optional":true,"field":"serial"}],"optional":false,"name":"test"}' http://localhost:8081/subjects/view/versions
but I get the error: {"error_code":422,"message":"Unrecognized field: type"}
The schema that you give as a JSON should start with a 'schema' key. The actual schema that you provide will be the value of the key schema.
So your request should look like this:
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" --data '{"schema" : "{\"type\":\"string\",\"fields\":[{\"type\":\"int64\",\"optional\":true,\"field\":\"id\"},{\"type\":\"string\",\"optional\":true,\"field\":\"serial\"}],\"optional\":false,\"name\":\"test\"}"}' http://localhost:8081/subjects/view/versions
I've made two other changes to the command:
I've escaped each double quote within the value of the schema key.
I've changed the struct data structure to string. I'm not sure why it isn't taking complex structures though.
Check out how they've modeled the schema here, for the first POST request described in the documentation.
First, do you need to store the schema in advance? If you use the JDBC connector with the Avro converter (which is part of the schema registry package), the JDBC connector will figure out the schema of the table from the database and register it for you. You will need to specify the converter in your KafkaConnect config file. You can use this as an example: https://github.com/confluentinc/schema-registry/blob/master/config/connect-avro-standalone.properties
If you really want to register the schema yourself, there's some chance the issue is with the shell command - escaping JSON in shell is tricky. I installed Advanced Rest Client in Chrome and use that to work with the REST APIs of both schema registry and KafkaConnect.

Cannot set more than one Meta data with OpenStack Swift Object

I am trying to set metadata with a Object stored in Swift Container. I am using following command (note that my container is 'container1' and object is 'employee.json':
curl -X POST -H "X-Auth-Token:$TOKEN" -H 'X-Object-Meta-metadata1: value' $STORAGE_URL/container1/employee.json
It works fine with one metadata. But whenever, I am trying to set more than one metadata issuing several curl commands, only the last metadata value is actually set.
I think, there should not be a limit that you can set only one metadata for a swift object. Am I doing anything wrong?
FYI: I am using Havana release of Openstack Swift.
Thank you.
I think, I have figured it out... Its my bad that I did not read documentation sincerely.
It [1] says, "A POST request will delete all existing metadata added with a previous PUT/POST."
So, I tried this and it worked...
curl -X POST -H "X-Auth-Token:$TOKEN" -H 'X-Object-Meta-p1:[P1]' -H 'X-Object-Meta-p2:[P1]' $STORAGE_URL/container1/employee.json
Here, instead of two POST requests, now I have set multiple metadata in a single POST request.
Again, thanks.
Ref:
http://docs.openstack.org/api/openstack-object-storage/1.0/content/update-object-metadata.html