Spark REST API: Failed to find data source: com.databricks.spark.csv - rest

I have a pyspark file stored on s3. I am trying to run it using spark REST API.
I am running the following command:
curl -X POST http://<ip-address>:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"action" : "CreateSubmissionRequest",
"appArgs" : [ "testing.py"],
"appResource" : "s3n://accessKey:secretKey/<bucket-name>/testing.py",
"clientSparkVersion" : "1.6.1",
"environmentVariables" : {
"SPARK_ENV_LOADED" : "1"
},
"mainClass" : "org.apache.spark.deploy.SparkSubmit",
"sparkProperties" : {
"spark.driver.supervise" : "false",
"spark.app.name" : "Simple App",
"spark.eventLog.enabled": "true",
"spark.submit.deployMode" : "cluster",
"spark.master" : "spark://<ip-address>:6066",
"spark.jars" : "spark-csv_2.10-1.4.0.jar",
"spark.jars.packages" : "com.databricks:spark-csv_2.10:1.4.0"
}
}'
and the testing.py file has a code snippet:
myContext = SQLContext(sc)
format = "com.databricks.spark.csv"
dataFrame1 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location1).repartition(1)
dataFrame2 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location2).repartition(1)
outDataFrame = dataFrame1.join(dataFrame2, dataFrame1.values == dataFrame2.valuesId)
outDataFrame.write.format(format).option("header", "true").option("nullValue","").save(outLocation)
But on this line:
dataFrame1 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location1).repartition(1)
I get exception:
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.csv.DefaultSource
I was trying different things out and one of those things was that I logged into the ip-address machine and ran this command:
./bin/spark-shell --packages com.databricks:spark-csv_2.10:1.4.0
so that It would download the spark-csv in .ivy2/cache folder. But that didn't solve the problem. What am I doing wrong?

(Posted on behalf of the OP).
I first added spark-csv_2.10-1.4.0.jar on driver and worker machines. and added
"spark.driver.extraClassPath" : "absolute/path/to/spark-csv_2.10-1.4.0.jar",
"spark.executor.extraClassPath" : "absolute/path/to/spark-csv_2.10-1.4.0.jar",
Then I got following error:
java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat
Caused by: java.lang.ClassNotFoundException: org.apache.commons.csv.CSVFormat
And then I added commons-csv-1.4.jar on both machines and added:
"spark.driver.extraClassPath" : "/absolute/path/to/spark-csv_2.10-1.4.0.jar:/absolute/path/to/commons-csv-1.4.jar",
"spark.executor.extraClassPath" : "/absolute/path/to/spark-csv_2.10-1.4.0.jar:/absolute/path/to/commons-csv-1.4.jar",
And that solved my problem.

Related

Spark Job SUBMITTED but not RUNNING after submit via REST API

Following the instructions in this website, I'm trying to submit a job to Spark via REST API /v1/submissions.
I tried to submit SparkPi in the example:
$ ./create.sh
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20211212044718-0003",
"serverSparkVersion" : "3.1.2",
"submissionId" : "driver-20211212044718-0003",
"success" : true
}
$ ./status.sh driver-20211212044718-0003
{
"action" : "SubmissionStatusResponse",
"driverState" : "SUBMITTED",
"serverSparkVersion" : "3.1.2",
"submissionId" : "driver-20211212044718-0003",
"success" : true
}
create.sh:
curl -X POST http://172.17.197.143:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"appResource": "/home/ruc/spark-3.1.2/examples/jars/spark-examples_2.12-3.1.2.jar",
"sparkProperties": {
"spark.master": "spark://172.17.197.143:7077",
"spark.driver.memory": "1g",
"spark.driver.cores": "1",
"spark.app.name": "REST API - PI",
"spark.jars": "/home/ruc/spark-3.1.2/examples/jars/spark-examples_2.12-3.1.2.jar",
"spark.driver.supervise": "true"
},
"clientSparkVersion": "3.1.2",
"mainClass": "org.apache.spark.examples.SparkPi",
"action": "CreateSubmissionRequest",
"environmentVariables": {
"SPARK_ENV_LOADED": "1"
},
"appArgs": [
"400"
]
}'
status.sh:
export DRIVER_ID=$1
curl http://172.17.197.143:6066/v1/submissions/status/$DRIVER_ID
But when I try to get the status of the job (even after a few minutes), I got a "SUBMITTED" rather than "RUNNING" or "FINISHED".
Then I looked up the log and found that
21/12/12 04:47:18 INFO master.Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper
21/12/12 04:47:18 WARN master.Master: Driver driver-20211212044718-0003 requires more resource than any of Workers could have.
# ...
21/12/12 04:49:02 WARN master.Master: Driver driver-20211212044718-0003 requires more resource than any of Workers could have.
However, in my spark-env.sh, I have
export SPARK_WORKER_MEMORY=10g
export SPARK_WORKER_CORES=2
I have no idea what happened. How can I make it run normally?
Since you've checked resources and You have enough. It might be network issue. executor maybe cannot connect back to driver program. Allow traffic on both master and workers.

Connection issue: Databricks - Snowflake

I am trying to connect to Snowflake from Databricks Notebook through externalbrowser authenticator but without any success.
CMD1
sfOptions = {
"sfURL" : "xxxxx.west-europe.azure.snowflakecomputing.com",
"sfAccount" : "xxxxx",
"sfUser" : "ivan.lorencin#xxxxx",
"authenticator" : "externalbrowser",
"sfPassword" : "xxxxx",
"sfDatabase" : "DWH_PROD",
"sfSchema" : "APLSDB",
"sfWarehouse" : "SNOWFLAKExxxxx",
"tracing" : "ALL",
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
CMD2
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select 1 as my_num union all select 2 as my_num") \
.load()
And CMD2 is not completed but I am receiving ".. Running command ..." that last forever.
Can anybody help what is going wrong here? How can I establish a connection?
It looks like you're setting Authenticator to externalbrowser, but according to the docs it should be sfAuthenticator - is this intentional? If you are trying to do an OAuth type of auth, why do you also have password?
If you're account/user requires OAuth to login, I'd remove that password entry from sfOptions, edit that one entry to sfAuthenticator and try again.
If that does not work, you should ensure that your Spark cluster can reach out to all the required Snowflake hosts (see SnowCD for assistance).

Cassandra Sink Connector for Confluent Platform

I am trying to run Cassandra sink connector for confluent platform.The cassandra-sink.json file is as below :
{
"name" : "cassandra-sink",
"config" : {
"connector.class" : "io.confluent.connect.cassandra.CassandraSinkConnector",
"tasks.max" : "1",
"topics" : "topic1",
"cassandra.contact.points" : "127.0.0.1",
"cassandra.keyspace" : "test",
"confluent.topic.bootstrap.servers": "127.0.0.1:9092",
"cassandra.write.mode" : "Update",
"connect.cassandra.port":"127.0.0.1:9042"
}
}
I downloaded confluent-hub install confluentinc/kafka-connect-cassandra:latest as per the link.
I am able to load the file but when i check the status i get the below error. I am unable to figure out what the issue is.
FAILED worker_id:127.0.0.1:8083,trace:com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed
com.datastax.driver.core.exceptions.TransportException: [/127.0.0.1:9042] Cannot connect
com.datastax.driver.core.ControlConnection.reconnectInternal
com.datastax.driver.core.ControlConnection.connect
com.datastax.driver.core.Cluster$Manager.negotiateProtocolVersionAndConnect
com.datastax.driver.core.Cluster$Manager.init
com.datastax.driver.core.Cluster.init
com.datastax.driver.core.SessionManager.initAsync
com.datastax.driver.core.SessionManager.executeAsync
com.datastax.driver.core.AbstractSession.execute
io.confluent.connect.cassandra.CassandraSessionImpl.executeStatement
io.confluent.connect.cassandra.CassandraSinkConnector.doStart
io.confluent.connect.cassandra.CassandraSinkConnector.start
org.apache.kafka.connect.runtime.WorkerConnector.doStart
org.apache.kafka.connect.runtime.WorkerConnector.start
org.apache.kafka.connect.runtime.WorkerConnector.transitionTo
org.apache.kafka.connect.runtime.Worker.startConnector
org.apache.kafka.connect.runtime.distributed.DistributedHerder.startConnector
org.apache.kafka.connect.runtime.distributed.DistributedHerder.access$1300
org.apache.kafka.connect.runtime.distributed.DistributedHerder$14
org.apache.kafka.connect.runtime.distributed.DistributedHerder$14
java.util.concurrent.FutureTask.run java.util.concurrent.ThreadPoolExecutor.runWorker
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run
Please guide.

How to convert Livy curl call to Livy Rest API call

I am getting started with Livy, in my setup Livy server is running on Unix machine and I am able to do curl to it and execute the job. I have created a fat jar and uploaded it on hdfs and I am simply calling its main method from Livy. My Json payload for Livy looks like below:
{
"file" : "hdfs:///user/data/restcheck/spark_job_2.11-3.0.0-RC1-
SNAPSHOT.jar",
"proxyUser" : "test_user",
"className" : "com.local.test.spark.pipeline.path.LivyTest",
"files" : ["hdfs:///user/data/restcheck/hivesite.xml","hdfs:///user/data/restcheck/log4j.properties"],
"driverMemory" : "5G",
"executorMemory" : "10G",
"executorCores" : 5,
"numExecutors" : 10,
"queue" : "user.queue",
"name" : "LivySampleTest2",
"conf" : {"spark.master" : "yarn","spark.executor.extraClassPath" :
"/etc/hbase/conf/","spark.executor.extraJavaOptions" : "-Dlog4j.configuration=file:log4j.properties","spark.driver.extraJavaOptions" : "-Dlog4j.configuration=file:log4j.properties","spark.ui.port" : 4100,"spark.port.maxRetries" : 100,"JAVA_HOME" : "/usr/java/jdk1.8.0_60","HADOOP_CONF_DIR" :
"/etc/hadoop/conf:/etc/hive/conf:/etc/hbase/conf","HIVE_CONF_DIR" :
"/etc/hive/conf"}
}
and below is my curl call to it:
curl -X POST --negotiate -u:"test_user" --data #/user/data/Livy/SampleFile.json  -H "Content-Type: application/json" https://livyhost:8998/batches
I am trying to convert this a REST API call and following the WordCount example provided by Cloudera but not able to covert my curl call to the REST API. I have all the jars already added in HDFS so I dont think I need to do the upload jar call.
It should work with curl also
Please try the below JSON.
curl -H "Content-Type: application/json" https://livyhost:8998/batches
-X POST --data '{
"name" : "LivyREST",
"className" : "com.local.test.spark.pipeline.path.LivyTest",
"file" : "/user/data/restcheck/spark_job_2.11-3.0.0-RC1-
SNAPSHOT.jar"
}'
Also, I am adding some more references
http://gethue.com/how-to-use-the-livy-spark-rest-job-server-api-for-submitting-batch-jar-python-and-streaming-spark-jobs/

unable to percolate : org.elasticsearch.index.percolator.PercolatorException: [myindex] failed to parse query [myDesignatedQueryName]

I am following this guide and converting this percolate api java code in scala but when i run this in SBT it throws following exceptions
[error] (run-main-0) org.elasticsearch.index.percolator.PercolatorException: [myindex] failed to parse query [myDesignatedQueryName]
org.elasticsearch.index.percolator.PercolatorException: [myindex] failed to parse query [myDesignatedQueryName]
at org.elasticsearch.index.percolator.PercolatorQueriesRegistry.parsePercolatorDocument(PercolatorQueriesRegistry.java:194)
at org.elasticsearch.index.percolator.PercolatorQueriesRegistry$RealTimePercolatorOperationListener.preIndex(PercolatorQueriesRegistry.java:309)
at org.elasticsearch.index.indexing.ShardIndexingService.preIndex(ShardIndexingService.java:139)
at org.elasticsearch.index.shard.service.InternalIndexShard.index(InternalIndexShard.java:420)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:193)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:511)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:419)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.query.QueryParsingException: [myindex] Strict field resolution and no field mapping can be found for the field with name [content]
at org.elasticsearch.index.query.QueryParseContext.failIfFieldMappingNotFound(QueryParseContext.java:393)
at org.elasticsearch.index.query.QueryParseContext.smartFieldMappers(QueryParseContext.java:372)
at org.elasticsearch.index.query.TermQueryParser.parse(TermQueryParser.java:95)
at org.elasticsearch.index.query.QueryParseContext.parseInnerQuery(QueryParseContext.java:277)
at org.elasticsearch.index.query.IndexQueryParserService.parseInnerQuery(IndexQueryParserService.java:321)
at org.elasticsearch.index.percolator.PercolatorQueriesRegistry.parseQuery(PercolatorQueriesRegistry.java:207)
at org.elasticsearch.index.percolator.PercolatorQueriesRegistry.parsePercolatorDocument(PercolatorQueriesRegistry.java:191)
at org.elasticsearch.index.percolator.PercolatorQueriesRegistry$RealTimePercolatorOperationListener.preIndex(PercolatorQueriesRegistry.java:309)
at org.elasticsearch.index.indexing.ShardIndexingService.preIndex(ShardIndexingService.java:139)
at org.elasticsearch.index.shard.service.InternalIndexShard.index(InternalIndexShard.java:420)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:193)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:511)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:419)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[trace] Stack trace suppressed: run last compile:run for the full output.
here is my code
object PercolateApiES extends App{
val node =nodeBuilder().client(true).node()
val client =node.client()
val qb=QueryBuilders.termQuery("content","amazing")
val res=client.prepareIndex("myindex", ".percolator", "myDesignatedQueryName")
.setSource(jsonBuilder()
.startObject()
.field("query", qb) // Register the query
.endObject())
.setRefresh(true) // Needed when the query shall be available immediately
.execute().actionGet();
val docBuilder = XContentFactory.jsonBuilder().startObject();
docBuilder.field("doc").startObject(); //This is needed to designate the document
docBuilder.field("content", "This is amazing!");
docBuilder.endObject(); //End of the doc field
docBuilder.endObject(); //End of the JSON root object
//Percolate
val response = client.preparePercolate()
.setIndices("myindex")
.setDocumentType("myDocumentType")
.setSource(docBuilder).execute().actionGet();
node.close
}
When I write these commands using curl they work fine
curl -XPUT 'localhost:9200/myindex1' -d '{
"mappings": {
"mytype": {
"properties": {
"content": {
"type": "string"
}
}
}
}
}'
{"acknowledged":true}
curl -XPUT 'localhost:9200/myindex1/.percolator/myDesignatedQueryName' -d '{
"query" : {
"term" : {
"content" : "amazing"
}
}
}'
{"_index":"myindex1","_type":".percolator","_id":"myDesignatedQueryName","_version":1,"created":true}
curl -XGET 'localhost:9200/myindex1/content/_percolate' -d '{
"doc" : {
"content" : "This is amazing!"
}
}'
{"took":231,"_shards":{"total":5,"successful":5,"failed":0},"total":1,"matches":[{"_index":"myindex1","_id":"myDesignatedQueryName"}]}
I am using elastic search-1.4.1 Please help me where i am making mistake and also i want to seethe results asi dont know how to do this in code
matches":[{"_index":"myindex1","_id":"myDesignatedQueryName"}
How can I fetch the result ,please help me and guide me ,Thanks
ES 1.4.0.Beta1 has introduced a breaking change which affects the percolator API. Basically, we need to set the index.query.parse.allow_unmapped_fields explicitly to true. Refer to http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-dynamic-mapping.html#_unmapped_fields_in_queries for details.
The actual conversation on adding this change is in their Github https://github.com/elasticsearch/elasticsearch/issues/6664.
Here is another related issue: index.query.parse.allow_unmapped_fields setting does not seem to allow unmapped fields in alias filters