Error while submitting PySpark Application through Livy REST API - rest

I want to submit a Pyspark application to Livy through REST API to invoke HiveWarehouse Connector. Based on this answer in Cloudera community
https://community.cloudera.com/t5/Community-Articles/How-to-Submit-Spark-Application-through-Livy-REST-API/ta-p/247502
I created a test1.json as follows
{
"jars": ["hive-warehouse-connector-assembly-1.0.0.3.1.0.0-78.jar"],
"pyFiles": ["pyspark_hwc-1.0.0.3.1.0.0-78.zip"],
"file": ["test1.py"]
}
and call InvokeHTTP. But I get this error ""Cannot deserialize instance of java.lang.String out of START_ARRAY token\n at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 1, column: 224] (through reference chain: org.apache.livy.server.batch.CreateBatchRequest[\"file\"
I think the 'file' field with test1.py is wrong. Can anyone tell me how to submit this?
This works with a simple spark-submit test1.py
All suggestions are welcome

The following works
For basic Hive access the following works use the below JSON
{
    "file":"hdfs-path/test1.py"
}
For Hive LLAP access use JSON as below
{
"jars": ["<path-to-jar>/hive-warehouse-connector-assembly-1.0.0.3.1.0.0-78.jar"],
"pyFiles": ["<path-to-zip>/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.0.0-78.zip"],
"file": "<path-to-file>/test3.py"
}
Interestingly when I put the zip in the "archives" field it gives error. It works for the "pyFiles" field though as shown above

Related

Google Cloud Data Fusion -- building pipeline from REST API endpoint source

Attempting to build a pipeline to read from a 3rd party REST API endpoint data source.
I am using the HTTP (version 1.2.0) plugin found in the Hub.
The response request URL is: https://api.example.io/v2/somedata?return_count=false
A sample of response body:
{
"paging": {
"token": "12456789",
"next": "https://api.example.io/v2/somedata?return_count=false&__paging_token=123456789"
},
"data": [
{
"cID": "aerrfaerrf",
"first": true,
"_id": "aerfaerrfaerrf",
"action": "aerrfaerrf",
"time": "1970-10-09T14:48:29+0000",
"email": "example#aol.com"
},
{...}
]
}
The main error in the logs is:
java.lang.NullPointerException: null
at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.getNextPage(BaseHttpPaginationIterator.java:118) ~[1580429892615-0/:na]
at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.ensurePageIterable(BaseHttpPaginationIterator.java:161) ~[1580429892615-0/:na]
at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.hasNext(BaseHttpPaginationIterator.java:203) ~[1580429892615-0/:na]
at io.cdap.plugin.http.source.batch.HttpRecordReader.nextKeyValue(HttpRecordReader.java:60) ~[1580429892615-0/:na]
at io.cdap.cdap.etl.batch.preview.LimitingRecordReader.nextKeyValue(LimitingRecordReader.java:51) ~[cdap-etl-core-6.1.1.jar:na]
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:128) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.scheduler.Task.run(Task.scala:109) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) [spark-core_2.11-2.3.3.jar:2.3.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_232]
Possible issues
After trying to troubleshoot this for awhile, I'm thinking the issue might be with
Pagination
Data Fusion HTTP plugin has a lot of methods to deal with pagination
Based on the response body above, it seems like the best option for Pagination Type is Link in Response Body
For the required Next Page JSON/XML Field Path parameter, I've tried $.paging.next and paging/next. Neither work.
I have verified that the link in /paging/next works when opening in Chrome
Authentication
When simply trying to view the response URL in Chrome, a prompt will pop up asking for username and password
Only need to input API key for username to get past this prompt in Chrome
To do this in the Data Fusion HTTP plugin, the API Key is used for Username in the Basic Authentication section
Anyone have any success in creating a pipeline in Google Cloud Data Fusion where the data source is a REST API?
In answer to
Anyone have any success in creating a pipeline in Google Cloud Data Fusion where the data source is a REST API?
This is not the optimal way to achieve this the best way would be to ingest data Service APIs Overview to pub/sub your would then use pub/sub as the source for your pipeline this would provide a simple and reliable staging location for your data on its for processing, storage, and analysis, see the documentation for the pub/sub API . In order to use this in conjunction with Dataflow, the steps to follow are in the official documentation here Using Pub/Sub with Dataflow
I think your problem is in the data format that you receive. The exception:
java.lang.NullPointerException: null
occurs when you do not specify a correct output schema (no schema in this case I believe)
Solution 1
To solve it, try configuring the HTTP Data Fusion plugin to:
Receive format: Text.
Output Schema: name: user Type: String
This should work to obtain the response from the API in string format. Once that is done, use a JSONParser to convert the string into a table like object.
Solution 2
Configure the HTTP Data Fusion plugin to:
Receive format: json
JSON/XML Result Path : data
JSON/XML Fields Mapping : Include the fields you presented (see attached foto).

Authenticate with ECE ElasticSearch Sink from Apache Fink (Scala code)

Compiler error when using example provided in Flink documentation. The Flink documentation provides sample Scala code to set the REST client factory parameters when talking to Elasticsearch, https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/elasticsearch.html.
When trying out this code i get a compiler error in IntelliJ which says "Cannot resolve symbol restClientBuilder".
I found the following SO which is EXACTLY my problem except that it is in Java and i am doing this in Scala.
Apache Flink (v1.6.0) authenticate Elasticsearch Sink (v6.4)
I tried copy pasting the solution code provided in the above SO into IntelliJ, the auto-converted code also has compiler errors.
// provide a RestClientFactory for custom configuration on the internally created REST client
// i only show the setMaxRetryTimeoutMillis for illustration purposes, the actual code will use HTTP cutom callback
esSinkBuilder.setRestClientFactory(
restClientBuilder -> {
restClientBuilder.setMaxRetryTimeoutMillis(10)
}
)
Then i tried (auto generated Java to Scala code by IntelliJ)
// provide a RestClientFactory for custom configuration on the internally created REST client// provide a RestClientFactory for custom configuration on the internally created REST client
import org.apache.http.auth.AuthScope
import org.apache.http.auth.UsernamePasswordCredentials
import org.apache.http.client.CredentialsProvider
import org.apache.http.impl.client.BasicCredentialsProvider
import org.apache.http.impl.nio.client.HttpAsyncClientBuilder
import org.elasticsearch.client.RestClientBuilder
// provide a RestClientFactory for custom configuration on the internally created REST client// provide a RestClientFactory for custom configuration on the internally created REST client
esSinkBuilder.setRestClientFactory((restClientBuilder) => {
def foo(restClientBuilder) = restClientBuilder.setHttpClientConfigCallback(new RestClientBuilder.HttpClientConfigCallback() {
override def customizeHttpClient(httpClientBuilder: HttpAsyncClientBuilder): HttpAsyncClientBuilder = { // elasticsearch username and password
val credentialsProvider = new BasicCredentialsProvider
credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials(es_user, es_password))
httpClientBuilder.setDefaultCredentialsProvider(credentialsProvider)
}
})
foo(restClientBuilder)
})
The original code snippet produces the error "cannot resolve RestClientFactory" and then Java to Scala shows several other errors.
So basically i need to find a Scala version of the solution described in Apache Flink (v1.6.0) authenticate Elasticsearch Sink (v6.4)
Update 1: I was able to make some progress with some help from IntelliJ. The following code compiles and runs but there is another problem.
esSinkBuilder.setRestClientFactory(
new RestClientFactory {
override def configureRestClientBuilder(restClientBuilder: RestClientBuilder): Unit = {
restClientBuilder.setHttpClientConfigCallback(new RestClientBuilder.HttpClientConfigCallback() {
override def customizeHttpClient(httpClientBuilder: HttpAsyncClientBuilder): HttpAsyncClientBuilder = {
// elasticsearch username and password
val credentialsProvider = new BasicCredentialsProvider
credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials(es_user, es_password))
httpClientBuilder.setDefaultCredentialsProvider(credentialsProvider)
httpClientBuilder.setSSLContext(trustfulSslContext)
}
})
}
}
The problem is that i am not sure if i should be doing a new of the RestClientFactory object. What happens is that the application connects to the elasticsearch cluster but then discovers that the SSL CERT is not valid, so i had to put the trustfullSslContext (as described here https://gist.github.com/iRevive/4a3c7cb96374da5da80d4538f3da17cb), this got me past the SSL issue but now the ES REST Client does a ping test and the ping fails, it throws an exception and the app shutsdown. I am suspecting that the ping fails because of the SSL error and maybe it is not using the trustfulSslContext i setup as part of new RestClientFactory and this makes me suspect that i should not have done the new, there should be a simple way to update the existing RestclientFactory object and basically this is all happening because of my lack of Scala knowledge.
Happy to report that this is resolved. The code i posted in Update 1 is correct. The ping to ECE was not working for two reasons:
The certificate needs to include the complete chain including the root CA, the intermediate CA and the cert for the ECE. This helped get rid of the whole trustfulSslContext stuff.
The ECE was sitting behind an ha-proxy and the proxy did the mapping for the hostname in the HTTP request to the actual deployment cluster name in ECE. this mapping logic did not take into account that the Java REST High Level client uses the org.apache.httphost class which creates the hostname as hostname:port_number even when the port number is 443. Since it did not find the mapping because of the 443 therefore the ECE returned a 404 error instead of 200 ok (only way to find this was to look at unencrypted packets at the ha-proxy). Once the mapping logic in ha-proxy was fixed, the mapping was found and the pings are now successfull.

How to create and call cloud code functions on parse-server?

I'm trying to run some very basic cloud code function on my parse-server and I get the same error every time: 141 Invalid function. I'm just adding a main.js file with my function in the cloud directory and trying to call it using Postman, but it looks like the file is not even called.
I've tried locally and on a docker, if the function exist or not I get the same result, and tried restarting the docker container after adding the code. I also tried adding a body to the request with parameters like master and functionName.
Here's my cloud code function (cloud/main.js):
Parse.Cloud.define('hello', function(req, res) {
return "function called";
});
Calling the function with a POST request on https://myurl/parse/functions/hello
and getting:
{
"code": 141,
"error": "Invalid function: \"hello\""
}
The response object has been removed from Parse Server Cloud Code post v3.0.0.
Your Cloud Code function should look like this...
Parse.Cloud.define("hello", async (request) => {
return "function called";
});
Please read the migration guide for more details on updating your cloud code to v3.0.0 or above.

Fetching data from json file in protractor

I am working on protractor testing framework to test the Angular Js application.
I am stuck in a scenario where I want to fetch the data
(eg: Test URL,Username,Password)from external file say Json File.
I have created one json file for this and written code to read the data from this json file but I am getting the error
as:
DeprecationWarning: os.tmpDir() is deprecated. Use os.tmpdir() instead.
[11:03:10] I/launcher - Running 1 instances of WebDriver
[11:03:10] I/local - Starting selenium standalone server...
[11:03:12] I/local - Selenium standalone server started at http://10.212.134.201:59697/wd/hub
[11:03:15] E/launcher - Error: Error: Cannot find module 'E:LAM WAH EE_Testing EnviornmentDetailed_Sheet - Copy.geojson'
where Detailed_Sheet - Copy.geojson is the file where I have given the url name and username and password.
Please anyone having idea about this help me so that I will get to know where is my mistake.
Please see below example, the example assume we have two files and they are in same folder:
1) login.auth.json, its content as below:
{
"loginurl": "http://172.16.99.47:3001",
"username": "azizi#hlwe.com",
"password": "abcd#1234"
}
Note: For json file, you no need to use module.exports or exports.xxx to export the json data, you only need to write a valid json data in json file and use require() to import, then you will get the json data.
2) test.js, its content as below:
var auth = require('./login.auth.json');
describe('Opne the clinicare website by logging into the site', function() {
it('Should Add a text in username and password fields and hit login button', function() {
browser.driver.manage().window().maximize();
browser.get(auth.loginurl);
//Perform Login:UserName
element(by.model('accessCode')).sendKeys(auth.username);
//Perform Login:Password
element(by.model('password')).sendKeys(auth.password);
//Perform Login:LoginButton
element(by.css('.btn.btn-primary.pull-right')).click();
});
});

How to get all jobs status through spark REST API?

I am using spark 1.5.1 and I'd like to retrieve all jobs status through REST API.
I am getting correct result using /api/v1/applications/{appId}. But while accessing jobs /api/v1/applications/{appId}/jobs getting "no such app:{appID}" response.
How should I pass app ID here to retrieve jobs status of application using spark REST API?
Spark provides 4 hidden RESTFUL API
1) Submit the job - curl -X POST http://SPARK_MASTER_IP:6066/v1/submissions/create
2) To kill the job - curl -X POST http://SPARK_MASTER_IP:6066/v1/submissions/kill/driver-id
3) To check status if the job - curl http://SPARK_MASTER_IP:6066/v1/submissions/status/driver-id
4) Status of the Spark Cluster - http://SPARK_MASTER_IP:8080/json/
If you want to use another APIs you can try Livy , lucidworks
url - https://doc.lucidworks.com/fusion/3.0/Spark_ML/Spark-Getting-Started.html
This is supposed to work when accessing a live driver's API endpoints, but since you're using Spark 1.5.x I think you're running into SPARK-10531, a bug where the Spark Driver UI incorrectly mixes up application names and application ids. As a result, you have to use the application name in the REST API url, e.g.
http://localhost:4040/api/v1/applications/Spark%20shell/jobs
According to the JIRA ticket, this only affects the Spark Driver UI; application IDs should work as expected with the Spark History Server's API endpoints.
This is fixed in Spark 1.6.0, which should be released soon. If you want a workaround which should work on all Spark versions, though, then the following approach should work:
The api/v1/applications endpoint misreports job names as job ids, so you should be able to hit that endpoint, extract the id field (which is actually an application name), then use that to construct the URL for the current application's job list (note that the /applications endpoint will only ever return a single job in the Spark Driver UI, which is why this approach should be safe; due to this property, we don't have to worry about the non-uniqueness of application names). For example, in Spark 1.5.2 the /applications endpoint can return a response which contains a record like
{
id: "Spark shell",
name: "Spark shell",
attempts: [
{
startTime: "2015-09-10T06:38:21.528GMT",
endTime: "1969-12-31T23:59:59.999GMT",
sparkUser: "",
completed: false
}]
}
If you use the contents of this id field to construct the applications/<id>/jobs URL then your code should be future-proofed against upgrades to Spark 1.6.0, since the id field will begin reporting the proper IDs in Spark 1.6.0+.
For those who have this problem and are running on YARN:
According to the docs,
when running in YARN cluster mode, [app-id] will actually be [base-app-id]/[attempt-id], where [base-app-id] is the YARN application ID
So if your call to https://HOST:PORT/api/v1/applications/application_12345678_0123 returns something like
{
"id" : "application_12345678_0123",
"name" : "some_name",
"attempts" : [ {
"attemptId" : "1",
<...snip...>
} ]
}
you can get eg. jobs by calling
https://HOST:PORT/api/v1/applications/application_12345678_0123/1/jobs
(note the "1" before "/jobs").
If you want to use the REST API to control Spark, you're probably best adding the Spark Jobserver to your installation which then gives you a much more comprehensive REST API than the private REST APIs you're currently querying.
Poking around, I've managed to get the job status for a single application by running
curl http://127.0.0.1:4040/api/v1/applications/Spark%20shell/jobs/
which returned
[ {
"jobId" : 0,
"name" : "parquet at <console>:19",
"submissionTime" : "2015-12-21T10:46:02.682GMT",
"stageIds" : [ 0 ],
"status" : "RUNNING",
"numTasks" : 2,
"numActiveTasks" : 2,
"numCompletedTasks" : 0,
"numSkippedTasks" : 0,
"numFailedTasks" : 0,
"numActiveStages" : 1,
"numCompletedStages" : 0,
"numSkippedStages" : 0,
"numFailedStages" : 0 }]
Spark has some hidden RESTFUL API that you can try.
Note that i have not tried yet, but i will.
For example: to get status of submit application you can do:
curl http://spark-cluster-ip:6066/v1/submissions/status/driver-20151008145126-0000
Note: "driver-20151008145126-0000" is submitsionId.
You can take a deep look in this link with this post from arturmkrtchyan on GitHub