Store execution plan of Spark´s dataframe - scala

I am currently trying to store the execution plan of a Spark´s dataframe into HDFS (through dataframe.explain(true) command)
The issue I am finding is that when I am using the explain(true) command, I am able to see the output by the command line and by the logs, however if I create a file (let´s say a .txt) with the content of the dataframe´s explain the file will appear empty.
I believe the issue relates to the configuration of Spark, but I am unable to
find any information about this in internet
(for those who want to see more about the plan execution of the dataframes using the explain function please refer to https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-dataset-operators.html#explain)

if I create a file (let´s say a .txt) with the content of the dataframe´s explain
How exactly did you try to achieve this?
explain writes its result to console, using println, and returns Unit, as can be seen in Dataset.scala:
def explain(extended: Boolean): Unit = {
val explain = ExplainCommand(queryExecution.logical, extended = extended)
sparkSession.sessionState.executePlan(explain).executedPlan.executeCollect().foreach {
// scalastyle:off println
r => println(r.getString(0))
// scalastyle:on println
}
}
So, unless you redirect the console output to write to your file (along with anything else printed to the console...), you won't be able to write explain's output to file.

The best way I have found is to redirect the output to a file when you run the job. I have used the following command :
spark-shell --master yarn -i test.scala > getlogs.log
my scala file has the following simple commands :
val df = sqlContext.sql("SELECT COUNT(*) FROM testtable")
df.explain(true)
exit()

Related

Remove all files with a given extension using scala spark

I have some csv.crc files generated when I try to write a dataframe into a csv file using spark. Therefore I want to delete all files with .csv.crc extension
val fs = FileSystem.get(existingSparkSession.sparkContext.hadoopConfiguration)
val srcPath=new Path("./src/main/resources/myDirectory/*.csv.crc")
println(fs.exists(srcPath))
println(fs.isFile(srcPath))
if(fs.exists(srcPath) && fs.isFile(srcPath)) {
fs.delete(srcPath,true)
}
both prinln lines give false as the value. therefor its not even going into the if condition. How can I delete all.csv.crc files using scala and spark
You can use below option to avoid crc file while writing.(Note :you're eliminating checksum).
fs.setVerifyChecksum(false).
else you can avoid crc files while reading using below,
config.("dfs.client.read.shortcircuit.skip.checksum", "true").

How do I ensure that my Apache Spark setup code runs only once?

I'm writing a Spark job in Scala that reads in parquet files on S3, does some simple transforms, and then saves them to a DynamoDB instance. Each time it runs we need to create a new table in Dynamo so I've written a Lambda function which is responsible for table creation. The first thing my Spark job does is generates a table name, invokes my Lambda function (passing the new table name to it), waits for the table to be created, and then proceeds normally with the ETL steps.
However it looks as though my Lambda function is consistently being invoked twice. I cannot explain that. Here's a sample of the code:
def main(spark: SparkSession, pathToParquet: String) {
// generate a unique table name
val tableName = generateTableName()
// call the lambda function
val result = callLambdaFunction(tableName)
// wait for the table to be created
waitForTableCreation(tableName)
// normal ETL pipeline
var parquetRDD = spark.read.parquet(pathToParquet)
val transformedRDD = parquetRDD.map((row: Row) => transformData(row), encoder=kryo[(Text, DynamoDBItemWritable)])
transformedRDD.saveAsHadoopDataset(getConfiguration(tableName))
spark.sparkContext.stop()
}
The code to wait for table creation is pretty-straightforward, as you can see:
def waitForTableCreation(tableName: String) {
val client: AmazonDynamoDB = AmazonDynamoDBClientBuilder.defaultClient()
val waiter: Waiter[DescribeTableRequest] = client.waiters().tableExists()
try {
waiter.run(new WaiterParameters[DescribeTableRequest](new DescribeTableRequest(tableName)))
} catch {
case ex: WaiterTimedOutException =>
LOGGER.error("Timed out waiting to create table: " + tableName)
throw ex
case t: Throwable => throw t
}
}
And the lambda invocation is equally simple:
def callLambdaFunction(tableName: String) {
val myLambda = LambdaInvokerFactory.builder()
.lambdaClient(AWSLambdaClientBuilder.defaultClient)
.lambdaFunctionNameResolver(new LambdaByName(LAMBDA_FUNCTION_NAME))
.build(classOf[MyLambdaContract])
myLambda.invoke(new MyLambdaInput(tableName))
}
Like I said, when I run spark-submit on this code, it definitely does hit the Lambda function. But I can't explain why it hits it twice. The result is that I get two tables provisioned in DynamoDB.
The waiting step also seems to fail within the context of running this as a Spark job. But when I unit-test my waiting code it seems to work fine on its own. It successfully blocks until the table is ready.
At first I theorized that perhaps spark-submit was sending this code to all of the worker nodes and they were independently running the whole thing. Initially I had a Spark cluster with 1 master and 2 workers. However I tested this out on another cluster with 1 master and 5 workers, and there again it hit the Lambda function exactly twice, and then apparently failed to wait for table creation because it dies shortly after invoking the Lambdas.
Does anyone have any clues as to what Spark might be doing? Am I missing something obvious?
UPDATE: Here's my spark-submit args which are visible on the Steps tab of EMR.
spark-submit --deploy-mode cluster --class com.mypackage.spark.MyMainClass s3://my-bucket/my-spark-job.jar
And here's the code for my getConfiguration function:
def getConfiguration(tableName: String) : JobConf = {
val conf = new Configuration()
conf.set("dynamodb.servicename", "dynamodb")
conf.set("dynamodb.input.tableName", tableName)
conf.set("dynamodb.output.tableName", tableName)
conf.set("dynamodb.endpoint", "https://dynamodb.us-east-1.amazonaws.com")
conf.set("dynamodb.regionid", "us-east-1")
conf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
conf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
new JobConf(conf)
}
Also here is a Gist containing some of the exception logs I see when I try to run this.
Thanks #soapergem for adding logging and options. I add an answer (a try one) since it may be a little bit longer than a comment :)
To wrap-up:
nothing strange with spark-submit and configuration options
in https://gist.github.com/soapergem/6b379b5a9092dcd43777bdec8dee65a8#file-stderr-log you can see that the application is executed twice. It passes twice from an ACCEPTED to RUNNING state. And that's consistent with EMR defaults (How to prevent EMR Spark step from retrying?). To confirm that, you can check whether you have 2 tables created after executing the step (I suppose here that you're generating tables with dynamic names; a different name per execution which in case of retry should give 2 different names)
For your last question:
It looks like my code might work if I run it in "client" deploy mode, instead of "cluster" deploy mode? Does that offer any hints to anyone here?
For more information about the difference, please check https://community.hortonworks.com/questions/89263/difference-between-local-vs-yarn-cluster-vs-yarn-c.html In your case, it looks like the machine executing spark-submit in client mode has different IAM policies than the EMR jobflow. My supposition here is that your jobflow role is not allowed to dynamodb:Describe* and that's why you're getting the exception with 500 code (from your gist):
Caused by: com.amazonaws.services.dynamodbv2.model.ResourceNotFoundException: Requested resource not found: Table: EmrTest_20190708143902 not found (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: V0M91J7KEUVR4VM78MF5TKHLEBVV4KQNSO5AEMVJF66Q9ASUAAJG)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:4243)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:4210)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeDescribeTable(AmazonDynamoDBClient.java:1890)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.describeTable(AmazonDynamoDBClient.java:1857)
at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:129)
at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:126)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:80)
To confirm this hypothesis, you an execute your part creating the table and waiting for creation locally (no Spark code here, just a simple java command of your main function) and:
for the first execution ensure that you have all permissions. IMO it will be dynamodb:Describe* on Resources: * (if it's the reason, AFAIK you should use somthing Resources: Test_Emr* in production for principle of least privilege )
for the 2nd execution remove dynamodb:Describe* and check whether you're getting the same stack trace like in the gist
I encountered the same problem in cluster mode too (v2.4.0). I workaround it by launching my apps programmatically using SparkLauncher instead of using spark-submit.sh. You could move your lambda logic into your main method that starts your spark app like this:
def main(args: Array[String]) = {
// generate a unique table name
val tableName = generateTableName()
// call the lambda function
val result = callLambdaFunction(tableName)
// wait for the table to be created
waitForTableCreation(tableName)
val latch = new CountDownLatch(1);
val handle = new SparkLauncher(env)
.setAppResource("/path/to/spark-app.jar")
.setMainClass("com.company.SparkApp")
.setMaster("yarn")
.setDeployMode("cluster")
.setConf("spark.executor.instances", "2")
.setConf("spark.executor.cores", "2")
// other conf ...
.setVerbose(true)
.startApplication(new SparkAppHandle.Listener {
override def stateChanged(sparkAppHandle: SparkAppHandle): Unit = {
latch.countDown()
}
override def infoChanged(sparkAppHandle: SparkAppHandle): Unit = {
}
})
println("app is launching...")
latch.await()
println("app exited")
}
your spark job starts before the table is actually created because defining operations one by one doesn't mean they will wait until previous one is finished
you need to change the code so that block related to spark is starting after table is created, and in order to achieving it you have to either use for-comprehension that insures every step is finished or put your spark pipeline into the callback of waiter called after the table is created (if you have any, hard to tell)
you can also use andThen or simple map
the main point is that all the lines of code written in your main are executed one by one immediately without waiting for previous one to finish

How to check a file/folder is present using pyspark without getting exception

I am trying to keep a check for the file whether it is present or not before reading it from my pyspark in databricks to avoid exceptions? I tried below code snippets but i am getting exception when file is not present
from pyspark.sql import *
from pyspark.conf import SparkConf
SparkSession.builder.config(conf=SparkConf())
try:
df = sqlContext.read.format('com.databricks.spark.csv').option("delimiter",",").options(header='true', inferschema='true').load('/FileStore/tables/HealthCareSample_dumm.csv')
print("File Exists")
except IOError:
print("file not found")`
When i have file, it reads file and "prints File Exists" but when the file is not there it will throw "AnalysisException: 'Path does not exist: dbfs:/FileStore/tables/HealthCareSample_dumm.csv;'"
Thanks #Dror and #Kini. I run spark on cluster, and I must add sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]), here s3 is the prefix of the file system of your cluster.
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))
The answer posted by #rosefun worked for me but it took lot of time for me to get it working. So I am giving some details about how that solution is working and what are the stuffs you should avoid.
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
The function is same and it works fine to check whether a file exists or not in the S3 bucket path that you provided.
You will have to change this function based on how you are specifying your path value to this function.
path = f"s3://bucket-name/import/data/"
pathexists = path_exists(path)
if the path variable that you are defining is having the s3 prefix in the path then it would work.
Also the portion of the code which split the string gets you just the bucket name as follows:
path.split("/")[2] will give you `bucket-name`
but if you don't have s3 prefix in the path then you will have to use the function by changing some code and which is as below:
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path("s3://" + path))
Looks like you should change except IOError: to except AnalysisException:.
Spark throws different errors/exception than regular python in a lot of cases. It’s not doing typical python io operations when reading a file, so makes sense for it to throw a different exception.
nice to see you on StackOverFlow.
I second dijksterhuis's solution, with one exception -
Analysis Exception is very general exception in Spark, and may be resulted for various reasons, not only due to missing file.
If you want to check whether the file exists or not, you'll need to bypass Spark's FS abstraction, and access the storage system directly (Whether is s3, posix, or something else). The down side of this solution is the lack of abstraction - once you'll change your underlying FS, you'll need to change your code as well.
You can validate existence of a file as seen here:
import os
if os.path.isfile('/path/file.csv'):
print("File Exists")
my_df = spark.read.load("/path/file.csv")
...
else:
print("File doesn't exists")
dbutils.fs.ls(file_location)
Do not import dbutils. It's already there when you start your cluster.

how to run sc.textFile() inside a foreach loop and do a union on it?

so this is what I have been trying and I'm a newbie here working with spark!
I'm trying to execute this code
val ii=sc.parallelize(Seq(("e.txt"),("r.txt"))).foreach{i => sc.textFile(i)}
but I'm getting "Nullpointer exception"
Thanks!
You can just add multiple files to the sc.textFile. You should not use the sc inside of a map operation. The map function will be distributed to the different executors, and the sc lives in the driver. Therefore it will throw a Nullpointer exception.
a.txt contents:
a.txt:line1
a.txt:line2
b.txt contents:
b.txt:line1
b.txt:line2
Spark allows you to add more files in the same operation:
scala> sc.textFile("a.txt,b.txt").collect()
res1: Array[String] = Array(a.txt:line1, a.txt:line2, b.txt:line1, b.txt:line2)
Hope this helps and have fun with Spark!

how to run scalding test in local mode with local input file

Scalding has a great utility to run an integration test for the job flow.
In this way the inputs and outputs are the in-memory buffer
val input = List("0" -> "This a a day")
val expectedOutput = List(("This", 1),("a", 2),("day", 1))
JobTest(classOf[WordCountJob].getName)
.arg("input", "input-data")
.arg("output", "output-data")
.source(TextLine("input-data"), input)
.sink(Tsv("output-data")) {
buffer: mutable.Buffer[(String, Int)] => {
buffer should equal(expectedOutput)
}
}.run
How can I transfare/write another code that will read input and write output to the real local file? Like FileTap/LFS in cascading - and not an in-memory approach
You might check out HadoopPlatformJobTest and the TypedParquetTupleTest.scala example which uses a local mini-cluster.
This unit test writes to a "MiniLocalCLuster" - While it's not directly a file, but accessible via reading the local minicluster with Hadoop filesystem.
Given you local file scenario, maybe you can copies the files with local reads to the mini-HDFS.