I'm writing a map only sparkSQL job which looks like
val lines = sc.textFile(inputPath)
val df = lines.map { line => ... }.toDF("col0", "col1")
df.write.parquet(output)
As the job takes quite a long time to compute, I would like to save and keep the results of the tasks which successfully terminated, even if the overall job fails or gets killed.
I noticed that, during the computation, in the output directory some temporary files are created.
I inspected them and noticed that, since my job has only a mapper, what is saved there is the output of the successful tasks.
The problem is that the job failed and I couldn't analyse what it could compute because the temp files were deleted.
Does anyone have some idea how to deal with this situation?
Cheers!
Change the output committer to DirectParquetOutputCommitter.
sc.hadoopConfiguration.set("spark.sql.parquet.output.committer.class", "org.apache.spark.sql.parquet.DirectParquetOutputCommitter"
Note that if you've turned on speculative execution, then you have to turn it off to use a direct output committer.
Related
In Azure Data Factory I am using a Lookup activity to get a list of files to download, then pass it to a ForEach where a dataflow is processing each file
I do not have 'Sequential' mode turned on, I would assume that the data flows should be running in parallel. However, their runtimes are not the same but actually have almost constant time between them (like, first data flow ran 4 mins, second 6, third 8 and so on). It seems as if the second data flow is waiting for the first one to finish and then uses its cluster to process the file.
Is that intended behavior? I have TTL on the cluster set but that did not help too much. If it is, then what is a workaround? I am currently working on creating a list of files first and using that instead of a ForEach but I am not sure if I am going to see an increase in efficiency
I have not been able to solve the issue with the Parallel data flows not executing in parallel, however, I have managed to change the solution that would increase performance.
What was before: A lookup activity that would get a list of files to process, passed on to a ForEach loop with a data flow activity.
What I am testing now: A Data flow activity that would get a list of files, and save them in a text file in ADLS, Then another data flow activity that was previously in a ForEach loop, but changed its source to use "List of Files" and point to that list
The result was an increase in efficiency (Using the same cluster, 40 files would take around 80 mins using ForEach and only 2-3 mins using List of Files), however, debugging is not easy now that everything is in 1 data flow
You can overwrite a list of files file, or use dynamic expressions and name the file as the pipelineId or something else
I'm trying to read a lot of avro files into a spark dataframe. They all share the same s3 filepath prefix, so initially I was running something like:
path = "s3a://bucketname/data-files"
df = spark.read.format("avro").load(path)
which was successfully identifying all the files.
The individual files are something like:
"s3a://bucketname/data-files/timestamp=20201007123000/id=update_account/0324345431234.avro"
Upon attempting to manipulate the data, the code kept errorring out, with a message that one of the files was not an Avro data file. The actual error message received is: org.apache.spark.SparkException: Job aborted due to stage failure: Task 62476 in stage 44102.0 failed 4 times, most recent failure: Lost task 62476.3 in stage 44102.0 (TID 267428, 10.96.134.227, executor 9): java.io.IOException: Not an Avro data file.
To circumvent the problem, I was able to get the explicit filepaths of the avro files I'm interested in. After putting them in a list (file_list), I was successfully able to run spark.read.format("avro").load(file_list).
The issue now is this - I'm interested in adding a number of fields to the dataframe that are part of the filepath (ie. the timestamp and the id from the example above).
While using just the bucket and prefix filepath to find the files (approach #1), these fields were automatically appended to the resulting dataframe. With the explicit filepaths, I don't get that advantage.
I'm wondering if there's a way to include these columns while using spark to read the files.
Sequentially processing the files would look something like:
for file in file_list:
df = spark.read.format("avro").load(file)
id, timestamp = parse_filename(file)
df = df.withColumn("id", lit(id))\
.withColumn("timestamp", lit(timestamp))
but there are over 500k files and this would take an eternity.
I'm new to Spark, so any help would be much appreciated, thanks!
Two separate things to tackle here:
Specifying Files
Spark has built in handling for reading all files of a particular type in a given path. As #Sri_Karthik suggested, try supplying a path like "s3a://bucketname/data-files/*.avro" (if that doesn't work, maybe try "s3a://bucketname/data-files/**/*.avro"... i can't remember the exact pattern matching syntax spark uses), which should grab all avro files only and get rid of that error where you are seeing non-avro files in those paths. In my opinion this is more elegant than manually fetching the file paths and explicitly specifying them.
As an aside, the reason you are seeing this is likely because folders typically get marked with metadata files like .SUCCESS or .COMPLETED to indicate they are are ready for consumption.
Extracting metadata from filepaths
If you check out this stackoverflow question, it shows how you can add the filename as a new column (both for scala and pyspark). You could then use the regexp_extract function to parse out the desired elements from that filename string. I've never used scala in spark so can't help you there, but it should be similar to the pyspark version.
Why dont you try to read the files first by using wholetextfiles method and add the path name into the data itself at the beginning. Then you can filter out the file names from the data and add it as a column while creating the dataframe. I agree it's a two step process. But it should work. To get a timestamp of file you will need filesystem object which js not serializable , i.e. it cant be used in sparks parallelized operation , So you will have to create a local collection with file and timestamp and join it somehow with the RDD you created with wholetextfiles.
I have a spark-streaming job in which I receive data from a message queue and process a bunch of records. In the process, I have a take() method on a dataset. Although the take action is happening in an expected manner, In the DAG visualization, I see multiple job ids created and all of them have the same take action. This is happening only when the data is in the order of a hundreds of thousand records. I didn't observe redundant jobs while running with tens of records in my local machine. Can anyone help me understand the reasoning behind this behavior?
The job ids - (91 to 95) are basically running the same action. Following is the code snippet corresponding to the mentioned action above.
val corruptedMessageArray: Array[ String ] = corruptedMessageDs.take(1);
if ( !corruptedMessageArray.isEmpty ) {
val firstCorruptedMessage: String = corruptedMessageArray( 0 )
}
Your question seems to be whether duplicate jobs are created by Spark.
If you look at the screenshot you will see that the jobs have a different number of tasks, hence it is not a simple matter of duplication.
I am not sure exactly what is happening, but it seems that for large datasets take() needs several quick subsequent jobs. Perhaps because it devises work, or perhaps because it needs to try how much work needs to be done.
I have a dataFrame of 1000 columns, and I am trying to get some statistics by doing some operations on each column. I need to sort each column so, I can't basically do multi column operations on it. I am doing all these column operations in a function called processColumn
def processColumn(df: DataFrame): Double = {
// sort the column
// get some statistics
}
To get this done, I am persisting the dataframe in memory, and doing a scala multi thread processing on it. So, the code is something like this
Let say the initial dataframe is df
df.columns.grouped(100).foreach { columnGroups =>
val newDf = df.select(columnGroups.head, columnGroups.tail:_*)
newDf.persist()
val parallelCol = columnGroups.par
parallelCol.tasksupport = new ForkJoinTaskSupport(
new scala.concurrent.forkjoin.ForkJoinPool(4)
)
parallelCol.foreach { columnName =>
val result = processColumn(df.select(columnName))
// I am storing result here to a synchronized list
}
newDf.unpersist()
}
So, if you see, I am specifying 4 threads to run at a time. But what happens sometimes is that one of the threads gets stuck, and I have more than 4 active jobs running. And the ones that gets stuck never finishes.
I feel the threads that starts from scala parallel collections have a time out, where sometimes it don't wait for all jobs to finish. And then the unpersist gets called. So, the active job is now stuck forever. I am trying to figure it out by going to source code to see if scala collections operations have a timeout, but haven't been able to figure it out for sure.
Any help will be highly appreciated. Also, please let me know if you have any questions. Thank you.
here's a code snippet. (problem description below)
val broadcastVarbannedDNS = sc.broadcast(filterList)
val INPUT = hc.table(s"tableName").where(s"DS BETWEEN 2016120100 AND 2016120100").rdd.filter(x => !broadcastVarbannedDNS.value.map(str => x.getString(2).contains(str)).contains(true))
INPUT.count()
filterList is a csv with 200k+ lines. Comes out to about 9MB.
When I run with the whole filter list, the job hangs at the filter stage and no exception or hint of what the problem is shows. Nothing on the logs either. However, when I cut the filterList file to a few hundred lines, it runs like a hot knife thru butter. Immediately, one could come to the conclusion that the "bigger" file is the issue, although 9MB is minute compared to something that should be able to handle GBs of cache. Any help is appreciated.
I would imagine the issue is that if you have N records in the RDD and M lines in the filter then your processing is O(M*N) as you are checking each record against each line.
This means that if processing the N records is 1 second and you have 200K lines then you are running 200K seconds which would be the same as hanging.