I have a dataFrame of 1000 columns, and I am trying to get some statistics by doing some operations on each column. I need to sort each column so, I can't basically do multi column operations on it. I am doing all these column operations in a function called processColumn
def processColumn(df: DataFrame): Double = {
// sort the column
// get some statistics
}
To get this done, I am persisting the dataframe in memory, and doing a scala multi thread processing on it. So, the code is something like this
Let say the initial dataframe is df
df.columns.grouped(100).foreach { columnGroups =>
val newDf = df.select(columnGroups.head, columnGroups.tail:_*)
newDf.persist()
val parallelCol = columnGroups.par
parallelCol.tasksupport = new ForkJoinTaskSupport(
new scala.concurrent.forkjoin.ForkJoinPool(4)
)
parallelCol.foreach { columnName =>
val result = processColumn(df.select(columnName))
// I am storing result here to a synchronized list
}
newDf.unpersist()
}
So, if you see, I am specifying 4 threads to run at a time. But what happens sometimes is that one of the threads gets stuck, and I have more than 4 active jobs running. And the ones that gets stuck never finishes.
I feel the threads that starts from scala parallel collections have a time out, where sometimes it don't wait for all jobs to finish. And then the unpersist gets called. So, the active job is now stuck forever. I am trying to figure it out by going to source code to see if scala collections operations have a timeout, but haven't been able to figure it out for sure.
Any help will be highly appreciated. Also, please let me know if you have any questions. Thank you.
Related
I have a spark-streaming job in which I receive data from a message queue and process a bunch of records. In the process, I have a take() method on a dataset. Although the take action is happening in an expected manner, In the DAG visualization, I see multiple job ids created and all of them have the same take action. This is happening only when the data is in the order of a hundreds of thousand records. I didn't observe redundant jobs while running with tens of records in my local machine. Can anyone help me understand the reasoning behind this behavior?
The job ids - (91 to 95) are basically running the same action. Following is the code snippet corresponding to the mentioned action above.
val corruptedMessageArray: Array[ String ] = corruptedMessageDs.take(1);
if ( !corruptedMessageArray.isEmpty ) {
val firstCorruptedMessage: String = corruptedMessageArray( 0 )
}
Your question seems to be whether duplicate jobs are created by Spark.
If you look at the screenshot you will see that the jobs have a different number of tasks, hence it is not a simple matter of duplication.
I am not sure exactly what is happening, but it seems that for large datasets take() needs several quick subsequent jobs. Perhaps because it devises work, or perhaps because it needs to try how much work needs to be done.
This question already has answers here:
How to run concurrent jobs(actions) in Apache Spark using single spark context
(2 answers)
Processing multiple files as independent RDD's in parallel
(3 answers)
How to run multiple Spark jobs in parallel?
(3 answers)
Closed 4 years ago.
I am using the SQL API of Spark 2.0.0.
I would like to know what is the good practice when I have two independant actions that have to be done on my data. Here is a basic example :
val ds = sc.parallelize(List(
("2018-12-07T15:31:48Z", "AAA",3),
("2018-12-07T15:32:48Z", "AAA",25),
("2018-12-07T15:33:48Z", "AAA",20),
("2018-12-07T15:34:48Z", "AAA",10),
("2018-12-07T15:35:48Z", "AAA",15),
("2018-12-07T15:36:48Z", "AAA",16),
("2018-12-07T15:37:48Z", "AAA",8),
("2018-12-07T15:31:48Z", "BBB",15),
("2018-12-07T15:32:48Z", "BBB",0),
("2018-12-07T15:33:48Z", "BBB",0),
("2018-12-07T15:34:48Z", "BBB",1),
("2018-12-07T15:35:48Z", "BBB",8),
("2018-12-07T15:36:48Z", "BBB",7),
("2018-12-07T15:37:48Z", "BBB",6)
)).toDF("timestamp","tag","value")
val newDs = commonTransformation(ds).cache();
newDs.count() // force computation of the dataset
val dsAAA = newDs.filter($"tag"==="AAA")
val dsBBB = newDs.filter($"tag"==="BBB")
actionAAA(dsAAA)
actionBBB(dsBBB)
Using the following functions :
def commonTransformation(ds:Dataset[Row]):Dataset[Row]={
ds // do multiple transformations on dataframe
}
def actionAAA(ds:Dataset[Row]){
Thread.sleep(5000) // Sleep to simulate an action that takes time
ds.show()
}
def actionBBB(ds:Dataset[Row]){
Thread.sleep(5000) // Sleep to simulate an action that takes time
ds.show()
}
In this example, we have an input dataset that contains multiple time series identified by the 'tag' column. Some transofrmations are applied to this whole dataset.
Then, I want to apply different actions depending of the tag of the time series on my data.
In my example, I get the expected result, but I had to wait a long time for both my actions to get executed, event though I had executors available.
I partialy solved the problem by using Java class Future, which allows me to start my actions in an asynchronous way. But with this solution, Spark become very slow if I start too much actions compared to his resources and end up taking more time than doing the actions one by one.
So for now, my solution is to start multiple actions with a maximum limit of actions running at the same time, but I don't feel like it is the good way to do (and the maximum limit is hard to guess).
I have created a database with Slick. And I am trying to create schemas of tables, select some information and so on. Here is my code for schemas creation:
val createUserTable = UserTable.table.schema.create
val createTaskTable = TaskTable.table.schema.create
Await.result(db.run(DBIO.seq(Queries.createUserTable, Queries.createTaskTable)), 2 seconds)
This code works just fine but I do not want to use Await.result with every query. What I am looking for is executing them in batch at least by purpose (creation, selection and so on). I could I have created this method to pass different actions:
def executeAction[T](action: DBIO[T]) =
Await.result(db.run(action), 2 seconds)
So I am curious how can I change it to pass some data structure which holds a sequence of queries? For example, List(createUserTable, createTaskTable)
Your help is appreciated!
Two ways to avoid Await for every DBIO action
Create list of DBIO actions and gather them using DBIO.seq and execute.
Use for-comprehension to compose all DBIO actions into one DBIO action.
This will help you some using await again and again to wait for the result of your intermediate DBIO actions.
In both cases, you have to wait for results in main thread (i.e stop the main thread from exiting) using Await.result at least once.
I'm writing a map only sparkSQL job which looks like
val lines = sc.textFile(inputPath)
val df = lines.map { line => ... }.toDF("col0", "col1")
df.write.parquet(output)
As the job takes quite a long time to compute, I would like to save and keep the results of the tasks which successfully terminated, even if the overall job fails or gets killed.
I noticed that, during the computation, in the output directory some temporary files are created.
I inspected them and noticed that, since my job has only a mapper, what is saved there is the output of the successful tasks.
The problem is that the job failed and I couldn't analyse what it could compute because the temp files were deleted.
Does anyone have some idea how to deal with this situation?
Cheers!
Change the output committer to DirectParquetOutputCommitter.
sc.hadoopConfiguration.set("spark.sql.parquet.output.committer.class", "org.apache.spark.sql.parquet.DirectParquetOutputCommitter"
Note that if you've turned on speculative execution, then you have to turn it off to use a direct output committer.
I have a spark job that takes a file with 8 records from hdfs, does a simple aggregation and saves it back to hdfs. I notice there are like hundreds of tasks when I do this.
I also am not sure why there are multiple jobs for this? I thought a job was more like when an action happened. I can speculate as to why - but my understanding was that inside of this code it should be one job and it should be broken down into stages, not multiple jobs. Why doesn't it just break it down into stages, how come it breaks into jobs?
As far as the 200 plus tasks, since the amount of data and the amount of nodes is miniscule, it doesn't make sense that there is like 25 tasks for each row of data when there is only one aggregations and a couple of filters. Why wouldn't it just have one task per partition per atomic operation?
Here is the relevant scala code -
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object TestProj {object TestProj {
def main(args: Array[String]) {
/* set the application name in the SparkConf object */
val appConf = new SparkConf().setAppName("Test Proj")
/* env settings that I don't need to set in REPL*/
val sc = new SparkContext(appConf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val rdd1 = sc.textFile("hdfs://node002:8020/flat_files/miscellaneous/ex.txt")
/*the below rdd will have schema defined in Record class*/
val rddCase = sc.textFile("hdfs://node002:8020/flat_files/miscellaneous/ex.txt")
.map(x=>x.split(" ")) //file record into array of strings based spaces
.map(x=>Record(
x(0).toInt,
x(1).asInstanceOf[String],
x(2).asInstanceOf[String],
x(3).toInt))
/* the below dataframe groups on first letter of first name and counts it*/
val aggDF = rddCase.toDF()
.groupBy($"firstName".substr(1,1).alias("firstLetter"))
.count
.orderBy($"firstLetter")
/* save to hdfs*/
aggDF.write.format("parquet").mode("append").save("/raw/miscellaneous/ex_out_agg")
}
case class Record(id: Int
, firstName: String
, lastName: String
, quantity:Int)
}
Below is the screen shot after clicking on the application
Below is are the stages show when viewing the specific "job" of id 0
Below is the first part of the screen when clicking on the stage with over 200 tasks
This is the second part of the screen inside the stage
Below is after clicking on the "executors" tab
As requested, here are the stages for Job ID 1
Here are the details for the stage in job ID 1 with 200 tasks
This is a classic Spark question.
The two tasks used for reading (Stage Id 0 in second figure) is the defaultMinPartitions setting which is set to 2. You can get this parameter by reading the value in the REPL sc.defaultMinPartitions. It should also be visible in the Spark UI under the "Environment" tab.
You can take a look at the code from GitHub to see that this exactly what is happening. If you want more partitions to be used on read, just add it as a parameter e.g., sc.textFile("a.txt", 20).
Now the interesting part comes from the 200 partitions that come on the second stage (Stage Id 1 in second figure). Well, each time there is a shuffle, Spark needs to decide how many partitions will the shuffle RDD have. As you can imagine, the default is 200.
You can change that using:
sqlContext.setConf("spark.sql.shuffle.partitions", "4”)
If you run your code with this configuration you will see that the 200 partitions are not going to be there any more. How to set this parameter is kind of an art. Maybe choose 2x the number of cores you have (or whatever).
I think Spark 2.0 has a way to automatically infer the best number of partitions for shuffle RDDs. Looking forward to that!
Finally, the number of jobs you get has to do with how many RDD actions the resulting optimized Dataframe code resulted to. If you read the Spark specs it says that each RDD action will trigger one job. When you action involves a Dataframe or SparkSQL the Catalyst optimizer will figure out an execution plan and generate some RDD based code to execute it. It's hard to say exactly why it uses two actions in your case. You may need to look at the optimized query plan to see exactly what is doing.
I am having a similar problem. But in my scenario the collection I am parallelizing has less elements than the number of tasks scheduled by Spark (causing spark to behave oddly sometimes). Using the forced partition number I was able to fix this issue.
It was something like this:
collection = range(10) # In the real scenario it was a complex collection
sc.parallelize(collection).map(lambda e: e + 1) # also a more complex operation in the real scenario
Then, I saw in the Spark log:
INFO YarnClusterScheduler: Adding task set 0.0 with 512 tasks