Scala: Process dataframe while value in column meets condition

Scala: Process dataframe while value in column meets condition - scala

I have to process a huge dataframe, download files from a service by the id column of the dataframe. The logic to download, and all the changes are prepared, but I am not sure what is the best way to make a loop around this. I run this on Databricks, which is why I need to perform the processes in chunks.
The dataframe has a "status" column, which can hold the following values:
"todo", "processing", "failed", "succeeded"
In the while loop I want to perform the following tasks:
while (there are rows with status "todo") {
- get the first 10 rows if status is todo (DONE)
- start processing the dataframe, update status to processing (DONE)
- download files (call UDF), update status to succeeded or failed
(DONE, not in the code here)
}
I would like to run this until all the rows' status are other then todo! The problem is that this while loop is not finishing, because the dataframe itself is not updated. It needs to be assigned to another dataframe, but then how to add the new one to the loop?
My code right now:
while(statusDoc.where("status == 'todo'").count > 0) {
val todoDF = test.filter("status == 'todo'")
val processingDF = todoDF.limit(10).withColumn("status", when(col("status") === "todo", "processing")
.otherwise(col("status")))
statusDoc.join(processingDF, Seq("id"), "outer")
.select($"id", \
statusDoc("fileUrl"), \
coalesce(processingDF("status"), statusDoc("status")).alias("status"))
}
The join should go like this:
val update = statusDoc.join(processingDF, Seq("id"), "outer")
.select($"id", statusDoc("fileUrl"),\
coalesce(processingDF("status"), statusDoc("status")).alias("status"))
Then this new update dataframe should be used for the next round of loop.

One thing to remember here is that DataFrame (Spark) are not mutable because they are distributed. You have no guarantee that a given modification would be properly propagated across all the network of executors, if you make some. And you also have no guarantee that a given portion of the data has not already been used somewhere else (in another node for example).
One thing you can do though is add another column with the updated values and remove the old column.
val update = statusDoc.
.withColumnRenamed("status", "status_doc")
.join(processingDF, Seq("id"), "outer")
.withColumn("updated_status", udf((stold: String, stold: String) => if (stnew != null) stnew else stold).apply(col("status"), col("status_doc"))
.drop("status_doc", "status")
.withColumnRenamed("updated_status", "status")
.select("id", "fileUrl", "status")
Then make sure you replace "statusDoc" with the "update" DataFrame. Do not forget to make the DataFrame a "var" instead of a "val". I'm surprised your IDE has not yelled yet.
Also, I'm sure you can think of a way of distributing the problem so that you avoid the while loop - I can help you do that but I need a clearer description of you issue. If you use a while loop, you won't use the full capabilities of your cluster because the while loop is only executed on the master. Then, you'll treat only 10 lines at a time, each time. I'm sure you can append all data you need to the whole DataFrame in a single map operation.

Related

Databricks - All records from Dataframe/Tempview get removed after merge

I am observing some weired issue. I am not sure whether it is lack of my knowledge in spark or what.
I have a dataframe as shown in below code. I create a tempview from it and I am observing that after merge operation, that tempview becomes empty. Not sure why.
val myDf = getEmployeeData()
myDf.createOrReplaceTempView("myView")
// Result1: Below lines display all the records
myDf.show()
spark.Table("myView").show()
// performing merge operation
val sql = s"""MERGE INTO employee AS a
USING myView AS b
ON a.Id = b.Id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *"""
spark.sql(sql)
// Result2: ISSUE is here. myDf & mvView both are empty
myDf.show()
spark.Table("myView").show()
Edit
getEmployeeData method performs join between two dataframes and returns the result.
df1.as(df1Alias).join(df2.as(df2Alias), expr(joinString), "inner").filter(finalFilterString).select(s"$df1Alias.*")

Dataframes in Spark are lazily evaluated, ie. not executed until an action like .show, .collect is executed or they're used in SQL DDL operation. This also means that if you refer to it once more, it will get reevaluated again.
Assuming there's no other background activity that can mess up, apparently your function getEmployeeData, depends on employee table. It gets executed both before and after the merge and might yield different result.
To prevent it you can checkpoint the dataframe:
myView.checkpoint()
or explicitly materialize it:
myView.write.saveAsTable("myViewMaterialized")
and later refer to the materialized version.

I agree with the points that #Kombajn zbozowy said related to Dataframes in spark are lazily evaluated and will be reevaluated again if you call action on it once more.
I would like to point out that, it is normal expected behavior and might not have to do anything with the merge operation.
For example, the dataframe that you get as an output of the join only contains inserts and you perform those inserts on the target table using dataframe write api, and then if you run the df.show() command again it would show you output as empty because when it is reevaluating the content of the dataframe by performing join it won't get any difference and will not output any records...
Same holds true for merge operation as it also updates the target table with insert/updates and when you rerun it won't show any output rows.

Accumulator in Spark Scala: Counter value is wrong when calculated in a filter and used with withColumn at the end

I'm trying to count the number of valid and invalid data, that is present in a file. Below is the code to do the same,
val badDataCountAcc = spark.sparkContext.longAccumulator("BadDataAcc")
val goodDataCountAcc = spark.sparkContext.longAccumulator("GoodDataAcc")
val dataframe = spark
.read
.format("csv")
.option("header", true)
.option("inferSchema", true)
.load(path)
.filter(data => {
val matcher = regex.matcher(data.toString())
if (matcher.find()) {
goodDataCountAcc.add(1)
println("GoodDataCountAcc: " + goodDataCountAcc.value)
true
} else {
badDataCountAcc.add(1)
println("BadDataCountAcc: " + badDataCountAcc.value)
false
}
}
)
.withColumn("FileName", input_file_name())
dataframe.show()
val filename = dataframe
.select("FileName")
.distinct()
val name = filename.collectAsList().get(0).toString()
println("" + filename)
println("Bad data Count Acc: " + badDataCountAcc.value)
println("Good data Count Acc: " + goodDataCountAcc.value)
I ran this code for the sample data that has 2 valid and 3 invalid data. Inside the filter, where I'm printing the counts, values are correct. But outside the filter when I'm printing the values for count, it is coming as 4 for good data and 6 for bad data.
Questions:
When I remove the withColumn statement at the end - along with the code which calculates distinct filename - values are printed correctly. I'm not sure why?
I do have a requirement to get the input filename as well. What would be best way to do that here?

First of all, Accumulator belongs to the RDD API, while you are using Dataframes. Dataframes are compiled down to RDDs in the end, but they are at a higher level of abstraction. It is better to use aggregations instead of Accumulators in this context.
From the Spark Accumulators documentation:
For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
Accumulators do not change the lazy evaluation model of Spark. If they are being updated within an operation on an RDD, their value is only updated once that RDD is computed as part of an action. Consequently, accumulator updates are not guaranteed to be executed when made within a lazy transformation like map(). The below code fragment demonstrates this property:
Your DataFrame filter will be compiled to an RDD filter, which is not an action, but a transformation (and thus lazy), so this only-once guarantee does not hold in your case.
How many times your code is executed depends is implementation-dependent, and may change with Spark versions, so you should not rely on it.
Regarding your two questions:
(BEFORE EDIT) This cannot be answered based on your code snippet because it doesn't contain any actions. Is it even the exact code snippet you use? I suspect that if you actually execute the code you posted without any additions except for the missing imports, it should print 0 two times because nothing is executed. Either way, you should always assume that an accumulator inside an RDD transformation is potentially executed multiple times (or even not at all if it is in a DataFrame operation which can possibly be optimized out).
Your approach of using withColumn is perfectly fine.
I'd suggest using DataFrame expressions and aggregations (or equivalent Spark SQL if you prefer that). The regex matching can be done using rlike, using the columns instead of relying of toString(), e.g. .withColumn("IsGoodData", $"myColumn1".rlike(regex1) && $"myColumn2".rlike(regex2)).
Then you can count the good and bad records using an aggregation like dataframe.groupBy($"IsGoodData").count()
EDIT: With the additional lines the answer to your first question is also clear: The first time was from the dataframe.show() and the second time from the filename.collectAsList(), which you probably also removed as it depends on the added column. Please make sure you understand the distinction between Spark transformations and actions and the lazy evaluation model of Spark. Otherwise you won't be very happy with it :-)

How do I understand that caching is used in Spark?

In my Scala/Spark application, I create DataFrame. I plan to use this Dataframe several times throughout the program. For that's why I decided to used .cache() method for that DataFrame. As you can see inside the loop I filter DataFrame several times with different values. For some reason .count() method returns me the always the same result. In fact, it must return two different count values. Also, I notice strange behavior in Mesos. It feels like the .cache() method is not being executed. After creating the DataFrame, the program goes to this part of code if (!df.head(1).isEmpty) and performs it for a very long time. I assumed that the caching process would run for a long time, and the other processes would use this cache and run quickly. What do you think is the problem?
import org.apache.spark.sql.DataFrame
var df: DataFrame = spark
.read
.option("delimiter", "|")
.csv("/path_to_the_files/")
.filter(col("col5").isin("XXX", "YYY", "ZZZ"))
df.cache()
var array1 = Array("111", "222")
var array2 = Array("333")
var storage = Array(array1, array2)
if (!df.head(1).isEmpty) {
for (item <- storage) {
df.filter(
col("col1").isin(item:_*)
)
println("count: " + df.count())
}
}

In fact, it must return two different count values.
Why? You are calling it on the same df. Maybe you meant something like
val df1 = df.filter(...)
println("count: " + df1.count())
I assumed that the caching process would run for a long time, and the other processes would use this cache and run quickly.
It does, but only when the first action which depends on this dataframe is executed, and head is that action. So you should expect exactly
the program goes to this part of code if (!df.head(1).isEmpty) and performs it for a very long time
Without caching, you'd also get the same time for both df.count() calls, unless Spark detects it and enables caching on its own.

Caching Large Dataframes in Spark Effectively

I am currently working on 11,000 files. Each file will generate a data frame which will be Union with the previous one. Below is the code:
var df1 = sc.parallelize(Array(("temp",100 ))).toDF("key","value").withColumn("Filename", lit("Temp") )
files.foreach( filename => {
val a = filename.getPath.toString()
val m = a.split("/")
val name = m(6)
println("FILENAME: " + name)
if (name == "_SUCCESS") {
println("Cannot Process '_SUCCSS' Filename")
} else {
val freqs=doSomething(a).toDF("key","value").withColumn("Filename", lit(name) )
df1=df1.unionAll(freqs)
}
})
First, i got an error of java.lang.StackOverFlowError on 11,000 files. Then, i add a following line after df1=df1.unionAll(freqs):
df1=df1.cache()
It resolves the problem but after each iteration, it is getting slower. Can somebody please suggest me what should be done to avoid StackOverflowError with no decrease in time.
Thanks!

The issue is that spark manages a dataframe as a set of transformations. It begins with the "toDF" of the first dataframe, then perform the transformations on it (e.g. withColumn), then unionAll with the previous dataframe etc.
The unionAll is just another such transformation and the tree becomes very long (with 11K unionAll you have an execution tree of depth 11K). The unionAll when building the information can get to a stack overflow situation.
The caching doesn't solve this, however, I imagine you are adding some action along the way (otherwise nothing would run besides building the transformations). When you perform caching, spark might skip some of the steps and therefor the stack overflow would simply arrive later.
You can go back to RDD for iterative process (your example actually is not iterative but purely parallel, you can simply save each separate dataframe along the way and then convert to RDD and use RDD union).
Since your case seems to be join unioning a bunch of dataframes without true iterations, you can also do the union in a tree manner (i.e. union pairs, then union pairs of pairs etc.) this would change the depth from O(N) to O(log N) where N is the number of unions.
Lastly, you can read and write the dataframe to/from disk. The idea is that after every X (e.g. 20) unions, you would do df1.write.parquet(filex) and then df1 = spark.read.parquet(filex). When you read the lineage of a single dataframe would be the file reading itself. The cost of course would be the writing and reading of the file.

Recursively adding rows to a dataframe

I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.

You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)

You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.

If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.