Spark Scala - Comparing Datasets Column by Column - scala

I'm just getting started with using spark, I've previously used python with pandas. One of the common things I do very regularly is compare datasets to see which columns have differences. In python/pandas this looks something like this:
merged = df1.merge(df2,on="by_col")
for col in cols:
diff = merged[col+"_x"] != merged[col+"_y"]
if diff.sum() > 0:
print(f"{col} has {diff.sum()} diffs")
I'm simplifying this a bit but this is the gist of it, and of course after this I'd drill down and look at for example:
col = "col_to_compare"
diff = merged[col+"_x"] != merged[col+"_y"]
print(merged[diff][[col+"_x",col+"_y"]])
Now in spark/scala this is turning out to be extremely inefficient. The same logic works, but this dataset is roughly 300 columns long, and the following code takes about 45 minutes to run for a 20mb dataset, because it's submitting 300 different spark jobs in sequence, not in parallel, so I seem to be paying the startup cost of spark 300 times. For reference the pandas one takes something like 300ms.
for(col <- cols){
val cnt = merged.filter(merged("dev_" + col) <=> merged("prod_" + col)).count
if(cnt != merged.count){
println(col + " = "+cnt + "/ "+merged.count)
}
}
What's the faster more spark way of doing this type of thing? My understanding is I want this to be a single spark job where it creates one plan. I was looking at transposing to a super tall dataset and while that could potentially work it ends up being super complicated and the code is not straightforward at all. Also although this example fits in memory, I'd like to be able to use this function across datasets and we have a few that are multiple terrabytes so it needs to scale for large datasets as well, whereas with python/pandas that would be a pain.

Related

Convert collect-map-foreach scala code block to spark/sql library functions

I have a spark dataframe (let's call it "records") like the following one:
id
name
a1
john
b"2
alice
c3'
joe
If you notice, the primary key column (id) values may have single/double quotes in them (like the second and third row in the dataframe).
I wrote following scala code to check for quotes in primary key column values:
def checkForQuotesInPrimaryKeyColumn(primaryKey: String, records: DataFrame): Boolean = {
// Extract primary key column values
val pkcValues = records.select(primaryKey).collect().map(_(0)).toList
// Check for single and double quotes in the values
var checkForQuotes = false // indicates no quotes
breakable {
pkcValues.foreach(pkcValue => {
if (pkcValue.toString.contains("\"") || pkcValue.toString.contains("\'")) {
checkForQuotes = true
println("Value that has quotes: " + pkcValue.toString)
break()
}
})}
checkForQuotes
}
This code works. But it doesn't take advantage of spark functionalities. I wish to make use of spark executors (and other features) that can complete this task faster.
The updated function looks like the following:
def checkForQuotesInPrimaryKeyColumnsUpdated(primaryKey: String, records: DataFrame): Boolean = {
val findQuotes = udf((s: String) => if (s.contains("\"") || s.contains("\'")) true else false)
records
.select(findQuotes(col(primaryKey)) as "quotes")
.filter(col("quotes") === true)
.collect()
.nonEmpty
}
The unit tests give similar runtimes on my machine for both the functions when run on a dataframe with 100 entries.
Is the updated function any faster (and/or better) than the original function? Is there any way the function can be improved?
Your first approach collects the entire dataframe to the driver. If your data does not fit into the driver's memory, it is going to break. Also you are right, you do not take advantage of spark.
The second approach uses spark to detect quotes. That's better. The problem is that you then collect a dataframe containing one boolean per record containing a quote to the driver just to see if there is at least one. This is a waste of time, especially if many records contain quotes. It is also a shame to use a UDF for this, since they are known to be slower than spark SQL primitives.
You could simply use spark to count the number records containing a quote, without collecting anything.
records.where(col(primaryKey).contains("\"") || col(primaryKey).contains("'"))
.count > 0
Since, you do not actually care about the number of records. You just want to check if there is at least one, you could use limit(1). SparkSQL will be able to further optimize the query:
records.where(col(primaryKey).contains("\"") || col(primaryKey).contains("'"))
.limit(1).count > 0
NB: it makes sense that in unit tests, with little data, both of your queries take the same time. Spark is meant for big data and has some overhead. With real data, your second approach should be faster than the first and the one I propose even so. Also, your first approach will get an OOM on the driver as soon as you add in more data.

Pyspark improving performance for multiple column operations

I have written a class which performs standard scaling over grouped data.
class Scaler:
.
.
.
.
def __transformOne__(self, df_with_stats, newName, colName):
return df_with_stats\
.withColumn(newName,
(F.col(colName)-F.col(f'avg({colName})'))/(F.col(f'stddev_samp({colName})')+self.tol))\
.drop(colName)\
.withColumnRenamed(newName, colName)
def transform(self, df):
df_with_stats = df.join(....) #calculate stats here by doing a groupby and then do a join
return reduce(lambda df_with_stats, kv: self.__transformOne__(df_with_stats, *kv),
self.__tempNames__(), df_with_stats)[df.columns]
The idea is to save the mean and variances in columns and simply do a column subtraction/division on the column i want to scale. This part is done in the function transformOne. So basically its an arithmetic operation on one column.
If i want to scale multiple columns I just call the function transformOne multiple times but a bit more efficiently using functools.reduce (see the function transform. The class works fast enough for a single column but when I have multiple columns it takes too much time.
I have no idea about internals of spark so im a complete newbie. Is there a way i can improve this computation over multiple columns ?
My solution does a lot of calls to withColumn function. Hence i changed the solution by using select instead of withColumn. There is substantial difference in the physical plans of both the approaches. For my application I improved from 15 minutes to 2 minutes using select. More information about this in this SO post.

Caching Large Dataframes in Spark Effectively

I am currently working on 11,000 files. Each file will generate a data frame which will be Union with the previous one. Below is the code:
var df1 = sc.parallelize(Array(("temp",100 ))).toDF("key","value").withColumn("Filename", lit("Temp") )
files.foreach( filename => {
val a = filename.getPath.toString()
val m = a.split("/")
val name = m(6)
println("FILENAME: " + name)
if (name == "_SUCCESS") {
println("Cannot Process '_SUCCSS' Filename")
} else {
val freqs=doSomething(a).toDF("key","value").withColumn("Filename", lit(name) )
df1=df1.unionAll(freqs)
}
})
First, i got an error of java.lang.StackOverFlowError on 11,000 files. Then, i add a following line after df1=df1.unionAll(freqs):
df1=df1.cache()
It resolves the problem but after each iteration, it is getting slower. Can somebody please suggest me what should be done to avoid StackOverflowError with no decrease in time.
Thanks!
The issue is that spark manages a dataframe as a set of transformations. It begins with the "toDF" of the first dataframe, then perform the transformations on it (e.g. withColumn), then unionAll with the previous dataframe etc.
The unionAll is just another such transformation and the tree becomes very long (with 11K unionAll you have an execution tree of depth 11K). The unionAll when building the information can get to a stack overflow situation.
The caching doesn't solve this, however, I imagine you are adding some action along the way (otherwise nothing would run besides building the transformations). When you perform caching, spark might skip some of the steps and therefor the stack overflow would simply arrive later.
You can go back to RDD for iterative process (your example actually is not iterative but purely parallel, you can simply save each separate dataframe along the way and then convert to RDD and use RDD union).
Since your case seems to be join unioning a bunch of dataframes without true iterations, you can also do the union in a tree manner (i.e. union pairs, then union pairs of pairs etc.) this would change the depth from O(N) to O(log N) where N is the number of unions.
Lastly, you can read and write the dataframe to/from disk. The idea is that after every X (e.g. 20) unions, you would do df1.write.parquet(filex) and then df1 = spark.read.parquet(filex). When you read the lineage of a single dataframe would be the file reading itself. The cost of course would be the writing and reading of the file.

Scala Iterator/Looping Technique - Large Collections

I have really large tab delimited files (10GB-70GB) and need to do some read, data manipulation, and write to a separate file. The files can range from 100 to 10K columns with 2 million to 5 million rows.
The first x columns are static which are required for reference. Sample file format:
#ProductName Brand Customer1 Customer2 Customer3
Corolla Toyota Y N Y
Accord Honda Y Y N
Civic Honda 0 1 1
I need to use the first 2 columns to get a product id then generate an output file similar to:
ProductID1 Customer1 Y
ProductID1 Customer2 N
ProductID1 Customer3 Y
ProductID2 Customer1 Y
ProductID2 Customer2 Y
ProductID2 Customer3 N
ProductID3 Customer1 N
ProductID3 Customer2 Y
ProductID3 Customer3 Y
Current sample code:
val fileNameAbsPath = filePath + fileName
val outputFile = new PrintWriter(filePath+outputFileName)
var customerList = Array[String]()
for(line <- scala.io.Source.fromFile(fileNameAbsPath).getLines()) {
if(line.startsWith("#")) {
customerList = line.split("\t")
}
else {
val cols = line.split("\t")
val productid = getProductID(cols(0), cols(1))
for (i <- (2 until cols.length)) {
val rowOutput = productid + "\t" + customerList(i) + "\t" + parser(cols(i))
outputFile.println(rowOutput)
outputFile.flush()
}
}
}
outputFile.close()
One of tests that I ran took about 12 hours to read a file (70GB) that has 3 million rows and 2500 columns. The final output file generated 250GB with about 800+ million rows.
My question is: is there anything in Scala other than what I'm already doing that can offer quicker performance?
Ok, some ideas ...
As mentioned in the comments, you don't want to flush after every line. So, yeah, get rid of it.
Moreover, PrintWriter by default flushes after every newline anyway (so, currently, you are actually flushing twice :)). Use a two-argument constructor, when creating PrintWriter, and make sure the second parameter is false
You don't need to create BufferedWriter explicitly, PrintWriter is already buffering by default. The default buffer size is 8K, you might want to try to play around with it, but it will probably not make any difference, because, last I checked, the underlying FileOutputStream ignores all that, and flushes kilobyte-sized chunks either way.
Get rid of gluing rows together in a variable, and just write each field straight to the output.
If you do not care about the order in which lines appear in the output, you can trivially parallelize the processing (if you do care about the order, you still can, just a little bit less trivially), and write several files at once. That would help tremendously, if you place your output chunks on different disks and/or if you have multiple cores to run this code. You'd need to rewrite your code in (real) scala to make it thread safe, but that should be easy enough.
Compress data as it is being written. Use GZipOutputStream for example. That not only lets you reduce the physical amount of data actually hitting the disks, but also allows for a much larger buffer
Check out what your parser thingy is doing. You didn't show the implementation, but something tells me it is likely not free.
split can get prohibitively expensive on huge strings. People often forget, that its parameter is actually a regular expression. You are probably better off writing a custom iterator or just using good-old StringTokenizer to parse the fields out as you go, rather than splitting up-front. At the very least, it'll save you one extra scan per line.
Finally, last, but by no measure least. Consider using spark and hdfs. This kind of problems is the very area where those tools really excel.

Spark: How to combine 2 sorted RDDs so that order is preserved after union?

I have 2 sorted RDDs:
val rdd_a = some_pair_rdd.sortByKey().
zipWithIndex.filter(f => f._2 < n).
map(f => f._1)
val rdd_b = another_pair_rdd.sortByKey().
zipWithIndex.filter(f => f._2 < n).
map(f => f._1)
val all_rdd = rdd_a.union(rdd_b)
In all_rdd, I see that the order is not necessarily maintained as I'd imagined (that all elements of rdd_a come first, followed by all elements of rdd_b). Is my assumption incorrect (about the contract of union), and if so, what should I use to append multiple sorted RDDs into a single rdd?
I'm fairly new to Spark so I could be wrong, but from what I understand Union is a narrow transformation. That is, each executor joins only its local blocks of RDD a with its local blocks of RDD b and then returns that to the driver.
As an example, let's say that you have 2 executors and 2 RDDS.
RDD_A = ["a","b","c","d","e","f"]
and
RDD_B = ["1","2","3","4","5","6"]
Let Executor 1 contain the first half of both RDD's and Executor 2 contain the second half of both RDD's. When they perform the union on their local blocks, it would look something like:
Union_executor1 = ["a","b","c","1","2","3"]
and
Union_executor2 = ["d","e","f","4","5","6"]
So when the executors pass their parts back to the driver you would have ["a","b","c","1","2","3","d","e","f","4","5","6"]
Again, I'm new to Spark and I could be wrong. I'm just sharing based on my understanding of how it works with RDD's. Hopefully we can both learn something from this.
You can't. Spark does not have a merge sort, because you can't make assumptions about the way that the RDDs are actually stored on the nodes. If you want things in sort order after you take the union, you need to sort again.