Is spark persist() (then action) really persisting?

Is spark persist() (then action) really persisting? - scala

I always understood that persist() and cache(), then action to activate the DAG, will calculate and keep the result in memory for later use. A lot of threads here will tell you to cache to enhance the performance of frequently used dataframe.
Recently I did a test and was confused because that does not seem to be the case.
temp_tab_name = "mytablename";
x = spark.sql("select * from " +temp_tab_name +" limit 10");
x = x.persist()
x.count() #action to activate all the above steps
x.show() #x should have been persisted in memory here, DAG evaluated, no going back to "select..." whenever referred to
x.is_cached #True
spark.sql("drop table "+ temp_tab_name);
x.is_cached #Still true!!
x.show() # Error, table not found here
So it seems to me that x is never calculated and persisted. The next reference to x still goes back to evaluate its DAG definition "select..." .Anything I missed here ?

cache and persist don't completely detach computation result from the source.
It just makes best-effort for avoiding recalculation. So, generally speaking, deleting source before you are done with the dataset is a bad idea.
What could go wrong in your particular case (from the top of my head):
1) show doesn't need all records of the table so maybe it triggers computation only for few partitions. So most of the partitions are still not calculated at this point.
2) spark needs some auxilliary information from the table (e.g. for partitioning)

The correct syntax is below ... here is some additional documentation for "uncaching" tables => https://spark.apache.org/docs/latest/sql-performance-tuning.html ... and you can confirm the examples below in the Spark UI under "storage" tab to see the objects being "cached" and "uncached"
# df method
df = spark.range(10)
df.cache() # cache
# df.persist() # acts same as cache
df.count() # action to materialize df object in ram
# df.foreach(lambda x: x) # another action to materialize df object in ram
df.unpersist() # remove df object from ram
# temp table method
df.createOrReplaceTempView("df_sql")
spark.catalog.cacheTable("df_sql") # cache
spark.sql("select * from df_sql").count() # action to materialize temp table in ram
spark.catalog.uncacheTable("df_sql") # remove temp table from ram

Related

Databricks - All records from Dataframe/Tempview get removed after merge

I am observing some weired issue. I am not sure whether it is lack of my knowledge in spark or what.
I have a dataframe as shown in below code. I create a tempview from it and I am observing that after merge operation, that tempview becomes empty. Not sure why.
val myDf = getEmployeeData()
myDf.createOrReplaceTempView("myView")
// Result1: Below lines display all the records
myDf.show()
spark.Table("myView").show()
// performing merge operation
val sql = s"""MERGE INTO employee AS a
USING myView AS b
ON a.Id = b.Id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *"""
spark.sql(sql)
// Result2: ISSUE is here. myDf & mvView both are empty
myDf.show()
spark.Table("myView").show()
Edit
getEmployeeData method performs join between two dataframes and returns the result.
df1.as(df1Alias).join(df2.as(df2Alias), expr(joinString), "inner").filter(finalFilterString).select(s"$df1Alias.*")

Dataframes in Spark are lazily evaluated, ie. not executed until an action like .show, .collect is executed or they're used in SQL DDL operation. This also means that if you refer to it once more, it will get reevaluated again.
Assuming there's no other background activity that can mess up, apparently your function getEmployeeData, depends on employee table. It gets executed both before and after the merge and might yield different result.
To prevent it you can checkpoint the dataframe:
myView.checkpoint()
or explicitly materialize it:
myView.write.saveAsTable("myViewMaterialized")
and later refer to the materialized version.

I agree with the points that #Kombajn zbozowy said related to Dataframes in spark are lazily evaluated and will be reevaluated again if you call action on it once more.
I would like to point out that, it is normal expected behavior and might not have to do anything with the merge operation.
For example, the dataframe that you get as an output of the join only contains inserts and you perform those inserts on the target table using dataframe write api, and then if you run the df.show() command again it would show you output as empty because when it is reevaluating the content of the dataframe by performing join it won't get any difference and will not output any records...
Same holds true for merge operation as it also updates the target table with insert/updates and when you rerun it won't show any output rows.

Skewed Window Function & Hive Source Partitions?

The data I am reading via Spark is highly skewed Hive Table with the following stats.
(MIN, 25TH, MEDIAN, 75TH, MAX) via Spark UI:
1506.0 B / 0 232.4 KB / 27288 247.3 KB / 29025 371.0 KB / 42669 269.0 MB / 27197137
I believe it is causing problems downstream in the job when I perform some Window Funcs, and Pivots.
I tried exploring this parameter to limit the partition size however nothing changed and the partitions are still skewed upon read.
spark.conf.set("spark.sql.files.maxPartitionBytes")
Also, when I cache this DF with the Hive table as source it takes a few min and even causes some GC in the Spark UI most likely because of the skew as well.
Does this spark.sql.files.maxPartitionBytes work on Hive tables or only files?
What is the best course of action for handling this skewed Hive source?
Would something like a stage barrier write to parquet or Salting be suitable for this problem?
I would like to avoid .repartition() on read as it adds another layer to an already data roller-coaster of a job.
Thank you
==================================================
After further research it appears the Window Function is causing skewed data too and this is where the Spark Job hangs.
I am performing some time series filling via double Window Function (forward then backward fill to impute all the null sensor readings) and am trying to follow this article to try a salt method to evenly distribute ... however the following code produces all null values so the salt method is not working.
Not sure why I am getting skews after Window since each measure item I am partitioning by has roughly the same amount of records after checking via .groupBy() ... thus why would salt be needed?
+--------------------+-------+
| measure | count|
+--------------------+-------+
| v1 |5030265|
| v2 |5009780|
| v3 |5030526|
| v4 |5030504|
...
salt post => https://medium.com/appsflyer/salting-your-spark-to-scale-e6f1c87dd18
nSaltBins = 300 # based off number of "measure" values
df_fill = df_fill.withColumn("salt", (F.rand() * nSaltBins).cast("int"))
# FILLS [FORWARD + BACKWARD]
window = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)
# FORWARD FILLING IMPUTER
ffill_imputer = F.last(df_fill['new_value'], ignorenulls=True)\
.over(window)
fill_measure_DF = df_fill.withColumn('value_impute_temp', ffill_imputer)\
.drop("value", "new_value")
window = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(0,Window.unboundedFollowing)
# BACKWARD FILLING IMPUTER
bfill_imputer = F.first(df_fill['value_impute_temp'], ignorenulls=True)\
.over(window)
df_fill = df_fill.withColumn('value_impute_final', bfill_imputer)\
.drop("value_impute_temp")

Salting might be helpful in the case where a single partition is big enough to not fit in memory on a single executor. This might happen even if all the keys are equally distributed as well (as in your case).
You have to include the salt column in your partitionBy clause which you are using to create the Window.
window = Window.partitionBy('measure', 'salt')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)
Again you have to create another window which will operate on the intermediate result
window1 = Window.partitionBy('measure')\
.orderBy('measure', 'date')\
.rowsBetween(Window.unboundedPreceding, 0)

Hive based solution :
You can enable Skew join optimization using hive configuration. Applicable settings are:
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
See databricks tips for this :
skew hints may work in this case

Spark Dataframe returns an inconsistent value on count()

I am using pypark to perform some computations on data obtained from a PostgreSQL database. My pipeline is something similar to this:
limit = 1000
query = "(SELECT * FROM table LIMIT {}) as filter_query"
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://path/to/db") \
.option("dbtable", query.format(limit)) \
.option("user", "user") \
.option("password", "password") \
.option("driver", "org.postgresql.Driver")
df.createOrReplaceTempView("table")
df.count() # 1000
So far, so good. The problem starts when I perform some transformations on the data:
counted_data = spark.sql("SELECT column1, count(*) as count FROM table GROUP BY column1").orderBy("column1")
counted_data.count() # First value
counted_data_with_additional_column = counted_data.withColumn("column1", my_udf_function)
counted_data_with_additional_column.count() # Second value, inconsistent with the first count (should be the same)
The first transformation alters the number of rows, (the value should be <= 1000). However, the second one does not, it just adds a new column. How can it be that I am getting a different result for count()?

The explanation is actually quite simple, but a bit tricky. Spark might perform additional reads to the input source (in this case a database). Since some other process is inserting data in the database, these additional calls read slightly different data than the original read, causing this inconsistent behaviour. A simple call to df.cache() after the read disables further reads. I figured this out by analyzing the traffic between the database and my computer, and indeed, some further SQL commands where issued that matched my transformations. After adding the cache() call, no further traffic appeared.

Since you are using Limit 1000, you might be getting different 1000 records on each execution. And since you will be getting different records each time, the result of aggregation will be different. In order to get the consistent behaviour with Limit you can try following approaches.
Either try to cache your dataframe with cahce() or Persist method, which will ensure that spark will use same data till the time it will be available in memory.
But better approach could be to sort the data based on some unique column and then get the 1000 records, which will ensure that you will get the same 1000 records each time.
Hope it helps.

Should I persist a Spark dataframe if I keep adding columns in it?

I could not find any discussion on below topic in any forum I searched in internet. It may be because I am new to Spark and Scala and I am not asking a valid question. If there are any existing threads discussing the same or similar topic, the links will be very helpful. :)
I am working on a process which uses Spark and Scala and creates a file by reading a lot of tables and deriving a lot of fields by applying logic to the data fetched from tables. So, the structure of my code is like this:
val driver_sql = "SELECT ...";
var df_res = spark.sql(driver_sql)
var df_res = df_res.withColumn("Col1", <logic>)
var df_res = df_res.withColumn("Col2", <logic>)
var df_res = df_res.withColumn("Col3", <logic>)
.
.
.
var df_res = df_res.withColumn("Col20", <logic>)
Basically, there is a driver query which creates the "driver" dataframe. After that, separate logic (functions) is executed based on a key or keys in the driver dataframe to add new columns/fields. The "logic" part is not always a one-line code, sometimes, it is a separate function which runs another query and does some kind of join on df_res and adds a new column. Record count also changes since I use “inner” join with other tables/dataframes in some cases.
So, here are my questions:
Should I persist df_res at any point in time?
Can I persist df_res again and again after columns are added? I mean, does it add value?
If I persist df_res (disk only) every time a new column is added, is the data in the disk replaced? Or does it create a new copy/version of df_res in the disk?
Is there is a better technique to persist/cache data in a scenario like this (to avoid doing a lot of stuff in memory)?

The first thing is persisting a dataframe helps when you are going to apply iterative operations on dataframe.
What you are doing here is applying transformation operation on your dataframes. There is no need to persist these dataframes here.
For eg:- Persisting would be helpful if you are doing something like this.
val df = spark.sql("select * from ...").persist
df.count
val df1 = df.select("..").withColumn("xyz",udf(..))
df1.count
val df2 = df.select("..").withColumn("abc",udf2(..))
df2.count
Now, if you persist df here then it would be beneficial in calculating df1 and df2.
One more thing to notice here is, the reason why I did df.count is because dataframe is persisted only when an action is applied on it. From Spark docs:
"The first time it is computed in an action, it will be kept in memory on the nodes". And this answers your second question as well.
Every time you persist a new copy will be created but you should unpersist the prev one first.

Distributed loading of a wide row into Spark from Cassandra

Let's assume we have a Cassandra cluster with RF = N and a table containing wide rows.
Our table could have an index something like this: pk / ck1 / ck2 / ....
If we create an RDD from a row in the table as follows:
val wide_row = sc.cassandraTable(KS, TABLE).select("c1", "c2").where("pk = ?", PK)
I notice that one Spark node has 100% of the data and the others have none. I assume this is because the spark-cassandra-connector has no way of breaking down the query token range into smaller sub ranges because it's actually not a range -- it's simply the hash of PK.
At this point we could simply call redistribute(N) to spread the data across the Spark cluster before processing, but this has the effect of moving data across the network to nodes that already have the data locally in Cassandra (remember RF = N)
What we would really like is to have each Spark node load a subset (slice) of the row locally from Cassandra.
One approach which came to mind is to generate an RDD containing a list of distinct values of the first cluster key (ck1) when pk = PK. We could then use mapPartitions() to load a slice of the wide row based on each value of ck1.
Assuming we already have our list values for ck1, we could write something like this:
val ck1_list = .... // RDD
ck1_list.repartition(ck1_list.count().toInt) // create a partition for each value of ck1
val wide_row = ck1_list.mapPartitions(f)
Within the partition iterator, f(), we would like to call another function g(pk, ck1) which loads the row slice from Cassandra for partition key pk and cluster key ck1. We could then apply flatMap to ck1_list so as to create a fully distributed RDD of the wide row without any shuffing.
So here's the question:
Is it possible to make a CQL call from within a Spark task? What driver should be used? Can it be set up only once an reused for subsequent tasks?
Any help would be greatly appreciated, thanks.

For the sake of future reference, I will explain how I solved this.
I actually used a slightly different method to the one outlined above, one which does not involve calling Cassandra from inside Spark tasks.
I started off with ck_list, a list of distinct values for the first cluster key when pk = PK. The code is not shown here, but I actually downloaded this list directly from Cassandra in the Spark driver using CQL.
I then transform ck_list into a list of RDDS. Next we combine the RDDs (each one representing a Cassandra row slice) into one unified RDD (wide_row).
The cast on CassandraRDD is necessary because union returns type org.apache.spark.rdd.RDD
After running the job I was able to verify that the wide_row had x partitions where x is the size of ck_list. A useful side effect is that wide_row is partitioned by the first cluster key, which is also the key I want to reduce by. Hence even more shuffling is avoided.
I don't know if this is the best way to achieve what I wanted, but it certainly works.
val ck_list // list first cluster key values where pk = PK
val wide_row = ck_list.map( ck =>
sc.cassandraTable(KS, TBL)
.select("c1", "c2").where("pk = ? and ck1 = ?", PK, ck)
.asInstanceOf[org.apache.spark.rdd.RDD]
).reduce( (x, y) => x.union(y) )