Spark: is DataFrame caching/persistence is transferred from one to another? - scala

Assume I have this code (Spark 1.6.2):
val finalDF: DataFrame = if (test) {
val df = sqlContext.read.parquet(url).cache
df.write.parquet(url2)
df }
else
sqlContext.read.parquet(other_url)
If I run finalDF.unpersist, will it indeed clean the data of finalDF/df from memory? If not, how can I do it?

Yes (if test is true).
Basically when cache changes the dataframe (i.e. the dataframe is not immutable) which means that if finalDF is df then you will be unpersisting df. If test is false then df would not have been created to begin with and the result of sqlContext.read.parquet is not cached anyway but calling unpersist would not do any harm.
You can check it out yourself by looking at the UI (by default in port 4040) and checking out the storage tab. It would show the cached df before unpersist and after it would disappear.

Spark drops out old data partitions in a least-recently-used (LRU) Algorithm. However, if you need to clean manually dataFrame.unpersist() works as you expected.
Refer to http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence for detail

Related

How do I understand that caching is used in Spark?

In my Scala/Spark application, I create DataFrame. I plan to use this Dataframe several times throughout the program. For that's why I decided to used .cache() method for that DataFrame. As you can see inside the loop I filter DataFrame several times with different values. For some reason .count() method returns me the always the same result. In fact, it must return two different count values. Also, I notice strange behavior in Mesos. It feels like the .cache() method is not being executed. After creating the DataFrame, the program goes to this part of code if (!df.head(1).isEmpty) and performs it for a very long time. I assumed that the caching process would run for a long time, and the other processes would use this cache and run quickly. What do you think is the problem?
import org.apache.spark.sql.DataFrame
var df: DataFrame = spark
.read
.option("delimiter", "|")
.csv("/path_to_the_files/")
.filter(col("col5").isin("XXX", "YYY", "ZZZ"))
df.cache()
var array1 = Array("111", "222")
var array2 = Array("333")
var storage = Array(array1, array2)
if (!df.head(1).isEmpty) {
for (item <- storage) {
df.filter(
col("col1").isin(item:_*)
)
println("count: " + df.count())
}
}
In fact, it must return two different count values.
Why? You are calling it on the same df. Maybe you meant something like
val df1 = df.filter(...)
println("count: " + df1.count())
I assumed that the caching process would run for a long time, and the other processes would use this cache and run quickly.
It does, but only when the first action which depends on this dataframe is executed, and head is that action. So you should expect exactly
the program goes to this part of code if (!df.head(1).isEmpty) and performs it for a very long time
Without caching, you'd also get the same time for both df.count() calls, unless Spark detects it and enables caching on its own.

Spark - use dataframe many times without many unloads

I got question. How I can copy dataframe without unload it again to redshift ?
val companiesData = spark.read.format("com.databricks.spark.redshift")
.option("url","jdbc:redshift://xxxx:5439/cf?user="+user+"&password="+password)
.option("query","select * from cf_core.company")
//.option("dbtable",schema+"."+table)
.option("aws_iam_role","arn:aws:iam::xxxxxx:role/somerole")
.option("tempdir","s3a://xxxxx/Spark")
.load()
import class.companiesData
class test {
val secondDF = filteredDF(companiesData)
def filteredDF(df: Dataframe): Dataframe {
val result = df.select("companynumber")
result
}
}
In this case this will unload data twice. First select * from table and second it will unload by select only companynumber. How I can unload data once and operate on this many times ? This is serious problem for me. Thanks for help
By "unload", do you mean read the data? If so, why are you sure it's being read twice? In fact, you don't have any action in your code, so I'm not even sure if the data is being read at all. If you do try to access secondDF somewhere else in the code, spark should only read the column you select in your class 'test'. I'm not 100% sure of this because I've never used redshift to load data into spark before.
In general, if you want to reuse a dataframe, you should cache it using
companiesData.cache()
Then, whenever you call an action on the dataframe, it will be cached into memory.

Spark DataFrame row count is inconsistent between runs

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir
is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.

Recursively adding rows to a dataframe

I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.
You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)
You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.
If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.

Broadcast not happening while joining dataframes in Spark 1.6

Below is the sample code that I am running. when this spark job runs, Dataframe joins are happening using sortmergejoin instead of broadcastjoin.
def joinedDf (sqlContext: SQLContext,
txnTable: DataFrame,
countriesDfBroadcast: Broadcast[DataFrame]):
DataFrame = {
txnTable.as("df1").join((countriesDfBroadcast.value).withColumnRenamed("CNTRY_ID", "DW_CNTRY_ID").as("countries"),
$"df1.USER_CNTRY_ID" === $"countries.DW_CNTRY_ID", "inner")
}
joinedDf(sqlContext, txnTable, countriesDfBroadcast).write.parquet("temp")
The broadcastjoin is not happening even when I specify a broadcast() hint in the join statement.
The optimizer is hashpartitioning the dataframe and it is causing data skew.
Has anyone seen this behavior?
I am running this on yarn using Spark 1.6 and HiveContext as SQLContext. The spark job runs on 200 executors. and the data size of the txnTable is 240GB and the datasize of countriesDf is 5mb.
Both the way how you broadcast DataFrame and how you access it are incorrect.
Standard broadcasts cannot be used to handle distributed data structures. If you want to perform broadcast join on a DataFrame you should use broadcast functions which marks given DataFrame for broadcasting:
import org.apache.spark.sql.functions.broadcast
val countriesDf: DataFrame = ???
val tmp: DataFrame = broadcast(
countriesDf.withColumnRenamed("CNTRY_ID", "DW_CNTRY_ID").as("countries")
)
txnTable.as("df1").join(
broadcast(tmp), $"df1.USER_CNTRY_ID" === $"countries.DW_CNTRY_ID", "inner")
Internally it will collect tmp without converting from internal and broadcast afterwards.
join arguments are eagerly evaluated. Even it was possible to use SparkContext.broadcast with distributed data structure broadcast value is evaluated locally before join is called. Thats' why your function work at all but doesn't perform broadcast join.