Spark dataframe seems to be recomputed twice

Spark dataframe seems to be recomputed twice - scala

SOLVED: I solved the issue, that was due to a very stupid, silly, idiotic mistake in one of the first passages of the flow.
Basically, I was computing a dataframe that was written to a Hive table; this dataframe then needed to be used to create the temporaryDF after many passages, but I was originally querying the table from scratch instead of using a copy of the dataframe to-be-written in the table. The mistake lies in the fact that the just-computed dataframes was missing previous partitions (due to the specific logic of the flow), whereas next computations to create temporaryDF needed also at least two previous partitions. I don't know why, I can't remember when, I decided to cache the just-computed one, thus losing information and getting an empty one under Oozie (in Spark-Shell I was always using at least three partitions, due to manually updating the table after some time - each new partition came every 15min). I was probably in a late night working sprint and my brain decided it was worthy to mess it up.
I upvoted and accepted #thebluephantom answer because he is very right within the specific circumstance I was describing.
Original:
I'm having a strange behaviour using Spark-Submit with Spark v.2.2.0.2.6.4.105-1 (using Scala) in Hadoop 2 under an Oozie workflow vs using Spark-Shell.
I have a Hive table that contains records that keep track of some processes every 15 minutes. The table is overwritten every time with new records or 'old' records that still satisfy the logic of the processes of interest.
I keep track of 'the age' of the records through a column that I will here call times_investigated, which ranges from 1 to 9.
I create a temporary dataframe, let's call it temporayDF, that contains both the old and the new entries (both the types need to be present to run useful computations). This temporayDF is then split between the new entries and the old ones, based on $"times_investigated" === 1 and $"times_investigated > 1" (or =!= 1).
Then, the processed entries are merged with a union in a final dataframe that is then written into the original Hive table.
// Before, I run the query on the 'old' Hive table and the logic over old and new entries.
// I now have a temporary dataframe
val temporaryDF = previousOtherDF
.withColumn("original_col_new", conditions)
.withColumn("original_other_col_new", otherConditions)
.withColumn("times_investigated_new", nvl($"times_investigated" + 1, 1))
.select(
previousColumns,
$"original_col_new".as("original_col"),
$"original_other_col_new".as("original_other_col"),
$"times_investigated_new".as("times_investigated"))
.cache
// Now I need to split the temporayDF in 2 to run some other logic on the new entries.
val newEntriesDF = temporaryDF
.filter($"times_investigated" === 1)
.join(neededDF, conditions, "leftouter")
.join(otherNeededDF, conditions, "leftouter")
.groupBy(cols)
.agg(min(colOne),
max(colTwo),
min(colThree),
max(colFour))
.withColumn("original_col_five_new",
when(conditions).otherwise(somethingElse))
.withColumn("original_col_six_new",
when(conditions).otherwise(somethingElse))
.select(orderedColumns)
val oldEntriesDF = temporaryDF.filter($"times_investigated" > 1)
val finalTableDF = oldEntriesDF.union(newEntriesDF)
// Now I write the table
finalTableDF.createOrReplaceTempView(tempFinalTableDF)
sql("""INSERT OVERWRITE TABLE $finalTableDF
SELECT * FROM tempFinalTableDF """)
// I would then need to re-use the newly-computed table to process further information...
The Problem:
The Hive table does not present the new entries with times_investigated = 1. It just processes the old ones, so, after those 9 times an entry can stay inside the table, it gets completely empty.
I run some tests within Spark-Shell and everything worked perfectly for many iterations, even manually writing the Hive table from the shell produced the expected results in the Hive table, but when I launched the workflow under Oozie, the strange behavior appeared again.
What I noticed within Spark-Shell is that, after writing the Hive table, if I went to compute a temporaryDF.show(), the new entries would be updated to $"times_investigated" = 2!
I tried to create a copy of temporaryDF to work on separate dataframes with the new and the old entries, but also this copyOfTemporaryDF gets updated after writing the Hive table.
It seems that this re-computation is happening before writing the Hive table under Oozie.
I know that I can compute the operations in a different manner, but I need to find a quick temporary fix on the current flow if possible.
Above all, I would love to understand what is happening under the hood, in order to avoid getting myself in such a circumstance later on.
Do you guys have any clue and/or advice?
I tried caching the intermediate dataframes, but without success.
P.S. Sorry for the probably bad coding practices
EDIT. More context: the temporaryDF comes from other intermediate dataframes, used just once to compute this on of interest. The last passages that create temporaryDF are withColumn operations, where $"times_investigated" is updated with a custom nvl function (that works exactly like the SQL one) and never gave problems in older versions of the flow (see below for the passages).
Edit2: I also tried to merge the operations on new and old entries in one long chained series, so that temopraryDF is actually the final dataframe to be written in the Hive table, but the new entries with times_investigated = 1 are still not considered (yet I have no issues via Spark-Shell and .showing the dataframe after writing to table makes it re-compute, so the times investigated are +1).

Use .cache otherwise you will get re-computation. You should do this for the appropriate dataframe or RDD if the RDD or DF is to be used multiple times in a single Spark App - not even Action dependent, sometimes you get "skipped stages".
val temporaryDF = previousOperations...cache()
2 vals use temporaryDF and without caching the recomputations will be as you see, and they may well give different results. That should be cached.
Of course if a Worker dies, or the partition evicted, some recomputing is needed.
.cache may not be ideal for datasets larger than available cluster memory. Each partition that is evicted will be rebuilt from source and that is a costly affair.
Also, using suitable partitioning and iterating a few times be better than persisting / caching; but it all depends.

Related

Databricks - Reduce delta version compute time

I've got a process which is really bogged down by the version computing for the target delta table.
Little bit of context - there are other things that run, all contributing uniform structured dataframes that I want to persist in a delta table. Ultimately these are all compiled into lots_of_dataframes to be logged.
for i in lots_of_dataframes:
i.write.insertInto("target_delta_table")
# ... take a while to compute version
I've got in to the documentation but couldn't find any setting to ignore the version compute. I did see vacuuming, but not sure that'll do the since there will still be a lot of activity in a small window of time.
I know that I can union all of the dataframes together and just do the insert once, but I'm wondering if there is a more Databricks-ian way to do it. Like a configuration to only maintain 1 version at a time and not worry about computing for a restore.

Most probably, but it's hard to say exactly without details, the problem arise from the following facts:
Spark is lazy - the actual data processing doesn't happen until you perform action, like writing data into a destination table. So if you have a lot of transformations, etc., they will happen when you're writing data.
You're writing data in the loop - you can potentially speedup it a bit by doing a union of all tables into a single dataframe, that will be written into one go:
import functools
unioned = functools.reduce(lambda x,y: x.union(y), lots_of_dataframes)
unioned.write.insertInto("target_delta_table")

Pyspark Dataframe count taking too long

So we have a Pyspark Dataframe which has around 25k records. We are trying to perform a count/empty check on this and it is taking too long. We tried,
df.count()
df.rdd.isEmpty()
len(df.head(1))==0
Converted to Pandas and tried pandas_df.empty()
Tried the arrow option
df.cache() and df.persist() before the counts
df.repartition(n)
Tried writing the df to DBFS, but writing is also taking quite a long time(cancelled after 20 mins)
Could you please help us on what we are doing wrong.
Note : There are no duplicate values in df and we have done multiple joins to form the df

Without looking at the df.explain() it's challenging to know specifically the issue but it certainly seems like you have could have a skewed data set.
(Skew usually is represented in the Spark UI with 1 executor taking a lot longer than the other partitions to finish.) If you on a recent version of spark there are tools to help with this out of the box:
spark.sql.adaptive.enabled = true
spark.sql.adaptive.skewJoin.enabled = true
Count is not taking too long. It's taking the time it needs to, to complete what you asked spark to do. To refine what it's doing you should do things you are likely already doing, filter the data first before joining so only critical data is being transferred to the joins. Reviewing your data for Skew, and programming around it, if you can't use adaptive query.
Convince yourself this is a data issue. Limit your source [data/tables] to 1000 or 10000 records and see if it runs fast. Then one at a time, remove the limit from only one [table/data source] (and apply limit to all others) and find the table that is the source of your problem. Then study the [table/data source] and figure out how you can work around the issue.(If you can't use adaptive query to fix the issue.)
(Finally If you are using hive tables, you should make sure the table stats are up to date.)
ANALYZE TABLE mytable COMPUTE STATISTICS;

Bring data of DataFrame back to local node for further actions (count / show) in spark/scala

I'm using Spark 1.6 in Scala.
I know it's some of the ideas behind the Spark Framework. But I couldn't answer it to myself by reading different tutorials.. (maybe the wrong ones).
I joined two DataFrames to a new one (nDF). Now I know, it's not yet proceeded, as long I say show, first or count.
But since I want to do exactly this, I want to inspect nDF in different ways:
nDF.show
nDF.count
nDF.filter()
..and so on, it would each time take a long time, since the original DataFrames are big. Couldn't I bring/copy the data to this new one. So I could solve these new actions as quick as on the original sets? (First I thought it's 'collect', but it only returns a Array, no DataFrame)

This is a classic scenario. When you join 2 Dataframes spark doesn't do any operation as it evaluates lazily when an action called on the resulting dataframe . Action mean show, count, print etc.
Now when show, count is being called on nDF, spark is evaluating the resultant dataframe every time i.e once when you called show, then when count is being called and so on. This means internally it is performing map/reduce every time an action is called on the resultant dataframe.
Spark doesn't cache the resulting dataframe in memory unless it is hinted to do so by doing df.cache / df.persist.
So when you do
val nDF = a.join(b).persist
And then call the count/show it will evaluate the nDF once and store the resulting dataframe in memory. Hence subsequent actions will be faster.
However the fist evaluation might be little slower also you need to using little more executor memory.

If the memory available to you is good with respect to the size of your dataset, what you're probably looking for is df.cache(). If the size of your dataset is too much, consider using df.persist() as it allows different levels of persistence.
Hope this is what you're looking for. Cheers

How to distribute processing to find waldos in csv using spark scala in a clustered environment?

I have a spark cluster of 1 master, 3 workers. I have a simple, but gigantic CSV file like this:
FirstName, Age
Waldo, 5
Emily, 7
John, 4
Waldo, 9
Amy, 2
Kevin, 4
...
I want to get all the records where FirstName is waldo. Normally on one machine and local mode, if I have an RDD, I can do a ".parallelize()" to get an RDD, then assuming the variable is "mydata", I can do:
mydata.foreach(x => // check if the first row's first column value contains "Waldo", if so, then print it)
From my understanding, using the above method, every spark slave would have to perform that iteration on the entire gigantic csv to get the result instead of each machine processing a third of the file (correct me if I am wrong).
Problem is, if I have 3 machines, how do I make it so that:
The csv file is broken up into 3 different "sets" to each of the
workers so that each worker is working with a much smaller file
(1/3rd of the original)
Each worker processes it, and finds all the "FirstName=Waldo"s"
The resulting list of "Waldo" records are
reported back to me in a way that actually takes advantage of the
cluster.

Mmm, lot of points to make here. First, if you are using HDFS, you file is already partitioned, it is a distributed file system. You probably even have the data replicated 3 times, as that is the default (depends on your config though).
Second, Spark will indeed make use of this partitioning when you told it to load data, and will process chunks locally. Shuffling data around is only required when you want to, for instance, re-partition you data by some criteria, like keys in a key/value pair, etc.
Third, Spark is indeed great for doing batch processing and some datamining if you don't want to structure a database or don't have predefined access patterns. In short, for what you seem to need. You don't even need to write and compile code since you can run a Spark Shell and try with a few lines. I do recommend you to look at the docs, since you don't seem to have a clear grasp of the platform yet.
Fourth, I don't have an IDE or anything here, but the code you need should be something line this (sort of pseudocode, but should be VERY close):
sc
.textFile("my_hdfs_path")
.keyBy(_.split('\t')(0))
.filter(_._1 == "Waldo")
.map(_._2)
.saveAsTextFile("my_hdfs_out")
if not too big, you can also use collect to bring all results to the driver location instead of saving to file, but after that you are back in a single machine. Hope it helps!

Apache spark streaming - cache dataset for joining

I'm considering using Apache Spark streaming for some real-time work but I'm not sure how to cache a dataset for use in a join/lookup.
The main input will be json records coming from Kafka that contain an Id, I want to translate that id into a name using a lookup dataset. The lookup dataset resides in Mongo Db but I want to be able to cache it inside the spark process as the dataset changes very rarely (once every couple of hours) so I don't want to hit mongo for every input record or reload all the records in every spark batch but I need to be able to update the data held in spark periodically (e.g. every 2 hours).
What is the best way to do this?
Thanks.

I've thought long and hard about this myself. In particular I've wondered is it possible to actually implement a database DB in Spark of sorts.
Well the answer is kind of yes. First you want a program that first caches the main data set into memory, then every couple of hours does an optimized join-with-tiny to update the main data set. Now apparently Spark will have a method that does a join-with-tiny (maybe it's already out in 1.0.0 - my stack is stuck on 0.9.0 until CDH 5.1.0 is out).
Anyway, you can manually implement a join-with-tiny, by taking the periodic bi-hourly dataset and turning it into a HashMap then broadcasting it as a broadcast variable. What this means is that the HashMap will be copied, but only once per node (compare this with just referencing the Map - it would be copied once per task - a much greater cost). Then you take your main dataset and add on the new records using the broadcasted map. You can then periodically (nightly) save to hdfs or something.
So here is some scruffy pseudo code to elucidate:
var mainDataSet: RDD[KeyType, DataType] = sc.textFile("/path/to/main/dataset")
.map(parseJsonAndGetTheKey).cache()
everyTwoHoursDo {
val newData: Map[KeyType, DataType] = sc.textFile("/path/to/last/two/hours")
.map(parseJsonAndGetTheKey).toarray().toMap
broadcast(newData)
val mainDataSetNew =
mainDataSet.map((key, oldValue) => (key,
newData.get(key).map(newDataValue =>
update(oldValue, newDataValue))
.getOrElse(oldValue)))
.cache()
mainDataSetNew.someAction() // to force execution
mainDataSet.unpersist()
mainDataSet = mainDataSetNew
}
I've also thought that you could be very clever and use a custom partioner with your own custom index, and then use a custom way of updating the partitions so that each partition itself holds a submap. Then you can skip updating partitions that you know won't hold any keys that occur in the newData, and also optimize the updating process.
I personally think this is a really cool idea, and the nice thing is your dataset is already ready in memory for some analysis / machine learning. The down side is your kinda reinventing the wheel a bit. It might be a better idea to look at using Cassandra as Datastax is partnering with Databricks (people who make Spark) and might end up supporting some kind of thing like this out of box.
Further reading:
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
http://www.datastax.com/2014/06/datastax-unveils-dse-45-the-future-of-the-distributed-database-management-system

Here is a fairly simple work-flow:
For each batch of data:
Convert the batch of JSON data to a DataFrame (b_df).
Read the lookup dataset from MongoDB as a DataFrame (m_df). Then cache, m_df.cache()
Join the data using b_df.join(m_df, "join_field")
Perform your required aggregation and then write to a data source.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark dataframe seems to be recomputed twice - scala

Related

Databricks - Reduce delta version compute time

Pyspark Dataframe count taking too long

Bring data of DataFrame back to local node for further actions (count / show) in spark/scala

How to distribute processing to find waldos in csv using spark scala in a clustered environment?

Apache spark streaming - cache dataset for joining

Categories

Resources