I am sure this is due to my inability to fully grasp the concept of .par under the hood, but I am seeing something a bit strange when using it along with a ForkJoinPool.
I have an ETL process that uses multi-threading to make parallel requests to the source system (normal stuff - Postgres, SQL Server, Oracle, etc) and then does quite a bit of work before putting the data into Azure Data Lake storage, and visualizing the tables through Databricks. We are using Databricks for our processing as well. The ETL process is written in Scala and is a JAR file that gets pulled from our internal Nexus repo upon ETL start, for a given ETL pipeline.
So, we have a bit of code like this:
val jobConcurrency = 5
// Get all the "things" we want to ingest, for a given pipeline. (This is a spark dataframe)
val configDF = getIngestionObjects(....)
// Collect and make par
val configDFPAR = configDF.toJSON.collect().par
// Add to new pool. (I have used both scala and java ForkJoinPool, with the exact same results)
configDFPAR.tasksupport = new ForkJoinTaskSupport(new scala.concurrent.forkjoin.ForkJoinPool(jobConcurrency))
// Do things
configDFPAR.foreach{element =>
// GO TO SOURCE
// PARTITION AND CLEAN DATA
// INSERT OR MERGE DATA
// RUN OPTIMIZE STATEMENTS
// CLEAN UP STAGING TABLES
// BUILD BRONZE/SILVER/GOLD TABLES
// TO ANY POST-PROCESSING
// NEXT
}
So, using grouped(x) with .par, it works exactly as I would expect... 5 tables come in, 5 get .par, and the next 5 drop in once all 5 original tables finish. The issue is one slow buffalo of of a table causes the entire thing to wait. I would rather keep it all under .par, and enable this queue based approach, but what I am seeing is that the job runs 5 tables, then 5, then 5, etc.. there is no waiting until one is finished. I can watch the spark logs, and see that at some point I might have 8 or 16 or even 32 tables ingesting at the same time (based on my cluster size).
So, is my understanding of .par wrong? I assume it hits some "checkpoint" where it says, "Ok, next", but I am surprised that its using more threads than 5 at a time. Again, I am probably doing something wrong here.
Thanks for any suggestions!
Related
I've got a process which is really bogged down by the version computing for the target delta table.
Little bit of context - there are other things that run, all contributing uniform structured dataframes that I want to persist in a delta table. Ultimately these are all compiled into lots_of_dataframes to be logged.
for i in lots_of_dataframes:
i.write.insertInto("target_delta_table")
# ... take a while to compute version
I've got in to the documentation but couldn't find any setting to ignore the version compute. I did see vacuuming, but not sure that'll do the since there will still be a lot of activity in a small window of time.
I know that I can union all of the dataframes together and just do the insert once, but I'm wondering if there is a more Databricks-ian way to do it. Like a configuration to only maintain 1 version at a time and not worry about computing for a restore.
Most probably, but it's hard to say exactly without details, the problem arise from the following facts:
Spark is lazy - the actual data processing doesn't happen until you perform action, like writing data into a destination table. So if you have a lot of transformations, etc., they will happen when you're writing data.
You're writing data in the loop - you can potentially speedup it a bit by doing a union of all tables into a single dataframe, that will be written into one go:
import functools
unioned = functools.reduce(lambda x,y: x.union(y), lots_of_dataframes)
unioned.write.insertInto("target_delta_table")
SOLVED: I solved the issue, that was due to a very stupid, silly, idiotic mistake in one of the first passages of the flow.
Basically, I was computing a dataframe that was written to a Hive table; this dataframe then needed to be used to create the temporaryDF after many passages, but I was originally querying the table from scratch instead of using a copy of the dataframe to-be-written in the table. The mistake lies in the fact that the just-computed dataframes was missing previous partitions (due to the specific logic of the flow), whereas next computations to create temporaryDF needed also at least two previous partitions. I don't know why, I can't remember when, I decided to cache the just-computed one, thus losing information and getting an empty one under Oozie (in Spark-Shell I was always using at least three partitions, due to manually updating the table after some time - each new partition came every 15min). I was probably in a late night working sprint and my brain decided it was worthy to mess it up.
I upvoted and accepted #thebluephantom answer because he is very right within the specific circumstance I was describing.
Original:
I'm having a strange behaviour using Spark-Submit with Spark v.2.2.0.2.6.4.105-1 (using Scala) in Hadoop 2 under an Oozie workflow vs using Spark-Shell.
I have a Hive table that contains records that keep track of some processes every 15 minutes. The table is overwritten every time with new records or 'old' records that still satisfy the logic of the processes of interest.
I keep track of 'the age' of the records through a column that I will here call times_investigated, which ranges from 1 to 9.
I create a temporary dataframe, let's call it temporayDF, that contains both the old and the new entries (both the types need to be present to run useful computations). This temporayDF is then split between the new entries and the old ones, based on $"times_investigated" === 1 and $"times_investigated > 1" (or =!= 1).
Then, the processed entries are merged with a union in a final dataframe that is then written into the original Hive table.
// Before, I run the query on the 'old' Hive table and the logic over old and new entries.
// I now have a temporary dataframe
val temporaryDF = previousOtherDF
.withColumn("original_col_new", conditions)
.withColumn("original_other_col_new", otherConditions)
.withColumn("times_investigated_new", nvl($"times_investigated" + 1, 1))
.select(
previousColumns,
$"original_col_new".as("original_col"),
$"original_other_col_new".as("original_other_col"),
$"times_investigated_new".as("times_investigated"))
.cache
// Now I need to split the temporayDF in 2 to run some other logic on the new entries.
val newEntriesDF = temporaryDF
.filter($"times_investigated" === 1)
.join(neededDF, conditions, "leftouter")
.join(otherNeededDF, conditions, "leftouter")
.groupBy(cols)
.agg(min(colOne),
max(colTwo),
min(colThree),
max(colFour))
.withColumn("original_col_five_new",
when(conditions).otherwise(somethingElse))
.withColumn("original_col_six_new",
when(conditions).otherwise(somethingElse))
.select(orderedColumns)
val oldEntriesDF = temporaryDF.filter($"times_investigated" > 1)
val finalTableDF = oldEntriesDF.union(newEntriesDF)
// Now I write the table
finalTableDF.createOrReplaceTempView(tempFinalTableDF)
sql("""INSERT OVERWRITE TABLE $finalTableDF
SELECT * FROM tempFinalTableDF """)
// I would then need to re-use the newly-computed table to process further information...
The Problem:
The Hive table does not present the new entries with times_investigated = 1. It just processes the old ones, so, after those 9 times an entry can stay inside the table, it gets completely empty.
I run some tests within Spark-Shell and everything worked perfectly for many iterations, even manually writing the Hive table from the shell produced the expected results in the Hive table, but when I launched the workflow under Oozie, the strange behavior appeared again.
What I noticed within Spark-Shell is that, after writing the Hive table, if I went to compute a temporaryDF.show(), the new entries would be updated to $"times_investigated" = 2!
I tried to create a copy of temporaryDF to work on separate dataframes with the new and the old entries, but also this copyOfTemporaryDF gets updated after writing the Hive table.
It seems that this re-computation is happening before writing the Hive table under Oozie.
I know that I can compute the operations in a different manner, but I need to find a quick temporary fix on the current flow if possible.
Above all, I would love to understand what is happening under the hood, in order to avoid getting myself in such a circumstance later on.
Do you guys have any clue and/or advice?
I tried caching the intermediate dataframes, but without success.
P.S. Sorry for the probably bad coding practices
EDIT. More context: the temporaryDF comes from other intermediate dataframes, used just once to compute this on of interest. The last passages that create temporaryDF are withColumn operations, where $"times_investigated" is updated with a custom nvl function (that works exactly like the SQL one) and never gave problems in older versions of the flow (see below for the passages).
Edit2: I also tried to merge the operations on new and old entries in one long chained series, so that temopraryDF is actually the final dataframe to be written in the Hive table, but the new entries with times_investigated = 1 are still not considered (yet I have no issues via Spark-Shell and .showing the dataframe after writing to table makes it re-compute, so the times investigated are +1).
Use .cache otherwise you will get re-computation. You should do this for the appropriate dataframe or RDD if the RDD or DF is to be used multiple times in a single Spark App - not even Action dependent, sometimes you get "skipped stages".
val temporaryDF = previousOperations...cache()
2 vals use temporaryDF and without caching the recomputations will be as you see, and they may well give different results. That should be cached.
Of course if a Worker dies, or the partition evicted, some recomputing is needed.
.cache may not be ideal for datasets larger than available cluster memory. Each partition that is evicted will be rebuilt from source and that is a costly affair.
Also, using suitable partitioning and iterating a few times be better than persisting / caching; but it all depends.
I have a spark cluster of 1 master, 3 workers. I have a simple, but gigantic CSV file like this:
FirstName, Age
Waldo, 5
Emily, 7
John, 4
Waldo, 9
Amy, 2
Kevin, 4
...
I want to get all the records where FirstName is waldo. Normally on one machine and local mode, if I have an RDD, I can do a ".parallelize()" to get an RDD, then assuming the variable is "mydata", I can do:
mydata.foreach(x => // check if the first row's first column value contains "Waldo", if so, then print it)
From my understanding, using the above method, every spark slave would have to perform that iteration on the entire gigantic csv to get the result instead of each machine processing a third of the file (correct me if I am wrong).
Problem is, if I have 3 machines, how do I make it so that:
The csv file is broken up into 3 different "sets" to each of the
workers so that each worker is working with a much smaller file
(1/3rd of the original)
Each worker processes it, and finds all the "FirstName=Waldo"s"
The resulting list of "Waldo" records are
reported back to me in a way that actually takes advantage of the
cluster.
Mmm, lot of points to make here. First, if you are using HDFS, you file is already partitioned, it is a distributed file system. You probably even have the data replicated 3 times, as that is the default (depends on your config though).
Second, Spark will indeed make use of this partitioning when you told it to load data, and will process chunks locally. Shuffling data around is only required when you want to, for instance, re-partition you data by some criteria, like keys in a key/value pair, etc.
Third, Spark is indeed great for doing batch processing and some datamining if you don't want to structure a database or don't have predefined access patterns. In short, for what you seem to need. You don't even need to write and compile code since you can run a Spark Shell and try with a few lines. I do recommend you to look at the docs, since you don't seem to have a clear grasp of the platform yet.
Fourth, I don't have an IDE or anything here, but the code you need should be something line this (sort of pseudocode, but should be VERY close):
sc
.textFile("my_hdfs_path")
.keyBy(_.split('\t')(0))
.filter(_._1 == "Waldo")
.map(_._2)
.saveAsTextFile("my_hdfs_out")
if not too big, you can also use collect to bring all results to the driver location instead of saving to file, but after that you are back in a single machine. Hope it helps!
I am trying to run a Spark job over data from multiple Cassandra tables which are grouped as part of the job. I am trying to get an end to end run with a huge data set 13m data points and it has failed over multiple points. As I fix those failures and move ahead, I encounter the next problem which I fix and restart the job again. Is there a way to speed up the testing cycle on real data so that I can restart/resume a previously failed job from a specific checkpoint?
You can checkpoint your RDDs to disk at various midpoints, which would let you restart from there if necessary. You would have to save the intermediates as a sequence file or text file, and do a little work to make sure everything goes to and from disk cleanly.
I find it more useful to start up the spark-shell and build my data flow in there. If you can identify a subset of your data which is representative, even better. Once you get into the REPL you can create RDDs, check the first value or take(100) and print them to stdout, count various result data sets, and so on. The REPL is what makes spark 10x more productive than hadoop for me.
Once I have built, in the REPL, a flow of transformations and actions that gets me the result that I need, then I can form it into a scala file and refactor that to be clean; extract functions that can be reused and unit tested, tune the parallelism, whatever.
I often find myself going back into the REPL when I need to extend my data flow, so I copy and paste code from my scala file to get to a good starting point, and experiment with the extension from there.
I'm considering using Apache Spark streaming for some real-time work but I'm not sure how to cache a dataset for use in a join/lookup.
The main input will be json records coming from Kafka that contain an Id, I want to translate that id into a name using a lookup dataset. The lookup dataset resides in Mongo Db but I want to be able to cache it inside the spark process as the dataset changes very rarely (once every couple of hours) so I don't want to hit mongo for every input record or reload all the records in every spark batch but I need to be able to update the data held in spark periodically (e.g. every 2 hours).
What is the best way to do this?
Thanks.
I've thought long and hard about this myself. In particular I've wondered is it possible to actually implement a database DB in Spark of sorts.
Well the answer is kind of yes. First you want a program that first caches the main data set into memory, then every couple of hours does an optimized join-with-tiny to update the main data set. Now apparently Spark will have a method that does a join-with-tiny (maybe it's already out in 1.0.0 - my stack is stuck on 0.9.0 until CDH 5.1.0 is out).
Anyway, you can manually implement a join-with-tiny, by taking the periodic bi-hourly dataset and turning it into a HashMap then broadcasting it as a broadcast variable. What this means is that the HashMap will be copied, but only once per node (compare this with just referencing the Map - it would be copied once per task - a much greater cost). Then you take your main dataset and add on the new records using the broadcasted map. You can then periodically (nightly) save to hdfs or something.
So here is some scruffy pseudo code to elucidate:
var mainDataSet: RDD[KeyType, DataType] = sc.textFile("/path/to/main/dataset")
.map(parseJsonAndGetTheKey).cache()
everyTwoHoursDo {
val newData: Map[KeyType, DataType] = sc.textFile("/path/to/last/two/hours")
.map(parseJsonAndGetTheKey).toarray().toMap
broadcast(newData)
val mainDataSetNew =
mainDataSet.map((key, oldValue) => (key,
newData.get(key).map(newDataValue =>
update(oldValue, newDataValue))
.getOrElse(oldValue)))
.cache()
mainDataSetNew.someAction() // to force execution
mainDataSet.unpersist()
mainDataSet = mainDataSetNew
}
I've also thought that you could be very clever and use a custom partioner with your own custom index, and then use a custom way of updating the partitions so that each partition itself holds a submap. Then you can skip updating partitions that you know won't hold any keys that occur in the newData, and also optimize the updating process.
I personally think this is a really cool idea, and the nice thing is your dataset is already ready in memory for some analysis / machine learning. The down side is your kinda reinventing the wheel a bit. It might be a better idea to look at using Cassandra as Datastax is partnering with Databricks (people who make Spark) and might end up supporting some kind of thing like this out of box.
Further reading:
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
http://www.datastax.com/2014/06/datastax-unveils-dse-45-the-future-of-the-distributed-database-management-system
Here is a fairly simple work-flow:
For each batch of data:
Convert the batch of JSON data to a DataFrame (b_df).
Read the lookup dataset from MongoDB as a DataFrame (m_df). Then cache, m_df.cache()
Join the data using b_df.join(m_df, "join_field")
Perform your required aggregation and then write to a data source.