Selectively load data into Spark DataFrame - mongodb

I have a usecase where I want to read a subsection of data and then perform some operation on it.
I am doing something like this on spark.
DataFrame salesDf = sqlContext.read().format("com.stratio.datasource.mongodb").options(options).load()
This seems to load all the data on the table/collection. When I perform a select operation using
salesDf.registerTempTable("sales");
sqlContext.sql("SELECT * FROM sales WHERE exit_date IS 08-08-2016").show();
It seems like the select operation is being performed on the dataframe that it just loaded. This seems to be a complete waste of space to me.
My table/collection has close to 1 billion records and I want to process just 100,000 of them, loading 1B records in an object seems to be a complete waste of space to me.
Please excuse the naivety of my question. I am very very new to spark. It would be wonderful if someone could provide me a way to just load subsection of table data instead of loading everything and then processing.

Related

Pyspark Dataframe count taking too long

So we have a Pyspark Dataframe which has around 25k records. We are trying to perform a count/empty check on this and it is taking too long. We tried,
df.count()
df.rdd.isEmpty()
len(df.head(1))==0
Converted to Pandas and tried pandas_df.empty()
Tried the arrow option
df.cache() and df.persist() before the counts
df.repartition(n)
Tried writing the df to DBFS, but writing is also taking quite a long time(cancelled after 20 mins)
Could you please help us on what we are doing wrong.
Note : There are no duplicate values in df and we have done multiple joins to form the df
Without looking at the df.explain() it's challenging to know specifically the issue but it certainly seems like you have could have a skewed data set.
(Skew usually is represented in the Spark UI with 1 executor taking a lot longer than the other partitions to finish.) If you on a recent version of spark there are tools to help with this out of the box:
spark.sql.adaptive.enabled = true
spark.sql.adaptive.skewJoin.enabled = true
Count is not taking too long. It's taking the time it needs to, to complete what you asked spark to do. To refine what it's doing you should do things you are likely already doing, filter the data first before joining so only critical data is being transferred to the joins. Reviewing your data for Skew, and programming around it, if you can't use adaptive query.
Convince yourself this is a data issue. Limit your source [data/tables] to 1000 or 10000 records and see if it runs fast. Then one at a time, remove the limit from only one [table/data source] (and apply limit to all others) and find the table that is the source of your problem. Then study the [table/data source] and figure out how you can work around the issue.(If you can't use adaptive query to fix the issue.)
(Finally If you are using hive tables, you should make sure the table stats are up to date.)
ANALYZE TABLE mytable COMPUTE STATISTICS;

Spark binary file and Delta Table

I have batches of binary files (~3mb each) that I receive in batches of ~20000 files at a time. These files are used downstream for further processing, but I want to process them and store in Delta tables.
I can do this easily:
df = spark.read.format(“binaryFile”).load(<path-to-batch>)
df = df.withColumn(“id”, expr(“uuid()”)
dt = DeltaTable.forName(“myTable”)
dt.alias(“a”).merge(
df.alias(“a”),
“a.path = b.path”
).whenNotMatchedInsert(
values={“id”: “b.id”, “content”: “b.content”}
).execute()
This makes the table quite slow already, but later I need to query certain IDs, do collect and write them individually back to binary files.
Questions:
Would my table benefit from a batch column and partition?
Should I partition by id? I know this is not ideal, but might make querying individual rows easier?
Is there a better way to write the files out again, rather than .collect()? I have seen when I select about 1000 specific ids write them out that about 10 minutes is just for collect and then less than a minute to write. I do something like:
for row in df.collect():
with open(row.id, “wb”) as fw:
fw.write(row.content)
As uuid() returns random values, I'm afraid we cannot use it to compare existing data with new records. (Sorry if I misunderstood the idea)
I don't think using partition by id will help as the id column has obviously high cardinality.
Instead of using collect() which loads all records into Driver, I think it would be better if you can write the records in the Spark dataframe directly and simultaneously from all the worker nodes into a temporary location on ADLS first and then aggregate a few data files from that location.

Spark dataframe seems to be recomputed twice

SOLVED: I solved the issue, that was due to a very stupid, silly, idiotic mistake in one of the first passages of the flow.
Basically, I was computing a dataframe that was written to a Hive table; this dataframe then needed to be used to create the temporaryDF after many passages, but I was originally querying the table from scratch instead of using a copy of the dataframe to-be-written in the table. The mistake lies in the fact that the just-computed dataframes was missing previous partitions (due to the specific logic of the flow), whereas next computations to create temporaryDF needed also at least two previous partitions. I don't know why, I can't remember when, I decided to cache the just-computed one, thus losing information and getting an empty one under Oozie (in Spark-Shell I was always using at least three partitions, due to manually updating the table after some time - each new partition came every 15min). I was probably in a late night working sprint and my brain decided it was worthy to mess it up.
I upvoted and accepted #thebluephantom answer because he is very right within the specific circumstance I was describing.
Original:
I'm having a strange behaviour using Spark-Submit with Spark v.2.2.0.2.6.4.105-1 (using Scala) in Hadoop 2 under an Oozie workflow vs using Spark-Shell.
I have a Hive table that contains records that keep track of some processes every 15 minutes. The table is overwritten every time with new records or 'old' records that still satisfy the logic of the processes of interest.
I keep track of 'the age' of the records through a column that I will here call times_investigated, which ranges from 1 to 9.
I create a temporary dataframe, let's call it temporayDF, that contains both the old and the new entries (both the types need to be present to run useful computations). This temporayDF is then split between the new entries and the old ones, based on $"times_investigated" === 1 and $"times_investigated > 1" (or =!= 1).
Then, the processed entries are merged with a union in a final dataframe that is then written into the original Hive table.
// Before, I run the query on the 'old' Hive table and the logic over old and new entries.
// I now have a temporary dataframe
val temporaryDF = previousOtherDF
.withColumn("original_col_new", conditions)
.withColumn("original_other_col_new", otherConditions)
.withColumn("times_investigated_new", nvl($"times_investigated" + 1, 1))
.select(
previousColumns,
$"original_col_new".as("original_col"),
$"original_other_col_new".as("original_other_col"),
$"times_investigated_new".as("times_investigated"))
.cache
// Now I need to split the temporayDF in 2 to run some other logic on the new entries.
val newEntriesDF = temporaryDF
.filter($"times_investigated" === 1)
.join(neededDF, conditions, "leftouter")
.join(otherNeededDF, conditions, "leftouter")
.groupBy(cols)
.agg(min(colOne),
max(colTwo),
min(colThree),
max(colFour))
.withColumn("original_col_five_new",
when(conditions).otherwise(somethingElse))
.withColumn("original_col_six_new",
when(conditions).otherwise(somethingElse))
.select(orderedColumns)
val oldEntriesDF = temporaryDF.filter($"times_investigated" > 1)
val finalTableDF = oldEntriesDF.union(newEntriesDF)
// Now I write the table
finalTableDF.createOrReplaceTempView(tempFinalTableDF)
sql("""INSERT OVERWRITE TABLE $finalTableDF
SELECT * FROM tempFinalTableDF """)
// I would then need to re-use the newly-computed table to process further information...
The Problem:
The Hive table does not present the new entries with times_investigated = 1. It just processes the old ones, so, after those 9 times an entry can stay inside the table, it gets completely empty.
I run some tests within Spark-Shell and everything worked perfectly for many iterations, even manually writing the Hive table from the shell produced the expected results in the Hive table, but when I launched the workflow under Oozie, the strange behavior appeared again.
What I noticed within Spark-Shell is that, after writing the Hive table, if I went to compute a temporaryDF.show(), the new entries would be updated to $"times_investigated" = 2!
I tried to create a copy of temporaryDF to work on separate dataframes with the new and the old entries, but also this copyOfTemporaryDF gets updated after writing the Hive table.
It seems that this re-computation is happening before writing the Hive table under Oozie.
I know that I can compute the operations in a different manner, but I need to find a quick temporary fix on the current flow if possible.
Above all, I would love to understand what is happening under the hood, in order to avoid getting myself in such a circumstance later on.
Do you guys have any clue and/or advice?
I tried caching the intermediate dataframes, but without success.
P.S. Sorry for the probably bad coding practices
EDIT. More context: the temporaryDF comes from other intermediate dataframes, used just once to compute this on of interest. The last passages that create temporaryDF are withColumn operations, where $"times_investigated" is updated with a custom nvl function (that works exactly like the SQL one) and never gave problems in older versions of the flow (see below for the passages).
Edit2: I also tried to merge the operations on new and old entries in one long chained series, so that temopraryDF is actually the final dataframe to be written in the Hive table, but the new entries with times_investigated = 1 are still not considered (yet I have no issues via Spark-Shell and .showing the dataframe after writing to table makes it re-compute, so the times investigated are +1).
Use .cache otherwise you will get re-computation. You should do this for the appropriate dataframe or RDD if the RDD or DF is to be used multiple times in a single Spark App - not even Action dependent, sometimes you get "skipped stages".
val temporaryDF = previousOperations...cache()
2 vals use temporaryDF and without caching the recomputations will be as you see, and they may well give different results. That should be cached.
Of course if a Worker dies, or the partition evicted, some recomputing is needed.
.cache may not be ideal for datasets larger than available cluster memory. Each partition that is evicted will be rebuilt from source and that is a costly affair.
Also, using suitable partitioning and iterating a few times be better than persisting / caching; but it all depends.

pyspark dataframe writing results

I am working on a project where i need to read 12 files average file size is 3 gb. I read them with RDD and create dataframe with spark.createDataFrame. Now I need to process 30 Sql queries on the dataframe most them need output of previous one like depend on each other so i save all my intermediate state in dataframe and create temp view for that dataframe.
The program takes only 2 minutes for execute part but the problem is while writing them to csv file or show the results or calling count() function takes too much time. I have tries re-partition thing but still it is taking to much time.
1.What could be the solution?
2.Why it is taking too much time to write even all processing taking small amount of time?
I solved above problem with persist and cache in pyspark.
Spark is a lazy programming language. Two types of Spark RDD operations are- Transformations and Actions. A Transformation is a function that produces new RDD from the existing RDDs but when we want to work with the actual dataset, at that point Action is performed. When the action is triggered after the result, new RDD is not formed like transformation.
Every time i do some operation it was just transforming, so if i call that particular dataframe it will it parent query every time since spark is lazy,so adding persist stopped calling parent query multiple time. It saved lots of processing time.

Apache spark streaming - cache dataset for joining

I'm considering using Apache Spark streaming for some real-time work but I'm not sure how to cache a dataset for use in a join/lookup.
The main input will be json records coming from Kafka that contain an Id, I want to translate that id into a name using a lookup dataset. The lookup dataset resides in Mongo Db but I want to be able to cache it inside the spark process as the dataset changes very rarely (once every couple of hours) so I don't want to hit mongo for every input record or reload all the records in every spark batch but I need to be able to update the data held in spark periodically (e.g. every 2 hours).
What is the best way to do this?
Thanks.
I've thought long and hard about this myself. In particular I've wondered is it possible to actually implement a database DB in Spark of sorts.
Well the answer is kind of yes. First you want a program that first caches the main data set into memory, then every couple of hours does an optimized join-with-tiny to update the main data set. Now apparently Spark will have a method that does a join-with-tiny (maybe it's already out in 1.0.0 - my stack is stuck on 0.9.0 until CDH 5.1.0 is out).
Anyway, you can manually implement a join-with-tiny, by taking the periodic bi-hourly dataset and turning it into a HashMap then broadcasting it as a broadcast variable. What this means is that the HashMap will be copied, but only once per node (compare this with just referencing the Map - it would be copied once per task - a much greater cost). Then you take your main dataset and add on the new records using the broadcasted map. You can then periodically (nightly) save to hdfs or something.
So here is some scruffy pseudo code to elucidate:
var mainDataSet: RDD[KeyType, DataType] = sc.textFile("/path/to/main/dataset")
.map(parseJsonAndGetTheKey).cache()
everyTwoHoursDo {
val newData: Map[KeyType, DataType] = sc.textFile("/path/to/last/two/hours")
.map(parseJsonAndGetTheKey).toarray().toMap
broadcast(newData)
val mainDataSetNew =
mainDataSet.map((key, oldValue) => (key,
newData.get(key).map(newDataValue =>
update(oldValue, newDataValue))
.getOrElse(oldValue)))
.cache()
mainDataSetNew.someAction() // to force execution
mainDataSet.unpersist()
mainDataSet = mainDataSetNew
}
I've also thought that you could be very clever and use a custom partioner with your own custom index, and then use a custom way of updating the partitions so that each partition itself holds a submap. Then you can skip updating partitions that you know won't hold any keys that occur in the newData, and also optimize the updating process.
I personally think this is a really cool idea, and the nice thing is your dataset is already ready in memory for some analysis / machine learning. The down side is your kinda reinventing the wheel a bit. It might be a better idea to look at using Cassandra as Datastax is partnering with Databricks (people who make Spark) and might end up supporting some kind of thing like this out of box.
Further reading:
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
http://www.datastax.com/2014/06/datastax-unveils-dse-45-the-future-of-the-distributed-database-management-system
Here is a fairly simple work-flow:
For each batch of data:
Convert the batch of JSON data to a DataFrame (b_df).
Read the lookup dataset from MongoDB as a DataFrame (m_df). Then cache, m_df.cache()
Join the data using b_df.join(m_df, "join_field")
Perform your required aggregation and then write to a data source.