Multiple data inserts in same dataframe - scala

We have customer data in a Hive table and sales data in another Hive table, which has data in TB's. We are trying to pull the sales data for multiple customers and save it to a file.
What we tried so far:
We tired with left outer join between customer and sales tables, but because of the huge sales data it is not working.
val data = customer.join(sales,"customer.id" = "sales.customerID",leftouter)
So the alternative is to pull the data form sales table based on specific customer region list and see if this region data has the customer data, if data exist save it in other dataframe and load the data to the same dataframe for all the regions.
My question here is, whether the multiple insert of data for the dataframe is supported in spark.

If the sales dataframe is larger than the customer dataframe then you could simply switch the order of the dataframes in the join operation.
val data = sales.join(customer,"customer.id" === "sales.customerID", "left_outer")
You could also add a hint for Spark to broadcast the smaller dataframe, though I belive it needs to be smaller than 2GB:
import org.apache.spark.sql.functions.broadcast
val data = sales.join(broadcast(customer),"customer.id" === "sales.customerID", "leftouter")
To use the other approach and iterativly merge dataframes is also possible. For this purpose you can use the union method (Spark 2.0+) or unionAll (older versions). This method will append a dataframe to another. In the case where you have a list of dataframes that you want to merge with each other you can use union together with reduce:
val dataframes = Seq(df1, df2, df3)
dataframes.reduce(_ union _)

Related

Spark Scala, grabbing the max value of 1 column, but keep all columns

I have a dataframe with 3 columns (customer, associations, timestamp).
I want to grab the latest customer by looking at timestamps.
Attempt
val rdd = readRdd.select(col("value"))
val val_columns = Seq("value.timestamp").map(x => last(col(x)).alias(x))
rdd.orderBy("value.timestamp")
.groupBy("value.customer")
.agg(val_columns.head, val_columns.tail: _*)
.show()
I believe the above code is working, but trying to figure out how to include all columns (ie. associations). If I understand correctly, adding it into the groupby would mean I'm grabbing the latest combination of customer and associations combined, but I only want to grab latest off the customer column and not look at multiple columns together.
Edit:
I might be onto something by adding:
val val_columns = Seq("value.lastRefresh", "value.associations")
.map(x => last(col(x)).alias(x))
Curious on thoughts.
If you want to return the latest customer data by the timestamp column, you can just order your dataframe by value.timestamp and apply limit(1):
import org.apache.spark.sql.functions._
df.orderBy(desc("value.timestamp")).limit(1).show()

Spark Persist method odd behavior

I am exploring spark persist function. It seems for some dataframe it is persisting whereas for others it is not, even though I have used the persist method on all the dataframes
Here is my code with explaination
// loading csv as dataframe and creating a view
val src_data=spark.read.option("header",true).csv("sources/data.csv")
src_data.createTempView("src_data")
**There is alreading a table called test in hive**
Here I am creating 3 dataframes using src and test and using persist on all 3 for later use
//dataframe 1
val changed_data= spark.sql("select sc.* from src_data sc inner join default.test t on sc.id=t.id where t.value!=sc.value or t.description!=sc.description ")
changed_data.persist().show()
changed_data.createOrReplaceTempView("changed_data")
// dataframe 2
val new_data= spark.sql("select * from src_data where id not in (select distinct id from default.test)")
println("new_data")
new_data.persist().show()
new_data.createOrReplaceTempView("new_data")
// dataframe 3
val unchanged_data= spark.sql("select * from test where id not in (select id from changed_data)")
unchanged_data.persist().show()
unchanged_data.createTempView("unchanged_data")
**then I truncate the table test**
spark.sql("truncate table test")
***Then i print the 3 dataframes I persisted***
new_data.show()
unchanged_data.show()
changed_data.show()
Before truncating test I can see data for all 3 dataframes using show but after I see only data for one dataframe....
I get data for only new_data(which is dataframe 2) eventhough I persisted all 3 dataframes and all three use table test??
Why this odd behaviour
The Dataframes will only be persisted if you invoke an action that goes through every record of the Dataframe.
Remember, that show() only shows the Top 20 rows as documented in the ScalaDocs.
In contrast, you could apply something like count() but that obviously will have some negative impact on your performance.

Spark: transform dataframe

I work with Spark 1.6.1 in Scala.
I have one dataframe, and I want to create different dataframe and only want to read 1 time.
For example one dataframe have two columns ID and TYPE, and I want to create two dataframe one with the value of type = A and other with type value = B.
I've checked another posts on stackoverflow, but found only the option to read the dataframe 2 times.
However, I would like another solution with the best performance possible.
Kinds regards.
Spark will read from the data source multiple times if you perform multiple actions on the data. The way to aviod this is to use cache(). In this way, the data will be saved to memory after the first action, which will make subsequent actions using the data faster.
Your two dataframes can be created in this way, requiring only one read of the data source.
val df = spark.read.csv(path).cache()
val dfA = df.filter($"TYPE" === "A").drop("TYPE")
val dfB = df.filter($"TYPE" === "B").drop("TYPE")
The "TYPE" column is dropped as it should be unnecessary after the separation.

Spark DataFrame row count is inconsistent between runs

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir
is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.

Scripts for generating csv files for spark Canssandra data

I want to generate the 'csv' files as per below logic for the table in cassandra.
val df = sc.parallelize(Seq(("a",1,"abc#gmail.com"), ("b",2,"def#gmail.com"),("a",1,"xyz#gmail.com"),("a",2,"abc#gmail.com"))).toDF("col1","col2","emailId")
I want to generate the 'csv' files as per below logic.
Since there are 3 distinct 'emailid's' I need to generate 3 distinct 'csv' files.
Three csv files for below 3 different queries.
select * from table where emailId='abc#gmail.com'
select * from table where emailId='def#gmail.com'
select * from table where emailId='xyz#gmail.com'
How can I do this. Can anyone please help me on this.
Version:
Spark 1.6.2
Scala 2.10
Create a distinct list of the emails then iterate over them. When iterating, filter for only the emails that match and save the dataframe to Cassandra.
import sql.implicits._
val emailData = sc.parallelize(Seq(("a",1,"abc#gmail.com"), ("b",2,"def#gmail.com"),("a",1,"xyz#gmail.com"),("a",2,"abc#gmail.com"))).toDF("col1","col2","emailId")
val distinctEmails = emailData.select("emailId").distinct().as[String].collect
for (email <- distinctEmails){
val subsetEmailsDF = emailData.filter($"emailId" === email).coalesce(1)
//... Save the subset dataframe to cassandra
}
Note: coalesce(1) sends all the data to one node. This can create memory issues if the dataframe is too large.