Spark Transactional remove rows - scala

I am working with dataframes with Scala in a banking process and I need remove some rows if the transaction is cancellation. For example if I have a cancellation, I must remove the previous row. In the case I have three cancellation continuous I must remove 3 previous rows.
DataFrame initial:
DataFrame expected
I will appreciate your help.

Combination of inbuilt functions, udf function and window function should help you get your desired result (commented for clarity)
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("Account").orderBy("Sequence").rowsBetween(Long.MinValue, Long.MaxValue)
import org.apache.spark.sql.functions._
def filterUdf = udf((array:Seq[Long], sequence: Long)=> !array.contains(sequence))
df.withColumn("collection", sum(when(col("Type") === "Cancellation", 1).otherwise(0)).over(windowSpec)) //getting the count of cancellation in each group
.withColumn("Sequence", when(col("Type") === "Cancellation", col("Sequence")-col("collection")).otherwise(col("Sequence"))) //getting the difference between count and sequence number to get the sequence number of previous
.withColumn("collection", collect_set(when(col("Type") === "Cancellation", col("Sequence")).otherwise(0)).over(windowSpec)) //collecting the differenced sequence number of cancellation
.filter(filterUdf(col("collection"), col("Sequence"))) //filtering out the rows calling the udf
.drop("collection")
.show(false)
which should give you
+-------+-----------+--------+
|Account|Type |Sequence|
+-------+-----------+--------+
|11047 |Aggregation|11 |
|1030583|Aggregation|1 |
|1030583|Aggregation|4 |
+-------+-----------+--------+
Note: This solution works only when you have sequencial cancellation in each group of Account

I think a Map of stack data structure is useful in this case, with the key is account id. You push the Agg rows into stack until encountering a Cancel, then you pop the stack.

Related

Spark Dataframe changing through execution

I'm fairly new to Spark, so most likely I'm having a huge gap in my understanding. Apologies in advance if what you see here is very silly. So, what I'm trying to achieve is:
Get a set of rows from a table in Hive (let's call it T_A) and save them in a DataFrame (let's call it DF_A). This is done.
Get extra information from another Hive table (T_B) and join it with DF_A to get a new Dataframe (DF_B). And then cache it. This is also done.
val DF_A = sparkSession.sql("select * from T_A where whatever=something").toDF()
val extraData = sparkSession.sql("select * from T_B where whatever=something").toDF()
val DF_B = DF_A.join(extraData,
col(something_else=other_thing), "left"
).toDF().cache()
Now this is me assuming Spark + Hive works similarly than regular java app + SQL, which is where I might need a hard course correction.
Here, I attempt to store in one of the Hive Tables I used before (T_B), partitioned by column X, whatever N rows I transformed (Tr1(DF_B)) from DF_B. I use:
val DF_C = DF_B.map(row => {
Tr1(row)
}).toDF()
DF_C.write.mode(SaveMode.Overwrite).insertInto("T_B")
After saving it to this table, I want to reuse the information from DF_B (not the transformed data reinserted in T_B, but the joined data in DF_B based on previous state of T_B) to make a second transformation over it (Tr2(DF_B)).
I want to update the same N rows written in T_B with the data transformed by previous step, using an "INSERT OVERWRITE" operation and the same partition column X.
val DF_D = DF_B.map(row => {
Tr2(row)
}).toDF()
DF_D.write.mode(SaveMode.Overwrite).insertInto("T_B")
What I expect:
T_B having N rows.
Having DF_B unchanged, with N rows.
What it's happening:
DF_B having 3*N rows.
T_C having 3*N rows.
Now, after some debugging, I found that DF_B has 3N rows after DF_C write finishes. So DF_B will have 3N rows too and that will cause T_B to have 3*N rows as well.
So, my question is... Is there a way to retain the original DF_B data and use it for the second map transformation, since it relies on the original DF_B state for the transformation process? Is there a reference somewhere I can read to know why this happens?
EDIT: I don't know if this is useful information, but I log the count of records before and after doing the first write. And I get the following
val DF_C = DF_B.map(row => {
Tr1(row)
}).toDF()
Logger.info("DF_C.count {} - DF_B.count {}"...
DF_C.write.mode(SaveMode.Overwrite).insertInto("T_B")
Logger.info("DF_C.count {} - DF_B.count {}"...
With persist(MEMORY_AND_DISK) or no persist at all, instead of cache and 3 test rows. I get:
DF_C.count 3 - DF_B.count 3
write
DF_C.count 3 - DF_B.count 9
With cache, I get:
DF_C.count 3 - DF_B.count 3
write
DF_C.count 9 - DF_B.count 9
Any idea?
Thank you so much.
In Spark, execution happens in lazy way, only when action gets called.
So when you calling some action two times on same dataframe (in your case DF_B) that dataframe(DB_B) will be created and transformed two times from starting at time of execution.
So try to persist your dataframe DF_B before calling first action, then you can use same DF for both Tr1 and Tr2.
After persist dataframe will be stored in memory/disk and can be reused multiple times.
You can learn more about persistance here

How debug spark dropduplicate and join function calls?

There is some table with duplicated rows. I am trying to reduce duplicates and stay with latest my_date (if there are
rows with same my_date it is no matter which one to use)
val dataFrame = readCsv()
.dropDuplicates("my_id", "my_date")
.withColumn("my_date_int", $"my_date".cast("bigint"))
import org.apache.spark.sql.functions.{min, max, grouping}
val aggregated = dataFrame
.groupBy(dataFrame("my_id").alias("g_my_id"))
.agg(max(dataFrame("my_date_int")).alias("g_my_date_int"))
val output = dataFrame.join(aggregated, dataFrame("my_id") === aggregated("g_my_id") && dataFrame("my_date_int") === aggregated("g_my_date_int"))
.drop("g_my_id", "g_my_date_int")
But after this code I when grab distinct my_id I get about 3000 less than in source table. What a reason can be?
how to debug this situation?
After doing drop duplicates do a except of this data frame with the original data frame this should give some insight on the rows which are additionally getting dropped . Most probably there are certain null or empty values for those columns which are being considered duplicates.

Apache Spark SQL: how to optimize chained join for dataframe

I have to make a left join between a principle data frame and several reference data frame, so a chained join computation. And I wonder how to make this action efficient and scalable.
Method 1 is easy to understand, which is also the current method, but I'm not satisfied because all the transformations have been chained and waited for the final action to trigger the computation, if I continue to add transformation and the volume of data, spark will fail at the end, so this method is not scalable.
Method 1:
def pipeline(refDF1: DataFrame, refDF2: DataFrame, refDF3: DataFrame, refDF4: DataFrame, refDF5: DataFrame): DataFrame = {
val transformations: List[DataFrame => DataFrame] = List(
castColumnsFromStringToLong(ColumnsToCastToLong),
castColumnsFromStringToFloat(ColumnsToCastToFloat),
renameColumns(RenameMapping),
filterAndDropColumns,
joinRefDF1(refDF1),
joinRefDF2(refDF2),
joinRefDF3(refDF3),
joinRefDF4(refDF4),
joinRefDF5(refDF5),
calculate()
)
transformations.reduce(_ andThen _)
}
pipeline(refDF1, refDF2, refDF3, refDF4, refDF5)(principleDF)
Method 2: I've not found a real way to achieve my idea, but I hope to trigger the computation of each join immediately.
according to my test, count() is too heavy for spark and useless for my application, but I don't know how to trigger the join computation with an efficient action. This kind of action is, in fact, the answer to this question.
val joinedDF_1 = castColumnsFromStringToLong(principleDF, ColumnsToCastToLong)
joinedDF_1.cache() // joinedDF is not always used multiple times, but for some data frame, it is, so I add cache() to indicate the usage
joinedDF_1.count()
val joinedDF_2 = castColumnsFromStringToFloat(joinedDF_1, ColumnsToCastToFloat)
joinedDF_2.cache()
joinedDF_2.count()
val joinedDF_3 = renameColumns(joinedDF_2, RenameMapping)
joinedDF_3.cache()
joinedDF_3.count()
val joinedDF_4 = filterAndDropColumns(joinedDF_4)
joinedDF_4.cache()
joinedDF_4.count()
...
When you want to force the computation of a given join (or any transformation that is not final) in Spark, you can use a simple show or count on your DataFrame. This kind of terminal points will force the computation of the result because otherwise it is simply not possible to execute the action.
Only after this will your DataFrame be effectively stored in your cache.
Once you're finished with a given DataFrame, don't hesitate to unpersist. This will unpersist your data if your cluster need more room for further computation.
You need to repartitions your dataset with the columns before calling the join transformation.
Example:
df1=df1.repartion(col("col1"),col("col2"))
df2=df2.repartion(col("col1"),col("col2"))
joinDF = df1.join(jf2,df1.col("col1").equals(df2.col("col1")) &....)
Try creating a new dataframe based on it.
Ex:
val dfTest = session.createDataFrame(df.rdd, df.schema).cache()
dfTest .storageLevel.useMemory // result should be a true.

Pyspark actions on dataframe [duplicate]

I have the following code:
val df_in = sqlcontext.read.json(jsonFile) // the file resides in hdfs
//some operations in here to create df as df_in with two more columns "terms1" and "terms2"
val intersectUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { seq1 intersect seq2 } ) //intersects two sequences
val symmDiffUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { (seq1 diff seq2) ++ (seq2 diff seq1) } ) //compute the difference of two sequences
val df1 = (df.withColumn("termsInt", intersectUDF(df("terms1"), df1("terms2") ) )
.withColumn("termsDiff", symmDiffUDF(df("terms1"), df1("terms2") ) )
.where( size(col("termsInt")) >0 && size(col("termsDiff")) > 0 && size(col("termsDiff")) <= 2 )
.cache()
) // add the intersection and difference columns and filter the resulting DF
df1.show()
df1.count()
The app is working properly and fast until the show() but in the count() step, it creates 40000 tasks.
My understanding is that df1.show() should be triggering the full df1 creation and df1.count() should be very fast. What am I missing here? why is count() that slow?
Thank you very much in advance,
Roxana
show is indeed an action, but it is smart enough to know when it doesn't have to run everything. If you had an orderBy it would take very long too, but in this case all your operations are map operations and so there's no need to calculate the whole final table. However, count needs to physically go through the whole table in order to count it and that's why it's taking so long. You could test what I'm saying by adding an orderBy to df1's definition - then it should take long.
EDIT: Also, the 40k tasks are likely due to the amount of partitions your DF is partitioned into. Try using df1.repartition(<a sensible number here, depending on cluster and DF size>) and trying out count again.
show() by default shows only 20 rows. If the 1st partition returned more than 20 rows, then the rest partitions will not be executed.
Note show has a lot of variations. If you run show(false) which means show all results, all partitions will be executed and may take more time. So, show() equals show(20) which is a partial action.

How to create key-value pairs DStream in Spark Streaming

I'm new to Spark Streaming. There's a project using Spark Streaming, the input is a key-value pair string like "productid,price".
The requirement is to process each line as a separate transaction, and make RDD triggered every 1 second.
In each interval I have to calculate the total price for each individual product, like
select productid, sum(price) from T group by productid
My current thought is that I have to do the following steps
1) split the whole line with \n val lineMap = lines.map{x=>x.split("\n")}
2) split each line with "," val
recordMap=lineMap.map{x=>x.map{y=>y.split(",")}}
Now I'm confused about how to make the first column as key and second column as value, and use reduceByKey function to get the total sum.
Please advise.
Thanks
Once you have split each row, you can do something like this:
rowItems.map { case Seq(product, price) => product -> price }
This way you obtain a DStream[(String, String)] on which you can apply pair transformations like reduceByKey (don't forget to import the required implicits).