Incremental update in rdd or dataframe apache spark - scala

I have a use case in which i have a set of data (Eg: A csv file containing around 10 million of rows and around 25 columns ).
and i have a set of rules(around 1000 rules) using that i need to update records, and these rules have to execute sequentially.
i wrote a code in which i am looping for every rule and for each rule i updating data.
suppose rule is like
col1=5 and col2=10 then col25=updatedValue
rulesList.foreach(rule=> {
var data = data.map(line(col1, col2, .., col25) => if(rule){
line(col1, col2, .., updatedValue)
} else {line(col1, col2, .., col25)})
})
these rules will execute sequential and finally a will get updated records.
But problem is that if rules and data is less that it is executing properly but if data is large than i gets StackOverflow Error, Reason may be because it is mapping for all rules and executing it last like map-reduce.
Is there any way using which i can update this data incremently.

Try mapping once over the RDD and loop over the rules inside the map a lot less data movement. All the rules will be applied locally at the data resulting in the updated record - instead of creating 1000 RDDs

Given a record in the RDD, if you can apply all updates incremently to it but independently of the other records, I would suggest you do the map first and then you iterate through the rulesList inside the map:
val result = data.map { case line(col1, col2, ..., col25) =>
var col25_mutable = col25
rulesList.foreach{ rule =>
col25_mutable = if(rule) updatedValue else col25_mutable
}
line(col1, col2, ..., col25_mutable)
}
This approach should be thread-safe if rulesList is a simple iterable object, such as Array or List.
I hope it works for you, or that it at least helps you achieve your goal.
Cheers

Related

hash join in spark scala on pair rdd

I am trying to perform a partition+broadcast join in spark scala. I have a dictionary that I am broadcasting to all the nodes. The structure of the dictionary is as follows:
{ key: Option[List[Strings]] } // I created this dictionary using a groupByKey first and then called collectAsMap before broadcasting.
The above dictionary was created using the table whose structure is similar to the table mentioned below.
I have a table that is a pair RDD whose structure is as follows:
Col A | Col B
I am trying to perform a join as follows:
val join_output = table.flatMap{
case(key, value) => custom_dictionary.value.get(key).map(
otherValue => otherValue.foreach((value, _))
)
}
My goal is to get a pair-RDD as an output whose contents are ( from table, from list stored in the dictionary).
The code runs and compiles successfully but when I check the output, I only see this: "()" as the output being saved. Where am I going wrong?
I did have a look at some of the other posts that did reflect up to some extent on this matter, but none of the options worked. I request some guidance on this issue. Also, if there is a post that exactly points to this, please let me know.

How debug spark dropduplicate and join function calls?

There is some table with duplicated rows. I am trying to reduce duplicates and stay with latest my_date (if there are
rows with same my_date it is no matter which one to use)
val dataFrame = readCsv()
.dropDuplicates("my_id", "my_date")
.withColumn("my_date_int", $"my_date".cast("bigint"))
import org.apache.spark.sql.functions.{min, max, grouping}
val aggregated = dataFrame
.groupBy(dataFrame("my_id").alias("g_my_id"))
.agg(max(dataFrame("my_date_int")).alias("g_my_date_int"))
val output = dataFrame.join(aggregated, dataFrame("my_id") === aggregated("g_my_id") && dataFrame("my_date_int") === aggregated("g_my_date_int"))
.drop("g_my_id", "g_my_date_int")
But after this code I when grab distinct my_id I get about 3000 less than in source table. What a reason can be?
how to debug this situation?
After doing drop duplicates do a except of this data frame with the original data frame this should give some insight on the rows which are additionally getting dropped . Most probably there are certain null or empty values for those columns which are being considered duplicates.

Is there a Scala collection that maintains the order of insert?

I have a List:hdtList which contain columns that represent the columns of a Hive table:
forecast_id bigint,period_year bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_system_name string,source_record_type string,gl_source_name string,gl_source_system_name string,year string
I have a List: partition_columns which contains two elements: source_system_name, period_year
Using the List: partition_columns, I am trying to match them and move the corresponding columns in List: hdtList to the end of it as below:
val (pc, notPc) = hdtList.partition(c => partition_columns.contains(c.takeWhile(x => x != ' ')))
But when I print them as: println(notPc.mkString(",") + "," + pc.mkString(","))
I see the output unordered as below:
forecast_id bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string,period_year bigint,source_system_name string
The columns period_year comes first and the source_system_name last. Is there anyway I can make data as below so that the order of columns in the List: partition_columns is maintained.
forecast_id bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string,source_system_name string,period_year bigint
I know there is an option to reverse a List but I'd like to learn if I can implement a collection that maintains that order of insert.
It doesn't matter which collections you use; you only use partition_columns to call contains which doesn't depend on its order, so how could it be maintained?
But your code does maintain order: it's just hdtList's.
Something like
// get is ugly, but safe here
val pc1 = partition_columns.map(x => pc.find(y => y.startsWith(x)).get)
after your code will give you desired order, though there's probably more efficient way to do it.

Iterate through a dataframe and dynamically assign ID to records based on substring [Spark][Scala]

Currently I have an input file(millions of records) where all the records contain a 2 character Identifier. Multiple lines in this input file will be concatenated into only one record in the output file, and how this is determined is SOLELY based on the sequential order of the Identifier
For example, the records would begin as below
1A
1B
1C
2A
2B
2C
1A
1C
2B
2C
1A
1B
1C
1A marks the beginning of a new record, so the output file would have 3 records in this case. Everything between the "1A"s will be combined into one record
1A+1B+1C+2A+2B+2C
1A+1C+2B+2C
1A+1B+1C
The number of records between the "1A"s varies, so I have to iterate through and check the Identifier.
I am unsure how to approach this situation using scala/spark.
My strategy is to:
Load the Input file into the dataframe.
Create an Identifier column based on substring of record.
Create a new column, TempID and a variable, x that is set to 0
Iterate through the dataframe
if Identifier =1A, x = x+1
TempID= variable x
Then create a UDF to concat records with the same TempID.
To summarize my question:
How would I iterate through the dataframe, check the value of Identifier column, then assign a tempID(whose value increases by 1 if the value of identifier column is 1A)
This is dangerous. The issue is that spark is not guaranteed keep the same order among elements, especially since they might cross partition boundaries. So when you iterate over them you could get a different order back. This also has to happen entirely sequentially, so at that point why not just skip spark entirely and run it as regular scala code as a preproccessing step before getting to spark.
My recommendation would be to either look into writing a custom data inputformat/data source, or perhaps you could use "1A" as a record delimiter similar to this question.
First - usually "iterating" over a DataFrame (or Spark's other distributed collection abstractions like RDD and Dataset) is either wrong or impossible. The term simply does not apply. You should transform these collections using Spark's functions instead of trying to iterate over them.
You can achieve your goal (or - almost, details to follow) using Window Functions. The idea here would be to (1) add an "id" column to sort by, (2) use a Window function (based on that ordering) to count the number of previous instances of "1A", and then (3) using these "counts" as the "group id" that ties all records of each group together, and group by it:
import functions._
import spark.implicits._
// sample data:
val df = Seq("1A", "1B", "1C", "2A", "2B", "2C", "1A", "1C", "2B", "2C", "1A", "1B", "1C").toDF("val")
val result = df.withColumn("id", monotonically_increasing_id()) // add row ID
.withColumn("isDelimiter", when($"val" === "1A", 1).otherwise(0)) // add group "delimiter" indicator
.withColumn("groupId", sum("isDelimiter").over(Window.orderBy($"id"))) // add groupId using Window function
.groupBy($"groupId").agg(collect_list($"val") as "list") // NOTE: order of list might not be guaranteed!
.orderBy($"groupId").drop("groupId") // removing groupId
result.show(false)
// +------------------------+
// |list |
// +------------------------+
// |[1A, 1B, 1C, 2A, 2B, 2C]|
// |[1A, 1C, 2B, 2C] |
// |[1A, 1B, 1C] |
// +------------------------+
(if having the result as a list does not fit your needs, I'll leave it to you to transform this column to whatever you need)
The major caveat here is that collect_list does not necessarily guarantee preserving order - once you use groupBy, the order is potentially lost. So - the order within each resulting list might be wrong (the separation to groups, however, is necessarily correct). If that's important to you, it can be worked around by collecting a list of a column that also contains the "id" column and using it later to sort these lists.
EDIT: realizing this answer isn't complete without solving this caveat, and realizing it's not trivial - here's how you can solve it:
Define the following UDF:
val getSortedValues = udf { (input: mutable.Seq[Row]) => input
.map { case Row (id: Long, v: String) => (id, v) }
.sortBy(_._1)
.map(_._2)
}
Then, replace the row .groupBy($"groupId").agg(collect_list($"val") as "list") in the suggested solution above with these rows:
.groupBy($"groupId")
.agg(collect_list(struct($"id" as "_1", $"val" as "_2")) as "list")
.withColumn("list", getSortedValues($"list"))
This way we necessarily preserve the order (with the price of sorting these small lists).

Scala/Spark: Immutable Dataframes and Memory

I am very new to Scala. I have experience in Java and R
I am confused about the immutability of DataFrames and memory management. The reason is this:
A Dataframe in R is also immutable. Subsequently, it was found in R to be unworkable. (Simplistically put) when working with a very large number of columns, each transformation led to a new Dataframe. 1000 consecutive operations on 1000 consecutive columns would lead to 1000 Dataframe objects). Now, most data scientists prefer R's data.table which performas operations by reference on a single data.table object.
Scala's dataframe (to a newbie) seems have a similar problem. The following code, for example, seems to create 1000 dataframes when renaming 1000 columns. Despite the foldLeft(), each call to withColumn() creates a new instance of DataFrame.
So, do I trust a very efficient garbage collection in Scala, or do I need to try and limit the number of immutable instances created. If the latter, what techniques should I be looking at?
def castAllTypedColumnsTo(df: DataFrame,
sourceType: DataType, targetType: DataType):
DataFrame =
{
val columnsToBeCasted = df.schema
.filter(s => s.dataType == sourceType)
if (columnsToBeCasted.length > 0)
{
println(s"Found ${columnsToBeCasted.length} columns " +
s"(${columnsToBeCasted.map(s => s.name).mkString(",")})" +
s" - casting to ${targetType.typeName.capitalize}Type")
}
columnsToBeCasted.foldLeft(df)
{ (foldedDf, col) =>
castColumnTo(foldedDf, col.name, targetType)
}
}
This method will return a new instance on each call
private def castColumnTo(df: DataFrame, cn: String, tpe: DataType):
DataFrame =
{
//println("castColumnTo")
df.withColumn(cn, df(cn).cast(tpe)
)
}
The difference is essentially laziness. Each new DataFrame that is returned is not materialized in memory. It just stores the base DataFrame and the function that should be applied to it. It's essentially an execution plan for how to create some data, not the data itself.
When it comes time to actually execute and save the result somewhere, then all 1000 operations can be applied to each row in parallel, so you get 1 additional output DataFrame. Spark condenses as many operations together as possible, and does not materialize anything unnecessary or that hasn't been explicitly requested to be saved or cached.