Make a Spark code more efficient and cleaner - scala

I have the following code that cleans a corpus of documents (pipelineClean(corpus)) that returns a Dataframe with two columns:
"id": Long
"tokens": Array[String].
After that, the code produces a Dataframe with the following columns:
"term": String
"postingList": List[Array[Long, Long]] (the first long is the documented the other the term frequency inside that document)
pipelineClean(corpus)
.select($"id" as "documentId", explode($"tokens") as "term") // explode creates a new row for each element in the given array column
.groupBy("term", "documentId").count //group by and then count number of rows per group, returning a df with groupings and the counting
.where($"term" =!= "") // seems like there are some tokens that are empty, even though Tokenizer should remove them
.withColumn("posting", struct($"documentId", $"count")) // merge columns as a single {docId, termFreq}
.select("term", "posting")
.groupBy("term").agg(collect_list($"posting") as "postingList") // we do another grouping in order to collect the postings into a list
.orderBy("term")
.persist(StorageLevel.MEMORY_ONLY_SER)
My question is: would it be possible to make this code shorter and/or more efficient? For example, is it possible to do the grouping within a single groupBy?

It doesn't look like you can do much more than what you've got apart from skipping the withColumn call and using a straight select:
.select(col("term"), struct(col("documentId"), col("count")) as "posting")
instead of
.withColumn("posting", struct($"documentId", $"count")) // merge columns as a single {docId, termFreq}
.select("term", "posting")

Related

Array manipulation in Spark, Scala

I'm new to scala, spark, and I have a problem while trying to learn from some toy dataframes.
I have a dataframe having the following two columns:
Name_Description Grade
Name_Description is an array, and Grade is just a letter. It's Name_Description that I'm having a problem with. I'm trying to change this column when using scala on Spark.
Name description is not an array that's of fixed size. It could be something like
['asdf_ Brandon', 'Ca%abc%rd']
['fthhhhChris', 'Rock', 'is the %abc%man']
The only problems are the following:
1. the first element of the array ALWAYS has 6 garbage characters, so the real meaning starts at 7th character.
2. %abc% randomly pops up on elements, so I wanna erase them.
Is there any way to achieve those two things in Scala? For instance, I just want
['asdf_ Brandon', 'Ca%abc%rd'], ['fthhhhChris', 'Rock', 'is the %abc%man']
to change to
['Brandon', 'Card'], ['Chris', 'Rock', 'is the man']
What you're trying to do might be hard to achieve using standard spark functions, but you could define UDF for that:
val removeGarbage = udf { arr: WrappedArray[String] =>
//in case that array is empty we need to map over option
arr.headOption
//drop first 6 characters from first element, then remove %abc% from the rest
.map(head => head.drop(6) +: arr.tail.map(_.replace("%abc%","")))
.getOrElse(arr)
}
Then you just need to use this UDF on your Name_Description column:
val df = List(
(1, Array("asdf_ Brandon", "Ca%abc%rd")),
(2, Array("fthhhhChris", "Rock", "is the %abc%man"))
).toDF("Grade", "Name_Description")
df.withColumn("Name_Description", removeGarbage($"Name_Description")).show(false)
Show prints:
+-----+-------------------------+
|Grade|Name_Description |
+-----+-------------------------+
|1 |[Brandon, Card] |
|2 |[Chris, Rock, is the man]|
+-----+-------------------------+
We are always encouraged to use spark sql functions and avoid using the UDFs as long as we can. I have a simplified solution for this which makes use of the spark sql functions.
Please find below my approach. Hope it helps.
val d = Array((1,Array("asdf_ Brandon","Ca%abc%rd")),(2,Array("fthhhhChris", "Rock", "is the %abc%man")))
val df = spark.sparkContext.parallelize(d).toDF("Grade","Name_Description")
This is how I created the input dataframe.
df.select('Grade,posexplode('Name_Description)).registerTempTable("data")
We explode the array along with the position of each element in the array. I register the dataframe in order to use a query to generate the required output.
spark.sql("""select Grade, collect_list(Names) from (select Grade,case when pos=0 then substring(col,7) else replace(col,"%abc%","") end as Names from data) a group by Grade""").show
This query will give out the required output. Hope this helps.

How to group by similar element in the lists

In my program I want to perform a groupBy operation on the dataframe using the common element in the list. For example the following dataframe:
visitorId |trackingIds |emailIds
+-----------+----------------+--------
[a158] |[666b,666b,777c]|[12]
[7g21] |[c0b5,c0b4] |[45, 87]
[p9098] |[666b] |[90]
[8u7t] |[c0b5] |[40]
Should be grouped by the column trackingIds, which is actually a List[String]
visitorId |trackingIds |emailIds
+------------------+----------------+------------
[a158, p9098] |[666b,666b,777c]|[12, 90]
[7g21, 8u7t] |[c0b5,c0b4] |[45, 87, 40]
I have a solution using a simple function that finds the element in other rows and merges accordingly but looking for a solution that would be cost effective, as the operation is to be performed on a big dataframe with millions of rows.

Is there a Scala collection that maintains the order of insert?

I have a List:hdtList which contain columns that represent the columns of a Hive table:
forecast_id bigint,period_year bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_system_name string,source_record_type string,gl_source_name string,gl_source_system_name string,year string
I have a List: partition_columns which contains two elements: source_system_name, period_year
Using the List: partition_columns, I am trying to match them and move the corresponding columns in List: hdtList to the end of it as below:
val (pc, notPc) = hdtList.partition(c => partition_columns.contains(c.takeWhile(x => x != ' ')))
But when I print them as: println(notPc.mkString(",") + "," + pc.mkString(","))
I see the output unordered as below:
forecast_id bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string,period_year bigint,source_system_name string
The columns period_year comes first and the source_system_name last. Is there anyway I can make data as below so that the order of columns in the List: partition_columns is maintained.
forecast_id bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string,source_system_name string,period_year bigint
I know there is an option to reverse a List but I'd like to learn if I can implement a collection that maintains that order of insert.
It doesn't matter which collections you use; you only use partition_columns to call contains which doesn't depend on its order, so how could it be maintained?
But your code does maintain order: it's just hdtList's.
Something like
// get is ugly, but safe here
val pc1 = partition_columns.map(x => pc.find(y => y.startsWith(x)).get)
after your code will give you desired order, though there's probably more efficient way to do it.

Iterate through a dataframe and dynamically assign ID to records based on substring [Spark][Scala]

Currently I have an input file(millions of records) where all the records contain a 2 character Identifier. Multiple lines in this input file will be concatenated into only one record in the output file, and how this is determined is SOLELY based on the sequential order of the Identifier
For example, the records would begin as below
1A
1B
1C
2A
2B
2C
1A
1C
2B
2C
1A
1B
1C
1A marks the beginning of a new record, so the output file would have 3 records in this case. Everything between the "1A"s will be combined into one record
1A+1B+1C+2A+2B+2C
1A+1C+2B+2C
1A+1B+1C
The number of records between the "1A"s varies, so I have to iterate through and check the Identifier.
I am unsure how to approach this situation using scala/spark.
My strategy is to:
Load the Input file into the dataframe.
Create an Identifier column based on substring of record.
Create a new column, TempID and a variable, x that is set to 0
Iterate through the dataframe
if Identifier =1A, x = x+1
TempID= variable x
Then create a UDF to concat records with the same TempID.
To summarize my question:
How would I iterate through the dataframe, check the value of Identifier column, then assign a tempID(whose value increases by 1 if the value of identifier column is 1A)
This is dangerous. The issue is that spark is not guaranteed keep the same order among elements, especially since they might cross partition boundaries. So when you iterate over them you could get a different order back. This also has to happen entirely sequentially, so at that point why not just skip spark entirely and run it as regular scala code as a preproccessing step before getting to spark.
My recommendation would be to either look into writing a custom data inputformat/data source, or perhaps you could use "1A" as a record delimiter similar to this question.
First - usually "iterating" over a DataFrame (or Spark's other distributed collection abstractions like RDD and Dataset) is either wrong or impossible. The term simply does not apply. You should transform these collections using Spark's functions instead of trying to iterate over them.
You can achieve your goal (or - almost, details to follow) using Window Functions. The idea here would be to (1) add an "id" column to sort by, (2) use a Window function (based on that ordering) to count the number of previous instances of "1A", and then (3) using these "counts" as the "group id" that ties all records of each group together, and group by it:
import functions._
import spark.implicits._
// sample data:
val df = Seq("1A", "1B", "1C", "2A", "2B", "2C", "1A", "1C", "2B", "2C", "1A", "1B", "1C").toDF("val")
val result = df.withColumn("id", monotonically_increasing_id()) // add row ID
.withColumn("isDelimiter", when($"val" === "1A", 1).otherwise(0)) // add group "delimiter" indicator
.withColumn("groupId", sum("isDelimiter").over(Window.orderBy($"id"))) // add groupId using Window function
.groupBy($"groupId").agg(collect_list($"val") as "list") // NOTE: order of list might not be guaranteed!
.orderBy($"groupId").drop("groupId") // removing groupId
result.show(false)
// +------------------------+
// |list |
// +------------------------+
// |[1A, 1B, 1C, 2A, 2B, 2C]|
// |[1A, 1C, 2B, 2C] |
// |[1A, 1B, 1C] |
// +------------------------+
(if having the result as a list does not fit your needs, I'll leave it to you to transform this column to whatever you need)
The major caveat here is that collect_list does not necessarily guarantee preserving order - once you use groupBy, the order is potentially lost. So - the order within each resulting list might be wrong (the separation to groups, however, is necessarily correct). If that's important to you, it can be worked around by collecting a list of a column that also contains the "id" column and using it later to sort these lists.
EDIT: realizing this answer isn't complete without solving this caveat, and realizing it's not trivial - here's how you can solve it:
Define the following UDF:
val getSortedValues = udf { (input: mutable.Seq[Row]) => input
.map { case Row (id: Long, v: String) => (id, v) }
.sortBy(_._1)
.map(_._2)
}
Then, replace the row .groupBy($"groupId").agg(collect_list($"val") as "list") in the suggested solution above with these rows:
.groupBy($"groupId")
.agg(collect_list(struct($"id" as "_1", $"val" as "_2")) as "list")
.withColumn("list", getSortedValues($"list"))
This way we necessarily preserve the order (with the price of sorting these small lists).

How to create key-value pairs DStream in Spark Streaming

I'm new to Spark Streaming. There's a project using Spark Streaming, the input is a key-value pair string like "productid,price".
The requirement is to process each line as a separate transaction, and make RDD triggered every 1 second.
In each interval I have to calculate the total price for each individual product, like
select productid, sum(price) from T group by productid
My current thought is that I have to do the following steps
1) split the whole line with \n val lineMap = lines.map{x=>x.split("\n")}
2) split each line with "," val
recordMap=lineMap.map{x=>x.map{y=>y.split(",")}}
Now I'm confused about how to make the first column as key and second column as value, and use reduceByKey function to get the total sum.
Please advise.
Thanks
Once you have split each row, you can do something like this:
rowItems.map { case Seq(product, price) => product -> price }
This way you obtain a DStream[(String, String)] on which you can apply pair transformations like reduceByKey (don't forget to import the required implicits).