Scala RDD Operation - scala

I am new in scala.
I have a csv file stored in hdfs. I am reading that file in scala using
val salesdata = sc.textFile("hdfs://localhost:9000/home/jayshree/sales.csv")
Here is a small sample of data "sales".
C_ID T_ID ITEM_ID ITEM_Price
5 199 1 500
33 235 1 500
20 249 3 749
35 36 4 757
19 201 4 757
17 94 5 763
39 146 5 763
42 162 5 763
49 41 6 824
3 70 6 824
24 161 6 824
48 216 6 824
I have to perform the following operation on it.
1.Apply some discount on each item, on the column d(itemprice) suppose 30% of discount. The formula will be d=d-(30%(d)).
2.Find customer wise minimum and maximum item value after applying 30% discount to each item.
I tried to multiply 30 with the observation of column ITEM_Price. The problem is that the value of d as taken as string. When I am multiplying with a number in result it is showing the value that many time. like (500*3 = 500500500)
I can convert it into a dataframe and do it. But I just want to know that without converting it into a dataframe can we do these operation for an RDD.

Discount
case class Transaction(cId: Int, tId: Int, itemId: Int, itemPrice: Int)
val salesdata : RDD[String]=> Map the RDD, within the map split the line by your separator and then convert the Array to a case class called Transaction calling Array(i).toInt to cast the fields. In this step your target is to get a RDD[Transaction].
Map the RDD again and copy your transaction applying the discount ( t => t.copy(itemPrice=0.7*t.itemPrice))
You will have a new RDD[Transaction]
Customer wise
Take the last object, apply a keyBy(_.cId) to get RDD[Int, Transaction] where your key is the client.
Reduce By Key adding the prices for each item. Goal => RDD[Int, Int] where you get the total for each client.
Find your target clients!

Since you want more of a guide, let's look at this outside of Spark for a second and think about things as typical Scala collections.
Your data would look like this:
val data = Array(
(5, 199, 5, 100),
(33, 235, 5, 100),
...
)
I think you will have no trouble mapping your salesdata RDD of strings to an RDD of Array or Tuple4 using a split or regular expression or something.
Let's go with a tuple. Then you can do this:
data.map {
case (cId, tId, item, price) => (cId, tId, item, price * .7)
}
That maps the original RDD of tuples to another RDD of tuples where the last values, the prices, are reduced by 30%. So the result is a Tuple4[Int, Int, Int, Double].
To be honest, I don't know what you mean by customer-wise min and max, but maybe it is something like this:
data.map {
case (cId, tId, item, price) => (cId, tId, item, price * .7)
}.groupBy(_._1)
.mapValues { tuples =>
val discountedPrices = tuples.map(_._4)
(discountedPrices.min, discountedPrices.max)
}
First, I do a groupBy, which produces a Map from cId (the first value in the tuple, which explains the ._1) to a collection of full tuples--so a Map of cId to a collection of rows pertaining to that cId. In Spark, this would produce a PairRDD.
Map and PairRDD both have a mapValues function, which allows me to preserve the keys (the cIds) while transforming each collection of tuples. In this case, I simply map the collection to a collection of discounted prices by getting the 4th item in each tuple, the discounted prices. Then I call min and max on that collection and return a tuple of those values.
So the result is a Map of customer ID to a tuple of the min and max of the discounted prices. The beauty of the RDD API is that it follows the conventional Scala collection API so closely, so it is basically the same thing.

Related

Persisting loop dataframes for group concat functions in Pyspark

I'm trying to aggregate a spark dataframe up to a unique ID, selecting the first non-null value from that column for that ID given a sort column. Basically replicating MySQL's group_concat function.
The SO post here Spark SQL replacement for MySQL's GROUP_CONCAT aggregate function was very helpful in replicating the group_concat for a single column. I need to do this for a dynamic list of columns.
I would rather not have to copy this code for each column (dozen +, could be dynamic in the future), so am trying to implement in a loop (frowned on in spark I know!) given a list of column names. Loop runs successfully but, the previous iterations don't persist even when the intermediate df is cached/persisted (re: Cacheing and Loops in (Py)Spark).
Any help, pointers or a more elegant non-looping solution would be appreciated (not afraid to try a bit of scala if there is a functional programming approach more suitable)!
Given following df:
unique_id
row_id
first_name
last_name
middle_name
score
1000000
1000002
Simmons
Bonnie
Darnell
88
1000000
1000006
Dowell
Crawford
Anne
87
1000000
1000007
NULL
Eric
Victor
89
1000000
1000000
Zachary
Fields
Narik
86
1000000
1000003
NULL
NULL
Warren
92
1000000
1000008
Paulette
Ronald
Irvin
85
group_column = "unique_id"
concat_list = ['first_name','last_name','middle_name']
sort_column = "score"
sort_order = False
df_final=df.select(group_column).distinct()
for i in concat_list:\
df_helper=df
df_helper=df_helper.groupBy(group_column)\
.agg(sort_array(collect_list(struct(sort_column,i)),sort_order).alias('collect_list'))\
.withColumn("sorted_list",col("collect_list."+str(i)))\
.withColumn("first_item",slice(col("sorted_list"),1,1))\
.withColumn(i,concat_ws(",",col("first_item")))\
.drop("collect_list")\
.drop("sorted_list")\
.drop("first_item")
print(i)
df_final=df_final.join(df_helper,group_column,"inner")
df_final.cache()
df_final.display() #I'm using databricks
My result looks like:
unique_id
middle_name
1000000
Warren
My desired result is:
unique_id
first_name
last_name
middle_name
1000000
Simmons
Eric
Warren
Second set of tables if they don't pretty print above
I found a solution to my own question: Add a .collect() call on my dataframe as I join to it, not a persist() or cache(); this will produce the expected dataframe.
group_column = "unique_id"
enter code hereconcat_list = ['first_name','last_name','middle_name']
sort_column = "score"
sort_order = False
df_final=df.select(group_column).distinct()
for i in concat_list:\
df_helper=df
df_helper=df_helper.groupBy(group_column)\
.agg(sort_array(collect_list(struct(sort_column,i)),sort_order).alias('collect_list'))\
.withColumn("sorted_list",col("collect_list."+str(i)))\
.withColumn("first_item",slice(col("sorted_list"),1,1))\
.withColumn(i,concat_ws(",",col("first_item")))\
.drop("collect_list")\
.drop("sorted_list")\
.drop("first_item")
print(i)
df_final=df_final.join(df_helper,group_column,"inner")
df_final.collect()
df_final.display() #I'm using databricks

Cumulative function in spark scala

I have tried this to calculate cumulate value but if the date field is same those values are added in the cumulative field, can someone suggestion solution Similar to this question
val windowval = (Window.partitionBy($"userID").orderBy($"lastModified")
.rangeBetween(Window.unboundedPreceding, 0))
val df_w_cumsum = ms1_userlogRewards.withColumn("totalRewards", sum($"noOfJumps").over(windowval)).orderBy($"lastModified".asc)
df_w_cumsum.filter($"batchType".isNull).filter($"userID"==="355163").select($"userID", $"noOfJumps", $"totalRewards",$"lastModified").show()
Note that your very first totalRewards=147 is the sum of the previous value 49 + all the values with timestamp "2019-08-07 18:25:06": 49 + (36 + 0 + 60 + 2) = 147.
The first option would be to aggregate all the values with the same timestamp fist e.g. groupBy($"userId", $"lastModified").agg(sum($"noOfJumps").as("noOfJumps")) (or something like that) and then run your aggregate sum. This will remove duplicate timestamps altogether.
The second option is to use row_number to define an order among rows with the same lastModified field first and then run your aggregate sum with .orderBy($"lastModified, $"row_number") (or something like that). This should keep all records and give you partial sum up along the way: totalRewards = 49 -> 85 -> 85 -> 145 -> 147 (or something similar depending on the order defined by row_number)
I think you want to sum by userid and timestamp.
So, You need to partition by userid and date and use window function to sym like the following:
import org.apache.spark.sql.functions.sum
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("userID", "lastModified")
df.withColumn("cumulativeSum", sum(col("noOfJumps").over(window))

Iterate through a dataframe and dynamically assign ID to records based on substring [Spark][Scala]

Currently I have an input file(millions of records) where all the records contain a 2 character Identifier. Multiple lines in this input file will be concatenated into only one record in the output file, and how this is determined is SOLELY based on the sequential order of the Identifier
For example, the records would begin as below
1A
1B
1C
2A
2B
2C
1A
1C
2B
2C
1A
1B
1C
1A marks the beginning of a new record, so the output file would have 3 records in this case. Everything between the "1A"s will be combined into one record
1A+1B+1C+2A+2B+2C
1A+1C+2B+2C
1A+1B+1C
The number of records between the "1A"s varies, so I have to iterate through and check the Identifier.
I am unsure how to approach this situation using scala/spark.
My strategy is to:
Load the Input file into the dataframe.
Create an Identifier column based on substring of record.
Create a new column, TempID and a variable, x that is set to 0
Iterate through the dataframe
if Identifier =1A, x = x+1
TempID= variable x
Then create a UDF to concat records with the same TempID.
To summarize my question:
How would I iterate through the dataframe, check the value of Identifier column, then assign a tempID(whose value increases by 1 if the value of identifier column is 1A)
This is dangerous. The issue is that spark is not guaranteed keep the same order among elements, especially since they might cross partition boundaries. So when you iterate over them you could get a different order back. This also has to happen entirely sequentially, so at that point why not just skip spark entirely and run it as regular scala code as a preproccessing step before getting to spark.
My recommendation would be to either look into writing a custom data inputformat/data source, or perhaps you could use "1A" as a record delimiter similar to this question.
First - usually "iterating" over a DataFrame (or Spark's other distributed collection abstractions like RDD and Dataset) is either wrong or impossible. The term simply does not apply. You should transform these collections using Spark's functions instead of trying to iterate over them.
You can achieve your goal (or - almost, details to follow) using Window Functions. The idea here would be to (1) add an "id" column to sort by, (2) use a Window function (based on that ordering) to count the number of previous instances of "1A", and then (3) using these "counts" as the "group id" that ties all records of each group together, and group by it:
import functions._
import spark.implicits._
// sample data:
val df = Seq("1A", "1B", "1C", "2A", "2B", "2C", "1A", "1C", "2B", "2C", "1A", "1B", "1C").toDF("val")
val result = df.withColumn("id", monotonically_increasing_id()) // add row ID
.withColumn("isDelimiter", when($"val" === "1A", 1).otherwise(0)) // add group "delimiter" indicator
.withColumn("groupId", sum("isDelimiter").over(Window.orderBy($"id"))) // add groupId using Window function
.groupBy($"groupId").agg(collect_list($"val") as "list") // NOTE: order of list might not be guaranteed!
.orderBy($"groupId").drop("groupId") // removing groupId
result.show(false)
// +------------------------+
// |list |
// +------------------------+
// |[1A, 1B, 1C, 2A, 2B, 2C]|
// |[1A, 1C, 2B, 2C] |
// |[1A, 1B, 1C] |
// +------------------------+
(if having the result as a list does not fit your needs, I'll leave it to you to transform this column to whatever you need)
The major caveat here is that collect_list does not necessarily guarantee preserving order - once you use groupBy, the order is potentially lost. So - the order within each resulting list might be wrong (the separation to groups, however, is necessarily correct). If that's important to you, it can be worked around by collecting a list of a column that also contains the "id" column and using it later to sort these lists.
EDIT: realizing this answer isn't complete without solving this caveat, and realizing it's not trivial - here's how you can solve it:
Define the following UDF:
val getSortedValues = udf { (input: mutable.Seq[Row]) => input
.map { case Row (id: Long, v: String) => (id, v) }
.sortBy(_._1)
.map(_._2)
}
Then, replace the row .groupBy($"groupId").agg(collect_list($"val") as "list") in the suggested solution above with these rows:
.groupBy($"groupId")
.agg(collect_list(struct($"id" as "_1", $"val" as "_2")) as "list")
.withColumn("list", getSortedValues($"list"))
This way we necessarily preserve the order (with the price of sorting these small lists).

Creating "running total" in Scala

I have a history DataFrame that has the following structure
id amount date
12345 150 1/1/2016
12345 50 1/4/2016
12345 250 1/4/2016
12345 950 1/9/2016
I would like to have a cumulative sum of $ with respect to date, such that the resulting sum is calculated as the sum of all earlier days with the same ID. results should be generated even for dates that do not have entries in the source DataFrame, assuming they are between the start and end dates. The expected output for the example input can be seen below.
ID date cumulative_sum
12345 1/1/2016 150
12345 1/2/2016 150
12345 1/3/2016 150
12345 1/4/2016 450
12345 1/5/2016 450
12345 1/6/2016 450
12345 1/7/2016 450
12345 1/8/2016 450
12346 1/9/2016 1400
Does anyone know how to calculate this sort of running total?
Basically, you first find subtotals for each date (doesn't really have to happen as a separate step, but this makes things a little more generic - I'll explain why below):
val subtotals = data
.groupBy(_.date)
.mapValues(_.map(_.amount).sum)
.withDefault(_ => 0)
Now, you can scan through the date range, and sum things up with something like this:
(0 to numberOfMonths)
.map(startDate.plusMonths)
.scanLeft(null -> 0) { case ((_, sum), date) =>
date -> (subtotals(date) + sum)
}.drop(1)
This is how you would do this is in "plain scala". Now, because you have mentioned "data frame", in your question, I suspect, you are actually using spark. This makes it a little bit more complicated, because the data may be distributed. The good news is, while you may have a huge number of transaction, there aren't enough months in the history of the world to make it impossible for you to process the aggregated data as a single task.
So, you just need to replace the first step above with a distributed equivalent:
val subtotals = dataFrame
.rdd
.map(tx => tx.date -> tx.amount)
.reduceByKey(_ + _)
.collect
.toMap
And now you can to the second step in exactly the same way I showed above.

How to create key-value pairs DStream in Spark Streaming

I'm new to Spark Streaming. There's a project using Spark Streaming, the input is a key-value pair string like "productid,price".
The requirement is to process each line as a separate transaction, and make RDD triggered every 1 second.
In each interval I have to calculate the total price for each individual product, like
select productid, sum(price) from T group by productid
My current thought is that I have to do the following steps
1) split the whole line with \n val lineMap = lines.map{x=>x.split("\n")}
2) split each line with "," val
recordMap=lineMap.map{x=>x.map{y=>y.split(",")}}
Now I'm confused about how to make the first column as key and second column as value, and use reduceByKey function to get the total sum.
Please advise.
Thanks
Once you have split each row, you can do something like this:
rowItems.map { case Seq(product, price) => product -> price }
This way you obtain a DStream[(String, String)] on which you can apply pair transformations like reduceByKey (don't forget to import the required implicits).