Aggregating sum for RDD in Scala (Spark) - scala

If I have a variable such as books: RDD[(String, Integer, Integer)], how do I want to merge keys with the same String (could represent title), and then sum the corresponding two integers (could represent pages and price).
ex:
[("book1", 20, 10),
("book2", 5, 10),
("book1", 100, 100)]
becomes
[("book1", 120, 110),
("book2", 5, 10)]

With an RDD you can use reduceByKey.
case class Book(name: String, i: Int, j: Int) {
def +(b: Book) = if(name == b.name) Book(name, i + b.i, j + b.j) else throw Exception
}
val rdd = sc.parallelize(Seq(
Book("book1", 20, 10),
Book("book2",5,10),
Book("book1",100,100)))
val aggRdd = rdd.map(book => (book.name, book))
.reduceByKey(_+_) // reduce calling our defined `+` function
.map(_._2) // we don't need the tuple anymore, just get the Books
aggRdd.foreach(println)
// Book(book1,120,110)
// Book(book2,5,10)

Just use Dataset:
val spark: SparkSession = SparkSession.builder.getOrCreate()
val rdd = spark.sparkContext.parallelize(Seq(
("book1", 20, 10), ("book2", 5, 10), ("book1", 100, 100)
))
spark.createDataFrame(rdd).groupBy("_1").sum().show()
// +-----+-------+-------+
// | _1|sum(_2)|sum(_3)|
// +-----+-------+-------+
// |book1| 120| 110|
// |book2| 5| 10|
// +-----+-------+-------+

Try converting it first to a key-tuple RDD and then performing a reduceByKey:
yourRDD.map(t => (t._1, (t._2, t._3)))
.reduceByKey((acc, elem) => (acc._1 + elem._1, acc._2 + elem._2))
Output:
(book2,(5,10))
(book1,(120,110))

Related

use print in scala

I am trying to use println to print the output in a certain format but not able to get.
val vgdataLines = sc.textFile("hdfs:///user/ashhall1616/bdc_data/assignment/t1/vgsales-small.csv")
val vgdata = vgdataLines.map(_.split(";"))
val countPublisher = vgdata.map(r => (r(4),1))
val totalcount= countPublisher.count().toInt
val reducePublisher = countPublisher.reduceByKey(_+_)
def toPercentage(x: Int): Double = {x * 100/totalcount}
val top50 = countPublisher.map(r => (r._1, r._2, toPercentage(r._2)))
val top50desc= top50.sortBy(_._2, false)
println(top50desc.take(50))
Expected output format:
(Ubisoft,3,15.0)
(Activision,3,15.0)
(Electronic Arts,2,10.0)
(Nintendo,2,10.0)
(Acclaim Entertainment,1,5.0)
(Sega,1,5.0)
(3DO,1,5.0)
(Namco Bandai Games,1,5.0)
Format I am getting:
res1: Array[(String, Int, Double)] = Array((Sony Computer Entertainment,1,5.0), (Activision,1,5.0), (Nintendo,1,5.0), (Activision,1,5.0), (Nintendo,1,5.0), (3DO,1,
5.0), (Sega,1,5.0), (TDK Mediactive,1,5.0), (Capcom,1,5.0), (Atari,1,5.0), (Konami Digital Entertainment,1,5.0), (Namco Bandai Games,1,5.0), (Electronic Arts,1,5.0
), (Kalypso Media,1,5.0), (Ubisoft,1,5.0), (Ubisoft,1,5.0), (Electronic Arts,1,5.0), (Ubisoft,1,5.0), (Acclaim Entertainment,1,5.0), (Activision,1,5.0))
This is when I use top50desc.take(50) instead of println(top50desc.take(50))
Given
val l = List[(String, Int, Double)](
("Ubisoft", 3, 15.0),
("Activision", 3, 15.0),
("Electronic Arts", 2, 10.0)
)
note the difference between printing each element of the collection
l.foreach(println)
// (Ubisoft,3,15.0)
// (Activision,3,15.0)
// (Electronic Arts,2,10.0)
and printing the collection itself
println(l)
// List((Ubisoft,3,15.0), (Activision,3,15.0), (Electronic Arts,2,10.0))
foreach is intendend for when we wish to apply some side-effect, such as printing, to each element.

Scala - String and Column objects

Here the variable "exprs" is of column type (i.e. exprs: Array[org.apache.spark.sql.Column] = Array(sum(country), sum(value), sum(price))).
why does exprs: _* runs into error? why should I provide head and tail which as far my understanding is only for string type?
val resGroupByDF2 = data.groupBy($"country").agg(exprs: _*) // why does this not work
case class (
country: String,
value: Double,
price: Double
)
val data = Seq(
cname("NA", 2, 14),
cname("EU", 4, 61),
cname("FE", 5, 1),
)
.toDF()
val exprs = data.columns.map(sum(_)) // here it returns exprs: Array[org.apache.spark.sql.Column] = Array(sum(country), sum(value), sum(price))
val resGroupByDF2 = data.groupBy($"country").agg(exprs.head, exprs.tail: _*) // why just agg(exprs: _*) does not work in select or agg as it is already a column type
It is because of signature of agg.
The signature is (expr: org.apache.spark.sql.Column, exprs: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame. This expects a column at least and optional var-arg.

How to Loop through multiple Col values in a dataframe to get count

I have a list of tables let say x,y,z and each table is having some cols for example test,test1,test2,test3 for table x. just like we have cols like rem,rem1,rem2 for table y. Similarly is the case for table z. Now the requirement is that we have to loop through each col in a table and have to get row count based on below scenario's.
If test is not NULL and all other are NULL(test1,test2,test3) then it will be one count.
Now we have to loop through each table and then find cols like test* then match the above condition then marked that row as one 1 count if it satisfy above condition.
I'm pretty new to scala but i thought of the below approach.
for each $tablename{
{
Val df = sql("select * from $tablename ")
val coldf = df.select(df.columns.filter(_.startsWith("test")).map(df(_)) : _*)
val df_filtered = coldf.map(eachrow ->df.filter(s$"eachrow".isNull))
}
}
}
It is not working for me and not getting any idea where to put the count variable.if someone can help on this i will really appreciate.
Im using spark 2 with scala.
Code update
below is the code for generating the table list and table-col mapping list.
val table_names = sql("SELECT t1.Table_Name ,t1.col_name FROM table_list t1 LEFT JOIN db_list t2 ON t2.tableName == t1.Table_Name WHERE t2.tableName IS NOT NULL ").toDF("tabname", "colname")
//List of all tables in the db as list of df
val dfList = table_names.select("tabname").map(r => r.getString(0)).collect.toList
val dfTableList = dfList.map(spark.table)
//Mapping each table with keycol
val tabColPairList = table_names.rdd.map( r => (r(0).toString, r(1).toString)).collect
val dfColMap = tabColPairList.map{ case (t, c) => (spark.table(t), c) }.toMap
After this i'm using the below methods..
def createCount(row: Row, keyCol: String, checkCols: Seq[String]): Int = {
if (row.isNullAt(row.fieldIndex(keyCol))) 0 else {
val nonNulls = checkCols.map(row.fieldIndex(_)).foldLeft( 0 )(
(acc, i) => if (row.isNullAt(i)) acc else acc + 1
)
if (nonNulls == 0) 1 else 0
}
}
val dfCountList = dfTableList.map{ df =>
df.cols
val keyCol = dfColMap(df)
//println(keyCol)
val colPattern = s"$keyCol\\d+".r
val checkCols = df.columns.map( c => c match {
case colPattern() => Some(c)
case _ => None
} ).flatten
val rddWithCount = df.rdd.map{ case r: Row =>
Row.fromSeq(r.toSeq ++ Seq(createCount(r, keyCol, checkCols)))
}
spark.createDataFrame(rddWithCount, df.schema.add("count", IntegerType))
its giving me below error:
createCount: (row: org.apache.spark.sql.Row, keyCol: String, checkCols: Seq[String])Int
java.util.NoSuchElementException: key not found: [id: string, projid: string ... 40 more fields]
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at $$$$cfae60361b5eed1a149235c6e9b59b24$$$$$anonfun$1.apply(<console>:121)
at $$$$cfae60361b5eed1a149235c6e9b59b24$$$$$anonfun$1.apply(<console>:120)
at scala.collection.immutable.List.map(List.scala:273)
... 78 elided
Given your requirement, I would suggest taking advantage of RDD's functionality and use a Row-based method that creates your count for each Row per DataFrame:
val dfX = Seq(
("a", "ma", "a1", "a2", "a3"),
("b", "mb", null, null, null),
("null", "mc", "c1", null, "c3")
).toDF("xx", "mm", "xx1", "xx2", "xx3")
val dfY = Seq(
("nd", "d", "d1", null),
("ne", "e", "e1", "e2"),
("nf", "f", null, null)
).toDF("nn", "yy", "yy1", "yy2")
val dfZ = Seq(
("g", null, "g1", "g2", "qg"),
("h", "ph", null, null, null),
("i", "pi", null, null, "qi")
).toDF("zz", "pp", "zz1", "zz2", "qq")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.IntegerType
val dfList = List(dfX, dfY, dfZ)
val dfColMap = Map(dfX -> "xx", dfY -> "yy", dfZ -> "zz")
def createCount(row: Row, keyCol: String, checkCols: Seq[String]): Int = {
if (row.isNullAt(row.fieldIndex(keyCol))) 0 else {
val nonNulls = checkCols.map(row.fieldIndex(_)).foldLeft( 0 )(
(acc, i) => if (row.isNullAt(i)) acc else acc + 1
)
if (nonNulls == 0) 1 else 0
}
}
val dfCountList = dfList.map{ df =>
val keyCol = dfColMap(df)
val colPattern = s"$keyCol\\d+".r
val checkCols = df.columns.map( c => c match {
case colPattern() => Some(c)
case _ => None
} ).flatten
val rddWithCount = df.rdd.map{ case r: Row =>
Row.fromSeq(r.toSeq ++ Seq(createCount(r, keyCol, checkCols)))
}
spark.createDataFrame(rddWithCount, df.schema.add("count", IntegerType))
}
dfCountList(0).show
// +----+---+----+----+----+-----+
// | xx| mm| xx1| xx2| xx3|count|
// +----+---+----+----+----+-----+
// | a| ma| a1| a2| a3| 0|
// | b| mb|null|null|null| 1|
// |null| mc| c1|null| c3| 0|
// +----+---+----+----+----+-----+
dfCountList(1).show
// +---+---+----+----+-----+
// | nn| yy| yy1| yy2|count|
// +---+---+----+----+-----+
// | nd| d| d1|null| 0|
// | ne| e| e1| e2| 0|
// | nf| f|null|null| 1|
// +---+---+----+----+-----+
dfCountList(2).show
// +---+----+----+----+----+-----+
// | zz| pp| zz1| zz2| qq|count|
// +---+----+----+----+----+-----+
// | g|null| g1| g2| qg| 0|
// | h| ph|null|null|null| 1|
// | i| pi|null|null| qi| 1|
// +---+----+----+----+----+-----+
[UPDATE]
Note that the above solution works for any number of DataFrames as long as you have them in dfList and their corresponding key columns in dfColMap.
If you have a list of Hive tables instead, simply convert them into DataFrames using spark.table(), as below:
val tableList = List("tableX", "tableY", "tableZ")
val dfList = tableList.map(spark.table)
// dfList: List[org.apache.spark.sql.DataFrame] = List(...)
Now you still have to tell Spark what the key column for each table is. Let's say you have the list in the same order as the table list's. You can zip the list to create dfColMap and you'll have everything needed to apply the above solution:
val keyColList = List("xx", "yy", "zz")
val dfColMap = dfList.zip(keyColList).toMap
// dfColMap: scala.collection.immutable.Map[org.apache.spark.sql.DataFrame,String] = Map(...)
[UPDATE #2]
If you have the Hive table names and their corresponding key column names stored in a DataFrame, you can generate dfColMap as follows:
val dfTabColPair = Seq(
("tableX", "xx"),
("tableY", "yy"),
("tableZ", "zz")
).toDF("tabname", "colname")
val tabColPairList = dfTabColPair.rdd.map( r => (r(0).toString, r(1).toString)).
collect
// tabColPairList: Array[(String, String)] = Array((tableX,xx), (tableY,yy), (tableZ,zz))
val dfColMap = tabColPairList.map{ case (t, c) => (spark.table(t), c) }.toMap
// dfColMap: scala.collection.immutable.Map[org.apache.spark.sql.DataFrame,String] = Map(...)

filtering dataframe in scala

Let say I have a dataframe created from a text file using case class schema. Below is the data stored in dataframe.
id - Type- qt - P
1, X, 10, 100.0
2, Y, 20, 200.0
1, Y, 15, 150.0
1, X, 5, 120.0
I need to filter dataframe by "id" and Type. And for every "id" iterate through the dataframe for some calculation.
I tried this way but it did not work. Code snapshot.
case class MyClass(id: Int, type: String, qt: Long, PRICE: Double)
val df = sc.textFile("xyz.txt")
.map(_.split(","))
.map(p => MyClass(p(0).trim.toInt, p(1), p(2).trim.toLong, p(3).trim.toDouble)
.toDF().cache()
val productList: List[Int] = df.map{row => row.getInt(0)}.distinct.collect.toList
val xList: List[RDD[MyClass]] = productList.map {
productId => df.filter({ item: MyClass => (item.id== productId) && (item.type == "X" })}.toList
val yList: List[RDD[MyClass]] = productList.map {
productId => df.filter({ item: MyClass => (item.id== productId) && (item.type == "Y" })}.toList
Taking the distinct idea from your example, simply iterate over all the IDs and filter the DataFrame according to the current ID. After this you have a DataFrame with only the relevant data:
val df3 = sc.textFile("src/main/resources/importantStuff.txt") //Your data here
.map(_.split(","))
.map(p => MyClass(p(0).trim.toInt, p(1), p(2).trim.toLong, p(3).trim.toDouble)).toDF().cache()
val productList: List[Int] = df3.map{row => row.getInt(0)}.distinct.collect.toList
println(productList)
productList.foreach(id => {
val sqlDF = df3.filter(df3("id") === id)
sqlDF.show()
})
sqlDF in the loop is the DF with the relevant data, later you can run your calculations on it.

Scala groupBy of a tuple to calculate stock basis

I am working on an exercise to calculate stock basis given a list of stock purchases in the form of thruples (ticker, qty, stock_price). I've got it working, but would like to do the calculation part in more of a functional way. Anyone have an answer for this?
// input:
// List(("TSLA", 20, 200),
// ("TSLA", 20, 100),
// ("FB", 10, 100)
// output:
// List(("FB", (10, 100)),
// ("TSLA", (40, 150))))
def generateBasis(trades: Iterable[(String, Int, Int)]) = {
val basises = trades groupBy(_._1) map {
case (key, pairs) =>
val quantity = pairs.map(_._2).toList
val price = pairs.map(_._3).toList
var totalPrice: Int = 0
for (i <- quantity.indices) {
totalPrice += quantity(i) * price(i)
}
key -> (quantity.sum, totalPrice / quantity.sum)
}
basises
}
This looks like this might work for you. (updated)
def generateBasis(trades: Iterable[(String, Int, Int)]) =
trades.groupBy(_._1).mapValues {
_.foldLeft((0,0)){case ((tq,tp),(_,q,p)) => (tq + q, tp + q * p)}
}.map{case (k, (q,p)) => (k,q,p/q)} // turn Map into tuples (triples)
I came up with the solution below. Thanks everyone for their input. I'd love to hear if anyone had a more elegant solution.
// input:
// List(("TSLA", 20, 200),
// ("TSLA", 10, 100),
// ("FB", 5, 50)
// output:
// List(("FB", (5, 50)),
// ("TSLA", (30, 166)))
def generateBasis(trades: Iterable[(String, Int, Int)]) = {
val groupedTrades = (trades groupBy(_._1)) map {
case (key, pairs) =>
key -> (pairs.map(e => (e._2, e._3)))
} // List((FB,List((5,50))), (TSLA,List((20,200), (10,100))))
val costBasises = for {groupedTrade <- groupedTrades
tradeCost = for {tup <- groupedTrade._2 // (qty, cost)
} yield tup._1 * tup._2 // (trade_qty * trade_cost)
tradeQuantity = for { tup <- groupedTrade._2
} yield tup._1 // trade_qty
} yield (groupedTrade._1, tradeQuantity.sum, tradeCost.sum / tradeQuantity.sum )
costBasises.toList // List(("FB", (5, 50)),("TSLA", (30, 166)))
}