How to aggregate data in Spark using Scala? - scala

I have a data set test1.txt. It contain data like below
2::1::3
1::1::2
1::2::2
2::1::5
2::1::4
3::1::2
3::1::1
3::2::2
I have created data-frame using the below code.
case class Test(userId: Int, movieId: Int, rating: Float)
def pRating(str: String): Rating = {
val fields = str.split("::")
assert(fields.size == 3)
Test(fields(0).toInt, fields(1).toInt, fields(2).toFloat)
}
val ratings = spark.read.textFile("C:/Users/test/Desktop/test1.txt").map(pRating).toDF()
2,1,3
1,1,2
1,2,2
2,1,5
2,1,4
3,1,2
3,1,1
3,2,2
But I want to print output like below I.e. removing duplicate combinations and instead of field(2) value sum of values1,1, 2.0.
1,1,2.0
1,2,2.0
2,1,12.0
3,1,3.0
3,2,2.0
Please help me on this, how can achieve this.

To drop duplicates, use df.distinct. To aggregate you first groupBy and then agg. Putting this all together:
case class Rating(userId: Int, movieId: Int, rating: Float)
def pRating(str: String): Rating = {
val fields = str.split("::")
assert(fields.size == 3)
Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat)
}
val ratings = spark.read.textFile("C:/Users/test/Desktop/test1.txt").map(pRating)
val totals = ratings.distinct
.groupBy('userId, 'movieId)
.agg(sum('rating).as("rating"))
.as[Rating]
I am not sure you'd want the final result as Dataset[Rating] and whether the distinct and sum logic is exactly as you'd want it as the example in the question is not very clear but, hopefully, this will give you what you need.

ratings.groupBy("userId","movieId").sum(rating)

Related

Technique to write multiple columns into a single function in Scala

Below are the two methods using Spark Scala where I am trying to find, if the column contains a string and then sum the number of occurrences(1 or 0), Is there a better way to write it into a single function where we can avoid writing a method ,each time a new condition gets added. Thanks in advance.
def sumFunctDays1cols(columnName: String, dayid: String, processday: String, fieldString: String, newColName: String): Column = {
sum(when(('visit_start_time > dayid).and('visit_start_time <= processday).and(lower(col(columnName)).contains(fieldString)), 1).otherwise(0)).alias(newColName) }
def sumFunctDays2cols(columnName: String, dayid: String, processday: String, fieldString1: String, fieldString2: String, newColName: String): Column = {
sum(when(('visit_start_time > dayid).and('visit_start_time <= processday).and(lower(col(columnName)).contains(fieldString1) || lower(col(columnName)).contains(fieldString2)), 1).otherwise(0)).alias(newColName) }
Below is where I am calling the function.
sumFunctDays1cols("columnName", "2019-01-01", "2019-01-10", "mac", "cust_count")
sumFunctDays1cols("columnName", "2019-01-01", "2019-01-10", "mac", "lenovo","prod_count")
You could do something like below (Not tested yet)
def sumFunctDays2cols(columnName: String, dayid: String, processday: String, newColName: String, fields: Column*): Column = {
sum(
when(
('visit_start_time > dayid)
.and('visit_start_time <= processday)
.and(fields.map(lower(col(columnName)).contains(_)).reduce( _ || _)),
1
).otherwise(0)).alias(newColName)
}
And you can use it as
sumFunctDays2cols(
"columnName",
"2019-01-01",
"2019-01-10",
"prod_count",
col("lenovo"),col("prod_count")
)
Hope this helps!
Make the parameter to your function a list, instead of String1, String2 .. , make the parameter as a list of string.
I have implemented a small example for you:
import org.apache.spark.sql.functions.udf
val df = Seq(
(1, "mac"),
(2, "lenovo"),
(3, "hp"),
(4, "dell")).toDF("id", "brand")
// dictionary Set of words to check
val dict = Set("mac","leno","noname")
val checkerUdf = udf { (s: String) => dict.exists(s.contains(_) )}
df.withColumn("brand_check", checkerUdf($"brand")).show()
I hope this solves your issue. But if you sill need help, upload the entire code snippet, and I will help you with it.

How to print a Monocle Lens as a property accessor style string

Using Monocle I can define a Lens to read a case class member without issue,
val md5Lens = GenLens[Message](_.md5)
This can used to compare the value of md5 between two objects and fail with an error message that includes the field name when the values differ.
Is there a way to produce a user-friendly string from the Lens alone that identifies the field being read by the lens? I want to avoid providing the field name explicitly
val md5LensAndName = (GenLens[Message](_.md5), "md5")
If there is a solution that also works with lenses with more than one component then even better. For me it would be good even if the solution only worked to a depth of one.
This is fundamentally impossible. Conceptually, lens is nothing more than a pair of functions: one to get a value from object and one to obtain new object using a given value. That functions can be implemented by the means of accessing the source object's fields or not. In fact, even GenLens macro can use a chain field accessors like _.field1.field2 to generate composite lenses to the fields of nested objects. That can be confusing at first, but this feature have its uses. For example, you can decouple the format of data storage and representation:
import monocle._
case class Person private(value: String) {
import Person._
private def replace(
array: Array[String], index: Int, item: String
): Array[String] = {
val copy = Array.ofDim[String](array.length)
array.copyToArray(copy)
copy(index) = item
copy
}
def replaceItem(index: Int, item: String): Person = {
val array = value.split(delimiter)
val newArray = replace(array, index, item)
val newValue = newArray.mkString(delimiter)
Person(newValue)
}
def getItem(index: Int): String = {
val array = value.split(delimiter)
array(index)
}
}
object Person {
private val delimiter: String = ";"
val nameIndex: Int = 0
val cityIndex: Int = 1
def apply(name: String, address: String): Person =
Person(Array(name, address).mkString(delimiter))
}
val name: Lens[Person, String] =
Lens[Person, String](
_.getItem(Person.nameIndex)
)(
name => person => person.replaceItem(Person.nameIndex, name)
)
val city: Lens[Person, String] =
Lens[Person, String](
_.getItem(Person.cityIndex)
)(
city => person => person.replaceItem(Person.cityIndex, city)
)
val person = Person("John", "London")
val personAfterMove = city.set("New York")(person)
println(name.get(personAfterMove)) // John
println(city.get(personAfterMove)) // New York
While not very performant, that example illustrates the idea: Person class don't have city or address fields, but by wrapping data extractor and a string rebuild function into Lens, we can pretend it have them. For more complex objects, lens composition works as usual: inner lens just operates on extracted object, relying on outer one to pack it back.

Scala; How can I iterate through a list and iterate through another list within each iteration

I will try to explain my question as well as I can.
So I have a list of stock items, within a delivery. I would like to be able to iterate through this list and find the matching product id by iterating through a product list until the product id of the stock items list and the product list match and then increment product quantities.
So the StockItem class looks like this:
case class StockItem(val spiId : Int, val pId :Int, val sdId :Int, val quan : Int)
And my Product class looks like this:
case class Product(val prodId : Int, val name : String, val desc : String, val price : Double, var quantity : Int, val location : String)
I have a method to find all the StockItems that have a particular spId, which returns a list of StockItems:
def findAllItemsInSP(sdId: Int): List[StockPurchaseItem] = stockPurchaseItems2.filter(_.sdId == sdId)
I have another method which is unfinished to iterate through this list and increment the quantities of each product:
def incrementStock(spID: Int){
val items = StockPurchaseItem.findAllItemsInSP(spID)
for (i <- items){
if(i.pId == products.prodId)
}
}
products is a set of Product objects. Obviously products.prodId doesn't work as I need to be refering to one element of the products Set, not the whole set. I don't know how to find the matching product id in the set of Products. Any help given, I would be very grateful for.
Note: sdId and spId refer to the same thing.
Many thanks
Jackie
1st - All parameters to a case class are automatically class values, so you don't need those val labels.
2nd - You say you "have a method to find all the StockItems that have a particular spId", but the code is filtering for _.sdId == sdId. spId? sdId? A bit confusing.
3rd - You say that "products is a set of Project objects." Did you mean Product objects? I don't see any "Project" code.
So, one thing you could do is make items a Map[Int,Int], which translates pId to quantity, but with a default of zero.
val items = StockPurchaseItem.findAllItemsInSP(spID).map(x => (x.spId, x.quantity)).toMap.withDefaultValue(0)
Now you can walk through products, incrementing every quantity by items(spId).
products.foreach(p => p.quantity += items(p.spId)) // or something like that
You could do it with a 'for-comprehension' Docs in a similar method to below.
val stockItems: List[StockItem] = ???
val products: List[Product] = ???
val result: List[Product] = for{
item <- stockItems
p <- products
if p.prodId == item.pId
} yield { p.copy(quan = p.quan + 1) } //I expect you'd want different logic to this
def incrementStock(spID: Int){
val items = StockPurchaseItem.findAllItemsInSP(spID)
for (i <- items) {
for (p <- products) {
if(i.pId == p.prodId)
products += p
}
Ok I have managed to do what I wanted to do by creating a new Product with the existing properties with that product id and then adding to the quantity.
def incrementStock(spID: Int){
val items = StockPurchaseItem.findAllItemsInSP(spID)
for (i <- items){
var currentProd: Product = Product(Product.findProduct(i.pId).get.prodId, Product.findProduct(i.pId).get.name, Product.findProduct(i.pId).get.desc, Product.findProduct(i.pId).get.price, Product.findProduct(i.pId).get.quantity, Product.findProduct(i.pId).get.location)
currentProd.quantity += i.quan
products -= (findProduct(i.pId).get)
products = products + currentProd
}
printProductInfo()
}

How to select the most recent records from a table by DateTime in Slick, by nested query

How to get the last (x) number of records, according to a date time field.
by using Slick, Scala and JodaTime library?
as the following code want to implement the last(x) definition for explaining the question.
import org.joda.time._
import com.github.tototoshi.slick.PostgresJodaSupport._
def last(x: Int): DateTime=???
val callInfo = callTable.filter( _.calltime >= last(x))
Is it possible to implement something like the following for the last method? that aimed to result into a nested query
def last(x: Int) = {
callTable.sortBy(_.calltime.desc).take(x).sortBy ( _.calltime.asc).take(1).map{ _.calltime}
}
this return a Query[Rep[Option[DateTime]], Option[DateTime], Seq] not DateTime!!
I believe what you want is to sort calltime and then take the desired number of elements:
callTable.sortBy(_.calltime.desc).take(x)
The complete code would be something like:
def last(x): DateTime = callTable.sortBy(_.calltime.desc).take(x).drop(x-1).map(_.calltime).headOption
val callInfo = last(10).map { lastDate => callTable.filter(_.calltime > lastDate).list

Scala, finding max value in arrays

First time I've had to ask a question here, there is not enough info on Scala out there for a newbie like me.
Basically what I have is a file filled with hundreds of thousands of lists formatted like this:
(type, date, count, object)
Rows look something like this:
(food, 30052014, 400, banana)
(food, 30052014, 2, pizza)
All I need to is find the one row with the highest count.
I know I did this a couple of months ago but can't seem to wrap my head around it now. I'm sure I can do this without a function too. All I want to do is set a value and put that row in it but I can't figure it out.
I think basically what I want to do is a Math.max on the 3rd element in the lists, but I just can't get it.
Any help will be kindly appreciated. Sorry if my wording or formatting of this question isn't the best.
EDIT: There's some extra info I've left out that I should probably add:
All the records are stored in a tsv file. I've done this to split them:
val split_food = food.map(_.split("/t"))
so basically I think I need to use split_food... somehow
Modified version of #Szymon answer with your edit addressed:
val split_food = food.map(_.split("/t"))
val max_food = split_food.maxBy(tokens => tokens(2).toInt)
or, analogously:
val max_food = split_food.maxBy { case Array(_, _, count, _) => count.toInt }
In case you're using apache spark's RDD, which has limited number of usual scala collections methods, you have to go with reduce
val max_food = split_food.reduce { (max: Array[String], current: Array[String]) =>
val curCount = current(2).toInt
val maxCount = max(2).toInt // you probably would want to preprocess all items,
// so .toInt will not be called again and again
if (curCount > maxCount) current else max
}
You should use maxBy function:
case class Purchase(category: String, date: Long, count: Int, name: String)
object Purchase {
def apply(s: String) = s.split("\t") match {
case Seq(cat, date, count, name) => Purchase(cat, date.toLong, count.toInt, name)
}
}
foodRows.map(row => Purchase(row)).maxBy(_.count)
Simply:
case class Record(food:String, date:String, count:Int)
val l = List(Record("ciccio", "x", 1), Record("buffo", "y", 4), Record("banana", "z", 3))
l.maxBy(_.count)
>>> res8: Record = Record(buffo,y,4)
Not sure if you got the answer yet but I had the same issues with maxBy. I found once I ran the package... import scala.io.Source I was able to use maxBy and it worked.