How to get Map with matching values - scala

I have a file with values like this :
user id | item id | rating | timestamp
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
62 257 2 879372434
200 222 5 876042340
210 40 3 891035994
224 29 3 888104457
303 785 3 879485318
122 387 5 879270459
194 274 2 879539794
......
I want to find all values where item id = "560"
and make Map from rating values(1-5) like this {1->6,2-5,3-10,4-6,5-14}
object Parse {
def main(args: Array[String]): Unit = {
//вытаскиваем данные с u.data
var a: List[(String, String, String, String)] = List()
for (line <- io.Source.fromFile("F:\\big data\\u.data").getLines) {
val newLine = line.replace("\t", ",")
if (newLine.split(",").length < 4) {
break
} else {
val asd = newLine.split(",")
val userId = asd(0)
val itemId = asd(1)
val rating = asd(2)
val timestamp = asd(3)
a = a :+ ((userId, itemId, rating, timestamp))
}
a = a.filter(_._2.equals("590")) <- filter list of tuples correctly
val empty: List[String] = a.map(_._2) <- have tyed to get list of all rating, but it does not work
}
}
How can I create a map of rating?
here as I can see we can generate a map of matching values
Scala groupBy for a list

If what you want is a Map of rating->count for a given "item id", this should do it.
util.Using(io.Source.fromFile("../junk.txt")) { file =>
val rec = raw"\d+\s+590\s+(\d+)\s+\d+".r //only this item id
file.getLines()
.collect { case rec(rating) => rating }
.foldLeft(Map.empty[String, Int]) {
case (m, r) => m + (r -> (m.getOrElse(r, 0) + 1))
}
}.getOrElse(Map.empty[String,Int])
Note that fromFile() is automatically closed at the end of the Using block.

I think using for-loop is not the better decision. Please, look at your problem from the data-stream problem not array. scala.io.Source.fromFile("F:\\big data\\u.data").getLines() returns to you Iterator[String] of your lines. It is more suitable to use it as data stream not as array of data. And in your conditions is better just use combination of map, filter, collect and groupBy functions to get grouped rows by rank.
Full correct code:
val sourceFile = scala.io.Source.fromFile("F:\\big data\\u.data")
try {
val linesOfArrays = sourceFile.getLines().map{
line => line.split(",")
}
require(!linesOfArrays.exists(_.length < 4)) // your data schema validation
val ratingCountsMap: Map[String, Int] = linesOfArrays.collect{
case rowValuesArray if rowValuesArray(1) == "590" =>
// in this line you will get rating and 1 for his counting
rowValuesArray(2) -> 1
}.toSeq
.groupBy{ case (rating, _) => rating }
.mapValues{ groupWithSameRating => groupWithSameRating.length }
} finally sourceFile.close()
And don't forget to release resource (in your case this is file) using close method in finally section or use scala-arm library (more about resources here)

Related

Split Map to multiple Maps

I need to process a diff between two (huge) Maps. To parallelize the task, I would like to split these 2 Maps by Key hash value and create smaller Maps (by range of hash value).
How would I archieve that in (idiomatic) Scala?
Here's a rough sketch to get you started with the Scala syntax:
// create two (slightly different) maps, print them as table side by side
val rnd = new util.Random
val originalMap1 = (0 to 10).map(i => (i, i * i)).toMap
val originalMap2 = (0 to 10).map(i => (i, i * i + rnd.nextInt(2))).toMap
for (i <- 0 to 10) {
val a = originalMap1(i)
val b = originalMap2(i)
val marker = if (a == b) "" else " <-"
println(s"$i: $a $b $marker")
}
//subdivide into smaller maps
val numSubmaps = 5
val submaps1 = originalMap1.groupBy(_._1.hashCode % numSubmaps)
val submaps2 = originalMap2.groupBy(_._1.hashCode % numSubmaps)
// compare each corresponding pair of maps separately, merge diffs
val diffs = (for (s <- 0 until numSubmaps) yield {
val m1 = submaps1(s)
val m2 = submaps2(s)
for {
k <- m1.keys
a = m1(k)
b = m2(k)
if a != b
} yield (k, (a, b))
}).reduce(_ ++ _)
println(diffs.toList.sortBy(_._1))
Input:
0: 0 1 <-
1: 1 2 <-
2: 4 4
3: 9 9
4: 16 16
5: 25 26 <-
6: 36 36
7: 49 49
8: 64 65 <-
9: 81 82 <-
10: 100 101 <-
Output:
List((0,(0,1)), (1,(1,2)), (5,(25,26)), (8,(64,65)), (9,(81,82)), (10,(100,101)))

Want to parse a file and reformat it to create a pairRDD in Spark through Scala

I have dataset in a file in the form:
1: 1664968
2: 3 747213 1664968 1691047 4095634 5535664
3: 9 77935 79583 84707 564578 594898 681805 681886 835470 880698
4: 145
5: 8 57544 58089 60048 65880 284186 313376
6: 8
I need to transform this to something like below using Spark and Scala as a part of preprocessing of data:
1 1664968
2 3
2 747213
2 1664968
2 4095634
2 5535664
3 9
3 77935
3 79583
3 84707
And so on....
Can anyone provide input on how this can be done.
The length of the original rows in the file varies as shown in the dataset example above.
I am not sure, how to go about doing this transformation.
I tried soemthing like below which gives me a pair of the key and the first element after the semi-colon.
But I am not sure how to iterate over the entire data and generate the pairs as needed.
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Graphx").setMaster("local"))
val rawLinks = sc.textFile("src/main/resources/links-simple-sorted-top100.txt")
rawLinks.take(5).foreach(println)
val formattedLinks = rawLinks.map{ rows =>
val fields = rows.split(":")
val fromVertex = fields(0)
val toVerticesArray = fields(1).split(" ")
(fromVertex, toVerticesArray(1))
}
val topFive = formattedLinks.take(5)
topFive.foreach(println)
}
val rdd = sc.parallelize(List("1: 1664968","2: 3 747213 1664968 1691047 4095634 5535664"))
val keyValues = rdd.flatMap(line => {
val Array(key, values) = line.split(":",2)
for(value <- values.trim.split("""\s+"""))
yield (key, value.trim)
})
keyValues.collect
split row in 2 parts and map on variable number of columns.
def transform(s: String): Array[String] = {
val Array(head, tail) = s.split(":", 2)
tail.trim.split("""\s+""").map(x => s"$head $x")
}
> transform("2: 3 747213 1664968 1691047 4095634 5535664")
// Array(2 3, 2 747213, 2 1664968, 2 1691047, 2 4095634, 2 5535664)

Scala Saddle Filtering Column Values

I am new to scala Saddle, I have three column (customer name, age and Status) in a frame. I have to apply filter in column (age). If any customer age having more than 18 I need to set the Status is "eligible" other wise I need to put "noteligible".
Code:
f.col("age").filterAt(x => x > 18) //but how to update Status column
Frames are immutable containers, so it is probably better to build your frame with the values fully initialised than start with a partially initialised Frame.
import org.saddle._
object Test {
def main(args: Array[String]): Unit = {
val names: Vec[Any] = Vec("andy", "bruce", "cheryl", "dino", "edgar", "frank", "gollum", "harvey")
val ages: Vec[Any] = Vec(4, 89, 7, 21, 14, 18, 23004, 65)
def status(age: Any): Any = if (age.asInstanceOf[Int] >= 18) "eligible" else "noteligible"
def mapper(indexAge: (Int, Any)): (Int, _) = indexAge match {
case (index, age) => (index, status(age))
}
val nameAge: Frame[Int, String, Any] = Frame("name" -> names, "age" -> ages)
val ageCol: Series[Int, Any] = nameAge.colAt(1)
val eligible: Series[Int, Any] = ageCol.map( mapper )
println("" + nameAge)
println("" + eligible)
val nameAgeStatus: Frame[Int, String, _] = nameAge.joinSPreserveColIx(eligible, how=index.LeftJoin, "status")
println("" + nameAgeStatus)
}
}
If you really need to start from a partially initialised Frame, you can always drop the uninitialised column and add it back with the correctly calculated values.
Although I would prefer to strongly type the data columns, I think a Frame only contains data of one type, and the common type for "Int" and "String" is "Any". This also affects the type signatures of the methods, although you might want to inline them without the type information anyway.
I found that looking at the scaladoc helped a lot.
This is the output from the final println call:
[8 x 3]
name age status
------ ----- -----------
0 -> andy 4 noteligible
1 -> bruce 89 eligible
2 -> cheryl 7 noteligible
3 -> dino 21 eligible
4 -> edgar 14 noteligible
5 -> frank 18 eligible
6 -> gollum 23004 eligible
7 -> harvey 65 eligible

Scala: .take(1) in for-comprehension?

val SumABC = 1000
val Max = 468
val Min = 32
val p9 = for {
a <- Max to 250 by -1
b <- Min+(Max-a) to 249
if a*a+b*b == (SumABC-a-b)*(SumABC-a-b)
} yield a*b*(SumABC-a-b)
Can I .take(1) here? (I tried to translate it to flatmap, filter, etc, but since I failed I guess it wouldn't be as readable anyway...)
If I understood your cryptic questin, what you would like to do is the following
val p9 = (for {
a <- Max to 250 by -1
b <- Min+(Max-a) to 249
if a*a+b*b == (SumABC-a-b)*(SumABC-a-b)
} yield a*b*(SumABC-a-b)).take(1)
Just add parenthesis before for and after yield to ensure the take method is called on the result of the for block

Is there a data structure / library to do in memory olap / pivot tables in Java / Scala?

Relevant questions
This question is quite relevant, but is 2 years old: In memory OLAP engine in Java
Background
I would like to create a pivot-table like matrix from a given tabular dataset, in memory
e.g. an age by marital status count (rows are age, columns are marital status).
The input: List of People, with age and some Boolean property (e.g. married),
The desired output: count of People, by age (row) and isMarried (column)
What I've tried (Scala)
case class Person(val age:Int, val isMarried:Boolean)
...
val people:List[Person] = ... //
val peopleByAge = people.groupBy(_.age) //only by age
val peopleByMaritalStatus = people.groupBy(_.isMarried) //only by marital status
I managed to do it the naive way, first grouping by age, then map which is doing a count by marital status, and outputs the result, then I foldRight to aggregate
TreeMap(peopleByAge.toSeq: _*).map(x => {
val age = x._1
val rows = x._2
val numMarried = rows.count(_.isMarried())
val numNotMarried = rows.length - numMarried
(age, numMarried, numNotMarried)
}).foldRight(List[FinalResult]())(row,list) => {
val cumMarried = row._2+
(if (list.isEmpty) 0 else list.last.cumMarried)
val cumNotMarried = row._3 +
(if (list.isEmpty) 0 else l.last.cumNotMarried)
list :+ new FinalResult(row._1, row._2, row._3, cumMarried,cumNotMarried)
}.reverse
I don't like the above code, it's not efficient, hard to read, and I'm sure there is a better way.
The question(s)
How do I groupBy "both"? and how do I do a count for each subgroup, e.g.
How many people are exactly 30 years old and married?
Another question, is how do I do a running total, to answer the question:
How many people above 30 are married?
Edit:
Thank you for all the great answers.
just to clarify, I would like the output to include a "table" with the following columns
Age (ascending)
Num Married
Num Not Married
Running Total Married
Running Total Not Married
Not only answering those specific queries, but to produce a report that will allow answering all such type of questions.
Here is an option that is a little more verbose, but does this in a generic fashion instead of using strict data types. You could of course use generics to make this nicer, but i think you get the idea.
/** Creates a new pivot structure by finding correlated values
* and performing an operation on these values
*
* #param accuOp the accumulator function (e.g. sum, max, etc)
* #param xCol the "x" axis column
* #param yCol the "y" axis column
* #param accuCol the column to collect and perform accuOp on
* #return a new Pivot instance that has been transformed with the accuOp function
*/
def doPivot(accuOp: List[String] => String)(xCol: String, yCol: String, accuCol: String) = {
// create list of indexes that correlate to x, y, accuCol
val colsIdx = List(xCol, yCol, accuCol).map(headers.getOrElse(_, 1))
// group by x and y, sending the resulting collection of
// accumulated values to the accuOp function for post-processing
val data = body.groupBy(row => {
(row(colsIdx(0)), row(colsIdx(1)))
}).map(g => {
(g._1, accuOp(g._2.map(_(colsIdx(2)))))
}).toMap
// get distinct axis values
val xAxis = data.map(g => {g._1._1}).toList.distinct
val yAxis = data.map(g => {g._1._2}).toList.distinct
// create result matrix
val newRows = yAxis.map(y => {
xAxis.map(x => {
data.getOrElse((x,y), "")
})
})
// collect it with axis labels for results
Pivot(List((yCol + "/" + xCol) +: xAxis) :::
newRows.zip(yAxis).map(x=> {x._2 +: x._1}))
}
my Pivot type is pretty basic:
class Pivot(val rows: List[List[String]]) {
val headers = rows.head.zipWithIndex.toMap
val body = rows.tail
...
}
And to test it, you could do something like this:
val marriedP = Pivot(
List(
List("Name", "Age", "Married"),
List("Bill", "42", "TRUE"),
List("Heloise", "47", "TRUE"),
List("Thelma", "34", "FALSE"),
List("Bridget", "47", "TRUE"),
List("Robert", "42", "FALSE"),
List("Eddie", "42", "TRUE")
)
)
def accum(values: List[String]) = {
values.map(x => {1}).sum.toString
}
println(marriedP + "\n")
println(marriedP.doPivot(accum)("Age", "Married", "Married"))
Which yields:
Name Age Married
Bill 42 TRUE
Heloise 47 TRUE
Thelma 34 FALSE
Bridget 47 TRUE
Robert 42 FALSE
Eddie 42 TRUE
Married/Age 47 42 34
TRUE 2 2
FALSE 1 1
The nice thing is that you can use currying to pass in any function for the values like you would in a traditional excel pivot table.
More can be found here: https://github.com/vinsonizer/pivotfun
You can
val groups = people.groupBy(p => (p.age, p.isMarried))
and then
val thirty_and_married = groups((30, true))._2
val over_thirty_and_married_count =
groups.filterKeys(k => k._1 > 30 && k._2).map(_._2.length).sum
I think it would be better to use the count method on Lists directly
For question 1
people.count { p => p.age == 30 && p.isMarried }
For question 2
people.count { p => p.age > 30 && p.isMarried }
If you also want to actual groups of people who conform to those predicates use filter.
people.filter { p => p.age > 30 && p.isMarried }
You could probably optimise these by doing the traversal only once but is that a requirement?
You can group using a tuple:
val res1 = people.groupBy(p => (p.age, p.isMarried)) //or
val res2 = people.groupBy(p => (p.age, p.isMarried)).mapValues(_.size) //if you dont care about People instances
You can answer both question like that:
res2.getOrElse((30, true), 0)
res2.filter{case (k, _) => k._1 > 30 && k._2}.values.sum
res2.filterKeys(k => k._1 > 30 && k._2).values.sum // nicer with filterKeys from Rex Kerr's answer
You could answer both questions with a method count on List:
people.count(p => p.age == 30 && p.isMarried)
people.count(p => p.age > 30 && p.isMarried)
Or using filter and size:
people.filter(p => p.age == 30 && p.isMarried).size
people.filter(p => p.age > 30 && p.isMarried).size
edit:
slightly cleaner version of your code:
TreeMap(peopleByAge.toSeq: _*).map {case (age, ps) =>
val (married, notMarried) = ps.span(_.isMarried)
(age, married.size, notMarried.size)
}.foldLeft(List[FinalResult]()) { case (acc, (age, married, notMarried)) =>
def prevValue(f: (FinalResult) => Int) = acc.headOption.map(f).getOrElse(0)
new FinalResult(age, married, notMarried, prevValue(_.cumMarried) + married, prevValue(_.cumNotMarried) + notMarried) :: acc
}.reverse