Scala Saddle Filtering Column Values - scala

I am new to scala Saddle, I have three column (customer name, age and Status) in a frame. I have to apply filter in column (age). If any customer age having more than 18 I need to set the Status is "eligible" other wise I need to put "noteligible".
Code:
f.col("age").filterAt(x => x > 18) //but how to update Status column

Frames are immutable containers, so it is probably better to build your frame with the values fully initialised than start with a partially initialised Frame.
import org.saddle._
object Test {
def main(args: Array[String]): Unit = {
val names: Vec[Any] = Vec("andy", "bruce", "cheryl", "dino", "edgar", "frank", "gollum", "harvey")
val ages: Vec[Any] = Vec(4, 89, 7, 21, 14, 18, 23004, 65)
def status(age: Any): Any = if (age.asInstanceOf[Int] >= 18) "eligible" else "noteligible"
def mapper(indexAge: (Int, Any)): (Int, _) = indexAge match {
case (index, age) => (index, status(age))
}
val nameAge: Frame[Int, String, Any] = Frame("name" -> names, "age" -> ages)
val ageCol: Series[Int, Any] = nameAge.colAt(1)
val eligible: Series[Int, Any] = ageCol.map( mapper )
println("" + nameAge)
println("" + eligible)
val nameAgeStatus: Frame[Int, String, _] = nameAge.joinSPreserveColIx(eligible, how=index.LeftJoin, "status")
println("" + nameAgeStatus)
}
}
If you really need to start from a partially initialised Frame, you can always drop the uninitialised column and add it back with the correctly calculated values.
Although I would prefer to strongly type the data columns, I think a Frame only contains data of one type, and the common type for "Int" and "String" is "Any". This also affects the type signatures of the methods, although you might want to inline them without the type information anyway.
I found that looking at the scaladoc helped a lot.
This is the output from the final println call:
[8 x 3]
name age status
------ ----- -----------
0 -> andy 4 noteligible
1 -> bruce 89 eligible
2 -> cheryl 7 noteligible
3 -> dino 21 eligible
4 -> edgar 14 noteligible
5 -> frank 18 eligible
6 -> gollum 23004 eligible
7 -> harvey 65 eligible

Related

Filtering unique values from a text file

How to find and filter unique values from a text file.
I tried like below, its not working.
val spark = SparkSession.builder().master("local").appName("distinct").getOrCreate()
var data = spark.sparkContext.textFile("text/file/opath")
val uniqueval = data.map { rec => (rec.split(",")(3).distinct) }
var fils = data.filter(line => line.split(",")(3).equals(uniqueval)).map(x => (x)).foreach { println }
Sample Data:
ID | Name
1 john
1 john
2 david
3 peter
4 steve
Required Output:
1 john
2 david
3 peter
4 steve
You almost have it right. .distinct() must just be called on the RDD.
I'd replace statement 3 with:
val uniqueval = data.distinct().map...
This assumes that similar records will have identical lines in the text file.
Is core scala allowed?
scala> val text = List ("single" , "double", "mono", "double")
text: List[String] = List(single, double, mono, double)
scala> val u = text.distinct
u: List[String] = List(single, double, mono)
scala> val d = text.diff(u)
d: List[String] = List(double)
scala> val s = u.diff (d)
s: List[String] = List(single, mono)
your code can be something like:
sparkContext.textFile("sample-data.txt").distinct()
.saveAsTextFile("sample-data-dist.txt");
distinct method can do the action you want.

Saving to a custom output format in Spark / Hadoop

I have one RDD which contains multiple datastructures, whereas one of these data structures is a Map[String, Int].
To visualize it easily I get the following after a map transformation:
val data = ... // This is a RDD[Map[String, Int]]
In one of the elements of this RDD, the Map contains the following:
*key value*
map_id -> 7753
Oscar -> 39
Jaden -> 13
Thomas -> 1
Chris -> 52
And then it contains other names and numbers in other elements of the RDD, each map contains a certain map_id. Anyhow, if I simply do data.saveAsTextFile(path), I will get the following output in my file:
Map(map_id -> 7753, Oscar -> 39, Jaden -> 13, Thomas -> 1, Chris -> 52)
Map(...)
Map(...)
However, I would like to format it as the following:
---------------------------
map_id: 7753
---------------------------
Oscar: 39
Jaden: 13
Thomas: 1
Chris: 52
---------------------------
map_id: <some other id>
---------------------------
Name: nbr
Name2: nbr2
Basically, the map_id as some kind of header, then the contents, one line of space and then the next element.
To my question, data RDD only has two options, save as text file or as object file, which neither as far as I can see support my to customize the formatting. How could I go about doing this?
You can just map to String and write the result. For example:
def format(map: Map[String, Int]): String = {
val id = map.get("map_id").map(_.toString).getOrElse("unknown")
val content = map.collect {
case (k, v) if k != "map_id" => s"$k: $v"
}.mkString("\n")
s"""|---------------------------
|map_id: $id
|-------------------------------
|$content
""".stripMargin
}
data.map(format(_)).saveAsTextFile(path)

Updating a 2d table of counts

Suppose I want a Scala data structure that implements a 2-dimensional table of counts that can change over time (i.e., individual cells in the table can be incremented or decremented). What should I be using to do this?
I could use a 2-dimensional array:
val x = Array.fill[Int](1, 2) = 0
x(1)(2) += 1
But Arrays are mutable, and I guess I should slightly prefer immutable data structures.
So I thought about using a 2-dimensional Vector:
val x = Vector.fill[Int](1, 2) = 0
// how do I update this? I want to write something like val newX : Vector[Vector[Int]] = x.add((1, 2), 1)
// but I'm not sure how
But I'm not sure how to get a new vector with only a single element changed.
What's the best approach?
Best depends on what your criteria are. The simplest immutable variant is to use a map from (Int,Int) to your count:
var c = (for (i <- 0 to 99; j <- 0 to 99) yield (i,j) -> 0).toMap
Then you access your values with c(i,j) and set them with c += ((i,j) -> n); c += ((i,j) -> (c(i,j)+1)) is a little bit annoying, but it's not too bad.
Faster is to use nested Vectors--by about a factor of 2 to 3, depending on whether you tend to re-set the same element over and over or not--but it has an ugly update method:
var v = Vector.fill(100,100)(0)
v(82)(49) // Easy enough
v = v.updated(82, v(82).updated(49, v(82)(49)+1) // Ouch!
Faster yet (by about 2x) is to have only one vector which you index into:
var u = Vector.fill(100*100)(0)
u(82*100 + 49) // Um, you think I can always remember to do this right?
u = u.updated(82*100 + 49, u(82*100 + 49)+1) // Well, that's actually better
If you don't need immutability and your table size isn't going to change, just use an array as you've shown. It's ~200x faster than the fastest vector solution if all you're doing is incrementing and decrementing an integer.
If you want to do this in a very general and functional (but not necessarily performant) way, you can use lenses. Here's an example of how you could use Scalaz 7's implementation, for example:
import scalaz._
def at[A](i: Int): Lens[Seq[A], A] = Lens.lensg(a => a.updated(i, _), (_(i)))
def at2d[A](i: Int, j: Int) = at[Seq[A]](i) andThen at(j)
And a little bit of setup:
val table = Vector.tabulate(3, 4)(_ + _)
def show[A](t: Seq[Seq[A]]) = t.map(_ mkString " ") mkString "\n"
Which gives us:
scala> show(table)
res0: String =
0 1 2 3
1 2 3 4
2 3 4 5
We can use our lens like this:
scala> show(at2d(1, 2).set(table, 9))
res1: String =
0 1 2 3
1 2 9 4
2 3 4 5
Or we can just get the value at a given cell:
scala> val v: Int = at2d(2, 3).get(table)
v: Int = 5
Or do a lot of more complex things, like apply a function to a particular cell:
scala> show(at2d(2, 2).mod(((_: Int) * 2), table))
res8: String =
0 1 2 3
1 2 3 4
2 3 8 5
And so on.
There isn't a built-in method for this, perhaps because it would require the Vector to know that it contains Vectors, or Vectors or Vectors etc, whereas most methods are generic, and it would require a separate method for each number of dimensions, because you need to specify a co-ordinate arg for each dimension.
However, you can add these yourself; the following will take you up to 4D, although you could just add the bits for 2D if that's all you need:
object UpdatableVector {
implicit def vectorToUpdatableVector2[T](v: Vector[Vector[T]]) = new UpdatableVector2(v)
implicit def vectorToUpdatableVector3[T](v: Vector[Vector[Vector[T]]]) = new UpdatableVector3(v)
implicit def vectorToUpdatableVector4[T](v: Vector[Vector[Vector[Vector[T]]]]) = new UpdatableVector4(v)
class UpdatableVector2[T](v: Vector[Vector[T]]) {
def updated2(c1: Int, c2: Int)(newVal: T) =
v.updated(c1, v(c1).updated(c2, newVal))
}
class UpdatableVector3[T](v: Vector[Vector[Vector[T]]]) {
def updated3(c1: Int, c2: Int, c3: Int)(newVal: T) =
v.updated(c1, v(c1).updated2(c2, c3)(newVal))
}
class UpdatableVector4[T](v: Vector[Vector[Vector[Vector[T]]]]) {
def updated4(c1: Int, c2: Int, c3: Int, c4: Int)(newVal: T) =
v.updated(c1, v(c1).updated3(c2, c3, c4)(newVal))
}
}
In Scala 2.10 you don't need the implicit defs and can just add the implicit keyword to the class definitions.
Test:
import UpdatableVector._
val v2 = Vector.fill(2,2)(0)
val r2 = v2.updated2(1,1)(42)
println(r2) // Vector(Vector(0, 0), Vector(0, 42))
val v3 = Vector.fill(2,2,2)(0)
val r3 = v3.updated3(1,1,1)(42)
println(r3) // etc
Hope that's useful.

Is there a data structure / library to do in memory olap / pivot tables in Java / Scala?

Relevant questions
This question is quite relevant, but is 2 years old: In memory OLAP engine in Java
Background
I would like to create a pivot-table like matrix from a given tabular dataset, in memory
e.g. an age by marital status count (rows are age, columns are marital status).
The input: List of People, with age and some Boolean property (e.g. married),
The desired output: count of People, by age (row) and isMarried (column)
What I've tried (Scala)
case class Person(val age:Int, val isMarried:Boolean)
...
val people:List[Person] = ... //
val peopleByAge = people.groupBy(_.age) //only by age
val peopleByMaritalStatus = people.groupBy(_.isMarried) //only by marital status
I managed to do it the naive way, first grouping by age, then map which is doing a count by marital status, and outputs the result, then I foldRight to aggregate
TreeMap(peopleByAge.toSeq: _*).map(x => {
val age = x._1
val rows = x._2
val numMarried = rows.count(_.isMarried())
val numNotMarried = rows.length - numMarried
(age, numMarried, numNotMarried)
}).foldRight(List[FinalResult]())(row,list) => {
val cumMarried = row._2+
(if (list.isEmpty) 0 else list.last.cumMarried)
val cumNotMarried = row._3 +
(if (list.isEmpty) 0 else l.last.cumNotMarried)
list :+ new FinalResult(row._1, row._2, row._3, cumMarried,cumNotMarried)
}.reverse
I don't like the above code, it's not efficient, hard to read, and I'm sure there is a better way.
The question(s)
How do I groupBy "both"? and how do I do a count for each subgroup, e.g.
How many people are exactly 30 years old and married?
Another question, is how do I do a running total, to answer the question:
How many people above 30 are married?
Edit:
Thank you for all the great answers.
just to clarify, I would like the output to include a "table" with the following columns
Age (ascending)
Num Married
Num Not Married
Running Total Married
Running Total Not Married
Not only answering those specific queries, but to produce a report that will allow answering all such type of questions.
Here is an option that is a little more verbose, but does this in a generic fashion instead of using strict data types. You could of course use generics to make this nicer, but i think you get the idea.
/** Creates a new pivot structure by finding correlated values
* and performing an operation on these values
*
* #param accuOp the accumulator function (e.g. sum, max, etc)
* #param xCol the "x" axis column
* #param yCol the "y" axis column
* #param accuCol the column to collect and perform accuOp on
* #return a new Pivot instance that has been transformed with the accuOp function
*/
def doPivot(accuOp: List[String] => String)(xCol: String, yCol: String, accuCol: String) = {
// create list of indexes that correlate to x, y, accuCol
val colsIdx = List(xCol, yCol, accuCol).map(headers.getOrElse(_, 1))
// group by x and y, sending the resulting collection of
// accumulated values to the accuOp function for post-processing
val data = body.groupBy(row => {
(row(colsIdx(0)), row(colsIdx(1)))
}).map(g => {
(g._1, accuOp(g._2.map(_(colsIdx(2)))))
}).toMap
// get distinct axis values
val xAxis = data.map(g => {g._1._1}).toList.distinct
val yAxis = data.map(g => {g._1._2}).toList.distinct
// create result matrix
val newRows = yAxis.map(y => {
xAxis.map(x => {
data.getOrElse((x,y), "")
})
})
// collect it with axis labels for results
Pivot(List((yCol + "/" + xCol) +: xAxis) :::
newRows.zip(yAxis).map(x=> {x._2 +: x._1}))
}
my Pivot type is pretty basic:
class Pivot(val rows: List[List[String]]) {
val headers = rows.head.zipWithIndex.toMap
val body = rows.tail
...
}
And to test it, you could do something like this:
val marriedP = Pivot(
List(
List("Name", "Age", "Married"),
List("Bill", "42", "TRUE"),
List("Heloise", "47", "TRUE"),
List("Thelma", "34", "FALSE"),
List("Bridget", "47", "TRUE"),
List("Robert", "42", "FALSE"),
List("Eddie", "42", "TRUE")
)
)
def accum(values: List[String]) = {
values.map(x => {1}).sum.toString
}
println(marriedP + "\n")
println(marriedP.doPivot(accum)("Age", "Married", "Married"))
Which yields:
Name Age Married
Bill 42 TRUE
Heloise 47 TRUE
Thelma 34 FALSE
Bridget 47 TRUE
Robert 42 FALSE
Eddie 42 TRUE
Married/Age 47 42 34
TRUE 2 2
FALSE 1 1
The nice thing is that you can use currying to pass in any function for the values like you would in a traditional excel pivot table.
More can be found here: https://github.com/vinsonizer/pivotfun
You can
val groups = people.groupBy(p => (p.age, p.isMarried))
and then
val thirty_and_married = groups((30, true))._2
val over_thirty_and_married_count =
groups.filterKeys(k => k._1 > 30 && k._2).map(_._2.length).sum
I think it would be better to use the count method on Lists directly
For question 1
people.count { p => p.age == 30 && p.isMarried }
For question 2
people.count { p => p.age > 30 && p.isMarried }
If you also want to actual groups of people who conform to those predicates use filter.
people.filter { p => p.age > 30 && p.isMarried }
You could probably optimise these by doing the traversal only once but is that a requirement?
You can group using a tuple:
val res1 = people.groupBy(p => (p.age, p.isMarried)) //or
val res2 = people.groupBy(p => (p.age, p.isMarried)).mapValues(_.size) //if you dont care about People instances
You can answer both question like that:
res2.getOrElse((30, true), 0)
res2.filter{case (k, _) => k._1 > 30 && k._2}.values.sum
res2.filterKeys(k => k._1 > 30 && k._2).values.sum // nicer with filterKeys from Rex Kerr's answer
You could answer both questions with a method count on List:
people.count(p => p.age == 30 && p.isMarried)
people.count(p => p.age > 30 && p.isMarried)
Or using filter and size:
people.filter(p => p.age == 30 && p.isMarried).size
people.filter(p => p.age > 30 && p.isMarried).size
edit:
slightly cleaner version of your code:
TreeMap(peopleByAge.toSeq: _*).map {case (age, ps) =>
val (married, notMarried) = ps.span(_.isMarried)
(age, married.size, notMarried.size)
}.foldLeft(List[FinalResult]()) { case (acc, (age, married, notMarried)) =>
def prevValue(f: (FinalResult) => Int) = acc.headOption.map(f).getOrElse(0)
new FinalResult(age, married, notMarried, prevValue(_.cumMarried) + married, prevValue(_.cumNotMarried) + notMarried) :: acc
}.reverse

How can I accumulate my totals in a more functional manner?

At the moment I have
val orders = new HashMap[Int, Int]
orders.put(36, 110)
orders.put(35, 90)
orders.put(34, 80)
orders.put(33, 60)
I would like to keep a running so that the end mapping appears as follows
36 -> 110
35 -> 200
34 -> 280
33 -> 340
At the moment I do this imperatively as follows
val keys = orders.keys.toList.sortBy(x => -x)
val accum = new HashMap[Int, Int]
accum.put(keys.head, orders(keys.head))
for (i <- 1 to keys.length - 1) {
accum.put(keys(i), orders(keys(i)) + accum(keys(i-1)))
}
accum.foreach {
x => println(x._1, x._2)
}
Is there a more functional way of doing this using mapping, folding etc? I would be able to do it with a straight List but the can't quite wrap my head around how to do this with HashMap
Edit: Ordering is important. The left column (36, 35, 34, 33) needs to be in descending order
Since HashMaps aren't sorted, it's not so simple to do this directly, so convert to an ordered sequence first:
val elems = orders.toSeq.sortBy(-_._1)
.scanLeft(0,0)((x, y) => (y._1, x._2 + y._2)).tail
// ArrayBuffer((36,110), (35,200), (34,280), (33,340))
If you actually want to stick these in an ordered map with reverse ordering, rather than just print them out, you could do this:
val accum = collection.SortedMap(elems: _*)(
new Ordering[Int] { def compare(x: Int, y: Int) = y compare x })
// SortedMap[Int,Int] = Map(36 -> 110, 35 -> 200, 34 -> 280, 33 -> 340)
This should work:
val orders = new HashMap[Int, Int]
orders.put(36, 110)
orders.put(35, 90)
orders.put(34, 80)
orders.put(33, 60)
val accum = new HashMap[Int, Int]
orders.toList.sortBy(-_._1).foldLeft(0){
case (sum, (k, v)) => {
accum.put(k, sum + v)
sum + v
}
}
For the record, here is a solution using the inits method:
import scala.collection.mutable._
// use a LinkedHashMap to keep the order
val orders = new LinkedHashMap[Int, Int]
orders.put(36, 110)
orders.put(35, 90)
orders.put(34, 80)
orders.put(33, 60)
// create a list of init sequences with no empty element
orders.toSeq.inits.toList.dropRight(1).
// > this returns
// ArrayBuffer((36,110), (35,90), (34,80), (33,60))
// ArrayBuffer((36,110), (35,90), (34,80))
// ArrayBuffer((36,110), (35,90))
// ArrayBuffer((36,110))
// now take the last key of each sequence and sum the values of the sequence
map(init => (init.last._1, init.map(_._2).sum)).reverse.toMap.mkString("\n")
36 -> 110
35 -> 200
34 -> 280
33 -> 340
I think you are doing it wrong. Don't create a Map directly: create a sequence. In this case, a ListBuffer is probably the most appropriate, so that you can easily append elements to it. It also supports constant time toList, though that shouldn't matter here.
If you must use a functional approach, you can either prepend to a List and reverse it, or go the way of iteratees. I'm not comfortable enough with the latter to explain them, though.
Once you have your collection, you'll scanLeft it. Or, if you built a List, you could scanRight it instead of having to reverse it. After that, it is a simple matter of calling toMap on the result.
Roughly speaking:
var accum: List[(Int, Int)] = Nil
accum ::= 36 -> 110
accum ::= 35 -> 90
accum ::= 34 -> 80
accum ::= 33 -> 60
val orders = accum.scanRight(0 -> 0) {
case ((k, v), (_, acc)) => (k, v + acc)
}.init.toMap
The init drops the seed. I could have avoided having to do that using tail and head, but that would require a check to see if accum is empty.
The var can be removed using either iteratees, or, perhaps, using the state monad at a higher level.
var sum = 0
orders.toList.sortBy (-_._1).map (o =>
{sum += o._2; (o._1 -> sum) }).toMap
Not very elegant, since it uses a var.