Retrieving a list of tuples with the Scala anorm stream API - scala

When I read in the Play! docs I find a way to parse the result to a List[(String, String)]
Like this:
// Create an SQL query
val selectCountries = SQL("Select * from Country")
// Transform the resulting Stream[Row] as a List[(String,String)]
val countries = selectCountries().map(row =>
row[String]("code") -> row[String]("name")
).toList
I want to do this, but my tuple will contain more data.
I'm doing like this:
val getObjects = SQL("SELECT a, b, c, d, e, f, g FROM table")
val objects = getObjects().map(row =>
row[Long]("a") -> row[Long]("b") -> row[Long]("c") -> row[String]("d") -> row[Long]("e") -> row[Long]("f") -> row[String]("g")).toList
Each tuple I get will be in this format, ofcourse, thats what I'm asking for in the code above:
((((((Long, Long), Long), String), Long), Long), String)
But I want this:
(Long, Long, Long, String, Long, Long, String)
What I'm asking is how should I parse the result to generate a tuple like the last one above. I want to do like they do in the documentation to List[(String, String)] but with more data.
Thanks

You are getting ((((((Long, Long), Long), String), Long), Long), String) because of ->, each call to this wraps two elements into a pair. So with each -> you got a tuple, then you took this tuple and made a new one, etc. You need to change arrow with a comma and wrap into ():
val objects = getObjects().map(row =>
(row[Long]("a"), row[Long]("b"), ..., row[String]("g")).toList
But remember that currently tuples can have no more then 22 elements in it.

Related

Find count in WindowedStream - Flink

I am pretty new in the world of Streams and I am facing some issues in my first try.
More specifically, I am trying to implement a count and groupBy functionality in a sliding window using Flink.
I 've done it in a normal DateStream but I am not able to make it work in a WindowedStream.
Do you have any suggestion on how can I do it?
val parsedStream: DataStream[(String, Response)] = stream
.mapWith(_.decodeOption[Response])
.filter(_.isDefined)
.map { record =>
(
s"${record.get.group.group_country}, ${record.get.group.group_state}, ${record.get.group.group_city}",
record.get
)
}
val result: DataStream[((String, Response), Int)] = parsedStream
.map((_, 1))
.keyBy(_._1._1)
.sum(1)
// The output of result is
// ((us, GA, Atlanta,Response()), 14)
// ((us, SA, Atlanta,Response()), 4)
result
.keyBy(_._1._1)
.timeWindow(Time.seconds(5))
//the following part doesn't compile
.apply(
new WindowFunction[(String, Int), (String, Int), String, TimeWindow] {
def apply(
key: Tuple,
window: TimeWindow,
values: Iterable[(String, Response)],
out: Collector[(String, Int)]
) {}
}
)
Compilation Error:
overloaded method value apply with alternatives:
[R](function: (String, org.apache.flink.streaming.api.windowing.windows.TimeWindow, Iterable[((String, com.flink.Response), Int)], org.apache.flink.util.Collector[R]) => Unit)(implicit evidence$28: org.apache.flink.api.common.typeinfo.TypeInformation[R])org.apache.flink.streaming.api.scala.DataStream[R] <and>
[R](function: org.apache.flink.streaming.api.scala.function.WindowFunction[((String, com.flink.Response), Int),R,String,org.apache.flink.streaming.api.windowing.windows.TimeWindow])(implicit evidence$27: org.apache.flink.api.common.typeinfo.TypeInformation[R])org.apache.flink.streaming.api.scala.DataStream[R]
cannot be applied to (org.apache.flink.streaming.api.functions.windowing.WindowFunction[((String, com.flink.Response), Int),(String, com.flink.Response),String,org.apache.flink.streaming.api.windowing.windows.TimeWindow]{def apply(key: String,window: org.apache.flink.streaming.api.windowing.windows.TimeWindow,input: Iterable[((String, com.flink.Response), Int)],out: org.apache.flink.util.Collector[(String, com.flink.Response)]): Unit})
.apply(
This is a simpler example that we can work on
val source: DataStream[(JsonField, Int)] = env.fromElements(("hello", 1), ("hello", 2))
val window2 = source
.keyBy(0)
.timeWindow(Time.minutes(1))
.apply(new WindowFunction[(JsonField, Int), Int, String, TimeWindow] {})
I have tried Your code and found the errors, it seems that you have an error when declaring the types for your WindowFunction.
The documentation says that the expected types for WindowFunction are WindowFunction[IN, OUT, KEY, W <: Window]. Now, if you take a look at Your code, Your IN is the type of the datastream that You are calculating windows on. The type of the stream is ((String, Response), Int) and not as declared in the code (String, Int).
If You will change the part that is not compiling to :
.apply(new WindowFunction[((String, Response), Int), (String, Response), String, TimeWindow] {
override def apply(key: String, window: TimeWindow, input: Iterable[((String, Response), Int)], out: Collector[(String, Response)]): Unit = ???
})
EDIT: As for the second example the error occurs because of the same reason in general. When You are using keyBy with Tuple You have two possible functions to use keyBy(fields: Int*), which uses integer to access field of the tuple using index provided (this is what You have used). And also keyBy(fun: T => K) where You provide a function to extract the key that will be used.
But there is one important difference between those functions one of them returns key as JavaTuple and the other one returns the key with its exact type.
So basically If You change the String to Tuple in Your simplified example it should compile clearly.

How to extract values from Some() in Scala

I have Some() type Map[String, String], such as
Array[Option[Any]] = Array(Some(Map(String, String)
I want to return it as
Array(Map(String, String))
I've tried few different ways of extracting it-
Let's say if
val x = Array(Some(Map(String, String)
val x1 = for (i <- 0 until x.length) yield { x.apply(i) }
but this returns IndexedSeq(Some(Map)), which is not what I want.
I tried pattern matching,
x.foreach { i =>
i match {
case Some(value) => value
case _ => println("nothing") }}
another thing I tried that was somewhat successful was that
x.apply(0).get.asInstanceOf[Map[String, String]]
will do something what I want, but it only gets 0th index of the entire array and I'd want all the maps in the array.
How can I extract Map type out of Some?
If you want an Array[Any] from your Array[Option[Any]], you can use this for expression:
for {
opt <- x
value <- opt
} yield value
This will put the values of all the non-empty Options inside a new array.
It is equivalent to this:
x.flatMap(_.toArray[Any])
Here, all options will be converted to an array of either 0 or 1 element. All these arrays will then be flattened back to one single array containing all the values.
Generally, the pattern is either to use transformations on the Option[T], like map, flatMap, filter, etc.
The problem is, we'll need to add a type cast to retrieve the underlying Map[String, String] from Any. So we'll use flatten to remove any potentially None types and unwrap the Option, and asInstanceOf to retreive the type:
scala> val y = Array(Some(Map("1" -> "1")), Some(Map("2" -> "2")), None)
y: Array[Option[scala.collection.immutable.Map[String,String]]] = Array(Some(Map(1 -> 1)), Some(Map(2 -> 2)), None)
scala> y.flatten.map(_.asInstanceOf[Map[String, String]])
res7: Array[Map[String,String]] = Array(Map(1 -> 1), Map(2 -> 2))
Also when you talk just about single value you can try Some("test").head and for null simply Some(null).flatten

Flatten an RDD - Nested lists in value of key value pair

It took me awhile to figure this out and I wanted to share my solution. Improvements are definitely welcome.
References: Flattening a Scala Map in an RDD, Spark Flatten Seq by reversing groupby, (i.e. repeat header for each sequence in it)
I have an RDD of the form: RDD[(Int, List[(String, List[(String, Int, Float)])])]
Key: Int
Value: List[(String, List[(String, Int, Float)])]
With a goal of flattening to: RDD[(Int, String, String, Int, Float)]
binHostCountByDate.foreach(println)
Gives the example:
(516361, List((2013-07-15, List((s2.rf.ru,1,0.5), (s1.rf.ru,1,0.5))), (2013-08-15, List((p.secure.com,1,1.0)))))
The final RDD should give the following
(516361,2013-07-15,s2.rf.ru,1,0.5)
(516361,2013-07-15,s1.rf.ru,1,0.5)
(516361,2013-08-15,p.secure.com,1,1.0)
It's a simple one-liner (and with destructuring in the for-comprehension we can use better names than _1, _2._1 etc which makes it easier to be sure we're getting the right result
// Use a outer list in place of an RDD for test purposes
val t = List((516361,
List(("2013-07-15", List(("s2.rf.ru,",1,0.5), ("s1.rf.ru",1,0.5))),
("2013-08-15", List(("p.secure.com,",1,1.0))))))
t flatMap {case (k, xs) => for ((d, ys) <- xs; (dom, a,b) <-ys) yield (k, d, dom, a, b)}
//> res0: List[(Int, String, String, Int, Double)] =
List((516361,2013-07-15,s2.rf.ru,,1,0.5),
(516361,2013-07-15,s1.rf.ru,1,0.5),
(516361,2013-08-15,p.secure.com,,1,1.0))
My approach is as follows:
I flatten the first key value pair. This "removes" the first list.
val binHostCountForDate = binHostCountByDate.flatMapValues(identity)
Gives me an RDD of the form: RDD[(Int, (String, List[(String, Int, Float)])]
binHostCountForDate.foreach(println)
(516361,(2013-07-15,List((s2.rf.ru,1,0.5), (s1.rf.ru,1,0.5))))
(516361,(2013-08-15,List(p.secure.com,1,1.0))
Now I map the first two items into a tuple creating a new key and the second tuple as the value. Then apply the same procedure as above to flatten on the new key value pair.
val binDataRemapKey = binHostCountForDate.map(f =>((f._1, f._2._1), f._2._2)).flatMapValues(identity)
This gives the flattened RDD: RDD[(Int, String),(String, Int, Float)]
If this form is fine then we are done but we can go one step further and remove the tuples to get the final form we were originally looking for.
val binData = binDataRemapKey.map(f => (f._1._1, f._1._2, f._2._1, f._2._2, f._2._3))
This gives us the final form of: RDD[(Int, String, String, Int, Float)]
We now have a flattened RDD that has preserved the parents of each list.

acces tuple inside a tuple for anonymous map job in Spark

This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.
I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so
val PairCount = Pairs.countByValue().toSeq
which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared
These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:
PairCount.map(x => (x._1._1, x._2))
But the output the this spits out is String1->1, String2->1, String3->1, etc.
How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?
Update:
#vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run
PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap
for each unique x I get x->1
for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.
how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?
If I correctly understand question, you need flatMap:
val pairCountRDD = pairs.countByValue() // RDD[((String, String), Int)]
val res : RDD[(String, Int)] = pairCountRDD.flatMap { case ((s1, s2), n) =>
Seq(s1 -> n, s2 -> n)
}
Update: I didn't quiet understand what your final goal is, but here's a few more examples that may help you, btw code above is incorrect, I have missed the fact that countByValue returns map, and not RDD:
val pairs = sc.parallelize(
List(
"mom"-> "dad", "dad" -> "granny", "foo" -> "bar", "foo" -> "baz", "foo" -> "foo"
)
)
// don't use countByValue, if pairs is large you will run out of memmory
val pairCountRDD = pairs.map(x => (x, 1)).reduceByKey(_ + _)
val wordCount = pairs.flatMap { case (a,b) => Seq(a -> 1, b ->1)}.reduceByKey(_ + _)
wordCount.take(10)
// count in how many pairs each word occur, keys and values:
val wordPairCount = pairs.flatMap { case (a,b) =>
if (a == b) {
Seq(a->1)
} else {
Seq(a -> 1, b ->1)
}
}.reduceByKey(_ + _)
wordPairCount.take(10)
to get the histograms for the (String,String) RDD I used this code.
val Hist_X = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)
where histogram was the (String,String) RDD

Reading from CSV file in scala

I have a file in that has different data informations about a certain population.
Example of file format:
1880,Mary,F,7065
1880,Anna,F,2604
1880,Emma,F,2003
1880,Elizabeth,F,1939
We can interpret this data as “In the year 1880, 7065 female babies were born named Mary"
I have a function that reads from the file
fromFile(name:String):List[List[String]]
fromFile returns a list of lists:
List( List("1880","Mary", "F","7065"))
I am having trouble figuring out how to get the data and parsing it out for a function like this, which takes a nested list and a number,and returns a list of of entries of such year.
For example if 'n' is 1880, then the return list would return all info about Mary.
object readFile{
val years = CSV.fromFile("my_file.csv")
def yearIs(data: List[List[String]], n: Int): List[List[String]] =
??
}
I trying to figure out how to access each element in the returned list and compare it to the given 'int', and return all the data.
I would always recommend to first convert input data into a appropriate structure and doing all conversions and possibly error reporting and then do what you want to do on that.
So an appropriate structure for one record would be:
case class Record(year: Int, name: String, female: Boolean, count: Int)
Lets convert your data:
val data = CSV.fromFile("my_file.csv").map {
case List(year, name, female, count) =>
Record(year.toInt, name, female == "F", count.toInt)
}
You should catch a MatchError and a NumberFormatException here or try to detect these error, if you do care about error handling.
Now we can define your method yearIs in a type-safe and concise manner:
def yearIs(data: List[Record], year: Int) = data.filter(_.year == year)
You could also directly create a Map from years to list of recors:
val byYear: Map[Int, List[Record]] = data.record.groupBy(_.year)
I think the best way to get "a list of the years from n onward" is to compare n with the year or first element in the List using filter.
scala> def yearIs(data: List[List[String]], n: Int): List[List[String]] = {
| data.filter(xs => xs.head.toInt > n)
| }
yearIs: (data: List[List[String]], n: Int)List[List[String]]
scala> yearIs(data, 1880)
res6: List[List[String]] = List()
scala> yearIs(data, 1879)
res7: List[List[String]] = List(List(1880, Mary, F, 7065), List(1880, Anna, F, 2604), List(1880, Emma, F, 2003))