Partitioning of a Stream - scala

I'm not sure if this is possible but I want to partition a stream based on some condition that depends on the output of the stream. It will make sense with an example I think.
I will create a bunch of orders which I will stream since the actual use case is a stream of orders coming in so it is not known up front what the next order will be or even the full list of orders:
scala> case class Order(item : String, qty : Int, price : Double)
defined class Order
scala> val orders = List(Order("bike", 1, 23.34), Order("book", 3, 2.34), Order("lamp", 1, 9.44), Order("bike", 1, 23.34))
orders: List[Order] = List(Order(bike,1,23.34), Order(book,3,2.34), Order(lamp,1,9.44), Order(bike,1,23.34))
Now I want to partition/group these orders into one set which contain duplicate orders and another set which contains unique orders. So in the above example, when I force the stream it should create two streams: one with the two orders for a bike (Since they are the same) and another stream containing all the other orders.
I tried the following:
created the partitioning function:
scala> def matchOrders(o : Order, s : Stream[Order]) = s.contains(o)
matchOrders: (o: Order, s: Stream[Order])Boolean
then tried to apply this to stream:
scala> val s : (Stream[Order], Stream[Order]) = orders.toStream.partition(matchOrders(_, s._1))
I got a null pointer exception since I guess the s._1 is empty initially?? I'm not sure. I've tried other ways but I'm not getting very far. Is there a way to achieve this partitioning?

That would not work anyway, because the first duplicate Order would have already gone to the unique Stream when you would process its duplicate.
The best way is to create a Map[Order, Boolean] which tells you if an Order appears more than once in the original orders list.
val matchOrders = orders.groupBy(identity).mapValues(_.size > 1)
val s : (Stream[Order], Stream[Order]) = orders.toStream.partition(matchOrders(_))

Note that you can only know that an order has no duplicates after your stream finishes. So since the standard Stream constructors require you to know whether the stream is empty, it seems they aren't lazy enough: you have to force your original stream to even begin building the no-duplicates stream. And of course if you do this, Helder Pereira's answer applies.

Related

How to filter RDD relying on hash map?

I'm new to using spark and scala but I have to solve the following problem:
I have one ORC file containing rows which I have to check against a certain condition comming from a hash map.
I build the hash map (filename,timestamp) with 120,000 entries this way (getTimestamp returns an Option[Long] type):
val tgzFilesRDD = sc.textFile("...")
val fileNameTimestampRDD = tgzFilesRDD.map(itr => {
(itr, getTimestamp(itr))
})
val fileNameTimestamp = fileNameTimestampRDD.collect.toMap
And retrieve the RDD with 6 million entries like this:
val sessionDataDF = sqlContext.read.orc("...")
case class SessionEvent(archiveName: String, eventTimestamp: Long)
val sessionEventsRDD = sessionDataDF.as[SessionEvent].rdd
And do the check:
val sessionEventsToReport = sessionEventsRDD.filter(se => {
val timestampFromFile = fileNameTimestamp.getOrElse(se.archiveName, None)
se.eventTimestamp < timestampFromFile.getOrElse[Long](Long.MaxValue)
})
Is this the right and performant way to do it? Is there a caching recommended?
Will the Map fileNameTimestamp get shuffled to the clusters where the parititons where processed?
fileNameTimestamp will get serialized for each task, and with 120,000 entries, it may be quite expensive. You should broadcast large objects and reference the broadcast variables:
val fileNameTimestampBC = sc.broadcast(fileNameTimestampRDD.collect.toMap)
Now only one of these object will be shipped to each worker. There is also no need to drop down to the RDD API, as the Dataset API has a filter method:
val sessionEvents = sessionDataDF.as[SessionEvent]
val sessionEventsToReport = sessionEvents.filter(se => {
val timestampFromFile = fileNameTimestampBC.value.getOrElse(se.archiveName, None)
se.eventTimestamp < timestampFromFile.getOrElse[Long](Long.MaxValue)
})
The fileNameTimestamp Map you collected exists on Spark Master Node. In order to be referenced efficiently like this in a query, the worker nodes need to have access to it. This is done by broadcasting.
In essence, you rediscovered Broadcast Hash Join: You are left joining sessionEventsRDD with tgzFilesRDD to gain access to the optional timestamp and then you filter accordingly.
When using RDDs, you need to explicitly code the joining strategy. Dataframes/Datasets API has a query optimiser that can make the choice for you. You can also explicitly ask the API to use the above broadcast join technique behind the scenes. You can find examples for both approaches here.
Let me know if this is clear enough :)

Issue accessing to DataSet in scala

I'm actually trying to calculate the mean time between one question and its response on twitter using apache-spark, the twitter API and cassandra.
But when I'm trying to access the DataSet returned by the CassandraConnector I only get NullPointerException.
def getTweets(inReplyToStatusId : Long, timestamp : DateTime ): String = {
if(inReplyToStatusId > 0){
CassandraConnector(TwitterStreamingApp.conf).withSessionDo { session =>
val reply_id = session.execute("SELECT created_at FROM twitter_streaming.tweets WHERE tweet_id ="+ inReplyToStatusId+"ALLOW FILTERING")
val reply_time_it = reply_id.all().get(0).getString("created_at")
print(reply_time_it)
}
}
}
any idea on how to do this in scala ? It seems pretty easy but I am struggling a lot on this !
Thank you
The most suspicious line is
val reply_time_it = reply_id.all().get(0).getString("created_at")
Since any non-key field could be absent in the cassandra record, it's very likely to be null sometimes. You can wrap it in option like
val reply_time_it = Option( reply_id.all().get(0).getString("created_at"))
Then you can use methods like getOrElse to get value with default, forEach to execute side-effecting method if value is present, and map to make new values based on that.
Also you should probably:
create one session for each partition using mapPartions or mapPartitionsWithIndex
create PreparedStatement for each query \ partition for performance reasons

In Scala [2.11.6] how would one create a lazy stream of objects from an ordered set of Longs

In a nutshell what I wish to do is take a set of Longs, arbitrarily ordered as in (7,3,9,14,123,2) and have available a series of Objects:
Set(SomeObject(7),SomeObject(3),SomeObject(9),SomeObject(14),SomeObject(123),SomeObject(2))
However I do not want the SomeObject objects initialized until I actually ask for them. I wish to be able to ask for them in arbitrary order as well: As in give me the 3rd SomeObject (by index) or give me the SomeObject that maps to the Long value of 7. All that without triggering initializations down the stack.
I understand a lazy stream however I'm not quite sure how to connect the dots between the first Set of Longs (map will do that instantly of course as in map { x => SomeObject(x)}) and yet end up with a Lazy Stream (in the same initial arbitrary order please!)
One of the additional rules is this needs to be Set based so I never have the same Long (and it's matching SomeObject) appear twice.
An additional need is to to handle multiple Sets of Longs initially being mashed together, while maintaining the (fifo) order and uniqueness but I believe that is all built into a subclass of Set to begin with.
Set doesn't provide a indexed access so you can't get "3rd SomeObject". Also Set cant provide you any operations without evaluating values that it contains because this values need to be ordered (in case of Tree-based implementation) or hashed (in case of HashSets), and you cant sort or hash value that you do not know.
If creation of SomeObject is resource consuming maybe it is better to create a "SomeObjectHolder" class that would create SomeObject on demand and provide hashing operations that will not require creation of SomeObject.
Than you will have
Set(SomeObjectHolder(7),SomeObjectHolder(3),SomeObjectHolder(9),...
And each SomeObjectHolder will create corresponding SomeObject for you when you need.
Some of your requirements can be satisfied by lazy view of some indexed sequence:
case class SomeObject(v:Long) {
println(s"$v created")
}
val source = Vector(0L, 1L, 2L, 3L, 4L)
val col = source.view.map(SomeObject.apply)
In this case, when you access individual elements by index col(2) only requested elements are evaluated. However when you request slice, all elements from 0 to endpoint are evaluated.
col.slice(1, 2).toList
Prints:
0 created
1 created
This approach has several drawbacks:
when you request element several times, it get's evaluated each time
when you request slice, all elements from the beginning are evaluated
you can't request mapping for arbitrary key (only for index)
To satisfy all you requirements custom class should be created:
class CachedIndexedSeq[K, V](source: IndexedSeq[K], func: K => V) extends IndexedSeq[V] {
private val cache = mutable.Map[K, V]()
def getMapping(key: K): V = cache.getOrElseUpdate(key, func(key))
override def length: Int = source.length
override def apply(idx: Int): V = getMapping(source(idx))
}
This class takes source indexed sequence as the argument along with mapping function. It lazily evaluates elements and also provides getMapping method to lazily map arbitrary key.
val source = Vector(0L, 1L, 2L)
val col2 = new CachedIndexedSeq[Long, SomeObject](source, SomeObject.apply)
col2.slice(1, 3).toList
col2(1)
col2(1)
col2.getMapping(1L)
Prints:
1 created
2 created
The only remaining requirement is the ability to avoid duplicates. Set doesn't combine well with requesting elements by index. So I suggest to put all your initial Longs into any indexed seq (such as Vector) and then call distinct on them, before wrapping in CachedIndexedSeq.

Scala counting map objects with specific attribute

I have referrals: Map[String, Referral] and am looking for the best way to count how many of those Referral objects have a certain phase attribute.
case class Referral(
name: String,
phase: String
)
I need a count of how many have phase equal to "phase1".
I have been able to simply loop on the Map to collect the Referrals with "phase1" and get them into an Iterable, but I have a hunch that's an unecessary extra step but can't wrap my head around how to do this fluidly.
val phase1_refs = for (ref <- referrals.values if ref.phase == "phase1") yield ref.name
val phase1_count = phase1_refs.size
What is the syntax to get the size of the phase1_refs using the for? I've been playing with filters on the values but keep confusing myself.
Thanks!
Use
referrals.values.count(_.phase == "phase1")

scala: map-like structure that doesn't require casting when fetching a value?

I'm writing a data structure that converts the results of a database query. The raw structure is a java ResultSet and it would be converted to a map or class which permits accessing different fields on that data structure by either a named method call or passing a string into apply(). Clearly different values may have different types. In order to reduce burden on the clients of this data structure, my preference is that one not need to cast the values of the data structure but the value fetched still has the correct type.
For example, suppose I'm doing a query that fetches two column values, one an Int, the other a String. The result then names of the columns are "a" and "b" respectively. Some ideal syntax might be the following:
val javaResultSet = dbQuery("select a, b from table limit 1")
// with ResultSet, particular values can be accessed like this:
val a = javaResultSet.getInt("a")
val b = javaResultSet.getString("b")
// but this syntax is undesirable.
// since I want to convert this to a single data structure,
// the preferred syntax might look something like this:
val newStructure = toDataStructure[Int, String](javaResultSet)("a", "b")
// that is, I'm willing to state the types during the instantiation
// of such a data structure.
// then,
val a: Int = newStructure("a") // OR
val a: Int = newStructure.a
// in both cases, "val a" does not require asInstanceOf[Int].
I've been trying to determine what sort of data structure might allow this and I could not figure out a way around the casting.
The other requirement is obviously that I would like to define a single data structure used for all db queries. I realize I could easily define a case class or similar per call and that solves the typing issue, but such a solution does not scale well when many db queries are being written. I suspect some people are going to propose using some sort of ORM, but let us assume for my case that it is preferred to maintain the query in the form of a string.
Anyone have any suggestions? Thanks!
To do this without casting, one needs more information about the query and one needs that information at compiole time.
I suspect some people are going to propose using some sort of ORM, but let us assume for my case that it is preferred to maintain the query in the form of a string.
Your suspicion is right and you will not get around this. If current ORMs or DSLs like squeryl don't suit your fancy, you can create your own one. But I doubt you will be able to use query strings.
The basic problem is that you don't know how many columns there will be in any given query, and so you don't know how many type parameters the data structure should have and it's not possible to abstract over the number of type parameters.
There is however, a data structure that exists in different variants for different numbers of type parameters: the tuple. (E.g. Tuple2, Tuple3 etc.) You could define parameterized mapping functions for different numbers of parameters that returns tuples like this:
def toDataStructure2[T1, T2](rs: ResultSet)(c1: String, c2: String) =
(rs.getObject(c1).asInstanceOf[T1],
rs.getObject(c2).asInstanceOf[T2])
def toDataStructure3[T1, T2, T3](rs: ResultSet)(c1: String, c2: String, c3: String) =
(rs.getObject(c1).asInstanceOf[T1],
rs.getObject(c2).asInstanceOf[T2],
rs.getObject(c3).asInstanceOf[T3])
You would have to define these for as many columns you expect to have in your tables (max 22).
This depends of course on that using getObject and casting it to a given type is safe.
In your example you could use the resulting tuple as follows:
val (a, b) = toDataStructure2[Int, String](javaResultSet)("a", "b")
if you decide to go the route of heterogeneous collections, there are some very interesting posts on heterogeneous typed lists:
one for instance is
http://jnordenberg.blogspot.com/2008/08/hlist-in-scala.html
http://jnordenberg.blogspot.com/2008/09/hlist-in-scala-revisited-or-scala.html
with an implementation at
http://www.assembla.com/wiki/show/metascala
a second great series of posts starts with
http://apocalisp.wordpress.com/2010/07/06/type-level-programming-in-scala-part-6a-heterogeneous-list%C2%A0basics/
the series continues with parts "b,c,d" linked from part a
finally, there is a talk by Daniel Spiewak which touches on HOMaps
http://vimeo.com/13518456
so all this to say that perhaps you can build you solution from these ideas. sorry that i don't have a specific example, but i admit i haven't tried these out yet myself!
Joschua Bloch has introduced a heterogeneous collection, which can be written in Java. I once adopted it a little. It now works as a value register. It is basically a wrapper around two maps. Here is the code and this is how you can use it. But this is just FYI, since you are interested in a Scala solution.
In Scala I would start by playing with Tuples. Tuples are kinda heterogeneous collections. The results can be, but not have to be accessed through fields like _1, _2, _3 and so on. But you don't want that, you want names. This is how you can assign names to those:
scala> val tuple = (1, "word")
tuple: ([Int], [String]) = (1, word)
scala> val (a, b) = tuple
a: Int = 1
b: String = word
So as mentioned before I would try to build a ResultSetWrapper around tuples.
If you want "extract the column value by name" on a plain bean instance, you can probably:
use reflects and CASTs, which you(and me) don't like.
use a ResultSetToJavaBeanMapper provided by most ORM libraries, which is a little heavy and coupled.
write a scala compiler plugin, which is too complex to control.
so, I guess a lightweight ORM with following features may satisfy you:
support raw SQL
support a lightweight,declarative and adaptive ResultSetToJavaBeanMapper
nothing else.
I made an experimental project on that idea, but note it's still an ORM, and I just think it may be useful to you, or can bring you some hint.
Usage:
declare the model:
//declare DB schema
trait UserDef extends TableDef {
var name = property[String]("name", title = Some("姓名"))
var age1 = property[Int]("age", primary = true)
}
//declare model, and it mixes in properties as {var name = ""}
#BeanInfo class User extends Model with UserDef
//declare a object.
//it mixes in properties as {var name = Property[String]("name") }
//and, object User is a Mapper[User], thus, it can translate ResultSet to a User instance.
object `package`{
#BeanInfo implicit object User extends Table[User]("users") with UserDef
}
then call raw sql, the implicit Mapper[User] works for you:
val users = SQL("select name, age from users").all[User]
users.foreach{user => println(user.name)}
or even build a type safe query:
val users = User.q.where(User.age > 20).where(User.name like "%liu%").all[User]
for more, see unit test:
https://github.com/liusong1111/soupy-orm/blob/master/src/test/scala/mapper/SoupyMapperSpec.scala
project home:
https://github.com/liusong1111/soupy-orm
It uses "abstract Type" and "implicit" heavily to make the magic happen, and you can check source code of TableDef, Table, Model for detail.
Several million years ago I wrote an example showing how to use Scala's type system to push and pull values from a ResultSet. Check it out; it matches up with what you want to do fairly closely.
implicit val conn = connect("jdbc:h2:f2", "sa", "");
implicit val s: Statement = conn << setup;
val insertPerson = conn prepareStatement "insert into person(type, name) values(?, ?)";
for (val name <- names)
insertPerson<<rnd.nextInt(10)<<name<<!;
for (val person <- query("select * from person", rs => Person(rs,rs,rs)))
println(person.toXML);
for (val person <- "select * from person" <<! (rs => Person(rs,rs,rs)))
println(person.toXML);
Primitives types are used to guide the Scala compiler into selecting the right functions on the ResultSet.