What does "row +: savedState.toSeq" do in StateStoreRestoreExec.doExecute? - scala

we can see StateStoreRestoreExec as follows.
case class StateStoreRestoreExec(
keyExpressions: Seq[Attribute],
stateId: Option[OperatorStateId],
child: SparkPlan)
extends UnaryExecNode with StateStoreReader {
override protected def doExecute(): RDD[InternalRow] = {
val numOutputRows = longMetric("numOutputRows")
child.execute().mapPartitionsWithStateStore(
getStateId.checkpointLocation,
operatorId = getStateId.operatorId,
storeVersion = getStateId.batchId,
keyExpressions.toStructType,
child.output.toStructType,
sqlContext.sessionState,
Some(sqlContext.streams.stateStoreCoordinator)) { case (store, iter) =>
val getKey = GenerateUnsafeProjection.generate(keyExpressions, child.output)
iter.flatMap { row =>
val key = getKey(row)
val savedState = store.get(key)
numOutputRows += 1
row +: savedState.toSeq
}
}
Here, I wonder the meaning of row +: savedState.toSeq . I think row is a instance of UnsafeRow and savedState.toSeq is a instance of Seq. So how can we operate them with +:. On the other hand, I think savedState is a instance of UnsafeRow and toSeq is not a member of UnsaveRow, So how does savedState.toSeq work?

row is an instance of an InternalRow, and savedState is an Option[UnsafeRow], which extends InternalRow. What happens here is that the saved state is transformed from Option[UnsafeRow] to a Seq[UnsafeRow] and then the row instance is prepended to that sequence.
When you flatMap over these UnsafeRow objects, you get back an Iterator[UnsafeRow].

Related

how do I loop a array in udf and return multiple variable value

I'm fresh with scala and udf, now I would like to write a udf which accept 3 parameters from a dataframe columns(one of them is array), for..loop current array, parse and return a case class which will be used afterwards. here's a my code roughly:
case class NewFeatures(dd: Boolean, zz: String)
val resultUdf = udf((arrays: Option[Row], jsonData: String, placement: Int) => {
for (item <- arrays) {
val aa = item.getAs[Long]("aa")
val bb = item.getAs[Long]("bb")
breakable {
if (aa <= 0 || bb <= 0) break
}
val cc = item.getAs[Long]("cc")
val dd = cc > 0
val jsonData = item.getAs[String]("json_data")
val jsonDataObject = JSON.parseFull(jsonData).asInstanceOf[Map[String, Any]]
var zz = jsonDataObject.getOrElse("zz", "").toString
NewFeatures(dd, zz)
}
})
when I run it, it will get exception:
java.lang.UnsupportedOperationException: Schema for type Unit is not supported
how should I modify above udf
First of all, try better naming for your variables, for instance in your case, "arrays" is of type Option[Row]. Here, for (item <- arrays) {...} is basically a .map method, using map on Options, you should provide a function, that uses Row and returns a value of some type (~= signature: def map[V](f: Row => V): Option[V], what you want in your case: def map(f: Row => NewFeatures): Option[NewFeature]). While you're breaking out of this map in some circumstances, so there's no assurance for the compiler that the function inside map method would always return an instance of NewFeatures. So it is Unit (it only returns on some cases, and not all).
What you want to do could be enhanced in something similar to this:
val funcName: (Option[Row], String, Int) => Option[NewFeatures] =
(rowOpt, jsonData, placement) => rowOpt.filter(
/* your break condition */
).map { row => // if passes the filter predicate =>
// fetch data from row, create new instance
}

Distinct elements in priority queue

I am using this implementation of a bounded priority queue, https://gist.github.com/ryanlecompte/5746241, and it works perfectly. However, I want this queue to not contain any element with the same ordering 'ord'. How can I achieve this?
I tried to update the maybeReplaceLowest function by giving the function lteq instead of lt.
private def maybeReplaceLowest(a: A) {
if (ord.lteq(a, head)) {
dequeue()
super.+=(a)
}
}
But I think it does not work because the element which has the same ord as the new element might not be at the head. What could be a quick workaround for this problem?
Many thanks.
Scala PriorityQueue is implemented using Array. Which is very inefficient for lookup operation, which you need to do for every insert (to check if the element already exists in the queue.
For TreeSet is ideal for lookup and ordered storage.
But since this is a sealed class (scala-2.12.8) I created composite class
import scala.collection.mutable
class PrioritySet[A](val maxSize:Int)(implicit ordering:Ordering[A]) {
var set = mutable.TreeSet[A]()(ordering)
private def removeAdditional(): this.type = {
val additionalElements = set.size - maxSize
if( additionalElements > 0) { set = set.drop(additionalElements) }
this
}
def +(elem: A): this.type = {
set += elem
removeAdditional()
}
def ++=(elem: A, elements: A*) : this.type = {
set += elem ++= elements
removeAdditional()
}
override def toString = this.getClass.getName + set.mkString("(", ", ", ")")
}
This can be used as
>>> val set = new PrioritySet[Int](4)
>>> set ++=(1000,3,42,2,5, 1,2,4, 100)
>>> println(set)
PrioritySet(5, 42, 100, 1000)

Getting previous and next element of a value in a Scala enumeration

I would like to add two new operations to a Scala Enumeration to get the previous and the next value given a certain value if it exists. For example, I would like to write something like:
object Nums extends MyNewEnumerationType {
type Nums = Value
val One,Two,Three = Value
}
Nums.nextOf(One) // Some(Two)
Nums.prevOf(One) // None
My idea was to create a new class and add the methods in this way:
class PrevNextEnum extends Enumeration {
val prevOf = values.zip(None +: values.map{_.some}.toSeq).toMap
val nextOf = {
if (values.isEmpty) Map.empty
else values.zip(values.tail.map{_.some}.toSeq :+ None).toMap
}
}
The problem is that this doesn't work because when prevOf and nextOf are initialized, values is empty.
First question: why values is empty and when it is filled with the values?
Second question: how can I implement prevOf and nextOf?
Third question: is it possible to add the methods prevOf and nextOf to the value type instead of the enumeration? Writing One.next feels more natural than writing Num.nextOf(One)
try the following codes:
class PrevNextEnum extends Enumeration {
lazy val prevOf = {
val list = values.toList
val map = list.tail.zip(list.map(Some(_))).toMap + (list.head -> None)
map
}
lazy val nextOf = {
val list = values.toList
val map = (list.zip(list.tail.map(Some(_)) :+ None).toMap)
map
}
}
object Nums extends PrevNextEnum {
type Nums = Value
val One, Two, Three = Value
}
object App extends App {
println(Nums.prevOf(Nums.Two))
println(Nums.nextOf(Nums.One))
println(Nums.nextOf(Nums.Three))
println(Nums.prevOf(Nums.One))
}
Building on the answer of user1484819 :
class PrevNextEnum extends Enumeration {
lazy val prevOf = {
val list = values.toList
val map = list.tail.zip(list).toMap
v:Value => map.get(v)
}
lazy val nextOf = {
val list = values.toList
val map = list.zip(list.tail).toMap
v:Value => map.get(v)
}
}
object Nums extends PrevNextEnum {
type Nums = Value
val One, Two, Three = Value
}
This has basically the same structure, but uses the fact that Map can return Options itself when using get instead of apply.

Counting pattern in scala list

My list looks like the following: List(Person,Invite,Invite,Person,Invite,Person...). I am trying to match based on a inviteCountRequired, meaning that the Invite objects following the Person object in the list is variable. What is the best way of doing this? My match code so far looks like this:
aList match {
case List(Person(_,_,_),Invitee(_,_,_),_*) => ...
case _ => ...
}
First stack question, please go easy on me.
Let
val aList = List(Person(1), Invite(2), Invite(3), Person(2), Invite(4), Person(3), Invite(6), Invite(7))
Then index each location in the list and select Person instances,
val persons = (aList zip Stream.from(0)).filter {_._1.isInstanceOf[Person]}
namely, List((Person(1),0), (Person(2),3), (Person(3),5)) . Define then sublists where the lower bound corresponds to a Person instance,
val intervals = persons.map{_._2}.sliding(2,1).toArray
res31: Array[List[Int]] = Array(List(0, 3), List(3, 5))
Construct sublists,
val latest = aList.drop(intervals.last.last) // last Person and Invitees not paired
val associations = intervals.map { case List(pa,pb,_*) => b.slice(pa,pb) } ++ latest
Hence the result looks like
Array(List(Person(1), Invite(2), Invite(3)), List(Person(2), Invite(4)), List(Person(3), Invite(6), Invite(7)))
Now,
associations.map { a =>
val person = a.take(1)
val invitees = a.drop(1)
// ...
}
This approach may be seen as a variable size sliding.
Thanks for your tips. I ended up creating another case class:
case class BallotInvites(person:Person,invites:List[Any])
Then, I populated it from the original list:
def constructBallotList(ballots:List[Any]):List[BallotInvites] ={
ballots.zipWithIndex.collect {
case (iv:Ballot,i) =>{
BallotInvites(iv,
ballots.distinct.takeRight(ballots.distinct.length-(i+1)).takeWhile({
case y:Invitee => true
case y:Person =>true
case y:Ballot => false})
)}
}}
val l = Ballot.constructBallotList(ballots)
Then to count based on inviteCountRequired, I did the following:
val count = l.count(b=>if ((b.invites.count(x => x.isInstanceOf[Person]) / contest.inviteCountRequired)>0) true else false )
I am not sure I understand the domain but you should only need to iterate once to construct a list of person + invites tuple.
sealed trait PorI
case class P(i: Int) extends PorI
case class I(i: Int) extends PorI
val l: List[PorI] = List(P(1), I(1), I(1), P(2), I(2), P(3), P(4), I(4))
val res = l.foldLeft(List.empty[(P, List[I])])({ case (res, t) =>
t match {
case p # P(_) => (p, List.empty[I]) :: res
case i # I(_) => {
val head :: tail = res
(head._1, i :: head._2) :: tail
}
}
})
res // List((P(4),List(I(4))), (P(3),List()), (P(2),List(I(2))), (P(1),List(I(1), I(1))))

Scala: Group an Iterable into an Iterable of Iterables by a predicate

I have very large Iterators that I want to split into pieces. I have a predicate that looks at an item and returns true if it is the start of a new piece. I need the pieces to be Iterators, because even the pieces will not fit into memory. There are so many pieces that I would be wary of a recursive solution blowing out your stack. The situation is similar to this question, but I need Iterators instead of Lists, and the "sentinels" (items for which the predicate is true) occur (and should be included) at the beginning of a piece. The resulting iterators will only be used in order, though some may not be used at all, and they should only use O(1) memory. I imagine this means they should all share the same underlying iterator. Performance is important.
If I were to take a stab at a function signature, it would be this:
def groupby[T](iter: Iterator[T])(startsGroup: T => Boolean): Iterator[Iterator[T]] = ...
I would have loved to use takeWhile, but it loses the last element. I investigated span, but it buffers results. My current best idea involves BufferedIterator, but maybe there is a better way.
You'll know you've got it right because something like this doesn't crash your JVM:
groupby((1 to Int.MaxValue).iterator)(_ % (Int.MaxValue / 2) == 0).foreach(group => println(group.sum))
groupby((1 to Int.MaxValue).iterator)(_ % 10 == 0).foreach(group => println(group.sum))
You have an inherent problem. Iterable implies that you can get multiple iterators. Iterator implies that you can only pass through once. That means that your Iterable[Iterable[T]] should be able to produce Iterator[Iterable[T]]s. But when this returns an element--an Iterable[T]--and that asks for multiple iterators, the underlying single iterator can't comply without either caching the results of the list (too big) or calling the original iterable and going through absolutely everything again (very inefficient).
So, while you could do this, I think you should conceive of your problem in a different way.
If you could start with a Seq instead, you could grab subsets as ranges.
If you already know how you want to use your iterable, you could write a method
def process[T](source: Iterable[T])(starts: T => Boolean)(handlers: T => Unit *)
which increments through the set of handlers each time starts fires off a "true". If there's any way you can do your processing in one sweep, something like this is the way to go. (Your handlers will have to save state via mutable data structures or variables, however.)
If you can permit iteration on the outer list to break the inner list, you could have an Iterable[Iterator[T]] with the additional constraint that once you iterate to a later sub-iterator, all previous sub-iterators are invalid.
Here's a solution of the last type (from Iterator[T] to Iterator[Iterator[T]]; one can wrap this to make the outer layers Iterable instead).
class GroupedBy[T](source: Iterator[T])(starts: T => Boolean)
extends Iterator[Iterator[T]] {
private val underlying = source
private var saved: T = _
private var cached = false
private var starting = false
private def cacheNext() {
saved = underlying.next
starting = starts(saved)
cached = true
}
private def oops() { throw new java.util.NoSuchElementException("empty iterator") }
// Comment the next line if you do NOT want the first element to always start a group
if (underlying.hasNext) { cacheNext(); starting = true }
def hasNext = {
while (!(cached && starting) && underlying.hasNext) cacheNext()
cached && starting
}
def next = {
if (!(cached && starting) && !hasNext) oops()
starting = false
new Iterator[T] {
var presumablyMore = true
def hasNext = {
if (!cached && !starting && underlying.hasNext && presumablyMore) cacheNext()
presumablyMore = cached && !starting
presumablyMore
}
def next = {
if (presumablyMore && (cached || hasNext)) {
cached = false
saved
}
else oops()
}
}
}
}
Here's my solution using BufferedIterator. It doesn't let you skip iterators correctly, but it's fairly simple and functional. The first element(s) go into a group even if !startsGroup(first).
def groupby[T](iter: Iterator[T])(startsGroup: T => Boolean): Iterator[Iterator[T]] =
new Iterator[Iterator[T]] {
val base = iter.buffered
override def hasNext = base.hasNext
override def next() = Iterator(base.next()) ++ new Iterator[T] {
override def hasNext = base.hasNext && !startsGroup(base.head)
override def next() = if (hasNext) base.next() else Iterator.empty.next()
}
}
Update: Keeping a little state lets you skip iterators and prevent people from messing with previous ones:
def groupby[T](iter: Iterator[T])(startsGroup: T => Boolean): Iterator[Iterator[T]] =
new Iterator[Iterator[T]] {
val base = iter.buffered
var prev: Iterator[T] = Iterator.empty
override def hasNext = base.hasNext
override def next() = {
while (prev.hasNext) prev.next() // Exhaust previous iterator; take* and drop* do NOT always work!! (Jira SI-5002?)
prev = Iterator(base.next()) ++ new Iterator[T] {
var hasMore = true
override def hasNext = { hasMore = hasMore && base.hasNext && !startsGroup(base.head) ; hasMore }
override def next() = if (hasNext) base.next() else Iterator.empty.next()
}
prev
}
}
If you are looking at memory constraints then the following will work. You can only use it if your underlying iterable object supports views. This implementation will iterate over the Iterable and then generate IterableViews which can then be iterated over. This implementation does not care if the very first element tests as a start group since it will be regardless.
def groupby[T](iter: Iterable[T])(startsGroup: T => Boolean): Iterable[Iterable[T]] = new Iterable[Iterable[T]] {
def iterator = new Iterator[Iterable[T]] {
val i = iter.iterator
var index = 0
var nextView: IterableView[T, Iterable[T]] = getNextView()
private def getNextView() = {
val start = index
var hitStartGroup = false
while ( i.hasNext && ! hitStartGroup ) {
val next = i.next()
index += 1
hitStartGroup = ( index > 1 && startsGroup( next ) )
}
if ( hitStartGroup ) {
if ( start == 0 ) iter.view( start, index - 1 )
else iter.view( start - 1, index - 1 )
} else { // hit end
if ( start == index ) null
else if ( start == 0 ) iter.view( start, index )
else iter.view( start - 1, index )
}
}
def hasNext = nextView != null
def next() = {
if ( nextView != null ) {
val next = nextView
nextView = getNextView()
next
} else null
}
}
}
You can maintain low memory foot-print by using Streams. Use result.toIterator, if you an iterator again.
With streams, there's no mutable state, only a single conditional and it's nearly as concise as Jay Hacker's solution.
def batchBy[A,B](iter: Iterator[A])(f: A => B): Stream[(B, Iterator[A])] = {
val base = iter.buffered
val empty = Stream.empty[(B, Iterator[A])]
def getBatch(key: B) = {
Iterator(base.next()) ++ new Iterator[A] {
def hasNext: Boolean = base.hasNext && (f(base.head) == key)
def next(): A = base.next()
}
}
def next(skipList: Option[Iterator[A]] = None): Stream[(B, Iterator[A])] = {
skipList.foreach{_.foreach{_=>}}
if (base.isEmpty) empty
else {
val key = f(base.head)
val batch = getBatch(key)
Stream.cons((key, batch), next(Some(batch)))
}
}
next()
}
I ran the tests:
scala> batchBy((1 to Int.MaxValue).iterator)(_ % (Int.MaxValue / 2) == 0)
.foreach{case(_,group) => println(group.sum)}
-1610612735
1073741823
-536870909
2147483646
2147483647
The second test prints too much to paste to Stack Overflow.
import scala.collection.mutable.ArrayBuffer
object GroupingIterator {
/**
* Create a new GroupingIterator with a grouping predicate.
*
* #param it The original iterator
* #param p Predicate controlling the grouping
* #tparam A Type of elements iterated
* #return A new GroupingIterator
*/
def apply[A](it: Iterator[A])(p: (A, IndexedSeq[A]) => Boolean): GroupingIterator[A] =
new GroupingIterator(it)(p)
}
/**
* Group elements in sequences of contiguous elements that satisfy a predicate. The predicate
* tests each single potential next element of the group with the help of the elements grouped so far.
* If it returns true, the potential next element is added to the group, otherwise
* a new group is started with the potential next element as first element
*
* #param self The original iterator
* #param p Predicate controlling the grouping
* #tparam A Type of elements iterated
*/
class GroupingIterator[+A](self: Iterator[A])(p: (A, IndexedSeq[A]) => Boolean) extends Iterator[IndexedSeq[A]] {
private[this] val source = self.buffered
private[this] val buffer: ArrayBuffer[A] = ArrayBuffer()
def hasNext: Boolean = source.hasNext
def next(): IndexedSeq[A] = {
if (hasNext)
nextGroup()
else
Iterator.empty.next()
}
private[this] def nextGroup(): IndexedSeq[A] = {
assert(source.hasNext)
buffer.clear()
buffer += source.next
while (source.hasNext && p(source.head, buffer)) {
buffer += source.next
}
buffer.toIndexedSeq
}
}