Spark mapWithState API explanation

Spark mapWithState API explanation - scala

I have been using the mapWithState API in Spark Streaming, but 2 things are not clear about the StateSpec.function:
Let's say my function is:
def trackStateForKey(batchTime: Time,
key: Long,
newValue: Option[JobData],
currentState: State[JobData]): Option[(Long, JobData)]
Why is the new value an Option[T] type? As far as I've seen, it was always defined for me, and since the method is supposed to be called with a new state, I don't really see the point why it could be optional.
What does the return value mean? I tried to find some pointers in the documentations and source code, but none of them describe what it is used for. Since I'm modifying the state of a key using state.remove() and state.update(), why would I have to do the same with return values?
In my current implementation I return None if I remove the key, and Some(newState) if I update it, but I'm not sure if that is correct.

Why is the new value an Option[T] type? As far as I've seen, it was
always defined for me, and since the method is supposed to be called
with a new state, I don't really see the point why it could be
optional.
It is an Option[T] for the reason that if you set a timeout using StateSpec.timeout, e.g:
StateSpec.function(spec _).timeout(Milliseconds(5000))
then the value passed in once the function times out will be None and the isTimingOut method on State[T] will yield true. This makes sense, because a timeout of the state doesn't mean that a new value has arrived for the specified key, and generally safer to use than passing null for T (which wouldn't work for primitives anyway) as you expect the user to safely operate on an Option[T].
You can see that in the Sparks implementation:
// Get the timed out state records, call the mapping function on each and collect the
// data returned
if (removeTimedoutData && timeoutThresholdTime.isDefined) {
newStateMap.getByTime(timeoutThresholdTime.get).foreach { case (key, state, _) =>
wrappedState.wrapTimingOutState(state)
val returned = mappingFunction(batchTime, key, None, wrappedState) // <-- This.
mappedData ++= returned
newStateMap.remove(key)
}
}
What does the return value mean? I tried to find some pointers in the
documentations and source code, but none of them describe what it is
used for. Since I'm modifying the state of a key using state.remove()
and state.update(), why would I have to do the same with return
values?
The return value is a way to pass intermediate state along the spark graph. For example, assume that I want to update my state but also perform some operation in my pipeline with the intermediate data, e.g:
dStream
.mapWithState(stateSpec)
.map(optionIntermediateResult.map(_ * 2))
.foreachRDD( /* other stuff */)
That return value is exactly what allows me to continue operating on said data. If you don't care for the intermediate result and only want the complete state, then outputting None is perfectly fine.
Edit:
I've written a blog post (following this question) which attempts to give an in-depth explanation to the API.

Related

(Scala) Am I using Options correctly?

I'm currently working on my functional programming - I am fairly new to it. Am i using Options correctly here? I feel pretty insecure on my skills currently. I want my code to be as safe as possible - Can any one point out what am I doing wrong here or is it not that bad? My code is pretty straight forward here:
def main(args: Array[String]): Unit =
{
val file = "myFile.txt"
val myGame = Game(file) //I have my game that returns an Option here
if(myGame.isDefined) //Check if I indeed past a .txt file
{
val solutions = myGame.get.getAllSolutions() //This returns options as well
if(solutions.isDefined) //Is it possible to solve the puzzle(crossword)
{
for(i <- solutions.get){ //print all solutions to the crossword
i.solvedCrossword foreach println
}
}
}
}
-Thanks!! ^^

When using Option, it is recommended to use match case instead of calling 'isDefined' and 'get'
Instead of the java style for loop, use higher-order function:
myGame match {
case Some(allSolutions) =>
val solutions = allSolutions.getAllSolutions
solutions.foreach(_.solvedCrossword.foreach(println))
case None =>
}

As a rule of thumb, you can think of Option as a replacement for Java's null pointer. That is, in cases where you might want to use null in Java, it often makes sense to use Option in Scala.
Your Game() function uses None to represent errors. So you're not really using it as a replacement for null (at least I'd consider it poor practice for an equivalent Java method to return null there instead of throwing an exception), but as a replacement for exceptions. That's not a good use of Option because it loses error information: you can no longer differentiate between the file not existing, the file being in the wrong format or other types of errors.
Instead you should use Either. Either consists of the cases Left and Right where Right is like Option's Some, but Left differs from None in that it also takes an argument. Here that argument can be used to store information about the error. So you can create a case class containing the possible types of errors and use that as an argument to Left. Or, if you never need to handle the errors differently, but just present them to the user, you can use a string with the error message as the argument to Left instead of case classes.
In getAllSolutions you're just using None as a replacement for the empty list. That's unnecessary because the empty list needs no replacement. It's perfectly fine to just return an empty list when there are no solutions.
When it comes to interacting with the Options, you're using isDefined + get, which is a bit of an anti pattern. get can be used as a shortcut if you know that the option you have is never None, but should generally be avoided. isDefined should generally only be used in situations where you need to know whether an option contains a value, but don't need to know the value.
In cases where you need to know both whether there is a value and what that value is, you should either use pattern matching or one of Option's higher-order functions, such as map, flatMap, getOrElse (which is kind of a higher-order function if you squint a bit and consider by-name arguments as kind-of like functions). For cases where you want to do something with the value if there is one and do nothing otherwise, you can use foreach (or equivalently a for loop), but note that you really shouldn't do nothing in the error case here. You should tell the user about the error instead.

If all you need here is to print it in case all is good, you can use for-comprehension which is considered quite idiomatic Scala way
for {
myGame <- Game("mFile.txt")
solutions <- myGame.getAllSolutions()
solution <- solutions
crossword <- solution.solvedCrossword
} println(crossword)

Confused about Observable vs. Single in functions like readCharacteristic()

In the RxJava2 version of RxAndroidBle, the functions readCharacteristic() and writeCharacteristic() return Single<byte[]>.
The example code to read a characteristic is:
device.establishConnection(false).flatMap(rxBleConnection -> rxBleConnection.readCharacteristic(characteristicUUID))
But the documentation for flatMap() says the mapping function is supposed to return an ObservableSource. Here, it returns a Single. How can this work?
Update: I looked at possibilities using operators like .single() and .singleOrError() but they all seem to require that the upstream emits one item and then completes. But establishConnection() doesn't ever complete. (This is one reason I suggested that perhaps establishConnection() should be reimagined as a Maybe, and some other way be provided to disconnect rather than just unsubscribing.)

You're totally correct, this example cannot be compiled. it's probably leftover from RxJava1 version, where Single wasn't exists.
Simple fix with the same result is to use RxJava2 flatMapSingle for instance:
device.establishConnection(false)
.flatMapSingle(rxBleConnection -> rxBleConnection.readCharacteristic(characteristicUUID))
flatMapSingle accepts a Single as the return value, and will map the success value of the input Single to an emission from the upstream Observable.
The point is, that RxJava has more specific Observable types, that exposes the possible series of emission expected from this Observable. Some methods now return Single as this is the logical operation of their stream (readCharacteristic()), some Observable as they will emit more than single emission (establishConnection() - connection status that can be changed over time).
But RxJava2 also provided many operators to convert between the different types and it really depends on your needs and scenario.

Thanks Rob!
In fact, the README was deprecated and required some pimping here and there. Please have a look if it's ok now.

I think I found the answer I was looking for. The crucial point:
Single.fromObservable(observableSource) doesn't do anything until it receives the second item from observableSource! Assuming that the first item it receives is a valid emission, then if the second item is:
onComplete(), it passes the first item to onSuccess();
onNext(), it signals IndexOutOfBoundsException since a Single can't emit more than one item;
onError(), it presumably forwards the error downstream.
Now, device.establishConnection() is a 1-item, non-completing Observable. The RxBleConnecton it emits is flatMapped to a Single with readCharacteristic(). But (another gotcha), flatMapSingle subscribes to these Singles and combines them into an Observable, which doesn't complete until the source establishConnection() does. But the source doesn't ever complete! Therefore the Single we're trying to create won't emit anything, since it doesn't receive that necessary second item.
The solution is to force the generation of onComplete() after the first (and only) item, which can be done with take(1). This will satisfy the Single we're creating, and cause it to emit the Characteristic value we're interested in. Hope that's clear.
The code:
Single<byte[]> readCharacteristicSingle( RxBleDevice device, UUID characteristicUUID ) {
return Single.fromObservable(
device.establishConnection( false )
.flatMapSingle( connection -> connection.readCharacteristic( characteristicUUID ) )
.take( 1L ) // make flatMapSingle's output Observable complete after the first emission
// (this makes the Single call onSuccess())
);
}

scala save slick result into new object

is there a way to save the result of a slick query into a new object?
This is my slick result, there is only one "object" in the list
val result: Future[Seq[ProcessTemplatesModel]] = db.run(action)
The result should be mapped on ProcessTemplatesModel because I want to access the values like this
process.title
Is this possible?
Thanks

TL;DR: you should keep the context as long as you can.
Future denotes the fact that the value will be given at some time in the future (this is what I call some context for the value).
The bad way to use it would be to block your thread, until such value is found, and then work with it.
A better way is to tell your program: "Once the value is found (whenever that is), do something with it". That's a continuation, or call-back, and is implemented with map and flatMap in scala.
Seq is another context for your value. It means that you actually have different possible values. If you want to make sure that you have at most one value, you can always do seq.headOption to switch context from Seq to Option.
The bad way to use it would be to take the first value without bothering checking if it exists or not.
A better way is to tell your program: "No matter how many values you have, do this for each of them".
Now, how do you work in context? You use the Functor and/or Monad operators: map, flatMap.
For instance, if you want to apply a function convertToSomethingElse to each element of your context, just do
result.map(list => list.map(process => convertToSomethingElse(process))
And you'll get a Future[Seq[SomethingElse]].
Another example, if you want to save the result somewhere else, you'll probably have some IO, or database operations, which may take some time, and possibly fail. We will assume you have a function save(entity: ProcessTemplateModel): Future[Boolean] that allows you to save one of your models. The fact that the function will take some time (and that it will be started in another thread) and possibly fail is visible in the return type Future[Boolean] (Boolean is not important here, it's the fact that we have again the Future context that matters).
Here, you will have to do (assuming you just want to save the first element in your list):
val savedFirstResult: Future[Option[ProcessTemplatesModel]] = result.flatMap {list =>
Future.traverse(list.headOption){ process => //traverse will switch the Future and Option contexts
save(process)
}
}
So as you can see, we can do most of what we want by staying inside the contexts that are returned by Slick. You shouldn't want to get outside of them because
most of the time, there's no need to, when you have map to use inside context some function for values outside context
extracting methods are most of the time unsafe: Option#get throws an exception if no element is in the Option, Await.result(future, duration) may block all computations or throw exceptions
responses in Play! can be given as Futures in a controller, using Action.async

How to Design and Call Scala API with Futures

Getting started learning scala and designing/implementing for asynchronous execution. My questions is around how to design APIs (and then call) them for operations that return Unit but may not right away. For example in the snippet below, the function (using Slick 3.0) inserts a user into the DB. Is Unit the correct return type for this function, and if so, how do callers know if/when the newly inserted user is successful?
override def insertOrUpdate(entity: User): Unit = {
database.run(users.insertOrUpdate(entity))
}
For example, if the above executes asynchronously and a caller looks something like
//create and insert new user with id = 7
val newUser = User(7, "someName")
userRepo.insertOrUpdate(newUser)
How does a caller know whether or not it is safe to do
userRepo.findById(7)
In unit testing, I know that if I follow up the insert call immediately by a findById call, the findById will return nothing, but if I introduce some latency between the insert and find call, it finds the new user. To summarize, what is the proper way to design an API for a function that executes asynchronously but has no natural return value to wrap in a Future?

Generally when working with Futures, you'll want to do any further processing via methods called on the returned Future. Eg.:
val newUser = User(7, "someName")
val future = userRepo.insertOrUpdate(newUser)
future.onSuccess { outcome => // Here, 'outcome' will hold whatever was contained in the Future - Unit in your description above.
val storedUser = userRepo.findById(7) // This will only execute once the future completes (successfully).
...
}
There are plenty of other useful methods for manipulating a Future (or a collection of them), such as "onFailure", "recover", "map" and "flatMap".
Try not to wait on the Future until as late as possible - preferably let Play or Spray or whatever other framework you might happen to be using take care of it for you (see here for Play documentation on doing this, for example).
Finally, in terms of your DB call to insert, I'd look into having the call return at least a boolean, or better still the primary key the new entry was inserted with, rather than Unit.

The Scala equivalent of PHP's isset()

How do I test and see if a variable is set in Scala. In PHP you would use isset()
I am looking for a way to see if a key is set in an array.

First, Array in Scala does not have keys. They have indices, and all indices have values in them. See the edit below about how those values might be initialized, though.
You probably mean Map, which has keys. You can check whether a key is present (and, therefore, a value) by using isDefinedAt or contains:
map isDefinedAt key
map contains key
There's no practical difference between the two. Now, you see in the edit that Scala favors the use of Option, and there's just such a method when dealing with maps. If you do this:
map get key
You'll receive an Option back, which will be None if the key (and, therefore, the value) is not present.
EDIT
This is the original answer. I've noticed now that the question is not exactly about this.
As a practical matter, all fields on the JVM are pre-initialized by the JVM itself, which zeroes it. In practice, all reference fields end up pointing to null, booleans are initialized with false and all other primitives are initialized with their version of zero.
There's no such thing in Scala as an "undefined" field -- you cannot even write such a thing. You can write var x: Type = _, but that simply results in the JVM initialization value. You can use null to stand for uninitialized where it makes sense, but idiomatic Scala code tries to avoid doing so.
The usual way of indicating the possibility that a value is not present is using Option. If you have a value, then you get Some(value). If you don't, you get None. See other Stack Overflow questions about various ways of using Option, since you don't use it like variable.isDefined in idiomatic code either (though that works).
Finally, note that idiomatic Scala code don't use var much, preferring val. That means you won't set things, but, instead, produce a new copy of the thing with that value set to something else.

PHP and Scala are so different that there is no direct equivalent. First of all Scala promotes immutable variables (final in Java world) so typically we strive for variables that are always set.
You can check for null:
var person: Person = null
//...
if(person == null) {//not set
//...
}
person = new Person()
if(person == null) {//set
//...
}
But it is a poor practice. The most idiomatic way would be to use Option:
var person: Option[Person] = None
//...
if(person.isDefined) {//not set
//...
}
person = Some(new Person())
if(person.isDefined) {//set
//...
}
Again, using isDefined isn't the most idiomatic ways. Consider map and pattern matching.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark mapWithState API explanation - scala

Related

(Scala) Am I using Options correctly?

Confused about Observable vs. Single in functions like readCharacteristic()

scala save slick result into new object

How to Design and Call Scala API with Futures

The Scala equivalent of PHP's isset()

Categories

Resources