Spark mapWithState updated states output - scala

After upgrading to Spark 1.6.1, I've started refactoring an application to replace updateStateByKey with mapWithState.
In order to take advantage of the performance advantages of the new API, I don't want to call stateSnapshots, which loads all states. I only want the updated states.
The mapWithState API returns a DStream of [key, input, state, output], where each state is the partially updated state after an input is ingested. How can I extract the latest states alone from this DStream (i.e. the state after all corresponding inputs have been ingested / mapped)?
I can do a map (to drop the input and output) and reduceByKey on the MapWithStateDStream, choosing the state with the newer timestamp (which I set inside the update function), but I have no guarantee there won't be two partial states with the same timestamp, even if using a custom, by key, partitioner.
How can I tell which partial state is the latest in the MapWithStateDStream output of mapWithState?

mapWithState will only be called for each state which is being updated in the current micro batch. One way to achieve what you want is to return an Some[S] in case the state has been updated.
StateSpec.function takes a method with the following signature:
mappingFunction:
(Time, KeyType, Option[ValueType], State[StateType]) => Option[MappedType]
What we can do is make sure that our Option[MappedType] is always Some[MappedType] when the value has been updated, otherwise None.
For example:
def updateState(key: Int, value: Option[Int], state: State[Int]): Option[Int] = {
value match {
case Some(something) if something > 10 =>
val updatedVal = something * something
state.update(updatedVal)
Some(updatedVal)
case _ => None
}
}
And then you can do:
val spec = StateSpec.function(updateState _)
ssc.mapWithState(spec).filter(!_.isEmpty).foreachRDD(/* do stuff on updated state */)
This way you filter out any none updated state and keep only the updated snapshots you're looking for.

One solution that would work if it is possible for your update algorithm is to call reduceByKey on the input stream before calling mapWithstate. Then there would only be a single update for each key and no partial states output.

Related

Convert collect-map-foreach scala code block to spark/sql library functions

I have a spark dataframe (let's call it "records") like the following one:
id
name
a1
john
b"2
alice
c3'
joe
If you notice, the primary key column (id) values may have single/double quotes in them (like the second and third row in the dataframe).
I wrote following scala code to check for quotes in primary key column values:
def checkForQuotesInPrimaryKeyColumn(primaryKey: String, records: DataFrame): Boolean = {
// Extract primary key column values
val pkcValues = records.select(primaryKey).collect().map(_(0)).toList
// Check for single and double quotes in the values
var checkForQuotes = false // indicates no quotes
breakable {
pkcValues.foreach(pkcValue => {
if (pkcValue.toString.contains("\"") || pkcValue.toString.contains("\'")) {
checkForQuotes = true
println("Value that has quotes: " + pkcValue.toString)
break()
}
})}
checkForQuotes
}
This code works. But it doesn't take advantage of spark functionalities. I wish to make use of spark executors (and other features) that can complete this task faster.
The updated function looks like the following:
def checkForQuotesInPrimaryKeyColumnsUpdated(primaryKey: String, records: DataFrame): Boolean = {
val findQuotes = udf((s: String) => if (s.contains("\"") || s.contains("\'")) true else false)
records
.select(findQuotes(col(primaryKey)) as "quotes")
.filter(col("quotes") === true)
.collect()
.nonEmpty
}
The unit tests give similar runtimes on my machine for both the functions when run on a dataframe with 100 entries.
Is the updated function any faster (and/or better) than the original function? Is there any way the function can be improved?
Your first approach collects the entire dataframe to the driver. If your data does not fit into the driver's memory, it is going to break. Also you are right, you do not take advantage of spark.
The second approach uses spark to detect quotes. That's better. The problem is that you then collect a dataframe containing one boolean per record containing a quote to the driver just to see if there is at least one. This is a waste of time, especially if many records contain quotes. It is also a shame to use a UDF for this, since they are known to be slower than spark SQL primitives.
You could simply use spark to count the number records containing a quote, without collecting anything.
records.where(col(primaryKey).contains("\"") || col(primaryKey).contains("'"))
.count > 0
Since, you do not actually care about the number of records. You just want to check if there is at least one, you could use limit(1). SparkSQL will be able to further optimize the query:
records.where(col(primaryKey).contains("\"") || col(primaryKey).contains("'"))
.limit(1).count > 0
NB: it makes sense that in unit tests, with little data, both of your queries take the same time. Spark is meant for big data and has some overhead. With real data, your second approach should be faster than the first and the one I propose even so. Also, your first approach will get an OOM on the driver as soon as you add in more data.

Confused about intervalJoin

I'm trying to come up with a solution which involves applying some logic after the join operation to pick one event from streamB among multiple EventBs. It would be like a reduce function but it only returns 1 element instead of doing it incrementally. So the end result would be a single (EventA, EventB) pair instead of a cross product of 1 EventA and multiple EventB.
streamA
.keyBy((a: EventA) => a.common_key)
.intervalJoin(
streamB
.keyBy((b: EventB) => b.common_key)
)
.between(Time.days(-30), Time.days(0))
.process(new MyJoinFunction)
The data would be ingested like (assuming they have the same key):
EventB ts: 1616686386000
EventB ts: 1616686387000
EventB ts: 1616686388000
EventB ts: 1616686389000
EventA ts: 1616686390000
Each EventA key is guaranteed to arrive only once.
Assume a join operation like above and it generated 1 EventA with 4 EventB, successfully joined and collected in MyJoinFunction. Now what I want to do is, access these values at once and do some logic to correctly match the EventA to exactly one EventB.
For example, for the above dataset I need (EventA 1616686390000, EventB 1616686387000).
MyJoinFunction will be invoked for each (EventA, EventB) pair but I'd like an operation after this, that lets me access an iterator so I can look through all EventB events for each EventA.
I am aware that I can apply another windowing operation after the join to group all the pairs, but I want this to happen immediately after the join succeeds. So if possible, I'd like to avoid adding another window since my window is already large (30 days).
Is Flink the correct choice for this use case or I am completely in the wrong?
This could be implemented as a KeyedCoProcessFunction. You would key both streams by their common key, connect them, and process both streams together.
You can use ListState to store the events from B (for a given key), and ValueState for A (again, for a given key). You can use an event time timer to know when the time has come to look through the B events in the ListState, and produce your result. Don't forget to clear the state once you are finished with it.
If you're not familiar with this part of the Flink API, the tutorial on Process Functions should be helpful.

Counting latest state of stateful entities in streaming with Flink

I tried to create my first real-time analytics job in Flink. The approach is kappa-architecture-like, so I have my raw data on Kafka where we receive a message for every change of state of any entity.
So the messages are of the form:
(id,newStatus, timestamp)
We want to compute, for every time window, the count of items in a given status. So the output should be of the form:
(outputTimestamp, state1:count1,state2:count2 ...)
or equivalent. These rows should contain, at any given time, the count of the items in a given status, where the status associated to an Id is the most recent message observed for that id. The status for an id should be counted in any case, even if the event is way older than those getting processed. So the sum of all the counts should be equal to the number of different IDs observed in the system. The following step could be forgetting about the items in a final item after a while, but this is not a strict requirement right now.
This will be written on elasticsearch and then queried.
I tried many different paths and none of them completely satisfied the requirement. Using a sliding window I could easily achieve the expected behaviour, except that when the beginning of the sliding window surpassed the timestamp of an event, it was lost for the count, as you may expect. Others approaches failed to be consistent when working with a backlog because I did some tricks with keys and timestamps that failed when the data was processed all at once.
So I would like to know, even at an high level, how should I approach this problem. It looks like a relatively common use-case but the fact that the relevant information for a given ID must be retained indefinitely to count the entities correctly creates a lot of problems.
I think I have a solution for your problem:
Given a DataStream of (id, state, time) as:
val stateUpdates: DataStream[(Long, Int, ts)] = ???
You derive the actual state changes as follows:
val stateCntUpdates: DataStream[(Int, Int)] = s // (state, cntUpdate)
.keyBy(_._1) // key by id
.flatMap(new StateUpdater)
StateUpdater is a stateful FlatMapFunction. It has a keyed state that stores the last state of each id. For each input record it returns two state count update records: (oldState, -1), (newState, +1). The (oldState, -1) record ensures that counts of previous states are reduced.
Next you aggregate the state count changes per state and window:
val cntUpdatesPerWindow: DataStream[(Int, Int, Long)] = stateCntUpdates // (state, cntUpdate, time)
.keyBy(_._1) // key by state
.timeWindow(Time.minutes(10)) // window should be non-overlapping, e.g. Tumbling
.apply(new SumReducer(), new YourWindowFunction())
SumReducer sums the cntUpdates and YourWindowFunction assigns the timestamp of your window. This step aggregates all state changes for each state in a window.
Finally, we adjust the current count with the count updates.
val stateCnts: DataStream[(Int, Int, Long)] = cntUpdatesPerWindow // (state, count, time)
.keyBy(_._1) // key by state again
.map(new CountUpdater)
CountUpdater is a stateful MapFunction. It has a keyed state that stores the current count for each state. For each incoming record, the count is adjusted and a record (state, newCount, time) is emitted.
Now you have a stream with new counts for each state (one record for each state). If possible, you can update your Elasticsearch index using these records. If you need to collect all state counts for a given time you can do that using a window.
Please note: The state size of this program depends on the number of unique ids. That might cause problems if the id space grows very fast.

Is there a way to add extra metadata for Spark dataframes?

Is it possible to add extra meta data to DataFrames?
Reason
I have Spark DataFrames for which I need to keep extra information. Example: A DataFrame, for which I want to "remember" the highest used index in an Integer id column.
Current solution
I use a separate DataFrame to store this information. Of course, keeping this information separately is tedious and error-prone.
Is there a better solution to store such extra information on DataFrames?
To expand and Scala-fy nealmcb's answer (the question was tagged scala, not python, so I don't think this answer will be off-topic or redundant), suppose you have a DataFrame:
import org.apache.spark.sql
val df = sc.parallelize(Seq.fill(100) { scala.util.Random.nextInt() }).toDF("randInt")
And some way to get the max or whatever you want to memoize on the DataFrame:
val randIntMax = df.rdd.map { case sql.Row(randInt: Int) => randInt }.reduce(math.max)
sql.types.Metadata can only hold strings, booleans, some types of numbers, and other metadata structures. So we have to use a Long:
val metadata = new sql.types.MetadataBuilder().putLong("columnMax", randIntMax).build()
DataFrame.withColumn() actually has an overload that permits supplying a metadata argument at the end, but it's inexplicably marked [private], so we just do what it does — use Column.as(alias, metadata):
val newColumn = df.col("randInt").as("randInt_withMax", metadata)
val dfWithMax = df.withColumn("randInt_withMax", newColumn)
dfWithMax now has (a column with) the metadata you want!
dfWithMax.schema.foreach(field => println(s"${field.name}: metadata=${field.metadata}"))
> randInt: metadata={}
> randInt_withMax: metadata={"columnMax":2094414111}
Or programmatically and type-safely (sort of; Metadata.getLong() and others do not return Option and may throw a "key not found" exception):
dfWithMax.schema("randInt_withMax").metadata.getLong("columnMax")
> res29: Long = 209341992
Attaching the max to a column makes sense in your case, but in the general case of attaching metadata to a DataFrame and not a column in particular, it appears you'd have to take the wrapper route described by the other answers.
As of Spark 1.2, StructType schemas have a metadata attribute which can hold an arbitrary mapping / dictionary of information for each Column in a Dataframe. E.g. (when used with the separate spark-csv library):
customSchema = StructType([
StructField("cat_id", IntegerType(), True,
{'description': "Unique id, primary key"}),
StructField("cat_title", StringType(), True,
{'description': "Name of the category, with underscores"}) ])
categoryDumpDF = (sqlContext.read.format('com.databricks.spark.csv')
.options(header='false')
.load(csvFilename, schema = customSchema) )
f = categoryDumpDF.schema.fields
["%s (%s): %s" % (t.name, t.dataType, t.metadata) for t in f]
["cat_id (IntegerType): {u'description': u'Unique id, primary key'}",
"cat_title (StringType): {u'description': u'Name of the category, with underscores.'}"]
This was added in [SPARK-3569] Add metadata field to StructField - ASF JIRA, and designed for use in Machine Learning pipelines to track information about the features stored in columns, like categorical/continuous, number categories, category-to-index map. See the SPARK-3569: Add metadata field to StructField design document.
I'd like to see this used more widely, e.g. for descriptions and documentation of columns, the unit of measurement used in the column, coordinate axis information, etc.
Issues include how to appropriately preserve or manipulate the metadata information when the column is transformed, how to handle multiple sorts of metadata, how to make it all extensible, etc.
For the benefit of those thinking of expanding this functionality in Spark dataframes, I reference some analogous discussions around Pandas.
For example, see xray - bring the labeled data power of pandas to the physical sciences which supports metadata for labeled arrays.
And see the discussion of metadata for Pandas at Allow custom metadata to be attached to panel/df/series? · Issue #2485 · pydata/pandas.
See also discussion related to units: ENH: unit of measurement / physical quantities · Issue #10349 · pydata/pandas
If you want to have less tedious work, I think you can add an implicit conversion between DataFrame and your custom wrapper (haven't tested it yet though).
implicit class WrappedDataFrame(val df: DataFrame) {
var metadata = scala.collection.mutable.Map[String, Long]()
def addToMetaData(key: String, value: Long) {
metadata += key -> value
}
...[other methods you consider useful, getters, setters, whatever]...
}
If the implicit wrapper is in DataFrame's scope, you can just use normal DataFrame as if it was your wrapper, ie.:
df.addtoMetaData("size", 100)
This way also makes your metadata mutable, so you should not be forced to compute it only once and carry it around.
I would store a wrapper around your dataframe. For example:
case class MyDFWrapper(dataFrame: DataFrame, metadata: Map[String, Long])
val maxIndex = df1.agg("index" ->"MAX").head.getLong(0)
MyDFWrapper(df1, Map("maxIndex" -> maxIndex))
A lot of people saw the word "metadata" and went straight to "column metadata". This does not seem to be what you wanted, and was not what I wanted when I had a similar problem. Ultimately, the problem here is that a DataFrame is an immutable data structure that, whenever an operation is performed on it, the data passes on but the rest of the DataFrame does not. This means that you can't simply put a wrapper on it, because as soon as you perform an operation you've got a whole new DataFrame (potentially of a completely new type, especially with Scala/Spark's tendencies toward implicit conversions). Finally, if the DataFrame ever escapes its wrapper, there's no way to reconstruct the metadata from the DataFrame.
I had this problem in Spark Streaming, which focuses on RDDs (the underlying datastructure of the DataFrame as well) and came to one simple conclusion: the only place to store the metadata is in the name of the RDD. An RDD name is never used by the core Spark system except for reporting, so it's safe to repurpose it. Then, you can create your wrapper based on the RDD name, with an explicit conversion between any DataFrame and your wrapper, complete with metadata.
Unfortunately, this does still leave you with the problem of immutability and new RDDs being created with every operation. The RDD name (our metadata field) is lost with each new RDD. That means you need a way to re-add the name to your new RDD. This can be solved by providing a method that takes a function as an argument. It can extract the metadata before the function, call the function and get the new RDD/DataFrame, then name it with the metadata:
def withMetadata(fn: (df: DataFrame) => DataFrame): MetaDataFrame = {
val meta = df.rdd.name
val result = fn(wrappedFrame)
result.rdd.setName(meta)
MetaDataFrame(result)
}
Your wrapping class (MetaDataFrame) can provide convenience methods for parsing and setting metadata values, as well as implicit conversions back and forth between Spark DataFrame and MetaDataFrame. As long as you run all your mutations through the withMetadata method, your metadata will carry along though your entire transformation pipeline. Using this method for every call is a bit of a hassle, yes, but the simple reality is that there is not a first-class metadata concept in Spark.

OutOfBoundsException with ALS - Flink MLlib

I'm doing a recommandation system for movies, using the MovieLens datasets available here :
http://grouplens.org/datasets/movielens/
To compute this recommandation system, I use the ML library of Flink in scala, and particulalrly the ALS algorithm (org.apache.flink.ml.recommendation.ALS).
I first map the ratings of the movie into a DataSet[(Int, Int, Double)] and then create a trainingSet and a testSet (see the code below).
My problem is that there is no bug when I'm using the ALS.fit function with the whole dataset (all the ratings), but if I just remove only one rating, the fit function doesn't work anymore, and I don't understand why.
Do You have any ideas? :)
Code used :
Rating.scala
case class Rating(userId: Int, movieId: Int, rating: Double)
PreProcessing.scala
object PreProcessing {
def getRatings(env : ExecutionEnvironment, ratingsPath : String): DataSet[Rating] = {
env.readCsvFile[(Int, Int, Double)](
ratingsPath, ignoreFirstLine = true,
includedFields = Array(0,1,2)).map{r => new Rating(r._1, r._2, r._3)}
}
Processing.scala
object Processing {
private val ratingsPath: String = "Path_to_ratings.csv"
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val ratings: DataSet[Rating] = PreProcessing.getRatings(env, ratingsPath)
val trainingSet : DataSet[(Int, Int, Double)] =
ratings
.map(r => (r.userId, r.movieId, r.rating))
.sortPartition(0, Order.ASCENDING)
.first(ratings.count().toInt)
val als = ALS()
.setIterations(10)
.setNumFactors(10)
.setBlocks(150)
.setTemporaryPath("/tmp/tmpALS")
val parameters = ParameterMap()
.add(ALS.Lambda, 0.01) // After some tests, this value seems to fit the problem
.add(ALS.Seed, 42L)
als.fit(trainingSet, parameters)
}
}
"But if I just remove only one rating"
val trainingSet : DataSet[(Int, Int, Double)] =
ratings
.map(r => (r.userId, r.movieId, r.rating))
.sortPartition(0, Order.ASCENDING)
.first((ratings.count()-1).toInt)
The error :
06/19/2015 15:00:24 CoGroup (CoGroup at org.apache.flink.ml.recommendation.ALS$.updateFactors(ALS.scala:570))(4/4) switched to FAILED
java.lang.ArrayIndexOutOfBoundsException: 5
at org.apache.flink.ml.recommendation.ALS$BlockRating.apply(ALS.scala:358)
at org.apache.flink.ml.recommendation.ALS$$anon$111.coGroup(ALS.scala:635)
at org.apache.flink.runtime.operators.CoGroupDriver.run(CoGroupDriver.java:152)
...
The problem is the first operator in combination with the setTemporaryPath parameter of Flink's ALS implementation. In order to understand the problem, let me quickly explain how the blocking ALS algorithm works.
The blocking implementation of alternating least squares first partitions the given ratings matrix user-wise and item-wise into blocks. For these blocks, routing information is calculated. This routing information says which user/item block receives which input from which item/user block, respectively. Afterwards, the ALS iteration is started.
Since Flink's underlying execution engine is a parallel streaming dataflow engine, it tries to execute as many parts of the dataflow as possible in a pipelined fashion. This requires to have all operators of the pipeline online at the same time. This has the advantage that Flink avoids to materialize intermediate results, which might be prohibitively large. The disadvantage is that the available memory has to be shared among all running operators. In the case of ALS where the size of the individual DataSet elements (e.g. user/item blocks) is rather large, this is not desired.
In order to solve this problem, not all operators of the implementation are executed at the same time if you have set a temporaryPath. The path defines where the intermediate results can be stored. Thus, if you've defined a temporary path, then ALS first calculates the routing information for the user blocks and writes them to disk, then it calculates the routing information for the item blocks and writes them to disk and last but not least it starts ALS iteration for which it reads the routing information from the temporary path.
The calculation of the routing information for the user and item blocks both depend on the given ratings data set. In your case when you calculate the user routing information, then it will first read the ratings data set and apply the first operator on it. The first operator returns n-arbitrary elements from the underlying data set. The problem right now is that Flink does not store the result of this first operation for the calculation of the item routing information. Instead, when you start the calculation of the item routing information, Flink will re-execute the dataflow starting from its sources. This means that it reads the ratings data set from disk and applies the first operator on it again. This will give you in many cases a different set of ratings compared to the result of the first first operation. Therefore, the generated routing information are inconsistent and ALS fails.
You can circumvent the problem by materializing the result of the first operator and use this result as the input for the ALS algorithm. The object FlinkMLTools contains a method persist which takes a DataSet, writes it to the given path and then returns a new DataSet which reads the just written DataSet. This allows you to break up the resulting dataflow graph.
val firstTrainingSet : DataSet[(Int, Int, Double)] =
ratings
.map(r => (r.userId, r.movieId, r.rating))
.first((ratings.count()-1).toInt)
val trainingSet = FlinkMLTools.persist(firstTrainingSet, "/tmp/tmpALS/training")
val als = ALS()
.setIterations(10)
.setNumFactors(10)
.setBlocks(150)
.setTemporaryPath("/tmp/tmpALS/")
val parameters = ParameterMap()
.add(ALS.Lambda, 0.01) // After some tests, this value seems to fit the problem
.add(ALS.Seed, 42L)
als.fit(trainingSet, parameters)
Alternatively, you can try to leave the temporaryPath unset. Then all steps (routing information calculation and als iteration) are executed in a pipelined fashion. This means that both the user and item routing information calculation use the same input data set which results from the first operator.
The Flink community is currently working on keeping intermediate results of operators in the memory. This will allow to pin the result of the first operator so that it won't be calculated twice and, thus, not giving differing results due to its non-deterministic nature.