This is what my streaming data looks like:
time | id | group
---- | ---| ---
1 | a1 | b1
2 | a1 | b2
3 | a1 | b3
4 | a2 | b3
Consider all examples above within our window. My use case gets the latest distinct id.
I need the output to be like below:
time | id | group
---- | ---| ---
3 | a1 | b3
4 | a2 | b3
How can I achieve this in Flink?
I am aware of the window function WindowFunction. However, I cannot wrap my head around doing this.
I have tried this just to get the distinct ids. How can I extend this function to my use case?
class DistinctGrid extends WindowFunction[UserMessage, String, Tuple, TimeWindow] {
override def apply(key: Tuple, window: TimeWindow, input: Iterable[UserMessage], out: Collector[String]): Unit = {
val distinctGeo = input.map(_.id).toSet
for (i <- distinctGeo) {
out.collect(i)
}
}
}
If you key the stream by the id field, then there's no need to think about distinct ids -- you'll have a separate window for each distinct key. Your window function just needs iterate over the window contents to find the UserMessage with the largest timestamp, and output that as the result of the window (for that key). However, there's a built-in function that does just that -- look at the documentation for maxBy() -- so no need for a window function in this case.
Roughly speaking then, this will look like
stream.keyBy("id")
.timeWindow(Time.minutes(10))
.maxBy("time")
.print()
Related
I am new to Scala and am currently studying datasets for Scala and Spark. Based on my input dataset below, I am trying to create a new dataset (see below). In the new dataset, I aim to have a new column which contains a Scala trait Seq[order_summary]. The Scala trait stores data the corresponding Name, Ticket Number, and Seat Number taken from the input dataset.
I have implemented input_dataset.groupyBy("Name") to organise the dataset and have tried df.withColumn("NewColumn", struct(df("a"), df("b"))) to combine different columns together. However, I would like to use a Scala trait instead and am also stuck with matching the name to the ticket number. Would anyone know how to resolve this or point me towards the right direction?
Input dataset: input_dataset
Name Type is String. Ticket Number Type is Int
+----+---------------+-------------+
|Name| Ticket Number | Seat Number |
+----+---------------+-------------+
|Adam| 123 | AB |
|Adam| 456 | AC |
|Adam| 789 | AD |
|Bob | 1234 | BA |
|Bob | 5678 | BB |
|Sam | 987 | CA |
|Sam | 654 | CB |
|Sam | 321 | CC |
|Sam | 876 | CD |
+----+---------------+-------------+
Output dataset
Name Type is String. Purchase Order Summary is a trait, Seq[order_summary]
+----+-----------------------------------------------------+
|Name| Purchase Order Summary |
+----+-----------------------------------------------------+
|Adam|((Adam,123,AB),(Adam,456,AC),(Adam,789,AD)) |
|Bob |((Bob,1234,BA),(Bob,5678,BB)) |
|Sam |((Sam,987,CA),(Sam,654,CB),(Sam,321,CC),(Sam,876,CD))|
+----+-----------------------------------------------------+
Pretty sure Spark has a map method.
So you could just create a case class
case class PurchaseOrderSummary(name: String, ticketNum: Long, seatNum: Int)
and instantiate it inside a map from your DF, then collect it into a list.
df.map(row => PurchaseOrderSummary(row.getString(0), row.getLong(1), row.getInt(2))).collectAsList
collectAsList should retrieve data from the RDD and transform it to a scala List[PurchaseOrderSummary].
I have two tables:
A - table shows if a given topic has been processed
B - all posibile topicNames related to a given projectId
What I'd like to return is table, which shows topics left to be processed. So assuming table B contains all possible topicNames I want to exclude those from table A and show only B-A(ghi, jkl). To ilustrate this please look at table C below:
I'm really struggling to get the right query. Any hints on that?
A:
fieldId | projectId | topicName
-------------------------------
1 | A | abc
1 | A | def
B:
fieldId | projectId | topicName
--------------------------------
1 | A | abc
1 | A | def
1 | A | ghi
1 | A | jkl
What I want - Table C:
C:
fieldId | projectId | topicName
-------------------------------
1 | A | ghi
1 | A | jkl
You are looking for EXCEPT based upon 2 queries. This is basically the opposite of UNION.
EXCEPT returns all rows that are in the result of query1 but not in
the result of query2. (This is sometimes called the difference between
two queries.) Again, duplicates are eliminated unless EXCEPT ALL is
used.
For your case something like: (see demo)
select fieldId , projectId , topicName from B
except
select fieldId , projectId , topicName from A;
Consider the below dataframe with store and books available:
+-----------+------+-------+
| storename | book | price |
+-----------+------+-------+
| S1 | B11 | 10$ | <<
| S2 | B11 | 11$ |
| S1 | B15 | 29$ | <<
| S2 | B10 | 25$ |
| S2 | B16 | 30$ |
| S1 | B09 | 21$ | <
| S3 | B15 | 22$ |
+-----------+------+-------+
Suppose we need to find the stores which have two books namely, B11 and B15. Here, the answer is S1 as it stores both books.
One way of doing it is to find intersection of the stores having book B11 with the stores having book B15 using below command:
val df_select = df.filter($"book" === "B11").select("storename")
.join(df.filter($"book" === "B15").select("storename"), Seq("storename"), "inner")
which contains the name of stores having both.
But instead I want a table
+-----------+------+-------+
| storename | book | price |
+-----------+------+-------+
| S1 | B11 | 10$ | <<
| S1 | B15 | 29$ | <<
| S1 | B09 | 21$ | <
+-----------+------+-------+
which contains all records related to that fulfilling store. Note that B09 is not left out. (Use case : the user can explore some other books as well in the same store)
We can do this by doing another intersection of above result with original dataframe:
df_select.join(df, Seq("storename"), "inner")
But, I see scalability and readability issue with step 1 as I have to keep on joining one dataframe to another if the number of books are more than 2. Lots of pain to do and that's error-prone too. Is there a more elegant way to do the same? Something like:
val storewise = Window.partitionBy("storename")
df.filter($"book".contains{"B11", "B15"}.over(storewise))
Found a simple solution using array_except function.
Add required set-of-field-values as an array in a new column, req_books
Add a column, all_books, storing all the books stored in a store using Window.
Using above two columns find if the store misses any required book, and filter them out if it misses anything.
Drop the excess columns created.
Code:
val df1 = df.withColumn("req_books", array(lit("B11"), lit("B15")))
.withColumn("all_books", collect_set('book).over(Window.partitionBy('storename)))
df1.withColumn("missing_books", array_except('req_books, 'all_books))
.filter(size('missing_books) === 0)
.drop('missing_book).drop('all_books).drop('req_books).show
Using Window Functions to create array of all values and check if it contains all the necessary values.
val bookList = List("B11", "B15") //list of books to search
def arrayContainsMultiple(bookList: Seq[String]) = udf((allBooks: WrappedArray[String]) => allBooks.intersect(bookList).sorted.equals(bookList.sorted))
val filteredDF = input
.withColumn("allBooks", collect_set($"books").over(Window.partitionBy($"storename")))
.filter(arrayContainsMultiple(bookList)($"allBooks"))
.drop($"allBooks")
I have a dataframe like this:
|-----+-----+-------+---------|
| foo | bar | fox | cow |
|-----+-----+-------+---------|
| 1 | 2 | red | blue | // row 0
| 1 | 2 | red | yellow | // row 1
| 2 | 2 | brown | green | // row 2
| 3 | 4 | taupe | fuschia | // row 3
| 3 | 4 | red | orange | // row 4
|-----+-----+-------+---------|
I need to group the records by "foo" and "bar" and then perform some magical computation on "fox" and "cow" to produce "badger", which may insert or delete records:
|-----+-----+-------+---------+---------|
| foo | bar | fox | cow | badger |
|-----+-----+-------+---------+---------|
| 1 | 2 | red | blue | zebra |
| 1 | 2 | red | blue | chicken |
| 1 | 2 | red | yellow | cougar |
| 2 | 2 | brown | green | duck |
| 3 | 4 | red | orange | peacock |
|-----+-----+-------+---------+---------|
(In this example, row 0 has been split into two "badger" values, and row 3 has been deleted from the final output.)
My best approach so far looks like this:
val groups = df.select("foo", "bar").distinct
groups.flatMap(row => {
val (foo, bar): (String, String) = (row(0), row(1))
val group: DataFrame = df.where(s"foo == '$foo' AND bar == '$bar'")
val rowsWithBadgers: List[Row] = makeBadgersFor(group)
rowsWithBadgers
})
This approach has a few problems:
It's clumsy to match on foo and bar individually. (A utility method can clean that up, so not a big deal.)
It throws an Invalid tree: null\nnull error because of the nested operation in which I try to refer to df from inside groups.flatMap. Don't know how to get around that one yet.
I'm not sure whether this mapping and filtering actually leverages Spark distributed computation correctly.
Is there a more performant and/or elegant approach to this problem?
This question is very similar to Spark DataFrame: operate on groups, but I'm including it here because 1) it's not clear if that question requires addition and deletion of records, and 2) the answers in that question are out-of-date and lacking detail.
I don't see a way to accomplish this with groupBy and a user-defined aggregate function, because an aggregation function aggregates to a single row. In other words,
udf(<records with foo == 'foo' && bar == 'bar'>) => [foo,bar,aggregatedValue]
I need to possibly return two or more different rows, or zero rows after analyzing my group. I don't see a way for aggregation functions to do this -- if you have an example, please share.
A user-defined function could be used.
The single row returned can contain a list.
Then you can explode the list into multiple rows and reconstruct the columns.
The aggregator:
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoders.kryo
import org.apache.spark.sql.expressions.Aggregator
case class StuffIn(foo: BigInt, bar: BigInt, fox: String, cow: String)
case class StuffOut(foo: BigInt, bar: BigInt, fox: String, cow: String, badger: String)
object StuffOut {
def apply(stuffIn: StuffIn): StuffOut = new StuffOut(stuffIn.foo,
stuffIn.bar, stuffIn.fox, stuffIn.cow, "dummy")
}
object MultiLineAggregator extends Aggregator[StuffIn, Seq[StuffOut], Seq[StuffOut]] {
def zero: Seq[StuffOut] = Seq[StuffOut]()
def reduce(buffer: Seq[StuffOut], stuff: StuffIn): Seq[StuffOut] = {
makeBadgersForDummy(buffer, stuff)
}
def merge(b1: Seq[StuffOut], b2: Seq[StuffOut]): Seq[StuffOut] = {
b1 ++: b2
}
def finish(reduction: Seq[StuffOut]): Seq[StuffOut] = reduction
def bufferEncoder: Encoder[Seq[StuffOut]] = kryo[Seq[StuffOut]]
def outputEncoder: Encoder[Seq[StuffOut]] = kryo[Seq[StuffOut]]
}
The call:
val averageSalary: TypedColumn[StuffIn, Seq[StuffOut]] = MultiLineAggregator.toColumn
val res: DataFrame =
ds.groupByKey(x => (x.foo, x.bar))
.agg(averageSalary)
.map(_._2)
.withColumn("value", explode($"value"))
.withColumn("foo", $"value.foo")
.withColumn("bar", $"value.bar")
.withColumn("fox", $"value.fox")
.withColumn("cow", $"value.cow")
.withColumn("badger", $"value.badger")
.drop("value")
I have no idea what's going on here. Maybe I've been staring at this code for too long.
The query I have is as follows:
CREATE VIEW v_sku_best_before AS
SELECT
sw.sku_id,
sw.sku_warehouse_id "A",
sbb.sku_warehouse_id "B",
sbb.best_before,
sbb.quantity
FROM SKU_WAREHOUSE sw
LEFT OUTER JOIN SKU_BEST_BEFORE sbb
ON sbb.sku_warehouse_id = sw.warehouse_id
ORDER BY sbb.best_before
I can post the table definitions if that helps, but I'm not sure it will. Suffice to say that SKU_WAREHOUSE.sku_warehouse_id is an identity column, and SKU_BEST_BEFORE.sku_warehouse_id is a child that uses that identity as a foreign key.
Here's the result when I run the query:
+--------+-----+----+-------------+----------+
| sku_id | A | B | best_before | quantity |
+--------+-----+----+-------------+----------+
| 20251 | 643 | 11 | <<null>> | 140 |
+--------+-----+----+-------------+----------+
(1 row)
The join specifies that the sku_warehouse_id columns have to be equal, but when I pull the ID from each table (labelled as A and B) they're different.
What am I doing wrong?
Perhaps just sw.sku_warehouse_id instead of sw.warehouse_id?