How to access `global state` instead of just `group state` within `flatMapGroupsWithStata`? - spark-structured-streaming

I'm trying to migrate a stream processing project to Spark Structured Streaming.
Within this project, there is a correlation logic like this:
A dict with init values
{
1: [2, 3],
4: [5, 6],
}
Then a new input comes, saying that 2 and 5 should be correlated together.
We know the key for 2 is 1, and for 5 is 4, so we merge all values in entry 4 to entry 1.
Finally, the dict becomes { 1: [2, 3, 4, 5, 6] }.
Currently, we use a distributed database to save the dict. But with Spark, I want to retire the database and only rely on Spark's memory state.
According to this tutorial, I created a mapping function:
def mappingFunction(
key: String,
values: Iterator[Input],
state: GroupState[State]
): Iterator[...] = {
}
But it seems I can only access the state of the specific key (first param in this func).
My questions are:
If I receive <2, 5>, how can I update the group state of 1 and delete the group state of 4?
Can we rely on Spark for maintaining a complicated state like this? Or is a distributed global state store always needed for this case?
Thanks!

Related

How to do multiple where query without effect data in TypeORM?

I want to do multiple where query without effect data. I want to get data that include at least 1 data per array. Pseudo code
data =[1,3]
array1 = [1,2]
array2 = [3,4]
if(data.IsIntersect(array1) and data.IsIntersect(array2))
IsIntersect checks are there a intersection beetween arrays
I did so far
queryBuilder.andWhere(
'properties.id IN (:...sizeIds) AND properties.id IN (:...colorIds)',
{ sizeIds: [1, 2], colorIds: [3, 4] },
);
It returns empty because firstly checks properties for 'sizeIds' then it checks for 'colorIds'. For example
properties includes 1,3
check for sizeIds, returns 1
check for colorIds, return empty
How can I do that with typeORM?
How can properties.id be 1 and 3? And if it is, how could 1 or 3 be in both? You're asking for the impossible.
I assume you mean to ask for when properties.id is 1 or 3, because if it is [1,3] then you should use the postgres array syntax {1,3} & the ANY keyword (some variation on this: Check if value exists in Postgres array).
tldr, I think all you need is brackets and OR instead of AND:
queryBuilder.andWhere(
'(properties.id IN (:...sizeIds) OR properties.id IN (:...colorIds))',
{ sizeIds: [1, 2], colorIds: [3, 4] },
);
If properties.id is in fact an array, then please add the entity definition to your question. If you want to merge the rows where properties.id is in the list you will need a GROUP BY (https://orkhan.gitbook.io/typeorm/docs/select-query-builder).

Can I retrieve the second to last value of a stream?

I'm building a function that requires both the previous and the current value of a stream.
I managed to work that around, but I was wondering if that is some way to retrieve the second to last value of it.
You can use rxdart pairwise:
RangeStream(1, 4)
.pairwise()
.listen(print); // prints [1, 2], [2, 3], [3, 4]
You will always get a List containing the current emitted value and the last one as well. Just be aware this will only emit after there are 2 items to be emitted, so if you need the first value ASAP, this might not be the best solution for you.
A simple way to solve this is to just save the emitted value to an external variable, this usually isn't much recommended as Streams are supposed to be encapsulated from external code, but for many cases this would be simpler.
If you really need the first value you can duplicate your stream and consume the first value only once, then let pairwise() do it's magic, here's one solution using the async and rxdart packages:
Stream<int> stream = Stream.fromIterable([0, 1, 2, 3, 4, 5]);
List<Stream<int>> splitted = StreamSplitter.splitFrom(stream);
splitted[0].take(1).listen(print); // prints 0 immediately
splitted[1].pairwise().listen(print); // prints [0, 1], [1, 2], [2, 3], [3, 4], [4, 5]
Of course you can also merge them and get all of it in one stream.

Data structures being used in MongoDB (B-Trees etc)

Assume I am going to insert the following 5 elements in a Mongo db:
{id= 1, name=Bob, age=34}
{id= 2, name=Jane, age=22}
{id= 3, name=Mike, age=44}
{id= 4, name=Sam, age=55}
{id= 5, name=Joe, age=21}
1)
What data structure are these 5 objects stored in (before building an index)?
2)
I now build an index based on age field. As I understand a B-Tree will now be created containing those 5 objects. But what happens with the previous data structure are they still located in that as well?

Dynamic Json Keys in Scala

I'm new to scala (from python) and I'm trying to create a Json object that has dynamic keys. I would like to use some starting number as the top-level key and then combinations involving that number as second-level keys.
From reading the play-json docs/examples, I've seen how to build these nested structures. While that will work for the top-level keys (there are only 17 of them), this is a combinatorial problem and the power set contains ~130k combinations that would be the second-level keys so it isn't feasible to list that structure out. I also saw the use of a case class for structures, however the parameter name becomes the key in those instances which is not what I'm looking for.
Currently, I'm considering using HashMaps with the MultiMap trait so that I can map multiple combinations to the same original starting number and then second-level keys would be the combinations themselves.
I have python code that does this, but it takes 3-4 days to work through up-to-9-number combinations for all 17 starting numbers. The ideal final format would look something like below.
Perhaps it isn't possible to do in scala given the goal of using immutable structures. I suppose using regex on a string of the output might be an option as well. I'm open to any solutions regarding data structures to hold the info and how to approach the problem. Thanks!
{
"2": {
"(2, 3, 4, 5, 6)": {
"best_permutation": "(2, 4, 3, 5, 6)",
"amount": 26.0
},
"(2, 4, 5, 6)": {
"best_permutation": "(2, 5, 4, 6)",
"amount": 21.0
}
},
"3": {
"(3, 2, 4, 5, 6)": {
"best_permutation": "(3, 4, 2, 5, 6)",
"amount": 26.0
},
"(3, 4, 5, 6)": {
"best_permutation": "(3, 5, 4, 6)",
"amount": 21.0
}
}
}
EDIT:
There is no real data source other than the matrix I'm using as my lookup table. I've posted the links to the lookup table I'm using and the program if it might help, but essentially, I'm generating the content myself within the code.
For a given combination, I have a function that basically takes the first value of the combination (which is to be the starting point) and then uses the tail of that combination to generate a permutation.
After that I prepend the starting location to the front of each permutation and then use sliding(2) to work my way through the permutation looking up the amount which is in a breeze.linalg.DenseMatrix by using the two values to index the matrix I've provided below and summing the amounts gathered by indexing the matrix with the two sliding values (subtracting 1 from each value to account for the 0-based indexing).
At this point, it is just a matter of gathering the information (starting_location, combination, best_permutation and the amount) and constructing the nested HashMap. I'm using scala 2.11.8 if it makes any difference.
MATRIX: see here.
PROGRAM:see here.

Why does Data.append(Mutable​Range​Replaceable​Random​Access​Slice<Data>) append slice.count bytes from the beginning of the base collection?

Using Data.append(Mutable​Range​Replaceable​Random​Access​Slice), I expected the bytes within the start/end indexes of the provided slice to be appended onto the Data instance. Instead, it appears Slice.count bytes from the beginning of the Slice.base underlying collection are appended. In contrast, instantiating Data with a slice results in the bytes between the slice's start and end indexes populating the instance.
// Swift Playground, Xcode Version 8.3 (8E162)
import Foundation
var fooData = Data()
let barData = Data([0, 1, 2, 3, 4, 5])
let slice = barData.suffix(from: 3)
fooData.append(slice) // [0, 1, 2]
Data(slice) // [3, 4, 5]
Is this the expected behavior and, if so, what might help me better understand the behavior of Data.append in this context, and its differences from Data.init?
Additionally, given that the docs for Mutable​Range​Replaceable​Random​Access​Slice encourage using slices "only for transient computation", do Data.init and Data.append reference the Slice.base collection or create their own copy of the bytes?
I've filed a JIRA issue, which is likely the best place to continue tracking a possible answer:
https://bugs.swift.org/browse/SR-4473