Apache Beam - join two collections with unequal numer of rows

Apache Beam - join two collections with unequal numer of rows - apache-beam

I have two collection with unequal number of elements. Lets say A and B. I need to split A and B into groups and randomly assign rows inside groups.
Assumptions:
A and B are split into equal number of groups.
Number of elements in each group derived from B is 10 times higher than in group derived from A.
Element from group B can be assigned to only one element from group A (or not matched at all).
Elements between groups should be assigned randomly
My current code looks like this:
def join(grouped_ab):
(key, (rows_a, rows_b)) = grouped_ab
sample = random.sample(rows_b, len(rows_a))
for row_a, row_b in zip(rows_a, sample):
yield {
"a": row_a,
"b": row_b
}
a = p | "read a" >> ReadFromText("a.csv") \
| "split a" >> beam.ParDo(Split())
b = p | "read b" >> ReadFromText("b.csv") \
| "split b" >> beam.ParDo(Split())
(a, b) | beam.CoGroupByKey() \
| beam.ParDo(join) \
| WriteToText("out")
Above code works fine but the problem is that distibution of number of elements in groups are unequal. There are hot groups which can contain up to 30% of all elements.
How I can improve my code?

Looking at the problem statement it looks like B has many more rows compared to A. Also, all the rows of A should be present along with each group of B so that the lookups can happen locally. In the case of CoGroupByKey there is a shuffle that is involved.
For this usecase it would be better to use Side Inputs. The idea of Side input is that when joining two collections if one of the collections is small enough to fit in memory then the smaller dataset is broadcasted to all the workers so that when the lookup happens they are done locally thereby speeding up performance.
Details can be found here - https://beam.apache.org/documentation/programming-guide/#side-inputs
Thanks,
Jayadeep

Related

Postgres extended statistics with partitioning

I am using Postgres 13 and have created a table with columns A, B and C. The table is partitioned by A with 2 possible values. Partition 1 contains 100 possible values each for B and C, whereas partition 2 has 100 completely different values for B, and 1 different value for C. I have set the statistics for both columns to maximum so that this definitely doesn't cause any issue
If I group by B and C on either partition, Postgres estimates the number of groups correctly. However if I run the query against the base table where I really want it, it estimates what I assume is no functional dependency between A, B and C, i.e. (p1B + p1C) * (p2B + p2C) for 200 * 101 as opposed to the reality of p1B * p1C + p2B * p2C for 10000 + 100.
I guess I was half expecting it to sum the underlying partitions rather than use the full count of 200 B's and 101 C's that the base table can see. Moreover, if I also add A into the group by then the estimate erroneously doubles further still, as it then thinks that this set will also be duplicated for each value of A.
This all made me think that I need an extended statistic to tell it that A influences either B or C or both. However if I set one on the base partition and analyze, the value in pg_statistic_ext_data->stxdndistinct is null. Whereas if I set it on the partitions themselves, this does appear to work, though isn't particularly useful because the estimation is already correct at this level. How do I go about having Postgres estimate against the base table correctly without having to run the query against all of the partitions and unioning them together?

You can define extended statistics on a partitioned table, but PostgreSQL doesn't collect any data in that case. You'll have to create extended statistics on all partitions individually.
You can confirm that by querying the collected data after an ANALYZE:
SELECT s.stxrelid::regclass AS table_name,
s.stxname AS statistics_name,
d.stxdndistinct AS ndistinct,
d.stxddependencies AS dependencies
FROM pg_statistic_ext AS s
JOIN pg_statistic_ext_data AS d
ON d.stxoid = s.oid;
There is certainly room for improvement here; perhaps don't allow defining extended statistics on a partitioned table in the first place.

I found that I just had to turn enable_partitionwise_aggregate on to get this to estimate correctly

CASE in JOIN not working PostgreSQL

I got the following tables:
Teams
Matches
I want to get an output like:
matches.semana | teams.nom_equipo | teams.nom_equipo | Winner
1 AMERICA CRUZ AZUL AMERICA
1 SANTOS MORELIA MORELIA
1 LEON CHIVAS LEON
The columns teams.nom_equipo reference to matches.num_eqpo_lo & to matches.num_eqpo_v and at the same time they reference to the column teams.nom_equipo to get the name of each team based on their id
Edit: I have used the following:
SELECT m.semana, t_loc.nom_equipo AS LOCAL, t_vis.nom_equipo AS VISITANTE,
CASE WHEN m.goles_loc > m.goles_vis THEN 'home'
WHEN m.goles_vis > m.goles_loc THEN 'visitor'
ELSE 'tie'
END AS Vencedor
FROM matches AS m
JOIN teams AS t_loc ON (m.num_eqpo_loc = t_loc.num_eqpo)
JOIN teams AS t_vis ON (m.num_eqpo_vis = t_vis.num_eqpo)
ORDER BY m.semana;
But as you can see from the table Matches in row #5 from the goles_loc column (home team) & goles_vis (visitor) column, they have 2 vs 2 (number of goals - home vs visitor) being a tie but and when I run the code I get something that is not a tie:
Matches' score
Resultset from Select:
I also noticed that since the row #5 the names of both teams in the matches are not correct (both visitor & home team).
So, the Select brings correct data but in other order different than the original order (referring to the order from the table matches)
The order from the second week must be:
matches.semana | teams.nom_equipo | teams.nom_equipo | Winner
5 2 CRUZ AZUL TOLUCA TIE
6 2 MORELIA LEON LEON
7 2 CHIVAS SANTOS TIE
Row 8 from the Resultset must be Row # 5 and so on.
Any help would be really thanked!

When doing a SELECT which includes null for a column, that's the value it will always be, so winner in your case will never be populated.
Something like this is probably more along the lines of what you want:
SELECT m.semana, t_loc.nom_equipo AS loc_equipo, t_vis.nom_equipo AS vis_equipo,
CASE WHEN m.goles_loc - m.goles_vis > 0 THEN t_loc.nom_equipo
WHEN m.goles_vis - m.goles_loc > 0 THEN t_vis.nom_equipo
ELSE NULL
END AS winner
FROM matches AS m
JOIN teams AS t_loc ON (m.nom_eqpo_loc = t.num_eqpo)
JOIN teams AS t_vis ON (m.nom_eqpo_vis = t.num_eqpo)
ORDER BY m.semana;
Untested, but this should provide the general approach. Basically, you JOIN to the teams table twice, but using different conditions, and then you need to calculate the scores. I'm using NULL to indicate a tie, here.
Edit in response to comment from OP:
It's the same table -- teams -- but the JOINs produce different results, because the query uses different JOIN conditions in each JOIN.
The first JOIN, for t_loc, compares m.nom_eqpo_loc to t.num_eqpo. This means it gets the teams rows for the home team.
The second JOIN, for t_vis, compares m.nom_eqpo_vis to t.num_eqpo. This means it gets the teams rows for the visting team.
Therefore, in the CASE statement, t_loc refers to the home team, while t_vis refers to the visting one, enabling both to be used in the CASE statement, enabling the correct name to be found for winning.
Edit in response to follow-up comment from OP:
My original query was sorting by m.semana, which means other columns can appear in any order (essentially whichever Postgres feels is most efficient).
If you want the resulting table to be sorted exactly the same way as the matches table, then use the same ORDER BY tuple in its ORDER BY.
So, the ORDER BY clause would then become:
ORDER BY m.semana, m.nom_eqpo_loc, m.nom_eqpo_vis
Basically, the matches table PRIMARY KEY tuple.

Merge neo4j relationships into one while returning the result if certain condition satisfies

My use case is:
I have to return whole graph in result but the condition is
If there are more than 1 relationship in between two particular nodes in the same direction then I have to just merge it into 1 relationship. For ex: Lets say there are two nodes 'm' and 'n' and there are 3 relations in between these nodes say r1, r2, r3 (in the same direction) then when I get the result after firing cypher query I should get only 1 relation in between 'n' and 'm'.
I need to perform some operations on top of it like the resultant relation that we got from merging all the relations should contain the properties and their values that I want to retain. Actually I will retain all the properties of any one of the relations that are merging depending upon the timestamp field that is one of the properties in relation.
Note : I have same properties throughout all my relations (The number of properties and name of properties are same across all relations. Values may differ for sure)
Any help would be appreciated. Thanks in advance.

You mean something like this?
Delete all except the first
MATCH (a)-[r]->(b)
WITH a,b,type(r) as type, collect(r) as rels
FOREACH (r in rels[1..] | DELETE r)
Ordering by timestamp first
MATCH (a)-[r]->(b)
WITH a,r,b
ORDER BY r.timestamp DESC
WITH a,b,type(r) as type, collect(r) as rels
FOREACH (r in rels[1..] | DELETE r)
If you want to do all those operations virtually just on query results you'd do them in your programming language of choice.

How can I merge two streams with identical columns in Pentaho?

I am new user in Pentaho and maybe my question is very simple. I have two streams with identical columns, e.g. stream S1 has the columns: A, B, C, and stream S2 has columns: A, B, C (same name, same order, same data type). I want to merge or append these two streams into a single stream containing the columns A, B, C. However, when I use merge join (with the option FUL OUTER JOIN) my result is a stream with the columns: A, B, C, A_1, B_1, C_1. It is not what I want. I tried to use the append stream step, but in this case appeared nothing in the preview.

As per your requirement first create two stream.
Here we have taken two streams i.e. "stream1.xls" and "stream2.xls".
Then built the transformation using the "Sorted merge" join
For better understanding please refer the screenshots.

Why doesn't "((left union right) union other)" behave associatively?

The code in the following gist is lifted almost verbatim out of a lecture in Martin Odersky's Functional Programming Principles in Scala course on Coursera:
https://gist.github.com/aisrael/7019350
The issue occurs in line 38, within the definition of union in class NonEmpty:
def union(other: IntSet): IntSet =
// The following expression doesn't behave associatively
((left union right) union other) incl elem
With the given expression, ((left union right) union other), largeSet.union(Empty) takes an inordinate amount of time to complete with sets with 100 elements or more.
When that expression is changed to (left union (right union other)), then the union operation finishes relatively instantly.
ADDED: Here's an updated worksheet that shows how even with larger sets/trees with random elements, the expression ((left ∪ right) ∪ other) can take forever but (left ∪ (right ∪ other)) will finish instantly.
https://gist.github.com/aisrael/7020867

The answer to your question is very much connected to Relational databases - and the smart choices they make. When a database "unions" tables - a smart controller system will make some decisions around things like "How large is Table A? Would it make more sense to Join A & B first, or A & C when the user writes:
A Join B Join C
Anyhow, you can't expect the same behavior when you are writing the code by hand - because you have specified the order you want exactly, using parenthesis. None of those smart decisions can happen automatically. (Though in theory they could, and that's why Oracle ,Teradata, mySql exist)
Consider a ridiculously large example:
Set A - 1 Billion Records
Set B - 500 Million Records
Set C - 10 Records
For arguments sake assume that the union operator takes O(N) records by the SMALLEST of the 2 sets being joined. This is reasonable, each key can be looked up in the other as a hashed retrieval:
A & B runtime = O(N) runtime = 500 Million
(let's assume the class is just smart enough to use the smaller of the two for lookups)
so
(A & B) & C
Results in:
O(N) 500 million + O(N) 10 = 500,000,010 comparisons
Again pointing to the fact that it was forced to compare 1 Billion records to 500 Million records FIRST, per inner parenthesis, then - pull in 10 more.
But consider this:
A & (B & C)
Well now something amazing happens:
(B & C) runtime O(N) = 10 record comparisons (each of the 10 C records is checked against B for existence)
then
A & (result) = O(N) = 10
Total = 20 comparisons
Notice that once (B & C) was completed, we only had to bump 10 records against 1 billion!
Both examples produces the exact same result; one in O(N) = 20 runtime, the other in 500,000,010 !
To summarize, this problem illustrates in just a small way some of the complex thinking that goes into database design and the smart optimization that happens in that software. These things do not always happen automatically in programming languages unless you've coded them that way, or by using a library of some sorts. You could for example write a function that takes several sets and intelligently decides the union order. But, the issue becomes unbelievable complex if other set operations have to be mixed in. Hope this helps.

Associativity is not about performance. Two expressions may be equivalent by associativity but one may be vastly harder than the other to actually compute:
(23 * (14/2)) * (1/7)
Is the same as
23 * ((14/2) * (1/7))
But if it were me evaluating the two, I'd reach the answer (23) in a jiffy with the second one, but take longer if I forced myself to work with just the first one.