I'm trying to join multiple tables in q
a b c
key | valuea key | valueb key | valuec
1 | xa 1 | xb 2 | xc
2 | ya 2 | yb 4 | wc
3 | za
The expected result is
key | valuea | valueb | valuec
1 | xa | xb |
2 | ya | yb | xc
3 | za | |
4 | | | wc
The can be acheieved simply with
(a uj b) uj c
BUT does anyone know how i can do it in functional form?
I don't know how many tables i actually have
I need basically a function that will go over the list and smash any number of keyed tables together...
f:{[x] x uj priorx};
f[] each (a;b;c;d;e...)
Can anyone help? or suggest anything?
Thanks!
Another solution particular to your problem which is also little faster than your solution:
a (,')/(b;c)
figured it out... ;)
f:{[r;t]r uj t};
f/[();(a;b;c)]
Related
I have a square table similar to this:
| c | d |
| - | - |
a | 1 | 2 |
b | 3 | 4 |
I want to calculate matrix multiplication result where this table is multiplied by itself, i.e., this:
| c | d |
| -- | - |
a | 7 | 10 |
b | 15 | 22 |
While I understand that SQL should not be my language of choice for this task, I need to do this in that language. How do I do this?
It will make your life easier if you represent your matrix elements as (i,j,a[i,j]).
WITH matrix AS (SELECT * FROM
(VALUES ('a','a',1), ('a','b',1), ('b','a',2), ('b','b',3)) AS t(i,j,a))
SELECT m1.i as i, m2.j as j, sum(m1.a * m2.a) FROM matrix m1, matrix m2
GROUP BY m1.i, m2.j
ORDER BY i,j
This will handle sparse matrices nicely as well
Here a dbfiddle that you might be able to visualize.
Let's say I have a bunch of penguins around the country and I need to allocate food provisioning (which are distributed around the country as well) to the penguins.
I tried to simplify the problem as solving :
Input
The distribution of the penguins by area, grouped by proximity and prioritized as
+------------+------+-------+--------------------------------------+----------+
| PENGUIN ID | AERA | GROUP | PRIORITY (lower are allocated first) | QUANTITY |
+------------+------+-------+--------------------------------------+----------+
| P1 | A | A1 | 1 | 5 |
| P2 | A | A1 | 2 | 5 |
| P3 | A | A2 | 1 | 5 |
| P4 | B | B1 | 1 | 5 |
| P5 | B | B2 | 1 | 5 |
+------------+------+-------+--------------------------------------+----------+
The distribution of the food by area, also grouped by proximity and prioritized as
+---------+------+-------+--------------------------------------+----------+
| FOOD ID | AERA | GROUP | PRIORITY (lower are allocated first) | QUANTITY |
+---------+------+-------+--------------------------------------+----------+
| F1 | A | A1 | 2 | 5 |
| F2 | A | A1 | 1 | 2 |
| F3 | A | A2 | 1 | 7 |
| F4 | B | B1 | 1 | 7 |
+---------+------+-------+--------------------------------------+----------+
Expected output
The challenge is to allocate the food to the penguins from the same group first, respecting the priority order of both food and penguin and then take the left food to the other area.
So based on above data we would first allocate within same area and group as:
Stage 1: A1 (same area and group)
+------+-------+---------+------------+--------------------+
| AREA | GROUP | FOOD ID | PINGUIN ID | ALLOCATED_QUANTITY |
+------+-------+---------+------------+--------------------+
| A | A1 | F2 | P1 | 2 |
| A | A1 | F1 | P1 | 3 |
| A | A1 | F1 | P2 | 2 |
| A | A1 | X | P2 | 3 |
+------+-------+---------+------------+--------------------+
Stage 1: A2 (same area and group)
+------+-------+---------+------------+--------------------+
| AREA | GROUP | FOOD ID | PINGUIN ID | ALLOCATED_QUANTITY |
+------+-------+---------+------------+--------------------+
| A | A2 | F3 | P3 | 5 |
| A | A2 | F3 | X | 2 |
+------+-------+---------+------------+--------------------+
Stage 2: A (same area, food left from Stage 1:A2 can now be delivered to Stage 1:A1 penguin)
+------+---------+------------+--------------------+
| AREA | FOOD ID | PINGUIN ID | ALLOCATED_QUANTITY |
+------+---------+------------+--------------------+
| A | F2 | P1 | 2 |
| A | F1 | P1 | 3 |
| A | F1 | P2 | 2 |
| A | F3 | P3 | 5 |
| A | F3 | P2 | 2 |
| A | X | P2 | 1 |
+------+---------+------------+--------------------+
and then we continue do the same for Stage 3 (across AERA), Stage 4 (across AERA2 (by train), which is a different geography cut than AERA (by truck) so we can't just re-aggregate), 5...
What I tried
I'm well familiar how to do it efficiently with a simple R code using a bunch of For loop, array pointer and creating output row by row for each allocation. However with Spark/Scala i could only end up with big and none-efficient code for solving such a simple problem and i would like to reach the community because its probably just that i missed a spark functionality.
I can do it using a lot of spark row transformation as [withColumn,groupby,agg(sum),join,union,filters] but the DAG creation end up being so big that it start to slow the DAG build up after 5/6 stages. I can go around that by saving the output as a file after each stage but then i got an IO issue as i have millions of records to save per stage.
I can also do it running a UDAF (using .split() buffer) for each stage, explode result then join back to the original table to update each quantities per stage. It does make the DAG much more simple and fast to build but unfortunately likely due to the string manipulation inside the UDAF it is too slow for few partitions.
In the end both of the above method feel wrong as they are more like hacks and there must be a more simple way to solve this issue. Ideally i would prefer use transformation to not loose the lazy-evaluations as this is just one step among many other transformations
Thanks a lot for your time. I'm happy to discuss any suggested approach.
This is psuedocode/description, but my solution to Stage 1. The problem is pretty interesting, and I thought you described it quite well.
My thought is to use spark's window, struct, collect_list (and maybe a sortWithinPartitions), cumulative sums, and lagging to get to something like this:
C1 C2 C3 C4 C5 C6 C7 | C8
P1 | A | A1 | 5 | 0 | [(F1,2), (F2,7)] | [F2] | 2
P1 | A | A1 | 10 | 5 | [(F1,2), (F2,7)] | [] | -3
C4 = cumulative sum of quantity, grouped by area/group, ordered by priority
C5 = lag of C4 down a row, and null = 0
C6 = structure of food / quantity, with a cumulative sum of food quantity
C7/C8 = remaining food/food ids
Now you can use a plain udf to return the array of food groups that belong to a penguin, since you can find the first instance where C5 < C6.quantity and the first instance where C4 > C6.quantity. Everything in between is returned. If C4 is never larger than C6.quantity, then you can append X. Exploding this result of this array will get you all penguins and if a penguin does not have food.
To determine whether there is extra food, you can have a udf which calculates the amount of "remaining food" for each row and use a window and row_number to get the the last area that is fed. If remaining food > 0, those food ids have left over food, it will be reflected in the array, and you can also make it struct to map to the number of food items left over.
I think in the end I'm still doing a fair number of aggregations, but hopefully grouping some things together into arrays makes it faster to do comparisons across each individual item.
This question already has an answer here:
Flattening Rows in Spark
(1 answer)
Closed 5 years ago.
I have a dataframe in spark like below and I want to convert all the column in different rows with respect to first column id.
+----------------------------------+
| id code1 code2 code3 code4 code5 |
+----------------------------------+
| 1 A B C D E |
| 1 M N O P Q |
| 1 P Q R S T |
| 2 P A C D F |
| 2 S D F R G |
+----------------------------------+
I want the output like below format
+-------------+
| id code |
+-------------+
| 1 A |
| 1 B |
| 1 C |
| 1 D |
| 1 E |
| 1 M |
| 1 N |
| 1 O |
| 1 P |
| 1 Q |
| 1 P |
| 1 Q |
| 1 R |
| 1 S |
| 1 T |
| 2 P |
| 2 A |
| 2 C |
| 2 D |
| 2 F |
| 2 S |
| 2 D |
| 2 F |
| 2 R |
| 2 G |
+-------------+
Can anyone please help me here how I will get the above output with spark and scala.
using array, explode and drop functions should have you the desired output as
df.withColumn("code", explode(array("code1", "code2", "code3", "code4", "code5")))
.drop("code1", "code2", "code3", "code4", "code5")
OR
as defined by undefined_variable, you can just use select
df.select($"id", explode(array("code1", "code2", "code3", "code4", "code5")).as("code"))
df.select(col("id"),explode(concat_ws(",",Seq(col(code1),col("code2"),col("code3"),col("code4"),col("code5")))))
Basically idea is first concat all required columns and then explode it
I'm trying to find a way to mark duplicated cases similar to this question.
However, instead of counting occurrences of duplicated values, I'd like to mark them as 0 and 1, for duplicated and unique cases respectively. This is very similar to SPSS's identify duplicate cases function. For example if I have a dataset like:
Name State Gender
John TX M
Katniss DC F
Noah CA M
Katniss CA F
John SD M
Ariel FL F
And if I wanted to flag those with duplicated name, so the output would be something like this:
Name State Gender Dup
John TX M 1
Katniss DC F 1
Noah CA M 1
Katniss CA F 0
John SD M 0
Ariel FL F 1
A bonus would be a query statement that will handle which case to pick when determining the unique case.
SELECT name, state, gender
, NOT EXISTS (SELECT 1 FROM names nx
WHERE nx.name = na.name
AND nx.gender = na.gender
AND nx.ctid < na.ctid) AS Is_not_a_dup
FROM names na
;
Explanation: [NOT] EXISTS(...) results in a boolean value (which could be converted to an integer) Casting to boolean requires an extra pair of () , though:
SELECT name, state, gender
, (NOT EXISTS (SELECT 1 FROM names nx
WHERE nx.name = na.name
AND nx.gender = na.gender
AND nx.ctid < na.ctid))::integer AS is_not_a_dup
FROM names na
;
Results:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 6
name | state | gender | nodup
---------+-------+--------+-------
John | TX | M | t
Katniss | DC | F | t
Noah | CA | M | t
Katniss | CA | F | f
John | SD | M | f
Ariel | FL | F | t
(6 rows)
name | state | gender | nodup
---------+-------+--------+-------
John | TX | M | 1
Katniss | DC | F | 1
Noah | CA | M | 1
Katniss | CA | F | 0
John | SD | M | 0
Ariel | FL | F | 1
(6 rows)
I need to join two tables based on names. And the problem is that names may be a slight mispelling in one of the database. I have remedy this problem in the past using Stata and Python's fuzzy merging, where names are matched based on how closely similar they are, but I am wondering if this is possible to do in Postgresql.
For example, may data may be something similar to this:
Table A:
first_name_a | last_name_a | id_a
----------------------------------
William | Hartnell | 1
Matt | Smithe | 2
Paul | McGann | 3
David | Tennant | 4
Colin | Baker | 5
Table B:
first_name_b | last_name_b | id_b
----------------------------------
Matt | Smith | a
Peter | Davison | b
Dave | Tennant | c
Colin | Baker | d
Will | Hartnel | e
And in the end, I hope my results would look something like:
first_name_a | last_name_a | id_a | first_name_b | last_name_b | id_b
----------------------------------------------------------------------
William | Hartnell | 1 | Will | Hartnel | e
Matt | Smithe | 2 | Matt | Smith | a
Paul | McGann | 3 | | |
David | Tennant | 4 | Dave | Tennant | c
Colin | Baker | 5 | Colin | Baker | d
| | | Peter | Davison | b
My Sonic Screwdriver gives me some pseudo-code like this:
SELECT a.*, b.* FROM A a
JOIN B b
WHERE LEVENSHTEIN(first_name_a, first_name_b) IS LESS THAN 1
AND LEVENSHTEIN(last_name_a, last_name_b) IS LESS THAN 1
The DML you mention:
SELECT a.*, b.* FROM A a
JOIN B b
WHERE LEVENSHTEIN(first_name_a, first_name_b) IS LESS THAN 1
AND LEVENSHTEIN(last_name_a, last_name_b) IS LESS THAN 1
Looks correct, just bump up the 'fuzziness' (given 'IS LESS THAN 1' substitute 1 for the 'fuzzyness' level that you you require)
See http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html for reference info on LEVENSHTEIN.
Done up as an SQLFiddle. Play with the thresholds/look at some of the other mapping functions mentioned in matching fuzzy strings.