Postresql matrix multiplication of a table (multiply a table by itself) - postgresql

I have a square table similar to this:
| c | d |
| - | - |
a | 1 | 2 |
b | 3 | 4 |
I want to calculate matrix multiplication result where this table is multiplied by itself, i.e., this:
| c | d |
| -- | - |
a | 7 | 10 |
b | 15 | 22 |
While I understand that SQL should not be my language of choice for this task, I need to do this in that language. How do I do this?

It will make your life easier if you represent your matrix elements as (i,j,a[i,j]).
WITH matrix AS (SELECT * FROM
(VALUES ('a','a',1), ('a','b',1), ('b','a',2), ('b','b',3)) AS t(i,j,a))
SELECT m1.i as i, m2.j as j, sum(m1.a * m2.a) FROM matrix m1, matrix m2
GROUP BY m1.i, m2.j
ORDER BY i,j
This will handle sparse matrices nicely as well
Here a dbfiddle that you might be able to visualize.

Related

Extract contents from cell array

I have a series of Images, stored into an array A. So every entry of A contains an Image (matrix). All matrices are equally sized.
Now I want to extract the value of a specific position (pixel), but my current approach seems to be slow and I think there may be a better way to do it.
% Create data that resembles my problem
N = 5
for i = 1:N
A{i} = rand(5,5);
end
% my current approach
I = size(A{1},1);
J = size(A{1},2);
val = zeros(N,1);
for i = 1:I
for j = 1:J
for k = 1:N
B(k) = A{k}(i,j);
end
% do further operations on B for current i,j, don't save B
end
end
I was thinking there should be some way along the lines of A{:}(i,j) or vertcat(A{:}(i,j)) but both lead to
??? Bad cell reference operation.
I'm using Matlab2008b.
For further information, I use fft on B afterwards.
Here are the results of the answer by Cris
| Code | # images | Extracting Values | FFT | Overall |
|--------------|----------|-------------------|----------|-----------|
| Original | 16 | 12.809 s | 19.728 s | 62.884 s |
| Original | 128 | 105.974 s | 23.242 s | 177.280 s |
| ------------ | -------- | ----------------- | ------- | --------- |
| Answer | 16 | 42.122 s | 27.382 s | 104.565 s |
| Answer | 128 | 36.807 s | 26.623 s | 102.601 s |
| ------------ | -------- | ----------------- | ------- | --------- |
| Answer (mod) | 16 | 14.772 s | 27.797 s | 77.784 s |
| Answer (mod) | 128 | 13.637 s | 28.095 s | 83.839 s |
The answer codes was modded to double(squeeze(A(i,j,:))); because without double the FFT took much longer.
Answer (mod) uses double(A(i,j,:));
So the improvement seems to really kick in for larger sets of images, however I currently plan with processing ~ 500 images per run.
Update
Measured with the profile function, the result of using/omitting squeeze
| Code | # Calls | Time |
|--------------------------------|---------|----------|
| B = double(squeeze(A(i,j,:))); | 1431040 | 36.325 s |
| B= double(A(i,j,:)); | 1431040 | 14.289 s |
A{:}(i,j) does not work because A{:} is a comma-separated list of elements, equivalent to A{1},A{2},A{3},...A{end}. It makes no sense to index into such an array.
To speed up your operation, I recommend that you create a 3D matrix out of your data, like this:
A3 = cat(3,A{:});
Of course, this will only work if all elements of A have the same size (as was originally specified in the question).
Now you can quickly access the data like so:
for i = 1:I
for j = 1:J
B = squeeze(A3(i,j,:));
% do further operations on B for current i,j, don't save B
end
end
Depending on the operations you apply to each B, you could vectorize those operations as well.
Edit: Since you apply fft to each B, you can obtain that also without looping:
B_fft = fft(A3,[],3); % 3 is the dimension along which to apply the FFT

How to handle redistribution/allocation algorithm using Spark in Scala

Let's say I have a bunch of penguins around the country and I need to allocate food provisioning (which are distributed around the country as well) to the penguins.
I tried to simplify the problem as solving :
Input
The distribution of the penguins by area, grouped by proximity and prioritized as
+------------+------+-------+--------------------------------------+----------+
| PENGUIN ID | AERA | GROUP | PRIORITY (lower are allocated first) | QUANTITY |
+------------+------+-------+--------------------------------------+----------+
| P1 | A | A1 | 1 | 5 |
| P2 | A | A1 | 2 | 5 |
| P3 | A | A2 | 1 | 5 |
| P4 | B | B1 | 1 | 5 |
| P5 | B | B2 | 1 | 5 |
+------------+------+-------+--------------------------------------+----------+
The distribution of the food by area, also grouped by proximity and prioritized as
+---------+------+-------+--------------------------------------+----------+
| FOOD ID | AERA | GROUP | PRIORITY (lower are allocated first) | QUANTITY |
+---------+------+-------+--------------------------------------+----------+
| F1 | A | A1 | 2 | 5 |
| F2 | A | A1 | 1 | 2 |
| F3 | A | A2 | 1 | 7 |
| F4 | B | B1 | 1 | 7 |
+---------+------+-------+--------------------------------------+----------+
Expected output
The challenge is to allocate the food to the penguins from the same group first, respecting the priority order of both food and penguin and then take the left food to the other area.
So based on above data we would first allocate within same area and group as:
Stage 1: A1 (same area and group)
+------+-------+---------+------------+--------------------+
| AREA | GROUP | FOOD ID | PINGUIN ID | ALLOCATED_QUANTITY |
+------+-------+---------+------------+--------------------+
| A | A1 | F2 | P1 | 2 |
| A | A1 | F1 | P1 | 3 |
| A | A1 | F1 | P2 | 2 |
| A | A1 | X | P2 | 3 |
+------+-------+---------+------------+--------------------+
Stage 1: A2 (same area and group)
+------+-------+---------+------------+--------------------+
| AREA | GROUP | FOOD ID | PINGUIN ID | ALLOCATED_QUANTITY |
+------+-------+---------+------------+--------------------+
| A | A2 | F3 | P3 | 5 |
| A | A2 | F3 | X | 2 |
+------+-------+---------+------------+--------------------+
Stage 2: A (same area, food left from Stage 1:A2 can now be delivered to Stage 1:A1 penguin)
+------+---------+------------+--------------------+
| AREA | FOOD ID | PINGUIN ID | ALLOCATED_QUANTITY |
+------+---------+------------+--------------------+
| A | F2 | P1 | 2 |
| A | F1 | P1 | 3 |
| A | F1 | P2 | 2 |
| A | F3 | P3 | 5 |
| A | F3 | P2 | 2 |
| A | X | P2 | 1 |
+------+---------+------------+--------------------+
and then we continue do the same for Stage 3 (across AERA), Stage 4 (across AERA2 (by train), which is a different geography cut than AERA (by truck) so we can't just re-aggregate), 5...
What I tried
I'm well familiar how to do it efficiently with a simple R code using a bunch of For loop, array pointer and creating output row by row for each allocation. However with Spark/Scala i could only end up with big and none-efficient code for solving such a simple problem and i would like to reach the community because its probably just that i missed a spark functionality.
I can do it using a lot of spark row transformation as [withColumn,groupby,agg(sum),join,union,filters] but the DAG creation end up being so big that it start to slow the DAG build up after 5/6 stages. I can go around that by saving the output as a file after each stage but then i got an IO issue as i have millions of records to save per stage.
I can also do it running a UDAF (using .split() buffer) for each stage, explode result then join back to the original table to update each quantities per stage. It does make the DAG much more simple and fast to build but unfortunately likely due to the string manipulation inside the UDAF it is too slow for few partitions.
In the end both of the above method feel wrong as they are more like hacks and there must be a more simple way to solve this issue. Ideally i would prefer use transformation to not loose the lazy-evaluations as this is just one step among many other transformations
Thanks a lot for your time. I'm happy to discuss any suggested approach.
This is psuedocode/description, but my solution to Stage 1. The problem is pretty interesting, and I thought you described it quite well.
My thought is to use spark's window, struct, collect_list (and maybe a sortWithinPartitions), cumulative sums, and lagging to get to something like this:
C1 C2 C3 C4 C5 C6 C7 | C8
P1 | A | A1 | 5 | 0 | [(F1,2), (F2,7)] | [F2] | 2
P1 | A | A1 | 10 | 5 | [(F1,2), (F2,7)] | [] | -3
C4 = cumulative sum of quantity, grouped by area/group, ordered by priority
C5 = lag of C4 down a row, and null = 0
C6 = structure of food / quantity, with a cumulative sum of food quantity
C7/C8 = remaining food/food ids
Now you can use a plain udf to return the array of food groups that belong to a penguin, since you can find the first instance where C5 < C6.quantity and the first instance where C4 > C6.quantity. Everything in between is returned. If C4 is never larger than C6.quantity, then you can append X. Exploding this result of this array will get you all penguins and if a penguin does not have food.
To determine whether there is extra food, you can have a udf which calculates the amount of "remaining food" for each row and use a window and row_number to get the the last area that is fed. If remaining food > 0, those food ids have left over food, it will be reflected in the array, and you can also make it struct to map to the number of food items left over.
I think in the end I'm still doing a fair number of aggregations, but hopefully grouping some things together into arrays makes it faster to do comparisons across each individual item.

Cross tab with a list of values instead of summation

I want a Cross tab that lists field values and counts them instead of just giving a count for the summation. I know I could make this with groups but I cant list the values vertically that way. From my research I believe I have to use a Display String Formula.
SQL Field Data
-------------------------------------------------
| Play # | Formation |Back Set | R/P | PLAY |
-------------------------------------------------
| 1 | TREY | FG | R | TRUCK |
-------------------------------------------------
| 2 | T | FG | R | RHINO |
-------------------------------------------------
| 3 | D | FG | P | 5 STEP |
-------------------------------------------------
| 4 | D | FG | P | 5 STEP |
-------------------------------------------------
| 5 | K JET | NG | R | DOG |
-------------------------------------------------
Desired report structure:
-----------------------------------------------------------
| Backet & Formation | Run | Pass |
-----------------------------------------------------------
| NG K JET | BULLA 1 | |
| | HELL 3 | |
-----------------------------------------------------------
| FG D | | 5 STEP 2 |
-----------------------------------------------------------
| NG K JET | DOG | |
-----------------------------------------------------------
| FG T | RHINO | |
-----------------------------------------------------------
Don't see why a Crosstab is necessary for this - especially if the entire body of the report is just that table.
Group your records by Bracket and Formation - If that's not
something natively configured in your table, make a new Formula field
and group on that.
Drop the 3 relevant fields into whichever section you need to display. (It might be a Footer, based on whether or not you want repeats
Write a formula to determine whether or not Run or Pass are displayed, and place it in their suppression field. (Good luck getting a Crosstab to do that for you! It tends to prefer 0s over blanks.)
If there's more to the report than just this table, you can cheat the system by placing your "table" into a subreport. And of course you can stretch Line objects across the sections and it will stretch to form the table outlines

How to set sequence number of sub-elements in TSQL unsing same element as parent?

I need to set a sequence inside T-SQL when in the first column I have sequence marker (which is repeating) and use other column for ordering.
It is hard to explain so I try with example.
This is what I need:
|------------|-------------|----------------|
| Group Col | Order Col | Desired Result |
|------------|-------------|----------------|
| D | 1 | NULL |
| A | 2 | 1 |
| C | 3 | 1 |
| E | 4 | 1 |
| A | 5 | 2 |
| B | 6 | 2 |
| C | 7 | 2 |
| A | 8 | 3 |
| F | 9 | 3 |
| T | 10 | 3 |
| A | 11 | 4 |
| Y | 12 | 4 |
|------------|-------------|----------------|
So my marker is A (each time I met A I must start new group inside my result). All rows before first A must be set to NULL.
I know that I can achieve that with loop but it would be slow solution and I need to update a lot of rows (may be sometimes several thousand).
Is there a way to achive this without loop?
You can use window version of COUNT to get the desired result:
SELECT [Group Col], [Order Col],
COUNT(CASE WHEN [Group Col] = 'A' THEN 1 END)
OVER
(ORDER BY [Order Col]) AS [Desired Result]
FROM mytable
If you need all rows before first A set to NULL then use SUM instead of COUNT.
Demo here

How to join multiple tables

I'm trying to join multiple tables in q
a b c
key | valuea key | valueb key | valuec
1 | xa 1 | xb 2 | xc
2 | ya 2 | yb 4 | wc
3 | za
The expected result is
key | valuea | valueb | valuec
1 | xa | xb |
2 | ya | yb | xc
3 | za | |
4 | | | wc
The can be acheieved simply with
(a uj b) uj c
BUT does anyone know how i can do it in functional form?
I don't know how many tables i actually have
I need basically a function that will go over the list and smash any number of keyed tables together...
f:{[x] x uj priorx};
f[] each (a;b;c;d;e...)
Can anyone help? or suggest anything?
Thanks!
Another solution particular to your problem which is also little faster than your solution:
a (,')/(b;c)
figured it out... ;)
f:{[r;t]r uj t};
f/[();(a;b;c)]