PySpark - Creating a single column from multiple columns with some basic math - pyspark

Consider the following PySpark dataframe
Col1
Col2
Col3
A, B
D, G
A, G
C, F
C, D
A, G
C, F
C, D
A, G
I'd like to create a new dataframe with 2 columns, the first with all the different combinations, and the second column is the ratio: Frequency of Combination / Total Number of Combinations. For example,
Combination
Ratio
A, B
0.111 (1/9)
C, F
0.222 (2/9)
D, G
0.111 (1/9)
C, D
0.222 (2/9)
A, G
0.333 (3/9)

You can unpivot, then group by and count:
from pyspark.sql import functions as F, Window
df2 = df.selectExpr(
'stack(' + str(len(df.columns)) + ', ' + ', '.join(df.columns) + ') as combination'
).groupBy('combination').count().withColumn(
'ratio',
F.col('count') / F.sum('count').over(Window.orderBy())
).drop('count')
df2.show()
+-----------+------------------+
|combination| ratio|
+-----------+------------------+
| A, B|0.1111111111111111|
| C, F|0.2222222222222222|
| C, D|0.2222222222222222|
| D, G|0.1111111111111111|
| A, G|0.3333333333333333|
+-----------+------------------+

Related

PostgreSQL - How to use index for this kind of query

We got this query:
SELECT * FROM table WHERE A AND ( B = X1 OR B = X2 ) AND ( C = X3 OR D = TRUE ) AND E = 0;
I created this index:
CREATE INDEX _my_index ON public.table USING btree (A, B, C, D, E);
But I don't get any better performances ... how to deal with such queries for indexing ?
Thank you !
I'll assume that X1, X2 and X3 are constants and not table columns.
You won't be able to index C = X3 OR D = TRUE — OR is always a performance problem.
The condition B = X1 OR B = X2 should be rewritten to B IN (X1, X2).
Then this is the best index:
CREATE INDEX ON "table" (e, a, b);
If you always want to query for truth of a and e = 0, a partial index would be even better:
CREATE INDEX ON "table" (b) WHERE a AND e = 0;
If you need to index the conditions on c and d as well, and the table has a primary key, you can rewrite the query to:
SELECT * FROM "table"
WHERE a AND b IN (X1, X2) AND c = X3 AND e = 0
UNION
SELECT * FROM "table"
WHERE a AND b IN (X1, X2) AND d AND e = 0;
For this query, the following two indexes are commendable:
CREATE INDEX ON "table" (c, a, e, b);
CREATE INDEX ON "table" (e, a, d, b);
Again, you can move certain index columns into a WHERE condition if you always query for a certain value.

Referring to a different column in PostgreSQL column

Consider the following image
If you want to get a result row containing all steps to get the length of the non-labeled sides, you can do the following:
SELECT
5 AS a, --side 1, triangle 1
7 AS b, --side 2, triangle 1
(5*5) AS a2, --a^2
(7*7) AS b2, --b^2
(5*5)+(7*7) AS c2, --a^2 * b^2 = c^2
SQRT((5*5)+(7*7)) AS c, --√c2 = c
19 AS d, --side 1, triangle 2
24 AS e, --side 2 triangle 2
(19*19) AS d2, --d^2
(24*24) AS e2, --e^2
(19*19)+(24*24) AS f2, --d^2 * e^2 = f^2
SQRT((19*19)+(24*24)) AS f, --√f2 = f
(5*5)+(7*7)+(19*19)+(24*24) AS g2, --c^2 * f^2 = g^2
SQRT((5*5)+(7*7)+(19*19)+(24*24)) AS g --√g2 = g
However, that is CLEARLY very ugly. I'd like to use column substitution, like:
SELECT
5 AS a, --side 1, triangle 1
7 AS b, --side 2, triangle 1
(a*a) AS a2, --a^2
(b*b) AS b2, --b^2
a2+b2 AS c2, --a^2 * b^2 = c^2
SQRT(c2) AS c, --√c2 = c
19 AS d, --side 1, triangle 2
24 AS e, --side 2 triangle 2
(d*d) AS d2, --d^2
(e*e) AS e2, --e^2
d+e AS f2, --d^2 * e^2 = f^2
SQRT(f2) AS f, --√f2 = f
c2+f2 AS g2, --c^2 * f^2 = g^2
SQRT(g2) AS g --√g2 = g
Is there any easy way to do this?
PS Please don't explain how this is a ridiculous use of SQL, I know THAT! This was just the simplest way that I could reduce my problem to be understood. In my scenario, it is much more complex calculations with variables coming from many joined tables, that the results need to be inserted into a summary table with a very rigid structure. Currently, I'm bringing the results out to Node doing the calculations and inserting the data, but that is very VERY slow especially since I to go through the network to get to the database server.
This can be done using common table expressions:
with base_vars (a,b,d,e) as (
values (5),(7),(19),(24)
), var2 (a2, b2, d2, e2) as (
select a*a, b*b, d*d*, e*e
from base_vars,
), var3 (c2, c, f2, f) as (
select a2+b2, SQRT(a2+b2), d+e, sqrt(d+e)
from var2, base_vars
), var3 (g2, g) as (
select c2+f2, sqrt(c2+f2)
from var3
)
select sqrt(g)
from var3;
I am not 100% if I got all variables right, but I think you get the idea.
Another option would be to put that into a PL/pgSQL function.
lateral is a bit shorter than CTEs since it is not necessary to refer to a previous CTE. And the planner can not join the CTEs and the main query into a single plan.
with t (a,b,d,e) as (values (5,7,19,24))
select c, f, sqrt(c2 + f * f)
from
t
cross join lateral
(select a * a, b * b, d * d, e * e) t1 (a2, b2, d2, e2)
cross join lateral
(select a2 + b2, d2 + e2) t2 (c2, f2)
cross join lateral
(select sqrt(c2), sqrt(f2)) t3 (c, f)
;
c | f | sqrt
------------------+------------------+------------------
8.60232526704263 | 30.6104557300279 | 31.7962261911693

Maple: specify variable over which to maximize

This is a very simple question, but found surprisingly very little about it online...
I want to find the minimizer of a function in maple, I am not sure how to indicate which is the variable of interest? Let us take a very simple case, I want the symbolic minimizer of a quadratic expression in x, with parameters a, b and c.
Without specifying something, it does minimize over all variables, a, b, c and x.
f4 := a+b*x+c*x^2
minimize(f4, location)
I tried to specify the variable in the function, did not work either:
f5 :=(x) ->a+b*x+c*x^2
minimize(f5, location)
How should I do this? And, how would I do if I wanted over two variables, x and y?
fxy := a+b*x+c*x^2 + d*y^2 +e*y
f4 := a+b*x+c*x^2:
extrema(f4, {}, x);
/ 2\
|4 a c - b |
< ---------- >
| 4 c |
\ /
fxy := a+b*x+c*x^2 + d*y^2 +e*y:
extrema(fxy, {}, {x,y});
/ 2 2\
|4 a c d - b d - c e |
< --------------------- >
| 4 c d |
\ /
The nature of the extrema will depend upon the values of the parameters. For your first example above (quadratic in x) it will depend on the signum of c.
The command extrema accepts an optional fourth argument, such as an unassigned name (or an uneval-quoted name) to which is assigns the candidate solution points (as a side-effect of its calculation). Eg,
restart;
f4 := a+b*x+c*x^2:
extrema(f4, {}, x, 'cand');
2
4 a c - b
{----------}
4 c
cand;
b
{{x = - ---}}
2 c
fxy := a+b*x+c*x^2 + d*y^2 +e*y:
extrema(fxy, {}, {x,y}, 'cand');
2 2
4 a c d - b d - c e
{---------------------}
4 c d
cand;
b e
{{x = - ---, y = - ---}}
2 c 2 d
Alternatively, you may set up the partial derivatives and solve them manually. Note that for these two examples there is just a one result (for each) returned by solve.
restart:
f4 := a+b*x+c*x^2:
solve({diff(f4,x)},{x});
b
{x = - ---}
2 c
normal(eval(f4,%));
2
4 a c - b
----------
4 c
fxy := a+b*x+c*x^2 + d*y^2 +e*y:
solve({diff(fxy,x),diff(fxy,y)},{x,y});
b e
{x = - ---, y = - ---}
2 c 2 d
normal(eval(fxy,%));
2 2
4 a c d - b d - c e
---------------------
4 c d
The code for the extrema command can be viewed, by issuing the command showstat(extrema). You can see how it accounts for the case of solve returning multiple results.

Compute the change of basis matrix in Matlab

I've an assignment where I basically need to create a function which, given two basis (which I'm representing as a matrix of vectors), it should return the change of basis matrix from one basis to the other.
So far this is the function I came up with, based on the algorithm that I will explain next:
function C = cob(A, B)
% Returns C, which is the change of basis matrix from A to B,
% that is, given basis A and B, we represent B in terms of A.
% Assumes that A and B are square matrices
n = size(A, 1);
% Creates a square matrix full of zeros
% of the same size as the number of rows of A.
C = zeros(n);
for i=1:n
C(i, :) = (A\B(:, i))';
end
end
And here are my tests:
clc
clear out
S = eye(3);
B = [1 0 0; 0 1 0; 2 1 1];
D = B;
disp(cob(S, B)); % Returns cob matrix from S to B.
disp(cob(B, D));
disp(cob(S, D));
Here's the algorithm that I used based on some notes. Basically, if I have two basis B = {b1, ... , bn} and D = {d1, ... , dn} for a certain vector space, and I want to represent basis D in terms of basis B, I need to find a change of basis matrix S. The vectors of these bases are related in the following form:
(d1 ... dn)^T = S * (b1, ... , bn)^T
Or, by splitting up all the rows:
d1 = s11 * b1 + s12 * b2 + ... + s1n * bn
d2 = s21 * b1 + s22 * b2 + ... + s2n * bn
...
dn = sn1 * b1 + sn2 * b2 + ... + snn * bn
Note that d1, b1, d2, b2, etc, are all column vectors. This can be further represented as
d1 = [b1 b2 ... bn] * [s11; s12; ... s1n];
d2 = [b1 b2 ... bn] * [s21; s22; ... s2n];
...
dn = [b1 b2 ... bn] * [sn1; sn2; ... s1n];
Lets call the matrix [b1 b2 ... bn], whose columns are the columns vectors of B, A, so we have:
d1 = A * [s11; s12; ... s1n];
d2 = A * [s21; s22; ... s2n];
...
dn = A * [sn1; sn2; ... s1n];
Note that what we need now to find are all the entries sij for i=1...n and j=1...n. We can do that by left-multiplying both sides by the inverse of A, i.e. by A^(-1).
So, S might look something like this
S = [s11 s12 ... s1n;
s21 s22 ... s2n;
...
sn1 sn2 ... snn;]
If this idea is correct, to find the change of basis matrix S from B to D is really what I'm doing in the code.
Is my idea correct? If not, what's wrong? If yes, can I improve it?
Things become much easier when one has an intuitive understanding of the algorithm.
There are two key points to understand here:
C(B,B) is the identity matrix (i.e., do nothing to change from B to B)
C(E,D)C(B,E) = C(B,D) , think of this as B -> E -> D = B -> D
A direct corollary of 1 and 2 is
C(E,D)C(D,E) = C(D,D), the identity matrix
in other words
C(E,D) = C(D,E)-1
Summarizing.
Algorithm to calculate the matrix C(B,D) to change from B to D:
Define C(B,E) = [b1, ..., bn] (column vectors)
Define C(D,E) = [d1, ..., dn] (column vectors)
Compute C(E,D) as the inverse of C(D,E).
Compute C(B,D) as the product C(E,D)C(B,E).
Example
B = {(1,2), (3,4)}
D = {(1,1), (1,-1)}
C(B,E) = | 1 3 |
| 2 4 |
C(D,E) = | 1 1 |
| 1 -1 |
C(E,D) = | .5 .5 |
| .5 -.5 |
C(B,D) = | .5 .5 | | 1 3 | = | 1.5 3.5 |
| .5 -.5 | | 2 4 | | -.5 -.5 |
Verification
1.5 d1 + -.5 d2 = 1.5(1,1) + -.5(1,-1) = (1,2) = b1
3.5 d1 + -.5 d2 = 3.5(1,1) + -.5(1,-1) = (3,4) = b2
which shows that the columns of C(B,D) are in fact the coordinates of b1 and b2 in the base D.

NUL-byte between every other character in output

I'm using Ruby to read and then print a file to stdout, redirecting the output to a file in Windows PowerShell.
However, when I inspect the files, I get this for the input:
PS D:> head -n 1 .\inputfile
<text id="http://observer.guardian.co.uk/osm/story/0,,1009777,00.html"> <s> Hooligans NNS hooligan
, , , unbridled JJ unbridled passion NN passion
- : - and CC and no DT no executive JJ executiv
e boxes NNS box . SENT . </s>
... yet this for the output:
PS D:> head -n 1 .\outputfile
ÿ_< t e x t i d = " h t t p : / / o b s e r v e r . g u a r d i a n . c o . u k / o s m / s t o r y / 0 , , 1 0 0 9 7 7 7 , 0
0 . h t m l " > < s > H o o l i g a n s N N S h o o l i g a n , ,
, u n b r i d l e d J J u n b r i d l e d p a s s i o n N N p a s s i o n
- : - a n d C C a n d n o D T n o e x e c u t i v e J J
e x e c u t i v e b o x e s N N S b o x . S E N T . < / s >
How can this happen?
Edit: since my problem didn't have anything to do with Ruby, I've removed the Ruby-code, and included my usage of the Windows shell.
In PowerShell > is effectively the same as | Out-File and Out-File defaults to Unicode encoding. Try this instead of using >:
... | Out-File outputfile -encoding ASCII