how can I get near rows near one specific row in pyspark

how can I get near rows near one specific row in pyspark - pyspark

I want to get the near rows given one specific row. For example
give two dataframes:
User time
B 2
A 3
User time
A 1
B 2
A 3
D 6
E 7
G 10
D 11
The first one is the specific rows, and the sencond one is the whole table, let us set near window size as 1. Hence, the result show be the following:
User time
A 1
B 2
A 3
User time
B 2
A 3
D 6
but how can I get this? Thanks..

I have solved this problem. Actually it is easy. If anyone meets this problem , please utilize rowsBetween and window functions.

You can use lead/lag over a window. For this your dataframe needs to be ordered. let's assume you have another column "X".
from pyspark.sql.functions import Window, lag, lead
df.withColumn("c_1", lead("time").over(Window(partitionBy("user").orderBy("X"))))

Related

How do I replace the first 10 entries in a column with NaN in KDB

I am doing calculation on columns using summation. I want to manually change my first n entries in my calc column from float to NaN. Can someone please advise me how to do that?
For example, if my column in table t now is mycol:(1 2 3 4 5 6 7 8 9), I am trying to get a function that can replace the first n=4 entries with NaN, so my column in table t becomes mycol:(0N 0N 0N 0N 5 6 7 8 9)
Thank you so much!
Emily

We can use amend functionality to replace the first n items with null value. Additionally, it would be better to use the appropriate null literal for each column based on the type. Something like this would work:
f: {nullDict: "ijfs"!(0Ni;0Nj;0Nf:`); #[x; til y; :; nullDict .Q.ty x]}
This will amend the first y items in the list x. .Q.ty will get the type for input so that we can get the corresponding value from the dictionary.
You can then use this for a single column, like so:
update mycol: f[mycol;4] from tbl
You can also do this in one go for multiple columns with varying number of items to be replaced using functional form:
![tbl;();0b;`mycol`mycol2!((f[;4];`mycol);(f[;3];`mycol2))]
Do take note that you will need to modify nullDict with whatever other types you need.
Update: Thanks to Jonathon McMurray for suggesting a better way to build up nullDict for all primitive types using the below code:
{x!first each x$\:()}.Q.t except " "

Summing a Column values Starting with the Last one in Matlab

is there a function in matlab that sums up values starting from the last row and susbstitue the next row with summed values? for example:
data= 1 result 21
2 20
3 18
4 15
5 11
6 6

GameOfThrows is on the right track, but you need an additional flipud when you're done:
out = flipud(cumsum(flipud(data)));
The first flip ensures that we start summing from the last element instead of the first. We then perform our cumulative sum but you also want to be sure that the order is reversed so you have to call flipud one more time. However, to be absolutely safe, because we don't know if your data is a row or column vector, I'm going to ensure that your data is a column vector before doing what you ask:
out = flipud(cumsum(flipud(data(:))));

Aggregating from multiple columns in Tableau

I have a table that looks like:
id aff1 aff2 aff3 value
1 a x b 5
2 b c x 4
3 a b g 1
I would like to aggregate the aff columns to calculate the sum of "value" for each aff. For example, the above gives:
aff sum
a 6
b 10
c 4
g 1
x 9
Ideally, I'd like to do this directly in tableau without remaking the table by unfolding it along all the aff columns.

You can use Tableau’s inbuilt pivot method as below, without reshaping in source .
CTRL Select all 3 dimensions you want to merge , and click on pivot .
You will get your new reshaped data as below, delete other columns :
Finally build your view.
I hope this answers . Rest other options for the above results include JOIN at DB level, or creating multiple calculated fields for each attribute value which are not scalable.

Comparing, matching and combining columns of data

I need some help matching data and combining it. I currently have four columns of data in an Excel sheet, similar to the following:
Column: 1 2 3 4
U 3 A 0
W 6 B 0
R 1 C 0
T 9 D 0
... ... ... ...
Column two is a data value that corresponds to the letter in column one. What I need to do is compare column 3 with column 1 and whenever it matches copy the corresponding value from column 2 to column 4.
You might ask why don't I do this manually ? I have a spreadsheet with around 100,000 rows so this really isn't an option!
I do have access to MATLAB and have the information imported, if this would be more easily completed within that environment, please let me know.

As mentioned by #bla:
a formula similar to =IF(A1=C1,B1,0)
should serve (Excel).

Cumulative min on earlier versions of PostgreSQL

I am using PostgreSQL 8.2, which is main reason why I'm asking this question. I want to get in this version of PostgreSQL a column (let name it C) with cumulative minimum for some other preordered column (let name it B). So on n-th row of column C should be minimum of values of B in rows 1 to n for some ordering.
In example below column A gives order and column C contains cumulative minimum for column B in that order:
A B C
------------
1 5 5
2 4 4
3 6 4
4 5 4
5 3 3
6 1 1
Probably easiest way to explain what I want is what, in later versions, next query does:
SELECT A , B, min (B) OVER(ORDER BY A) C FROM T;
But version 8.2, of course, don't have window functions.
I've written some plpgsql functions that do this on arrays. But to use this I have to use array_agg aggregate function that I again wrote myself (there no built in array_agg in that version). This approach isn't very efficient and while it worked well on smaller tables it becoming almost unusable now that I need to use it on bigger ones.
So I would be very grateful for any suggestions of alternative, more efficient solutions of this problem.
Thank you!

Well, you can use this simple subselect:
SELECT a, b, (SELECT min(b) FROM t t1 WHERE t1.a <= t.a) AS c
FROM t
ORDER BY a;
But I doubt it will be faster for big tables than a plpgsql function. Maybe you can show us your function. There might be room for improvement there.
For this to be fast you should have a multi-column index like:
CREATE INDEX t_a_b_idx ON t (a,b);
But really, you should upgrade your to a more recent version of PostgreSQL. Version 8.2 has reached end of life last year. No more security updates. And so many missing features ...

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

how can I get near rows near one specific row in pyspark - pyspark

I have solved this problem. Actually it is easy. If anyone meets this problem , please utilize rowsBetween and window functions.

You can use lead/lag over a window. For this your dataframe needs to be ordered. let's assume you have another column "X". from pyspark.sql.functions import Window, lag, lead df.withColumn("c_1", lead("time").over(Window(partitionBy("user").orderBy("X"))))

Related

How do I replace the first 10 entries in a column with NaN in KDB

Summing a Column values Starting with the Last one in Matlab

Aggregating from multiple columns in Tableau

Comparing, matching and combining columns of data

Cumulative min on earlier versions of PostgreSQL

Categories

Resources