complex Sql help to get the start time & end time from a datetime column - amazon-redshift

I need to create pivot datetime column in such a way that when the Order column value keep increasing take the lowest value as start time and highest value as end time but once the counter reset it should create a new row for start & end time.
Sample data
computername currentuser datetime order
abc xyz 7/5/2022 20:04:51 1
abc xyz 7/5/2022 20:04:51 1
abc xyz 7/6/2022 6:45:51 1
abc xyz 7/6/2022 6:45:51 1
abc xyz 7/6/2022 7:06:45 2
abc xyz 7/6/2022 7:06:45 3
abc xyz 7/6/2022 7:07:00 4
abc xyz 7/6/2022 7:59:12 2
abc xyz 7/6/2022 7:59:12 3
abc xyz 7/6/2022 7:59:19 4
abc xyz 7/6/2022 7:59:21 5
abc xyz 7/6/2022 21:28:19 1
abc xyz 7/6/2022 21:28:19 1
abc xyz 7/6/2022 21:28:24 2
abc xyz 7/6/2022 21:28:24 3
abc xyz 7/6/2022 21:28:24 4
Expected Output
computername currentuser starttime endtime
abc xyz 7/5/2022 20:04:51 7/5/2022 20:04:51
abc xyz 7/6/2022 6:45:51 7/6/2022 7:07:00
abc xyz 7/6/2022 7:59:12 7/6/2022 7:59:21
abc xyz 7/6/2022 21:28:19 7/6/2022 21:28:24

Related

I need to compare two files using pyspark

I'm new to PySpark and need to compare two files based on col1 alone and populate new colum at end of file 1 based on matching conditions.
1 - Matching record
0 - Unmatached Record
File1:
Col1
Col2
...
ColN
1
abc
...
Xxxx
2
abc
...
Xxxx
3
abc
...
Xxxx
File 2
Col1
Col2
...
ColN
1
abc
...
Xxxx
2
abc
...
Xxxx
Expected output:
Col1
Col2
...
ColN
Newcol
1
abc
...
Xxxx
1
2
abc
...
Xxxx
1
3
abc
...
Xxxx
0

Remove table duplicates under certain conditions

I have a table like below that shows me some pnl by instrument (code) for some shifts, maturity, etc.
Instrument 123 appears two times (2 sets of shift, booknumber, insmat but different pnl). I would like to clean the table to only keep the first set (3 first rows).
> code | shift | pnl | booknumber | insmat
123 -20% 5 1234 2021.01.29
123 -0% 7 1234 2021.01.29
123 +20% 9 1234 2021.01.29
123 -20% 4 1234 2021.01.29
123 -0% 6 1234 2021.01.29
123 +20% 8 1234 2021.01.29
456 -20% 1 1234 2021.01.29
456 -0% 2 1234 2021.01.29
456 +20% 3 1234 2021.01.29
If there were no shifts involved I would do something like this:
select first code, first pnl, first booknumber, first insmat by code from t
Would love to hear if you have a solution!
Thanks!
If the shift pattern is consistently 3 shifts, you could use
q)select from t where 0=i mod 3
code shift pnl booknumber insmat
------------------------------------
123 20 5 1234 2021.01.29
123 20 4 1234 2021.01.29
456 -20 1 1234 2021.01.29
Alternative solution with an fby
q)select from t where shift=(first;shift)fby code
code shift pnl booknumber insmat
------------------------------------
123 20 5 1234 2021.01.29
123 20 4 1234 2021.01.29
456 -20 1 1234 2021.01.29
This will only work if the first shift value is unique within the shift pattern however.

KDB/Q: how to join and fill null with 0

I am joining 2 tables. How do I replace NULL with 0 a column from one of the table?
My code to join
newTable: table1 lj xkey `date`sym xkey table2
I am aware that 0^ helps you to do this, but I dont know how to apply here
In future I recommend that you show examples of the 2 tables you have and the expected outcome you would like because it is slightly difficult to know but I think this might be what you want.
First in your code you use xkey twice so it will throw an error. Change it to be:
newTable: table1 lj `date`sym xkey table2
Then for the updating of null values with a column from another tbl you could do:
q)tbl:([]date:.z.d;sym:10?`abc`xyz;data:10?8 2 0n)
q)tbl
date sym data
-------------------
2020.12.10 xyz 8
2020.12.10 abc 8
2020.12.10 abc 8
2020.12.10 abc
2020.12.10 abc
2020.12.10 xyz 2
2020.12.10 abc 2
2020.12.10 xyz
2020.12.10 xyz
2020.12.10 abc 2
q)tbl2:([date:.z.d;sym:`abc`xyz];data2:2?100)
q)tbl2
date sym| data2
--------------| -----
2020.12.10 abc| 23
2020.12.10 xyz| 46
q)select date,sym,data:data2^data from tbl lj `date`sym xkey tbl2 //Replace null values of data with data2.
date sym data
-------------------
2020.12.10 xyz 8
2020.12.10 abc 8
2020.12.10 abc 8
2020.12.10 abc 23
2020.12.10 abc 23
2020.12.10 xyz 2
2020.12.10 abc 2
2020.12.10 xyz 46
2020.12.10 xyz 46
2020.12.10 abc 2
So, it's
Use within an an update statement, for example:
q)newTable:([]column1:(1;0Nj;2;0Nj))
q)update 0^column1 from newTable
column1
-------
1
0
2
0
Or functional form:
q)newTable:([]column1:(1;0Nj;2;0Nj);column2:(1;2;3;0Nj))
q)parse"update 0^column1 from newTable"
!
`newTable
()
0b
(,`column1)!,(^;0;`column1)
q)![newTable;();0b;raze{enlist[x]!enlist(^;0;x)}each `column1`column2]
column1 column2
---------------
1 1
0 2
2 3
0 0

KDB - Text parsing and cataloging text data

I have data made of varying periodic strings that are effectively a time value list with a periodicity flag contained within. Unfortunately, each string length can have a different number of elements but no more than 7.
Example below - (# and #/M at the end of each string means these are monthly values) starting at 8/2020 while #/Y are annual numbers so we divide by 12 for example to get to a monthly value. # at the beginning simple means continue from prior period.
copied from CSV
ID,seg,strField
AAA,1,8/2020 2333 2456 2544 2632 2678 #/M
AAA,2,# 3333 3456 3544 3632 3678 #
AAA,3,# 4333 4456 4544 4632 4678 #/M
AAA,4,11/2021 5333 5456 #/M
AAA,5,# 6333 6456 6544 6632 6678 #/Y
t:("SSS";enlist",") 0:`:./Data/src/strField.csv; // read in csv data above
t:update result:count[t]#enlist`float$() from t; // initiate empty result column
I would normally tokenize then pass each of the 7 columns to a function but the limit is 8 arguments and I would like to send other meta data in addition to these 7 arguments.
t:#[t;`tok1`tok2`tok3`tok4`tok5`tok6`tok7;:;flip .Q.fu[{" " vs'x}]t `strField];
t: ungroup t;
//Desired result
ID seg iDate result
AAA 1 8/31/2020 2333
AAA 1 9/30/2020 2456
AAA 1 10/31/2020 2544
AAA 1 11/30/2020 2632
AAA 1 12/31/2020 2678
AAA 2 1/31/2021 3333
AAA 2 2/28/2021 3456
AAA 2 3/31/2021 3544
AAA 2 4/30/2021 3632
AAA 2 5/31/2021 3678
AAA 3 6/30/2021 4333
AAA 3 7/31/2021 4456
AAA 3 8/31/2021 4544
AAA 3 9/30/2021 4632
AAA 3 10/31/2021 4678
AAA 4 11/30/2021 5333
AAA 4 12/31/2021 5456
AAA 5 1/31/2022 527.75 <-- 6333/12
AAA 5 2/28/2022 527.75
AAA 5 3/31/2022 527.75
AAA 5 4/30/2022 527.75
AAA 5 5/31/2022 527.75
AAA 5 6/30/2022 527.75
AAA 5 7/31/2022 527.75
AAA 5 8/31/2022 527.75
AAA 5 9/30/2022 527.75
AAA 5 10/31/2022 527.75
AAA 5 11/30/2022 527.75
AAA 5 12/31/2022 527.75
AAA 5 1/31/2023 538.00 <--6456/12
AAA 5 2/28/2023 538.00
AAA 5 3/31/2023 538.00
AAA 5 4/30/2023 538.00
AAA 5 5/31/2023 538.00
AAA 5 6/30/2023 538.00
AAA 5 7/31/2023 538.00
AAA 5 8/31/2023 538.00
AAA 5 9/30/2023 538.00
AAA 5 10/31/2023 538.00
AAA 5 11/30/2023 538.00
AAA 5 12/31/2023 538.00
AAA 5 1/31/2024 etc..
AAA 5 2/29/2024
AAA 5 3/31/2024
AAA 5 4/30/2024
AAA 5 5/31/2024
AAA 5 6/30/2024
AAA 5 7/31/2024
ddonelly is correct, a dictionary or list gets around the limitation of 8 parameters for functions but I think it is not the right approach here. Below achieves the desired output:
t:("SSS";enlist",") 0:`:so.csv;
// This will process each distinct ID separately as the date logic I have here would break if you had a BBB entry that starts date over
{[t]
t:#[{[x;y] select from x where ID = y}[t;]';exec distinct ID from t];
raze {[t]
t:#[t;`strField;{" "vs string x}'];
t:ungroup update`$date from delete strField from #[t;`date`result`year;:;({first x}each t[`strField];"J"${-1_1_x}each t[`strField];
`Y =fills #[("#/Y";"#/M";"#")!`Y`M`;last each t[`strField]])];
delete year from ungroup update date:`$'string date from update result:?[year;result%12;result],
date:{x+til count x} each {max($[z;12#(x+12-x mod 12);1#x+1];y)}\[0;"M"$/:raze each reverse each
"/" vs/: string date;year] from t
} each t
}[t]
ID seg date result
AAA 1 2020.08 2333
AAA 1 2020.09 2456
AAA 1 2020.10 2544
AAA 1 2020.11 2632
AAA 1 2020.12 2678
AAA 2 2021.01 3333
AAA 2 2021.02 3456
AAA 2 2021.03 3544
AAA 2 2021.04 3632
AAA 2 2021.05 3678
AAA 3 2021.06 4333
AAA 3 2021.07 4456
AAA 3 2021.08 4544
AAA 3 2021.09 4632
AAA 3 2021.10 4678
AAA 4 2021.11 5333
AAA 4 2021.12 5456
AAA 5 2022.01 527.75
AAA 5 2022.02 527.75
AAA 5 2022.03 527.75
...
AAA 5 2023.01 538
AAA 5 2023.02 538
AAA 5 2023.03 538
AAA 5 2023.04 538
...
AAA 5 2024.01 545.3333
AAA 5 2024.02 545.3333
...
Below is a full breakdown of whats going on inside the nested function should you need it for understanding.
// vs (vector from scalar) is useful for string manipulation to separate the strField column into a more manageable list of seperate strings
t:#[t;`strField;{" "vs string x}'];
// split the strField out to more manageable columns
t:#[t;`date`result`year;:;
// date column from the first part of strField
({first x}each t[`strField];
// result for the actual value fields in the middle
"J"${-1_1_x}each t[`strField];
// year column which is a boolean to indicate special handling is needed.
// I also forward fill to account for rows which are continuation of
// the previous rows time period,
// e.g. if you had 2 or 3 lines in a row of continuous yearly data
`Y =fills #[("#/Y";"#/M";"#")!`Y`M`;last each t[`strField]])];
// ungroup to split each result into individual rows
t:ungroup update`$date from delete strField from t;
t:update
// divide yearly rows where necessary with a vector conditional
result:?[year;result%12;result],
// change year into a progressive month list
date:{x+til count x} each
// check if a month exists, if not take previous month + 1.
// If a year, previous month + 12 and convert to Jan
// create a list of Jans for the year which I convert to Jan->Dec above
{max($[z;12#(x+12-x mod 12);1#x+1];y)}\
// reformat date to kdb month to feed with year into the scan iterator above
[0;"M"$/:raze each reverse each "/" vs/: string date;year] from t;
// finally convert date to symbol again to ungroup year rows into individual rows
delete year from ungroup update date:`$'string date from t
could you pass the columns into a dictionary and then pass the dictionary into the function? This with circumvent the issue of having a maximum of 8 arguments since the dictionary can be as long as you require.

Flag data when value from one column is in another column

I'm trying to create a flag in my dataset based on 2 conditions, the first is simple. Does CheckingCol = CheckingCol2.
The second is more complicated. I have a column called TranID and a column called RevID.
For nay row if RevID is in TranID AND CheckingCol = CheckingCol2 then the flag should return "Yes". Otherwise the flag should return "No".
My data looks like this:
TranID RevID CheckingCol CheckingCol2
1 2 ABC ABC
2 1 ABC ABC
3 6 ABCDE ABCDE
4 3 ABCDE ABC
5 7 ABCDE ABC
The expected result would be:
TranID RevID CheckingCol CheckingCol2 Flag
1 2 ABC ABC Yes
2 1 ABC ABC Yes
3 6 ABCDE ABCDE No
4 3 ABCDE ABC No
5 7 ABCDE ABC No
I've tried using:
df.withColumn("TotalMatch", when((col("RevID").contains(col("TranID"))) & (col("CheckingColumn") == col("CheckingColumn2")), "Yes").otherwise("No"))
But it didn't work, and I've not been able to find anything online about how to do this.
Any help would be great!
Obtain the unique values as array from the TranID column, then check for the RevID from that array using isIn() function
from pyspark.sql import functions as sf
unique_values = df1.agg(sf.collect_set("TranID").alias("uniqueIDs"))
unique_values.show()
+---------------+
| uniqueIDs|
+---------------+
|[3, 1, 2, 5, 4]|
+---------------+
required_array = unique_values.take(1)[0].uniqueIDs
['3', '1', '2', '5', '4']
df2 = df1.withColumn("Flag", sf.when( (sf.col("RevID").isin(required_array) & (sf.col("CheckingCol") ==sf.col("CheckingCol2")) ) , "Yes").otherwise("No"))
Note: Check for the nulls and NoneType values in both RevID and TranID columns since they will affect the results