pyspark - converting DF Structure - pyspark

I am new to Python and Spark Programming.
I have data in Below given format-1, which will have data captured for different fields based on timestamp and trigger.
I need to convert this data into format-2, i.e, based on timestamp and Key, need to group all the fields given in format-1 and created records as per Format-2. In Format-1, there are field that does not have any key value (timestamp and Trigger), these fields should be populated for all the records in format-2
Can you please suggest me the best approach to perform this in pyspark.
Format-1:
Event time (key-1) trig (key-2) data field_Name
------------------------------------------------------
2021-05-01T13:57:29Z 30Sec 10 A
2021-05-01T13:57:59Z 30Sec 11 A
2021-05-01T13:58:29Z 30Sec 12 A
2021-05-01T13:58:59Z 30Sec 13 A
2021-05-01T13:59:29Z 30Sec 14 A
2021-05-01T13:59:59Z 30Sec 15 A
2021-05-01T14:00:29Z 30Sec 16 A
2021-05-01T14:00:48Z OFF 17 A
2021-05-01T13:57:29Z 30Sec 110 B
2021-05-01T13:57:59Z 30Sec 111 B
2021-05-01T13:58:29Z 30Sec 112 B
2021-05-01T13:58:59Z 30Sec 113 B
2021-05-01T13:59:29Z 30Sec 114 B
2021-05-01T13:59:59Z 30Sec 115 B
2021-05-01T14:00:29Z 30Sec 116 B
2021-05-01T14:00:48Z OFF 117 B
2021-05-01T14:00:48Z OFF 21 C
2021-05-01T14:00:48Z OFF 31 D
Null Null 41 E
Null Null 51 F
Format-2:
Event Time Trig A B C D E F
--------------------------------------------------------------
2021-05-01T13:57:29Z 30Sec 10 110 Null Null 41 51
2021-05-01T13:57:59Z 30Sec 11 111 Null Null 41 51
2021-05-01T13:58:29Z 30Sec 12 112 Null Null 41 51
2021-05-01T13:58:59Z 30Sec 13 113 Null Null 41 51
2021-05-01T13:59:29Z 30Sec 14 114 Null Null 41 51
2021-05-01T13:59:59Z 30Sec 15 115 Null Null 41 51
2021-05-01T14:00:29Z 30Sec 16 116 Null Null 41 51
2021-05-01T14:00:48Z OFF 17 117 21 31 41 51

Related

Address and smoothen noise in sensor data

I have sensors data as below wherein under Data Column, there are 6rows containing value 45 in between preceding and following rows containing value 50. The requirement is to clean this data and impute with 50 (prev value) in the new_data column. Moreover, the no of noise records (shown as 45 in table) might either vary in number or with level of rows.
Case 1 (sample data) :-
Sl.no
Timestamp
Data
New_data
1
1/1/2021 0:00:00
50
50
2
1/1/2021 0:15:00
50
50
3
1/1/2021 0:30:00
50
50
4
1/1/2021 0:45:00
50
50
5
1/1/2021 1:00:00
50
50
6
1/1/2021 1:15:00
50
50
7
1/1/2021 1:30:00
50
50
8
1/1/2021 1:45:00
50
50
9
1/1/2021 2:00:00
50
50
10
1/1/2021 2:15:00
50
50
11
1/1/2021 2:30:00
45
50
12
1/1/2021 2:45:00
45
50
13
1/1/2021 3:00:00
45
50
14
1/1/2021 3:15:00
45
50
15
1/1/2021 3:30:00
45
50
16
1/1/2021 3:45:00
45
50
17
1/1/2021 4:00:00
50
50
18
1/1/2021 4:15:00
50
50
19
1/1/2021 4:30:00
50
50
20
1/1/2021 4:45:00
50
50
21
1/1/2021 5:00:00
50
50
22
1/1/2021 5:15:00
50
50
23
1/1/2021 5:30:00
50
50
I am thinking of a need to group these data ordered by timestamp asc (like below) and then could have a condition in place where it will have to check group by group in large sample data and if group 1 is same as group 3 , replace group 2 with group 1 values.
Sl.no
Timestamp
Data
New_data
group
1
1/1/2021 0:00:00
50
50
1
2
1/1/2021 0:15:00
50
50
1
3
1/1/2021 0:30:00
50
50
1
4
1/1/2021 0:45:00
50
50
1
5
1/1/2021 1:00:00
50
50
1
6
1/1/2021 1:15:00
50
50
1
7
1/1/2021 1:30:00
50
50
1
8
1/1/2021 1:45:00
50
50
1
9
1/1/2021 2:00:00
50
50
1
10
1/1/2021 2:15:00
50
50
1
11
1/1/2021 2:30:00
45
50
2
12
1/1/2021 2:45:00
45
50
2
13
1/1/2021 3:00:00
45
50
2
14
1/1/2021 3:15:00
45
50
2
15
1/1/2021 3:30:00
45
50
2
16
1/1/2021 3:45:00
45
50
2
17
1/1/2021 4:00:00
50
50
3
18
1/1/2021 4:15:00
50
50
3
19
1/1/2021 4:30:00
50
50
3
20
1/1/2021 4:45:00
50
50
3
21
1/1/2021 5:00:00
50
50
3
22
1/1/2021 5:15:00
50
50
3
23
1/1/2021 5:30:00
50
50
3
Moreover, there is also a need to add an exception like, if the next group is having similar pattern, not to change but to retain the data as it is.
Ex below : If group 1 and group 3 are same , impute group 2 with group 1 value.
But if group 2 and group 4 are same, do not change group 3 , retain same data in New_data.
Case 2:-
Sl.no
Timestamp
Data
New_data
group
1
1/1/2021 0:00:00
50
50
1
2
1/1/2021 0:15:00
50
50
1
3
1/1/2021 0:30:00
50
50
1
4
1/1/2021 0:45:00
50
50
1
5
1/1/2021 1:00:00
50
50
1
6
1/1/2021 1:15:00
50
50
1
7
1/1/2021 1:30:00
50
50
1
8
1/1/2021 1:45:00
50
50
1
9
1/1/2021 2:00:00
50
50
1
10
1/1/2021 2:15:00
50
50
1
11
1/1/2021 2:30:00
45
50
2
12
1/1/2021 2:45:00
45
50
2
13
1/1/2021 3:00:00
45
50
2
14
1/1/2021 3:15:00
45
50
2
15
1/1/2021 3:30:00
45
50
2
16
1/1/2021 3:45:00
45
50
2
17
1/1/2021 4:00:00
50
50
3
18
1/1/2021 4:15:00
50
50
3
19
1/1/2021 4:30:00
50
50
3
20
1/1/2021 4:45:00
50
50
3
21
1/1/2021 5:00:00
50
50
3
22
1/1/2021 5:15:00
50
50
3
23
1/1/2021 5:30:00
50
50
3
24
1/1/2021 5:45:00
45
45
4
25
1/1/2021 6:00:00
45
45
4
26
1/1/2021 6:15:00
45
45
4
27
1/1/2021 6:30:00
45
45
4
28
1/1/2021 6:45:00
45
45
4
29
1/1/2021 7:00:00
45
45
4
30
1/1/2021 7:15:00
45
45
4
31
1/1/2021 7:30:00
45
45
4
Reaching out for help in coding in postgresql to address above scenario. Please feel free to suggest any alternative approaches to solve above problem.
The query below should answer the need.
The first query identifies the rows which correspond to a change of
data.
The second query groups the rows between two successive changes of data and set up the corresponding range of timestamp
The third query is a recursive query which calculates the new_data in an
iterative way according to the timestamp order.
The last query display the expected result.
WITH RECURSIVE list As
(
SELECT no
, timestamp
, lag(data) OVER w AS previous
, data
, lead(data) OVER w AS next
, data IS DISTINCT FROM lag(data) OVER w AS first
, data IS DISTINCT FROM lead(data) OVER w AS last
FROM sensors
WINDOW w AS (ORDER BY timestamp ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
), range_list AS
(
SELECT tsrange(timestamp, lead(timestamp) OVER w, '[]') AS range
, previous
, data
, lead(next) OVER w AS next
, first
FROM list
WHERE first OR last
WINDOW w AS (ORDER BY timestamp ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
), rec_list (range, previous, data, next, new_data, arr) AS
(
SELECT range
, previous
, data
, next
, data
, array[range]
FROM range_list
WHERE previous IS NULL
UNION ALL
SELECT c.range
, p.data
, c.data
, c.next
, CASE
WHEN p.new_data IS NOT DISTINCT FROM c.next
THEN p.data
ELSE c.data
END
, p.arr || c.range
FROM rec_list AS p
INNER JOIN range_list AS c
ON lower(c.range) = upper(p.range) + interval '15 minutes'
WHERE NOT array[c.range] <# p.arr
AND first
)
SELECT s.*, r.new_data
FROM sensors AS s
INNER JOIN rec_list AS r
ON r.range #> s.timestamp
ORDER BY timestamp
see the test result in dbfiddle

Extracting all rows containing a specific datetime value (MATLAB)

I have a table which looks like this:
Entry number
Timestamp
Value1
Value2
Value3
Value4
5758
28-06-2018 16:30
34
63
34.2
60.9
5759
28-06-2018 17:00
33.5
58
34.9
58.4
5758
28-06-2018 16:30
34
63
34.2
60.9
5759
28-06-2018 17:00
33.5
58
34.9
58.4
5760
28-06-2018 17:30
33
53
35.2
58.5
5761
28-06-2018 18:00
33
63
35
57.9
5762
28-06-2018 18:30
33
61
34.6
58.9
5763
28-06-2018 19:00
33
59
34.1
59.4
5764
28-06-2018 19:30
28
89
33.5
64.2
5765
28-06-2018 20:00
28
89
33
66.1
5766
28-06-2018 20:30
28
83
32.5
67
5767
28-06-2018 21:00
29
89
32.2
68.4
Where '28-06-2018 16:30' is under one column. So I have 6 columns:
Entry number, Timestamp, Value1, Value2, Value3, Value4
I want to extract all rows that belong to '28-06-2018', i.e all data pertaining to that day. Since my table is too large I couldn't fit more data, however, the entries under the timestamp range for a couple of months.
t=table([5758;5759],["28-06-2018 16:30";"29-06-2018 16:30"],[34;33.5],'VariableNames',{'Entry number','Timestamp','Value1'})
t =
2×3 table
Entry number Timestamp Value1
____________ __________________ ______
5758 "28-06-2018 16:30" 34
5759 "29-06-2018 16:30" 33.5
t(contains(t.('Timestamp'),"28-06"),:)
ans =
1×3 table
Entry number Timestamp Value1
____________ __________________ ______
5758 "28-06-2018 16:30" 34

Add null to the columns which are empty

I am trying to put null to the columns which are empty using perl or awk, to find the number of column , header's column count can be used. I tried to perform the solution using perl and some regex. However, the output looks very close to the desired output but if noticed carefully row number one is showing incorrect data.
Input data:
id name type foo-id zoo-id loo-id-1 moo-id-2
----- --------------- ----------- ------ ------ ------ ------
0 zoo123 soozoo 8 31 32
51 zoo213 soozoo 48 51
52 asz123 soozoo 47 52
53 asw122 soozoo 1003 53
54 fff123 soozoo 68 54
55 sss123 soozoo 75 55
56 ssd123 soozoo 76 56
Expected Output:
0 zoo123 soozoo 8 null 31 32
51 zoo213 soozoo 48 51 null null
52 asz123 soozoo 47 52 null null
53 asw122 soozoo 1003 53 null null
54 fff123 soozoo 68 54 null null
55 sss123 soozoo 75 55 null null
56 ssd123 soozoo 76 56 null null
Very close to solution but row-1 is showing incorrect data:
echo "$x"|grep -E '^[0-9]+' |perl -ne 'm/^([\d]+)(?:\s+([\w]+))?(?:\s+([-\w]+))?(?:\s+([\d]+))?(?:\s+([\d]+))?(?:\s+([\d]+))?(?:\s+([\d]+))?/;printf "%s %s %s %s %s %s %s\n", $1, $2//"null", $3//"null",$4//"null",$5//"null",$6//"null",$7//"null"' |column -t
0 zoo123 soozoo 8 31 32 null
51 zoo213 soozoo 48 51 null null
52 asz123 soozoo 47 52 null null
53 asw122 soozoo 1003 53 null null
54 fff123 soozoo 68 54 null null
55 sss123 soozoo 75 55 null null
56 ssd123 soozoo 76 56 null null
When you have a fixed-width string to parse, you'll find that unpack() is a better tool than regexes.
This should demonstrate how to do it. I'll leave it to you to convert it to a one-liner.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
while (<DATA>) {
next if /^\D/; # Skip lines that don't start with a digit
# I worked out the unpack() template by counting columns.
my #data = map { /\S/ ? $_ : 'null' } unpack('A7A14A16A8A8A8A8');
say join ' ', #data;
}
__DATA__
id name type foo-id zoo-id loo-id-1 moo-id-2
----- --------------- ----------- ------ ------ ------ ------
0 zoo123 soozoo 8 31 32
51 zoo213 soozoo 48 51
52 asz123 soozoo 47 52
53 asw122 soozoo 1003 53
54 fff123 soozoo 68 54
55 sss123 soozoo 75 55
56 ssd123 soozoo 76 56
Output:
$ perl unpack | column -t
0 zoo123 soozoo 8 null 31 32
51 zoo213 soozoo 48 51 null null
52 asz123 soozoo 47 52 null null
53 asw122 soozoo 1003 53 null null
54 fff123 soozoo 68 54 null null
55 sss123 soozoo 75 55 null null
56 ssd123 soozoo 76 56 null null
With GNU awk:
awk 'NR>2{ # ignore first and second row
NF=7 # fix number of columns
for(i=1; i<=NF; i++) # loop with all columns
if($i ~ /^ *$/){ # if empty or only spaces
$i="null"
}
print $0}' FIELDWIDTHS='7 14 16 8 8 10 8' OFS='|' file | column -s '|' -t
As one line:
awk 'NR>2{NF=7; for(i=1;i<=NF;i++) if($i ~ /^ *$/){$i="null"} print $0}' FIELDWIDTHS='7 14 16 8 8 10 8' OFS='|' file | column -s '|' -t
Output:
0 zoo123 soozoo 8 null 31 32
51 zoo213 soozoo 48 51 null null
52 asz123 soozoo 47 52 null null
53 asw122 soozoo 1003 53 null null
54 fff123 soozoo 68 54 null null
55 sss123 soozoo 75 55 null null
56 ssd123 soozoo 76 56 null null
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Combine 2 data frames with different columns in spark

I have 2 dataframes:
df1 :
Id purchase_count purchase_sim
12 100 1500
13 1020 1300
14 1010 1100
20 1090 1400
21 1300 1600
df2:
Id click_count click_sim
12 1030 2500
13 1020 1300
24 1010 1100
30 1090 1400
31 1300 1600
I need to get the combined data frame with results as :
Id click_count click_sim purchase_count purchase_sim
12 1030 2500 100 1500
13 1020 1300 1020 1300
14 null null 1010 1100
24 1010 1100 null null
30 1090 1400 null null
31 1300 1600 null null
20 null null 1090 1400
21 null null 1300 1600
I can't use union because of different column names. Can some one suggest me a better way to do this ?
All you require a full outer join on ID column.
df1.join(df2, Seq("Id"), "full_outer")
// Since the Id column name is same in both the dataframes, if you use comparison like
df1($"Id") === df2($"Id"), you will get duplicate ID columns
Please refer the below documentation for future references.
https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

SQL Server: FAILING extra records

I have a tableA (ID int, Match varchar, code char, status = char)
ID Match code Status
101 123 A
102 123 B
103 123 C
104 234 A
105 234 B
106 234 C
107 234 B
108 456 A
109 456 B
110 456 C
I want to populate status with 'FAIL' when:
For same match, there exists code different than (A,B or C)
or the code exists multiple times.
In other words, code can be only (A,B,C) and it should exists only one for same match, else fail. So, expected result would be:
ID Match code Status
101 123 A NULL
102 123 B NULL
103 123 C NULL
104 234 A NULL
105 234 B NULL
106 234 C NULL
107 234 B FAIL
108 456 A NULL
109 456 B NULL
110 456 C NULL
Thanks
No guarantees on efficiency here...
update tableA
set status = 'FAIL'
where ID not in (
select min(ID)
from tableA
group by Match, code)