Add null to the columns which are empty - perl

I am trying to put null to the columns which are empty using perl or awk, to find the number of column , header's column count can be used. I tried to perform the solution using perl and some regex. However, the output looks very close to the desired output but if noticed carefully row number one is showing incorrect data.
Input data:
id name type foo-id zoo-id loo-id-1 moo-id-2
----- --------------- ----------- ------ ------ ------ ------
0 zoo123 soozoo 8 31 32
51 zoo213 soozoo 48 51
52 asz123 soozoo 47 52
53 asw122 soozoo 1003 53
54 fff123 soozoo 68 54
55 sss123 soozoo 75 55
56 ssd123 soozoo 76 56
Expected Output:
0 zoo123 soozoo 8 null 31 32
51 zoo213 soozoo 48 51 null null
52 asz123 soozoo 47 52 null null
53 asw122 soozoo 1003 53 null null
54 fff123 soozoo 68 54 null null
55 sss123 soozoo 75 55 null null
56 ssd123 soozoo 76 56 null null
Very close to solution but row-1 is showing incorrect data:
echo "$x"|grep -E '^[0-9]+' |perl -ne 'm/^([\d]+)(?:\s+([\w]+))?(?:\s+([-\w]+))?(?:\s+([\d]+))?(?:\s+([\d]+))?(?:\s+([\d]+))?(?:\s+([\d]+))?/;printf "%s %s %s %s %s %s %s\n", $1, $2//"null", $3//"null",$4//"null",$5//"null",$6//"null",$7//"null"' |column -t
0 zoo123 soozoo 8 31 32 null
51 zoo213 soozoo 48 51 null null
52 asz123 soozoo 47 52 null null
53 asw122 soozoo 1003 53 null null
54 fff123 soozoo 68 54 null null
55 sss123 soozoo 75 55 null null
56 ssd123 soozoo 76 56 null null

When you have a fixed-width string to parse, you'll find that unpack() is a better tool than regexes.
This should demonstrate how to do it. I'll leave it to you to convert it to a one-liner.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
while (<DATA>) {
next if /^\D/; # Skip lines that don't start with a digit
# I worked out the unpack() template by counting columns.
my #data = map { /\S/ ? $_ : 'null' } unpack('A7A14A16A8A8A8A8');
say join ' ', #data;
}
__DATA__
id name type foo-id zoo-id loo-id-1 moo-id-2
----- --------------- ----------- ------ ------ ------ ------
0 zoo123 soozoo 8 31 32
51 zoo213 soozoo 48 51
52 asz123 soozoo 47 52
53 asw122 soozoo 1003 53
54 fff123 soozoo 68 54
55 sss123 soozoo 75 55
56 ssd123 soozoo 76 56
Output:
$ perl unpack | column -t
0 zoo123 soozoo 8 null 31 32
51 zoo213 soozoo 48 51 null null
52 asz123 soozoo 47 52 null null
53 asw122 soozoo 1003 53 null null
54 fff123 soozoo 68 54 null null
55 sss123 soozoo 75 55 null null
56 ssd123 soozoo 76 56 null null

With GNU awk:
awk 'NR>2{ # ignore first and second row
NF=7 # fix number of columns
for(i=1; i<=NF; i++) # loop with all columns
if($i ~ /^ *$/){ # if empty or only spaces
$i="null"
}
print $0}' FIELDWIDTHS='7 14 16 8 8 10 8' OFS='|' file | column -s '|' -t
As one line:
awk 'NR>2{NF=7; for(i=1;i<=NF;i++) if($i ~ /^ *$/){$i="null"} print $0}' FIELDWIDTHS='7 14 16 8 8 10 8' OFS='|' file | column -s '|' -t
Output:
0 zoo123 soozoo 8 null 31 32
51 zoo213 soozoo 48 51 null null
52 asz123 soozoo 47 52 null null
53 asw122 soozoo 1003 53 null null
54 fff123 soozoo 68 54 null null
55 sss123 soozoo 75 55 null null
56 ssd123 soozoo 76 56 null null
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Related

Extracting all rows containing a specific datetime value (MATLAB)

I have a table which looks like this:
Entry number
Timestamp
Value1
Value2
Value3
Value4
5758
28-06-2018 16:30
34
63
34.2
60.9
5759
28-06-2018 17:00
33.5
58
34.9
58.4
5758
28-06-2018 16:30
34
63
34.2
60.9
5759
28-06-2018 17:00
33.5
58
34.9
58.4
5760
28-06-2018 17:30
33
53
35.2
58.5
5761
28-06-2018 18:00
33
63
35
57.9
5762
28-06-2018 18:30
33
61
34.6
58.9
5763
28-06-2018 19:00
33
59
34.1
59.4
5764
28-06-2018 19:30
28
89
33.5
64.2
5765
28-06-2018 20:00
28
89
33
66.1
5766
28-06-2018 20:30
28
83
32.5
67
5767
28-06-2018 21:00
29
89
32.2
68.4
Where '28-06-2018 16:30' is under one column. So I have 6 columns:
Entry number, Timestamp, Value1, Value2, Value3, Value4
I want to extract all rows that belong to '28-06-2018', i.e all data pertaining to that day. Since my table is too large I couldn't fit more data, however, the entries under the timestamp range for a couple of months.
t=table([5758;5759],["28-06-2018 16:30";"29-06-2018 16:30"],[34;33.5],'VariableNames',{'Entry number','Timestamp','Value1'})
t =
2×3 table
Entry number Timestamp Value1
____________ __________________ ______
5758 "28-06-2018 16:30" 34
5759 "29-06-2018 16:30" 33.5
t(contains(t.('Timestamp'),"28-06"),:)
ans =
1×3 table
Entry number Timestamp Value1
____________ __________________ ______
5758 "28-06-2018 16:30" 34

pyspark - converting DF Structure

I am new to Python and Spark Programming.
I have data in Below given format-1, which will have data captured for different fields based on timestamp and trigger.
I need to convert this data into format-2, i.e, based on timestamp and Key, need to group all the fields given in format-1 and created records as per Format-2. In Format-1, there are field that does not have any key value (timestamp and Trigger), these fields should be populated for all the records in format-2
Can you please suggest me the best approach to perform this in pyspark.
Format-1:
Event time (key-1) trig (key-2) data field_Name
------------------------------------------------------
2021-05-01T13:57:29Z 30Sec 10 A
2021-05-01T13:57:59Z 30Sec 11 A
2021-05-01T13:58:29Z 30Sec 12 A
2021-05-01T13:58:59Z 30Sec 13 A
2021-05-01T13:59:29Z 30Sec 14 A
2021-05-01T13:59:59Z 30Sec 15 A
2021-05-01T14:00:29Z 30Sec 16 A
2021-05-01T14:00:48Z OFF 17 A
2021-05-01T13:57:29Z 30Sec 110 B
2021-05-01T13:57:59Z 30Sec 111 B
2021-05-01T13:58:29Z 30Sec 112 B
2021-05-01T13:58:59Z 30Sec 113 B
2021-05-01T13:59:29Z 30Sec 114 B
2021-05-01T13:59:59Z 30Sec 115 B
2021-05-01T14:00:29Z 30Sec 116 B
2021-05-01T14:00:48Z OFF 117 B
2021-05-01T14:00:48Z OFF 21 C
2021-05-01T14:00:48Z OFF 31 D
Null Null 41 E
Null Null 51 F
Format-2:
Event Time Trig A B C D E F
--------------------------------------------------------------
2021-05-01T13:57:29Z 30Sec 10 110 Null Null 41 51
2021-05-01T13:57:59Z 30Sec 11 111 Null Null 41 51
2021-05-01T13:58:29Z 30Sec 12 112 Null Null 41 51
2021-05-01T13:58:59Z 30Sec 13 113 Null Null 41 51
2021-05-01T13:59:29Z 30Sec 14 114 Null Null 41 51
2021-05-01T13:59:59Z 30Sec 15 115 Null Null 41 51
2021-05-01T14:00:29Z 30Sec 16 116 Null Null 41 51
2021-05-01T14:00:48Z OFF 17 117 21 31 41 51

Add unique rows for each group when similar group repeats after certain rows

Hi Can anyone help me please to get unique group number?
I need to give unique rows for each group even when same group repeats after some groups.
I have following data:
id version product startdate enddate
123 0 2443 2010/09/01 2011/01/02
123 1 131 2011/01/03 2011/03/09
123 2 131 2011/08/10 2012/09/10
123 3 3009 2012/09/11 2014/03/31
123 4 668 2014/04/01 2014/04/30
123 5 668 2014/05/01 2016/01/01
123 6 668 2016/01/02 2017/09/08
123 7 131 2017/09/09 2017/10/10
123 8 131 2018/10/11 2019/01/01
123 9 550 2019/01/02 2099/01/01
select *,
dense_rank()over(partition by id order by id,product)
from table
Expected results:
id version product startdate enddate count
123 0 2443 2010/09/01 2011/01/02 1
123 1 131 2011/01/03 2011/03/09 2
123 2 131 2011/08/10 2012/09/10 2
123 3 3009 2012/09/11 2014/03/31 3
123 4 668 2014/04/01 2014/04/30 4
123 5 668 2014/05/01 2016/01/01 4
123 6 668 2016/01/02 2017/09/08 4
123 7 131 2017/09/09 2017/10/10 5
123 8 131 2018/10/11 2019/01/01 5
123 9 550 2019/01/02 2099/01/01 6
Try the following
SELECT
id,version,product,startdate,enddate,
1+SUM(v)OVER(PARTITION BY id ORDER BY version) n
FROM
(
SELECT
*,
IIF(LAG(product)OVER(PARTITION BY id ORDER BY version)<>product,1,0) v
FROM TestTable
) q

Combine 2 data frames with different columns in spark

I have 2 dataframes:
df1 :
Id purchase_count purchase_sim
12 100 1500
13 1020 1300
14 1010 1100
20 1090 1400
21 1300 1600
df2:
Id click_count click_sim
12 1030 2500
13 1020 1300
24 1010 1100
30 1090 1400
31 1300 1600
I need to get the combined data frame with results as :
Id click_count click_sim purchase_count purchase_sim
12 1030 2500 100 1500
13 1020 1300 1020 1300
14 null null 1010 1100
24 1010 1100 null null
30 1090 1400 null null
31 1300 1600 null null
20 null null 1090 1400
21 null null 1300 1600
I can't use union because of different column names. Can some one suggest me a better way to do this ?
All you require a full outer join on ID column.
df1.join(df2, Seq("Id"), "full_outer")
// Since the Id column name is same in both the dataframes, if you use comparison like
df1($"Id") === df2($"Id"), you will get duplicate ID columns
Please refer the below documentation for future references.
https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

compare & merge files-unix

cat file1.txt
Id leng sal mon
25671 34343 56565 5565
44888 56565 45554 6868
23343 23423 26226 6224
77765 88688 87464 6848
66776 23343 63463 4534
cat file2.txt
Id sp He Ho
25671 33 45 35
34353 64 75 33
77765 56 56 67
cat output.txt
Id leng sal sp He Ho
25671 34343 56565 33 45 35
77765 88688 87464 56 56 67
Compare both file1.txt & file2.txt, if the column1 is same in both files(file1.txt & file2.txt), report in separte output(output.txt) only matched one by merging (ignore 4th column in file1.txt, while merging output file).
I have tried cat file1.txt file2.txt|sort-u >output.txt. But it does not work. Any awk,trick using join is appreciated.
awk 'NR==FNR{ s[$1] = $2 " " $3 }
NR!=FNR{ if( $1 in s ) print $1, s[$1], $2,$3,$4}' file1.txt file2.txt
join -o 0 1.2 1.3 2.2 2.3 2.4 <(sort file1.txt) <(sort file2.txt) |sort -n | tr ' ' '\t'
This might work for you (GNU sed):
cat <<\! >file1.txt
> Id leng sal mon
> 25671 34343 56565 5565
> 44888 56565 45554 6868
> 23343 23423 26226 6224
> 77765 88688 87464 6848
> 66776 23343 63463 4534
> !
cat <<\! >file2.txt
> Id sp He Ho
> 25671 33 45 35
> 34353 64 75 33
> 77765 56 56 67
> !
sed 's|^\(\S*\)\s*\(.*\)|/^\1/s/\\(\\(\\S*\\s*\\)\\{3\\}\\).*/\\1\2/p|' file2.txt
/^Id/s/\(\(\S*\s*\)\{3\}\).*/\1sp He Ho/p
/^25671/s/\(\(\S*\s*\)\{3\}\).*/\133 45 35/p
/^34353/s/\(\(\S*\s*\)\{3\}\).*/\164 75 33/p
/^77765/s/\(\(\S*\s*\)\{3\}\).*/\156 56 67/p
sed 's|^\(\S*\)\s*\(.*\)|/^\1/s/\\(\\(\\S*\\s*\\)\\{3\\}\\).*/\\1\2/p|' file2.txt |
sed -nf - file1.txt
Id leng sal sp He Ho
25671 34343 56565 33 45 35
77765 88688 87464 56 56 67
Explanation:
Convert file2.txt into a sed script that transforms file1.txt into the required format.