emulate SAS' datastep statement FIRST using linux command line tools - command-line

Let's say I have the first column of the following dataset in a file and I want to emulate the flag in the second column so I export only that row tied to a flag = 1 (dataset is pre-sorted by the target column):
1 1
1 0
1 0
2 1
2 0
2 0
I could run awk 'NR==1 {print; next} seen[$1]++ {print}' dataset but would run into a problem for very large files (seen keeps growing). Is there an alternative to handle this without tracking every single unique value of the target column (here column #1)? Thanks.

So you only have the first column? And would like to generate the second? I think a slightly different awk command could work
awk '{if (last==$1) {flag=0} else {last=$1; flag=1}; print $0,flag}' file.txt
Basically you just check if the first field matches the last one you've seen. Since it's sorted, you don't have to keep track of everything you've seen, only the last one to know if the value is different.

Seems like grep would be fine for this:
$ grep " 1" dataset

Related

Filtering tshark output for .csv. Preventing errors from missing fields

I am trying to filter a pcap file in tshark wit a lua script and ultimately output it to a .csv. I am most of the way there but I am still running into a few issues.
This is what I have so far
tshark -nr -V -X lua_script:wireshark_dissector.lua -r myfile.pcap -T fields -e frame.time_epoch -e Something_UDP.field1 -e Something_UDP.field2 -e Something_UDP.field3 -e Something_UDP.field4 -e Something_UDP.field5 -e Something_UDP.field6 -e Something_UDP.field15 -e Something_UDP.field16 -e Something_UDP.field18 -e Something_UDP.field22 -E separator=,
Here is an example of what the frames look like, sort of.
frame 1
time: 1626806198.437893000
Something_UDP.field1: 0
Something_UDP.field2: 1
Something_UDP.field3:1
Something_UDP.field5:1
Something_UDP.field6:1
frame 2
time: 1626806198.439970000
Something_UDP.field8: 1
Something_UDP.field9: 0
Something_UDP.field13: 0
Something_UDP.field14: 0
frame 3
time: 1626806198.440052000
Something_UDP.field15: 1
Something_UDP.field16: 0
Something_UDP.field18: 1
Something_UDP.field19:1
Something_UDP.field20:1
Something_UDP.field22: 0
Something_UDP.field24: 0
The output I am looking for would be
1626806198.437893000,0,1,1,,1,1,1,,,,,
1626806198.440052000,,,,,,,,,1,0,,1,1,1,,0,0,,,,
That is if the frame contains one of the fields I am looking for it will output its value followed by a comma but if that field isn't there it will output a comma. One issue is that not every frame contains info that I am interested in and I don't want them to be outputted. Part of the issue with that is that one of the fields I need is epoch time and that will be in every frame but that is only important if the other fields are there. I could use awk or grep to do this but wondering if it can all be done inside tshark. The other issue is that the fields being requested will com from a text file and there may be fields in the text file that don't actually exist in the pcap file and if that happens I get a "tshark: Some fields aren't valid:" error.
In short I have 2 issues.
1: I need to print data only it the fields names match but not if the only match is epoch.
2: I need it to work even if one of the fields being requested doesn't exist.
I need to print data only it the fields names match but not if the only match is epoch.
Try using a display filter that mentions all the field names in which you're interested, with an "or" separating them, such s
-Y "Something_UDP.field1 or Something_UDP.field2 or Something_UDP.field3 or Something_UDP.field4 or Something_UDP.field5 or Something_UDP.field6 or Something_UDP.field15 or Something_UDP.field16 or Something_UDP.field18 or Something_UDP.field22"
so that only packets containing at least one of those fields will be processed.
I need it to work even if one of the fields being requested doesn't exist.
Then you will need to construct the command line on the fly, avoiding field names that aren't valid.
One way, in a script, to test whether a field is valid is to use the dftest command:
dftest Something_UDP.field1 >/dev/null 2>&1
will exit with a status of 0 if there's a field named "Something_UDP.field1" and will exit with a status of 2 if there isn't; if the scripting language you're using can check the exit status of a command to see if it succeeds, you can use that.

Is there any way to encode Multiple columns in a csv using base64 in Shell?

I have a requirement to replace multiple columns of a csv file with its base64 encoding value which should be applied to some columns of the file but keep the first line unaffected as the first line contains the header of the file. I have tried out for 1 column as below but as I have given it to proceed after skipping the first line of the file it is not
gawk 'BEGIN { FS="|"; OFS="|" } NR >=2 { cmd="echo "$4" | base64 -w 0";cmd | getline x;close(cmd); print $1,$2,$3,x}' awktest
o/p:
12|A|B|Qw==
13|C|D|RQ==
36|Z|V|VQ==
Qs: It is not showing the header in the output. What should I do to make produce the header in the output? Also can I use any loop here to replace multiple columns?
input:
10|A|B|C|5|T|R
12|A|B|C|6|eee|ff
13|C|D|E|9|dr|xrdd
36|Z|V|U|7|xc|xd
Required output:
10|A|B|C|5|T|R
12|A|B|encodedvalue|6|encodedvalue|ff
13|C|D|encodedvalue|9|encodedvalue|xrdd
36|Z|V|encodedvalue|7|encodedvalue|xd
Is this possible? Have researched a lot but could not find a proper explanation. I am new to shell. Kindly help. Many thanks!!!!
It looks like you can just sequence conditionals. This may not be the best way of solving the header issue, but it's intuitive.
BEGIN { FS="|"; OFS="|" } NR ==1 {print} NR >=2 { cmd="echo "$4" | base64 -w 0";cmd | getline x;close(cmd); print $1,$2,$3,x}
As for using a loop to affect multiple columns... Loops in bash are hard. Awk is technically its own language, and may have a looping construct of it's own, IDK. But it's not clear you need a loop. If there's only a reasonable number of fields that need modifying, you can just parameterize the existing command (somehow) by the field index, and then pipe through however many instances of it. It won't be as performant as doing it all in a single pass of awk, but that's probably ok.

How to use fishshell to add numbers to files

I have a very simple mp3 player, and the order it plays audio files are based on the file names, and the rule is there must be a 3-size number in the beginning of file name, such as:
001file.mp3
002file.mp3
003file.mp3
I want to write a fish shell sortmp3 to add numbers to the files of a directory. Say directory myfiles contains files:
aaa.mp3
bbb.mp3
ccc.mp3`
When I run sortmp3 myfiles, the file names will be changed to:
001aaa.mp3
002bbb.mp3
003ccc.mp3
But my question is:
how to generate some sequential numbers?
how to make sure the size of each number is exactly 3?
I would write this, which makes no assumptions about how many files there are in a directory:
function sortmp3
set -l files *
set -l i
for i in (seq (count $files))
echo mv $files[$i] (printf "%03d%s" $i $files[$i])
end
end
Remove the "echo" if you like how it works.
You can generate sequential numbers with the seq tool - an external program.
This will only take care of the first part, it won't pad to three characters.
To do that, there's a variety of choices:
printf '%s\n' 00(seq 0 99) | rev | cut -c 1-3 | rev
printf '%s\n' 00(seq 0 99) | sed 's/^.*\(...\)$/\1/'
The 00(seq 0 99) part will generate numbers from "1" to "99" with two zeroes prepended - ie. from "001" to "0099". The later parts of the pipeline remove the superfluous zeroes again.
Or with the next fish version, you can use the new string tool:
string sub -s -3 -- 00(seq 0 99)
Depending on your specific situation you should use the "seq" command to generate sequential numbers or the "math" command to increment a counter. To format the number with a predictable number of leading zeros use the "printf" command:
set idx 12
printf '%03d' $idx

Diff command - avoiding monolithic grouping of consecutive differing lines

Playing around with the standard linux diff command, I could not find a way to avoid the following type of grouping in its output (the output listings here assume the unified format)
This question aims at the case that each line differs by little from its counterpart in the other file, and it's more useful to see each line next to its counterpart.
I would like instead of having groups like this show up in the comparison output:
- line 1
- line 2
- line 3
+ line 1 modified
+ line 2 modified
+ line 3 modified
To get this:
- line 1
+ line 1 modified
- line 2
+ line 2 modified
- line 3
+ line 3 modified
Of course, this is a convenience question as this can be accomplished by writing your own code to post-process the diff output, or diverging from the lcs algorithm with your own algorithm. I don't think variants like wdiff etc. would help much, as the plain diff -U0 output format fits my needs very well except for this grouping property, whereas wdiff introduces other aspects that are not optimal for my case.
I'm looking for a command-line way, or a library that can be used in code, not a UI tool.
I was trying to solve this myself. The closest I go was this:
diff -y -W 10000 file1 file2 | grep '|' | sed 's/\s*|\s*/\n/g'
The one issue is that this assumes there are no "white space" difference at the beginning of the lines (or that you don't care about it).

creating a hash with regex matches in perl

Lets say i have a file like below:
And i want to store all the decimal numbers in a hash.
hello world 10 20
world 10 10 10 10 hello 20
hello 30 20 10 world 10
i was looking at this
and this worked fine:
> perl -lne 'push #a,/\d+/g;END{print "#a"}' temp
10 20 10 10 10 10 20 30 20 10 10
Then what i need was to count number of occurrences of each regex.
for this i think it would be better to store all the matches in a hash and assign an incrementing value for each and every key.
so i tried :
perl -lne '$a{$1}++ for ($_=~/(\d+)/g);END{foreach(keys %a){print "$_.$a{$_}"}}' temp
which gives me an output of:
> perl -lne '$a{$1}++ for ($_=~/(\d+)/g);END{foreach(keys %a){print "$_.$a{$_}"}}' temp
10.4
20.7
Can anybody correct me whereever i was wrong?
the output i expect is:
10.7
20.3
30.1
although i can do this in awk,i would like to do it only in perl
Also order of the output is not a concern for me.
$a{$1}++ for ($_=~/(\d+)/g);
This should be
$a{$_}++ for ($_=~/(\d+)/g);
and can be simplified to
$a{$_}++ for /\d+/g;
The reason for this is that /\d+/g creates a list of matches, which is then iterated over by for. The current element is in $_. I imagine $1 would contain whatever was left in there by the last match, but it's definitely not what you want to use in this case.
Another option would be this:
$a{$1}++ while ($_=~/(\d+)/g);
This does what I think you expected your code to do: loop over each successful match as the matches happen. Thus the $1 will be what you think it is.
Just to be clear about the difference:
The single argument for loop in Perl means "do something for each element of a list":
for (#array)
{
#do something to each array element
}
So in your code, a list of matches was built first, and only after the whole list of matches was found did you have the opportunity to do something with the results. $1 got reset on each match as the list was being built, but by the time your code was run, it was set to the last match on that line. That is why your results didn't make sense.
On the other hand, a while loop means "check if this condition is true each time, and keep going until the condition is false". Therefore, the code in a while loop will be executed on each match of a regex, and $1 has the value for that match.
Another time this difference is important in Perl is file processing. for (<FILE>) { ... } reads the entire file into memory first, which is wasteful. It is recommended to use while (<FILE>) instead, because then you go through the file line by line and keep only the information you want.