Looking for an awk (or sed) one-liner to remove lines from the output if the first field is a duplicate.
An example for removing duplicate lines I've seen is:
awk 'a !~ $0; {a=$0}'
Tried using it for a basis with no luck (I thought changing the $0's to $1's would do the trick, but didn't seem to work).
awk '{ if (a[$1]++ == 0) print $0; }' "$#"
This is a standard (very simple) use for associative arrays.
this is how to remove duplicates
awk '!_[$1]++' file
If you're open to using Perl:
perl -ane 'print if ! $a{$F[0]}++' file
-a autosplits the line into the #F array, which is indexed starting at 0
The %a hash remembers if the first field has already been seen
This related solution assumes your field separator is a comma, rather than whitespace
perl -F, -ane 'print if ! $a{$F[0]}++' file
it print the unique as well as single value of the duplicates
awk '!a[$1]++' file_name
Related
I have mulitple lines
QQQQl123
hsdhjhksd
QQQQl234
ajkdkjsdh
QQQQl564
i want to print all matching QQQQl[0-9]+
like
QQQQl123
QQQQl234
QQQQl564
how to do this using perl
I tried:
$ perl -0777pe '/QQQQl[0-9]+/' filename
it shows nothing
perl -we 'while(<>){ next unless $_=~/QQQQl[0-9]+/; print $_; }' < filename
perl -ne 'print if /QQQQl[0-9]+/' filename
Or, if, for some reason, you insist on using -0777, you could do
perl -0777nE 'say for /QQQQl[0-9]+/g' filename
(or print "$_\n" instead of say)
Your code doesn't work because /QQQQl[0-9]+/ returns true because $_ indeed contains that pattern, but you never asked Perl to do anything based on that return value.
-n is preferable to -p in that case, since you don't want to print every line but only some (-p automatically prints every line, and there is very little you can do about it).
This question is very much the same as this except that I am looking to do this as fast as possible, doing only a single pass of the (unfortunately gzip compressed) file.
Given the pattern CAPTURE and input
1:.........
...........
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
...........
1000:......
Print:
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
Can this be accomplished with a regular expression?
I vaguely remember that this kind of grammar cannot be captured by a regular expression but not quite sure as regular expressions these days provide look aheads,etc.
You can buffer the lines until you see a line that contains CAPTURE, treating the first occurrence of the pattern specially.
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my #buf;
while ( my $line = <> ) {
push #buf, $line unless $first;
if ( $line=~/CAPTURE/ ) {
if ($first) {
#buf = ($line);
$first = 0;
}
print #buf;
#buf = ();
}
}
Feed the input into this program via zcat file.gz | perl script.pl.
Which can of course be jammed into a one-liner, if need be...
zcat file.gz | perl -ne '$x&&push#b,$_;if(/CAPTURE/){$x||=#b=$_;print#b;#b=()}'
Can this be accomplished with a regular expression?
You mean in a single pass, in a single regex? If you don't mind reading the entire file into memory, sure... but this is obviously not a good idea for large files.
zcat file.gz | perl -0777ne '/((^.*CAPTURE.*$)(?s:.*)(?2)(?:\z|\n))/m and print $1'
I would write
gunzip -c file.gz | sed -n '/CAPTURE/,$p' | tac | sed -n '/CAPTURE/,$p' | tac
Find the first CAPTURE and look back for the last one.
echo "/CAPTURE/,?CAPTURE? p" | ed -s <(gunzip -c inputfile.gz)
EDIT: Answer to comment and second (better?) solution.
When your input doesn't end with a newline, ed will complain, as shown by these tests.
# With newline
printf "1,$ p\n" | ed -s <(printf "%s\n" test)
# Without newline
printf "1,$ p\n" | ed -s <(printf "%s" test)
# message removed
printf "1,$ p\n" | ed -s <(printf "%s" test) 2> /dev/null
I do not know the memory complications this will give for a large file, but you would prefer a streaming solution.
You can use sed for the next approach.
Keep reading lines until you find the first match. During this time only remember the last line read (by putting it in a Hold area).
Now change your tactics.
Append each line to the Hold area. You do not know when to flush until the next match.
When you have the next match, recall the Hold area and print this.
I needed some tweeking for preventing the second match to be printed twice. I solved this by reading the next line and replacing the HOLD area with that line.
The total solution is
gunzip -c inputfile.gz | sed -n '1,/CAPTURE/{h;n};H;/CAPTURE/{x;p;n;h};'
When you don't like the sed holding space, you can implemnt the same approach with awk:
gunzip -c inputfile.gz |
awk '/CAPTURE/{capt=1} capt==1{a[i++]=$0} /CAPTURE/{for(j=0;j<i;j++) print a[j]; i=0}'
I don't think regex will be faster than double scan...
Here is an awk solution (double scan)
$ awk '/pattern/ && NR==FNR {a[++f]=NR; next} a[1]<=FNR && FNR<=a[f]' file{,}
Alternatively if you have any a priori information on where the patterns appear on the file you can have heuristic approaches which will be faster on those special cases.
Here is one more example with regex (the cons is that if files are large, it will consume a large memory)
#!/usr/bin/perl
{
local $/ = undef;
open FILE, $ARGV[0] or die "Couldn't open file: $!";
binmode FILE;
$string = <FILE>;
close FILE;
}
print $1 if $string =~ /([^\n]+(CAPTURE).*\2.*?)\n/s;
Or with one liner:
cat file.tmp | perl -ne '$/=undef; print $1 if <STDIN> =~ /([^\n]+(CAPTURE).*\2.*?)\n/s'
result:
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
This might work for you (GNU sed):
sed '/CAPTURE/!d;:a;n;:b;//ba;$d;N;bb' file
Delete all lines until the first containing the required string. Print the line containing the required string. Replace the pattern space with the next line. If this line contains the required string, repeat the last two previous sentences. If it is the last line of the file, delete the pattern space. Otherwise, append the next line and repeat the last three previous sentences.
Having studied the test files used for haukex's benchmark, it would seem that sed is not the tool to extract this file. Using a mixture of csplit, grep and sed presents a reasonably fast solution as follows:
lines=$(grep -nTA1 --no-group-separator CAPTURE oldFile |
sed '1s/\t.*//;1h;$!d;s/\t.*//;H;x;s/\n/ /')
csplit -s oldFile $lines && rm xx0{0,2} && mv xx01 newFile
Split the original file into three files. A file preceding the first occurrence of CAPTURE, a file from the first CAPTURE to the last CAPTURE and a file containing of the remainder. The first and third files are discarded and the second file renamed.
csplit can use line numbers to split the original file. grep is extremely fast at filtering patterns and can return the line numbers of all patterns that match CAPTURE and the following context line. sed can manipulate the results of grep into two line numbers which are supplied to the csplit command.
When run against the test files (as above) I get timings around 10 seconds.
While posting this question, the problem I had at hand was that I had several huge gzip compressed log files generated by a java application.
The log lines were of the following format:
[Timestamp] (AppName) {EventId} [INFO]: Log text...
[Timestamp] (AppName) {EventId} [EXCEPTION]: Log text...
at com.application.class(Class.java:154)
caused by......
[Timestamp] (AppName) {EventId} [LogLevel]: Log text...
Given an EventId, I needed to extract all the lines corresponding to the event from these files. The problem became unsolvable with a trivial grep for EventId just due to the fact that the exception lines could be of arbitrary length and do not contain the EventId.
Unfortunately I forgot to consider the edge case where the last log line for an EventId could be the exception and the answers posted here would not print the stacktrace lines. However it wasn't hard to modify haukex's solution to cover these cases as well:
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my #buf;
while ( my $line = <> ) {
push #buf, $line unless $first;
if ( $line=~/EventId/ or ($first==0 and $line!~/\(AppName\)/)) {
if ($first) {
#buf = ($line);
$first = 0;
}
print #buf;
#buf = ();
}
else {
$first = 1;
}
}
I am still wondering if the faster solutions(mainly walter's sed solution or haukex's in-memory perl solution) could be modified to do the same.
I have the lines in the following format:
enter image description here
Help is required to display the line alone containing more than 5 comma in a line in a separate file.
perl has a tr (translate) operator that returns the number of translations that occurred. We can use this to count substrings in a string.
cat file.txt | perl -ne 'print if tr/,// > 5'
Using egrep:
egrep '([^,]*,){6,}'
Using awk:
awk -F, 'NF>5{print}'
Using a sed which has an "extended regular expression option" (I'll assume -r here, but it could be -E):
sed -n -r -e '/([^,]*,){6,}/p'
Of course you have to be careful what you ask for. For example, if you have a CSV file with commas embedded within "values", and if you only want lines with more than five "values", then things get a little trickier for tools that are not CSV-aware.
Text in image looks like CSV.
then, using AWK...
awk -F'","' 'NF>5{print}'
like peak's above answer.
I think you already have answers to your raw question here. However, if what you're really asking is if you want to find how many rows have CSV fields that exceed 5, then I think you need something like Perl's Text::CSV module.
An example of this is the following string:
one,two,three,four,five,"six,seven"
This has six commas but only five fields. Do you want to see this line, or do you want to skip it? If you want to see it (as an exception -- a line with more than five commas), then use one of the methods already suggested.
If you don't, then you really want a CSV parser, and Perl's is quite nice -- more lightweight and easier than most languages, in my opinion:
use strict;
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1 } );
open my $IN, "<:encoding(utf8)", "file.csv" or die;
while (my $row = $csv->getline($IN)) {
if (#$row > 5) {
$csv->combine(#$row);
print $csv->string(), "\n";
}
}
close $IN;
What I love about awk is you can fetch all the lines from a file that satisfies the condition on some aritrary field you specify.
For example,
awk '$3~/hi/' < test.txt # print all lines where the third field matches the pattern "hi"
or
awk '$2>=2' < test.txt # print all lines where the second field is greater or equal to 2
As a beginner who's learning about the power of unix, I am absolutely fascinated about this.
Now I am wondering if there is an easy way to perform regex substitutions only on some arbitrary fields you specify? For example, I want to do regex substitution on the third field only.
my current method is to "cut" the field I want and perform substitution on that using perl or sed, which then I "paste" to the original file. But I am wondering if there is more efficient way to achieve this.
Thanks
Since you tagged this question with 'perl' (in addition to 'sed', 'awk', 'unix', and 'command-line'), I'll assume you're interested in answers that incorporate any of the above tools.
Perl has an auto-split command-line switch (-a):
perl -lane 'print if $F[2] =~ /some pattern/' filename
...or...
perl -lane 'print if $F[1] >= 42' filename
-a causes an auto-split into the #F array. -n causes Perl to iterate over the lines of the file you feed it. The rest is programming. ;)
Now for substitution:
perl -i.bak -lane '$F[2] =~ s/match/subst/; print join q/ /, #F' filename
Or, a little shorter using the -p switch, which tells Perl to print each line as it appears in $_. That means if you alter #F, you'll have to copy it back into $_:
perl -i.bak -pale '$F[2] =~ s/match/subst/ && $_="#F"' filename
This might work for you:
echo -e 'Fred barney Wilma\nfoo bar baz' |
awk '$2 == "barney"{sub(/b/,"B",$2)};1'
Fred Barney Wilma
foo bar baz
You can use the sub, gsub commands or this this case:
echo -e 'Fred barney Wilma\nfoo bar baz'|
awk '$2 == "barney"{$2="Barney"};1'
Fred Barney Wilma
foo bar baz
Just substitute the second field completely.
N.B. The 1 at the end of the line is shorthand for {print}.
Consider a simple example:
awk -F "," '{ OFS=","; sub ("1", "x", $3); print $0 }' file.txt > newfile.txt
newfile.txt will now contain:
1,2,3,4,5,6,7
8,9,x0,11,12,13,14
15,16,x7,18,19,20,21
Here, 1 was replaced with an x in the third column $3.
-F "," sets the delimiter of the input file.
OFS="," adds a comma to the output.
If you would like to make the substitution globally, consider using gsub instead of sub.
HTH
I just had a task in where I needed to replace each 3rd value in a tabulator separated file with a fixed value. I guess it can be done in Perl on a Unix shell like so
$perl -a -n -i -F'/\t/' -e '$F[2]="THE FIXED VALUE";print join "\t", #F' bla.txt
I just wanted to know if this is a "correct" way, or if there is a better (for a currently lacking definition of better) to do it?
I think your one-liner is reasonable and readable. There are many more ways to do it. I would stack the perlrun options and save a few keystrokes:
perl -F'\t' -i -ape'$F[2]="THE FIXED VALUE"; $_ = join "\t", #F' bla.txt
A shame that $, does not get populated with the argument of -F, so there's still a piece of repetition.