How to display lines in a file where it contains more than 5 comma in a line using egrep or awk - perl

I have the lines in the following format:
enter image description here
Help is required to display the line alone containing more than 5 comma in a line in a separate file.

perl has a tr (translate) operator that returns the number of translations that occurred. We can use this to count substrings in a string.
cat file.txt | perl -ne 'print if tr/,// > 5'

Using egrep:
egrep '([^,]*,){6,}'
Using awk:
awk -F, 'NF>5{print}'
Using a sed which has an "extended regular expression option" (I'll assume -r here, but it could be -E):
sed -n -r -e '/([^,]*,){6,}/p'
Of course you have to be careful what you ask for. For example, if you have a CSV file with commas embedded within "values", and if you only want lines with more than five "values", then things get a little trickier for tools that are not CSV-aware.

Text in image looks like CSV.
then, using AWK...
awk -F'","' 'NF>5{print}'
like peak's above answer.

I think you already have answers to your raw question here. However, if what you're really asking is if you want to find how many rows have CSV fields that exceed 5, then I think you need something like Perl's Text::CSV module.
An example of this is the following string:
one,two,three,four,five,"six,seven"
This has six commas but only five fields. Do you want to see this line, or do you want to skip it? If you want to see it (as an exception -- a line with more than five commas), then use one of the methods already suggested.
If you don't, then you really want a CSV parser, and Perl's is quite nice -- more lightweight and easier than most languages, in my opinion:
use strict;
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1 } );
open my $IN, "<:encoding(utf8)", "file.csv" or die;
while (my $row = $csv->getline($IN)) {
if (#$row > 5) {
$csv->combine(#$row);
print $csv->string(), "\n";
}
}
close $IN;

Related

Extract everything between first and last occurence of the same pattern in single iteration

This question is very much the same as this except that I am looking to do this as fast as possible, doing only a single pass of the (unfortunately gzip compressed) file.
Given the pattern CAPTURE and input
1:.........
...........
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
...........
1000:......
Print:
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
Can this be accomplished with a regular expression?
I vaguely remember that this kind of grammar cannot be captured by a regular expression but not quite sure as regular expressions these days provide look aheads,etc.
You can buffer the lines until you see a line that contains CAPTURE, treating the first occurrence of the pattern specially.
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my #buf;
while ( my $line = <> ) {
push #buf, $line unless $first;
if ( $line=~/CAPTURE/ ) {
if ($first) {
#buf = ($line);
$first = 0;
}
print #buf;
#buf = ();
}
}
Feed the input into this program via zcat file.gz | perl script.pl.
Which can of course be jammed into a one-liner, if need be...
zcat file.gz | perl -ne '$x&&push#b,$_;if(/CAPTURE/){$x||=#b=$_;print#b;#b=()}'
Can this be accomplished with a regular expression?
You mean in a single pass, in a single regex? If you don't mind reading the entire file into memory, sure... but this is obviously not a good idea for large files.
zcat file.gz | perl -0777ne '/((^.*CAPTURE.*$)(?s:.*)(?2)(?:\z|\n))/m and print $1'
I would write
gunzip -c file.gz | sed -n '/CAPTURE/,$p' | tac | sed -n '/CAPTURE/,$p' | tac
Find the first CAPTURE and look back for the last one.
echo "/CAPTURE/,?CAPTURE? p" | ed -s <(gunzip -c inputfile.gz)
EDIT: Answer to comment and second (better?) solution.
When your input doesn't end with a newline, ed will complain, as shown by these tests.
# With newline
printf "1,$ p\n" | ed -s <(printf "%s\n" test)
# Without newline
printf "1,$ p\n" | ed -s <(printf "%s" test)
# message removed
printf "1,$ p\n" | ed -s <(printf "%s" test) 2> /dev/null
I do not know the memory complications this will give for a large file, but you would prefer a streaming solution.
You can use sed for the next approach.
Keep reading lines until you find the first match. During this time only remember the last line read (by putting it in a Hold area).
Now change your tactics.
Append each line to the Hold area. You do not know when to flush until the next match.
When you have the next match, recall the Hold area and print this.
I needed some tweeking for preventing the second match to be printed twice. I solved this by reading the next line and replacing the HOLD area with that line.
The total solution is
gunzip -c inputfile.gz | sed -n '1,/CAPTURE/{h;n};H;/CAPTURE/{x;p;n;h};'
When you don't like the sed holding space, you can implemnt the same approach with awk:
gunzip -c inputfile.gz |
awk '/CAPTURE/{capt=1} capt==1{a[i++]=$0} /CAPTURE/{for(j=0;j<i;j++) print a[j]; i=0}'
I don't think regex will be faster than double scan...
Here is an awk solution (double scan)
$ awk '/pattern/ && NR==FNR {a[++f]=NR; next} a[1]<=FNR && FNR<=a[f]' file{,}
Alternatively if you have any a priori information on where the patterns appear on the file you can have heuristic approaches which will be faster on those special cases.
Here is one more example with regex (the cons is that if files are large, it will consume a large memory)
#!/usr/bin/perl
{
local $/ = undef;
open FILE, $ARGV[0] or die "Couldn't open file: $!";
binmode FILE;
$string = <FILE>;
close FILE;
}
print $1 if $string =~ /([^\n]+(CAPTURE).*\2.*?)\n/s;
Or with one liner:
cat file.tmp | perl -ne '$/=undef; print $1 if <STDIN> =~ /([^\n]+(CAPTURE).*\2.*?)\n/s'
result:
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
This might work for you (GNU sed):
sed '/CAPTURE/!d;:a;n;:b;//ba;$d;N;bb' file
Delete all lines until the first containing the required string. Print the line containing the required string. Replace the pattern space with the next line. If this line contains the required string, repeat the last two previous sentences. If it is the last line of the file, delete the pattern space. Otherwise, append the next line and repeat the last three previous sentences.
Having studied the test files used for haukex's benchmark, it would seem that sed is not the tool to extract this file. Using a mixture of csplit, grep and sed presents a reasonably fast solution as follows:
lines=$(grep -nTA1 --no-group-separator CAPTURE oldFile |
sed '1s/\t.*//;1h;$!d;s/\t.*//;H;x;s/\n/ /')
csplit -s oldFile $lines && rm xx0{0,2} && mv xx01 newFile
Split the original file into three files. A file preceding the first occurrence of CAPTURE, a file from the first CAPTURE to the last CAPTURE and a file containing of the remainder. The first and third files are discarded and the second file renamed.
csplit can use line numbers to split the original file. grep is extremely fast at filtering patterns and can return the line numbers of all patterns that match CAPTURE and the following context line. sed can manipulate the results of grep into two line numbers which are supplied to the csplit command.
When run against the test files (as above) I get timings around 10 seconds.
While posting this question, the problem I had at hand was that I had several huge gzip compressed log files generated by a java application.
The log lines were of the following format:
[Timestamp] (AppName) {EventId} [INFO]: Log text...
[Timestamp] (AppName) {EventId} [EXCEPTION]: Log text...
at com.application.class(Class.java:154)
caused by......
[Timestamp] (AppName) {EventId} [LogLevel]: Log text...
Given an EventId, I needed to extract all the lines corresponding to the event from these files. The problem became unsolvable with a trivial grep for EventId just due to the fact that the exception lines could be of arbitrary length and do not contain the EventId.
Unfortunately I forgot to consider the edge case where the last log line for an EventId could be the exception and the answers posted here would not print the stacktrace lines. However it wasn't hard to modify haukex's solution to cover these cases as well:
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my #buf;
while ( my $line = <> ) {
push #buf, $line unless $first;
if ( $line=~/EventId/ or ($first==0 and $line!~/\(AppName\)/)) {
if ($first) {
#buf = ($line);
$first = 0;
}
print #buf;
#buf = ();
}
else {
$first = 1;
}
}
I am still wondering if the faster solutions(mainly walter's sed solution or haukex's in-memory perl solution) could be modified to do the same.

Unix - filename and string result on same line

I need to search a directory that has hundreds or thousands of files, each containing XML with one or more instances of a specific string (begin/end tag with data).
I can get all the instances of the string by doing
grep -ho '<mytagname>..............<\/mytagname>' /home/xyzzy/mydata/*.XML > /home/mydata/tagvalues.txt
then a few sed commands to strip off the tags, so I wind up with a file just containing a list of values:
value001
value002
value003
(etc)
Ideally though, I'd like to have each line of the file to also include the filename so I can import into a database for analysis.
So my result would be something like this
fileAAA value001
fileAAA value002
fileAAA value003
fileBBB value004
Exact formatting of the above is flexible - could have spaces or other separator, it could even still include the begin/end tags.
The closest I've been able to get is with grep -o
fileAAA:value001
value002
value003
fileBBB:value004
A perl one-liner would seem ideal but I'm new enough to that, that I have no clue how to begin.
Could be done using a one-liner like so:
perl -lne 'print "$ARGV $1" if /<mytagname>(.*?)<\/mytagname>/' *.xml
However, I'd strongly recommend that you use an actual XML parser like XML::Twig or XML::LibXML
use strict;
use warnings;
use XML::LibXML;
for my $file (</home/xyzzy/mydata/*.XML>) {
my $doc = XML::LibXML->load_xml(location => $file);
for my $node ($doc->findnodes("//mytagname")) {
print "$file " . $node->textContent() . "\n";
}
}
What about awk?
awk -F'</?mytagname>' '$2 {print FILENAME,$2}' /home/xyzzy/mydata/*.XML
Explanation:
-F regex - set field delimiter must be a separate argument thus enclosed in its own quotes
$2 - if second field has a value
{print FILENAME,$2} - print filename SPACE the value of second field

Sed: syntax error with unexpected "("

I've got file.txt which looks like this:
C00010018;1;17/10/2013;17:00;18;920;113;NONE
C00010019;1;18/10/2013;17:00;18;920;0;NONE
C00010020;1;19/10/2013;19:00;18;920;0;NONE
And I'm trying to do two things:
Select the lines that have $id_play as 2nd field.
Replace ; with - on those lines.
My attempt:
#!/usr/bin/perl
$id_play=3;
$input="./file.txt";
$result = `sed s#^\([^;]*\);$id_play;\([^;]*\);\([^;]*\);\([^;]*\);\([^;]*\);\([^;]*\)\$#\1-$id_play-\2-\3-\4-\5-\6#g $input`;
And I'm getting this error:
sh: 1: Syntax error: "(" unexpected
Why?
You have to escape the # characters, add 2 backslashes in some cases (thanks ysth!), add single quotes between sed and make it also filter the lines. So replace with this:
$result = `sed 's\#^\\([^;]*\\);$id_play;\\([^;]*\\);\\([^;]*\\);\\([^;]*\\);\\([^;]*\\);\\([^;]*\\);\\([^;]*\\)\$\#\\1-$id_play-\\2-\\3-\\4-\\5-\\6-\\7\#g;tx;d;:x' $input`;
PS. What you are trying to do can be achieved in a much more clean way without calling sed and using a split. For example:
#!/usr/bin/perl
use warnings;
use strict;
my $id_play=3;
my $input="file.txt";
open (my $IN,'<',$input);
while (<$IN>) {
my #row=split/;/;
print join('-',#row) if $row[1]==$id_play;
}
close $IN;
No need to ever call sed from perl as the perl regex engine already built in and much easier to use. The above answer is perfectly fine. With such a simple dataset, another simple way to do it a little more idiomatically (although maybe a little more obfuscated...then again that sed command was a little complex in itself!) would be:
#!/usr/bin/perl
use warnings;
use strict;
my $id_play = 3;
my #result = map { s/;/-/g; $_ } grep { /^\w+;$id_play;/ } <DATA>;
print #result;
__DATA__
C00010018;1;17/10/2013;17:00;18;920;113;NONE
C00010019;1;18/10/2013;17:00;18;920;0;NONE
C00010020;1;19/10/2013;19:00;18;920;0;NONE
C00010020;3;19/10/2013;19:00;18;920;0;NONE
C00010019;3;18/10/2013;17:00;18;920;0;NONE
C00010020;4;19/10/2013;19:00;3;920;0;NONE
Assuming the file isn't too terribly large, you can just use grep with a regex to grab the lines you are looking for, and then map with a substitution operator to covert those semicolons to hyphens and store the results in a list that you can then print out. I tested it with the DATA block below the code, but instead of reading in from that block, you would probably read in from your file as normal.
edit: Also forgot to mention that in sed, '(' and ')' are treated as literal regular characters and not regex groupings. If you're dead set on sed for such things, use the -r option of sed to have it use those characters in the regex sense.
$ cat file
C00010018;1;17/10/2013;17:00;18;920;113;NONE
C00010019;2;18/10/2013;17:00;18;920;0;NONE
C00010020;3;19/10/2013;19:00;18;920;0;NONE
$
$ id_play=2
$
$ awk -v id="$id_play" -F';' -v OFS='-' '$2==id{$1=$1}1' file
C00010018;1;17/10/2013;17:00;18;920;113;NONE
C00010019-2-18/10/2013-17:00-18-920-0-NONE
C00010020;3;19/10/2013;19:00;18;920;0;NONE

perl - help parsing out number values from many small text files

I have a number of files in a common directory (/home/test) with a common name:
ABC_1_20110508.out
ABC_1_20110509.out
ABC_1_20110510.out
..
Each text file has one record that looks like this:
(count, 553076)
I would like to strip out the numbers and just list them out in a file one at a time.
553076
1005
7778000
...
Can someone show me how to do this using perl?
use this regex:
/\(\w+, (\d+)\)/
you can also use the magic diamond operator to iterate over all of the files at once:
while (<>) {
# extract the number
/\(\w+, (\d+)\)/;
# print it out
print $1, "\n";
}
And if your perl script is called myscript.pl, the you can call it like this:
$ myscript.pl /home/test/ABC_1_*.out
Sounds like a one-liner to me:
$ perl -wne '/(\d+)/ && print "$1\n"' *.out > out.txt
Easiest way is to use the <> operator. When invoking a perl program without arguments, <> acts just like <STDIN>. If you call it arguments, <> will give you the contents of every file in #ARGV without you having to manually manage the filehandles.
Ex: ./your_script.pl /home/test/ABC_1_????????.out or cat /home/test/ABC_1_????????.out | ./your_script.pl. These would have the same effect.

How can I filter out specific column from a CSV file in Perl?

I am just a beginner in Perl and need some help in filtering columns using a Perl script.
I have about 10 columns separated by comma in a file and I need to keep 5 columns in that file and get rid of every other columns from that file. How do we achieve this?
Thanks a lot for anybody's assistance.
cheers,
Neel
Have a look at Text::CSV (or Text::CSV_XS) to parse CSV files in Perl. It's available on CPAN or you can probably get it through your package manager if you're using Linux or another Unix-like OS. In Ubuntu the package is called libtext-csv-perl.
It can handle cases like fields that are quoted because they contain a comma, something that a simple split command can't handle.
CSV is an ill-defined, complex format (weird issues with quoting, commas, and spaces). Look for a library that can handle the nuances for you and also give you conveniences like indexing by column names.
Of course, if you're just looking to split a text file by commas, look no further than #Pax's solution.
Use split to pull the line apart then output the ones you want (say every second column), create the following xx.pl file:
while(<STDIN>) {
chomp;
#fields = split (",",$_);
print "$fields[1],$fields[3],$fields[5],$fields[7],$fields[9]\n"
}
then execute:
$ echo 1,2,3,4,5,6,7,8,9,10 | perl xx.pl
2,4,6,8,10
If you are talking about CSV files in windows (e.g., generated from Excel), you will need to be careful to take care of fields that contain comma themselves but are enclosed by quotation marks.
In this case, a simple split won't work.
Alternatively, you could use Text::ParseWords, which is in the standard library. Add
use Text::ParseWords;
to the top of Pax's example above, and then substitute
my #fields = parse_line(q{,}, 0, $_);
for the split.
You can use some of Perl's built in runtime options to do this on the command line:
$ echo "1,2,3,4,5" | perl -a -F, -n -e 'print join(q{,}, $F[0], $F[3]).qq{\n}'
1,4
The above will -a(utosplit) using the -F(ield) of a comma. It will then join the fields you are interested in and print them back out (with a line separator). This assumes simple data without nested comma's. I was doing this with an unprintable field separator (\x1d) so this wasn't an issue for me.
See http://perldoc.perl.org/perlrun.html#Command-Switches for more details.
Went looking didn't find a nice csv compliant filter program thats flexible to be useful for than just a one-of, so I wrote one. Enjoy.
Basic usage is:
bash$ csvfilter [-r <columnTitle>]* [-quote] <csv.file>
#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long;
use Text::CSV;
my $always_quote=0;
my #remove;
if ( ! GetOptions('remove:s'=> \#remove,
'quote-always'=>sub {$always_quote=1;}) ) {
die "$0:invalid option (use --remove [--quote-always])";
}
my #cols2remove;
sub filter(#)
{
my #fields=#_;
my #r;
my $i=0;
for my $c (#cols2remove) {
my $p;
#if ( $i $i ) {
push(#r, splice(#fields, $i));
}
return #r;
}
# create just one if these
my $csvOut=new Text::CSV({always_quote=>$always_quote});
sub printLine(#)
{
my #fields=#_;
my $combined=$csvOut->combine(filter(#fields));
my $str=$csvOut->string();
if ( length($str) ) {
print "$str\n";
}
}
my $csv = Text::CSV->new();
my $od;
open($od, "| cat") || die "output:$!";
while () {
$csv->parse($_);
if ( $. == 1 ) {
my $failures=0;
my #cols=$csv->fields;
for my $rm (#remove) {
for (my $c=0; $c$b} #cols2remove);
}
printLine($csv->fields);
}
exit(0);
\
In addition to what people here said about processing comma-separated files, I'd like to note that one can extract the even (or odd) array elements using an array slice and/or map:
#myarray[map { $_ * 2 } (0 .. 4)]
Hope it helps.
My personal favorite way to do CSV is using the AnyData module. It seems to make things pretty simple, and removing a named column can be done rather easily. Take a look on CPAN.
This answers a much larger question, but seems like a good relevant bit of information.
The unix cut command can do what you want (and a whole lot more). It has been reimplemented in Perl.