How can I extract specific columns in perl? - perl

chr1 1 10 el1
chr1 13 20 el2
chr1 50 55 el3
I have this tab delimited file and I want to extract the second and third column using perl. How can I do that?
I tried reading the file using file handler and storing it in a string, then converting the string to an array but it didn't get me anywhere.
My attempt is:
while (defined($line=<FILE_HANDLE>)) {
my #tf1;
#tf1 = split(/\t/ , $line);
}

Simply autosplit on tab
# ↓ index starts on 0
$ perl -F'\t' -lane'print join ",", #F[1,2]' inputfile
Output:
1,10
13,20
50,55
See perlrun.

use strict;
my $input=shift or die "must provide <input_file> as an argument\n";
open(my $in,"<",$input) or die "Cannot open $input for reading: $!";
while(<$in>)
{
my #tf1=split(/\t/,$_);
print "$tf1[1]|$tf1[2]\n"; # $tf1[1] is the second column and $tf1[2] is the third column
}
close($in)

What problem are you having? Your code already does all the hard parts.
while (defined($line=<FILE_HANDLE>)) {
my #tf1;
#tf1 = split(/\t/ , $line);
}
You have all three columns in your #tf1 array (by the way - your variable naming needs serious work!) All you need to do now is to print the second and third elements from the array (but remember that Perl array elements are numbered from zero).
print "$tf1[1] / $tf1[2]\n";
It's possible to simplify your code quite a lot by taking advantage of Perl's default behaviours.
while (<FILE_HANDLE>) { # Store record in $_
my #tf1 = split(/\t/); # Declare and initialise on one line
# split() works on $_ by default
print "$tf1[1] / $tf1[2]\n";
}

Even more pithily than #daxim as a one-liner:
perl -aE 'say "#F[1,2]" ' file
See also: How to sort an array or table by column in perl?

Related

Extract preceding and trailing characters to a matched string from file in awk

I have a large string file seq.txt of letters, unwrapped, with over 200,000 characters. No spaces, numbers etc, just a-z.
I have a second file search.txt which has lines of 50 unique letters which will match once in seq.txt. There are 4000 patterns to match.
I want to be able to find each of the patterns (lines in file search.txt), and then get the 100 characters before and 100 characters after the pattern match.
I have a script which uses grep and works, but this runs very slowly, only does the first 100 characters, and is written out with echo. I am not knowledgeable enough in awk or perl to interpret scripts online that may be applicable, so I am hoping someone here is!
cat search.txt | while read p; do echo "grep -zoP '.{0,100}$p' seq.txt | sed G"; done > do.grep
Easier example with desired output:
>head seq.txt
abcdefghijklmnopqrstuvwxyz
>head search.txt
fgh
pqr
uvw
>head desiredoutput.txt
cdefghijk
mnopqrstu
rstuvwxyz
Best outcome would be a tab separated file of the 100 characters before \t matched pattern \t 100 characters after. Thank you in advance!
One way
use warnings;
use strict;
use feature 'say';
my $string;
# Read submitted files line by line (or STDIN if #ARGV is empty)
while (<>) {
chomp;
$string = $_;
last; # just in case, as we need ONE line
}
# $string = q(abcdefghijklmnopqrstuvwxyz); # test
my $padding = 3; # for the given test sample
my #patterns = do {
my $search_file = 'search.txt';
open my $fh, '<', $search_file or die "Can't open $search_file: $!";
<$fh>;
};
chomp #patterns;
# my #patterns = qw(bcd fgh pqr uvw); # test
foreach my $patt (#patterns) {
if ( $string =~ m/(.{0,$padding}) ($patt) (.{0,$padding})/x ) {
say "$1\t$2\t$3";
# or
# printf "%-3s\t%3s%3s\n", $1, $2, $3;
}
}
Run as program.pl seq.txt, or pipe the content of seq.txt to it.†
The pattern .{0,$padding} matches any character (.), up to $padding times (3 above), what I used in case the pattern $patt is found at a position closer to the beginning of the string than $padding (like the first one, bcd, that I added to the example provided in the question). The same goes for the padding after the $patt.
In your problem then replace $padding to 100. With the 100 wide "padding" before and after each pattern, when a pattern is found at a position closer to the beginning than the 100 then the desired \t alignment could break, if the position is lesser than 100 by more than the tab value (typically 8).
That's what the line with the formatted print (printf) is for, to ensure the width of each field regardless of the length of the string being printed. (It is commented out since we are told that no pattern ever gets into the first or last 100 chars.)
If there is indeed never a chance that a matched pattern breaches the first or the last 100 positions then the regex can be simplified to
/(.{$padding}) ($patt) (.{$padding})/x
Note that if a $patt is within the first/last $padding chars then this just won't match.
The program starts the regex engine for each of #patterns, what in principle may raise performance issues (not for one run with the tiny number of 4000 patterns, but such requirements tend to change and generally grow). But this is by far the simplest way to go since
we have no clue how the patterns may be distributed in the string, and
one match may be inside the 100-char buffer of another (we aren't told otherwise)
If there is a performance problem with this approach please update.
† The input (and output) of the program can be organized in a better way using named command-line arguments via Getopt::Long, for an invocation like
program.pl --sequence seq.txt --search search.txt --padding 100
where each argument may be optional here, with defaults set in the file, and argument names may be shortened and/or given additional names, etc. Let me know if that is of interest
One in awk. -v b=3 is the before context length -v a=3 is the after context length and -v n=3 is the match length which is always constant. It hashes all the substrings of seq.txt to memory so it uses it depending on the size of the seq.txt and you might want to follow the consumption with top, like: abcdefghij -> s["def"]="abcdefghi" , s["efg"]="bcdefghij" etc.
$ awk -v b=3 -v a=3 -v n=3 '
NR==FNR {
e=length()-(n+a-1)
for(i=1;i<=e;i++) {
k=substr($0,(i+b),n)
s[k]=s[k] (s[k]==""?"":ORS) substr($0,i,(b+n+a))
}
next
}
($0 in s) {
print s[$0]
}' seq.txt search.txt
Output:
cdefghijk
mnopqrstu
rstuvwxyz
You can tell grep to search for all the patterns in one go.
sed 's/.*/.{0,100}&.{0,100}/' search.txt |
grep -zoEf - seq.txt |
sed G >do.grep
4000 patterns should be easy peasy, though if you get to hundreds of thousands, maybe you will want to optimize.
There is no Perl regex here, so I switched from the nonstandard grep -P to the POSIX-compatible and probably more efficient grep -E.
The surrounding context will consume any text it prints, so any match within 100 characters from the previous one will not be printed.
You can try following approach to your problem:
load string input data
load into an array patterns
loop through each pattern and look for it in the string
form an array from found matches
loop through matches array and print result
NOTE: the code is not tested due absence of input data
use strict;
use warnings;
use feature 'say';
my $fname_s = 'seq.txt';
my $fname_p = 'search.txt';
open my $fh, '<', $fname_s
or die "Couldn't open $fname_s";
my $data = do { local $/; <$fh> };
close $fh;
open my $fh, '<', $fname_p
or die "Couln't open $fname_p";
my #patterns = <$fh>;
close $fh;
chomp #patterns;
for ( #patterns ) {
my #found = $data =~ s/(.{100}$_.{100})/g;
s/(.{100})(.{50})(.{100})/$1 $2 $3/ && say for #found;
}
Test code for provided test data (added latter)
use strict;
use warnings;
use feature 'say';
my #pat = qw/fgh pqr uvw/;
my $data = do { local $/; <DATA> };
for( #pat ) {
say $1 if $data =~ /(.{3}$_.{3})/;
}
__DATA__
abcdefghijklmnopqrstuvwxyz
Output
cdefghijk
mnopqrstu
rstuvwxyz

How to store value from cut command into a Perl array?

my #up = `cat abc.txt|head -2|tail -1|cut -d' ' -f1-3`;
Instead of storing the individual fields in the array. It's storing the entire output as a string in the first element.
This is the output I am getting
$up[0] = 'xxx 12 234'
I want this
#up = ('xxx', 12, 234)
|
It looks like you want the first three space-delimited fields of the second line of file abc.txt
The problem is that backticks will return one line of output in each element of the array, and because cut prints all three fields on a single line, they appear as a single array element.
You could split the value again inside Perl, but when you have the whole of the Perl language available, it's wasteful to use the shell to do something so simple and you should do everything in Perl
This program will do as you ask. I've used Data::Dump only so that you can verify that the contents of #up are as you wanted
use strict;
use warnings 'all';
use Data::Dump;
my #up = do {
open my $fh, '<', 'abc.txt' or die $!;
<$fh>; # Skip one line
(split ' ', <$fh>)[0 .. 2];
};
dd \#up;
output
["xxx", 12, 234]
You can either split the result by whitespaces:
my #up = split(/\s+/, `cat abc.txt ...`);
Or prior you can set input record separator to space. This one however is not as flexible, it's just simple string so in case there are two spaces in a row it will treat it as empty field in the middle:
local $/ = " ";
my #up = `cat abc.txt ...`;

Need to replace a specific column value of a file based on values in other columns of the file

Below is my content of file and below is my requirement
Based value in column no 1 and column no 5, I would want to replace value in column 7 with "1".
For example:
If column 1="change" and column 5="defer", then replace value in column 7 with "1".
If column 1="change" and column 5="defererence" then replace value in column 7 with "1".
Otherwise do not do anything with row, keep it as it is.
Input file:
Change|sinmg|1234|ewfew|def|fdfd|JAMES|rewr|ROBERT|3|fe
Change|sinmg|2345|ewfew|defer|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|
noChange|sinmg|2323|ewfew|def|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|fe
Change|sinmg|3456|ewfew|defer|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|
Change|sinmg|2345|ewfew|defererence|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|
Above is just a sample to make it easier to explain.However I want to pass values for column 1 and column 5 from a file to match against value in file. If it matches, then only replace column 7 with value "1" otherwise dont do anything with row, keep it as it is.
I tried couple of options and not able to achieve required results.
perl -F'\|' -i -lape 'if ($F[0] eq "change" && $F[4] eq "defer") { s/$F[6]/1/g;}' file_name
Above command is replacing all values of 3 in file irrespective of fields. But i want to only replace 6th column value based on 1st column and 4th column by passing different values to 1st and 4th column in a loop.
Adding more information:
As mentioned by me above, above example is just simplest form of my problem to make everybody understand the requirement. I have a file with name "listfile" which has got list of values for column no 1 and column no 5 for matching. If values in column no 1 and column no 5 from my sourcefile matches with the values passed from file "listfile", then solution should replace value in column no 7 with "1". Otherwise do not do anything with row from source file, keep it as it is.
I tried to do below, but unable to achieve required.
#!/usr/bin/ksh
for line in $(cat dir/listfile)
do
VAR1=$(echo $line | nawk -F"|" '{print $1}')
VAR2=$(echo $line | nawk -F"|" '{print $2}')
nawk -F"|" 'BEGIN {OFS="|"} {if($1!="'"$VAR1"'" && $5!="'"$VAR2"'") {$8="1"; print $0;} else {print $0;}}' dir/sourcefile >> dir/sourcefile_REVISED
done
No of records between original source file and revised source file after replacing Column no 7 should be same. Only thing is for all values of Column no 1 and 5 from file listfile, i need column no 7 value to be replaced by "1".
Thanks,
You can use awk to do this.
awk -F'|' 'BEGIN{OFS="|"}{if($1=="Change"&&$5=="defer"){$7=1}{print}}' file
I realize that you also need 5th column to match with "differerence"... Following should work:
awk -F'|' 'BEGIN{OFS="|"}{if($1=="Change"&&$5=="defer"||$5=="defererence"){$7=1}{print}}' file
perl -F'\|' -i -lape '
BEGIN{ $" = "|" }
$F[6]=1, $_="#F" if $F[0] eq "Change" && $F[4] =~ /^defer(erence)?$/;
' listfile
The substitute operator s/// doesn't understand the concept of columns of data, it just operates on the given string. In this case, you are replacing whatever is in column 7 everywhere it occurs in the input line because of the g modifier.
The awk solution in the answer by #imauser is a good solution.
With awk you would do:
$ awk '$1=="Change"&&$5~/^defer(erence)?$/{$7=1}1' FS='|' OFS='|' file
Change|sinmg|1234|ewfew|def|fdfd|JAMES|rewr|ROBERT|3|fe
Change|sinmg|2345|ewfew|defer|VENKTRAAAMMAMAMAMMAMA|1|rewr|BEAEY|3|
noChange|sinmg|2323|ewfew|def|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|fe
Change|sinmg|3456|ewfew|defer|VENKTRAAAMMAMAMAMMAMA|1|rewr|BEAEY|3|
Change|sinmg|2345|ewfew|defererence|VENKTRAAAMMAMAMAMMAMA|1|rewr|BEAEY|
You can still use a perl+regexp solution by "skipping" the first six fields before doing the replacement:
perl -F'\|' -i -lapE 'if ($F[0] eq "change" && $F[4] =~ m{^defer(erence)?$}) { s{^(?:[^|]*\|){6}\K([^|]*)}{1} }' file_name
An advantage over the awk solutions: you can still use the -i switch here.
You can achieve this using split and join in perl script as follows:
#!/usr/bin/perl
use warnings;
use strict;
my $infile = "dir/sourcefile";
my $listfile = "dir/listfile";
my $outfile = "dir/sourcefile_REVISED";
my #list;
open LFOPEN, $listfile or die $!;
while (<LFOPEN>) {
chomp;
my #col = split /\|/, $_;
push #list, \#col;
}
close LFOPEN;
open IFOPEN, $infile or die $!;
open OFOPEN, '>', $outfile or die $!;
while (<IFOPEN>) {
chomp;
my #col = split /\|/, $_;
foreach my $lref (#list) {
$col[6] = '1' if ($col[0] eq $lref->[0] and $col[4] eq $lref->[1]);
}
print OFOPEN join ('|', #col) . "\n";
}
close IFOPEN;
close OFOPEN;
Input (dir/sourcefile):
Change|sinmg|1234|ewfew|def|fdfd|JAMES|rewr|ROBERT|3|fe
Change|sinmg|2345|ewfew|defer|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|
noChange|sinmg|2323|ewfew|def|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|fe
Change|sinmg|3456|ewfew|defer|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|
Change|sinmg|2345|ewfew|defererence|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|
List (dir/listfile):
Change|defer
Change|defererence
Output (dir/sourcefile_REVISED):
Change|sinmg|1234|ewfew|def|fdfd|JAMES|rewr|ROBERT|3|fe
Change|sinmg|2345|ewfew|defer|VENKTRAAAMMAMAMAMMAMA|1|rewr|BEAEY|3
noChange|sinmg|2323|ewfew|def|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|fe
Change|sinmg|3456|ewfew|defer|VENKTRAAAMMAMAMAMMAMA|1|rewr|BEAEY|3
Change|sinmg|2345|ewfew|defererence|VENKTRAAAMMAMAMAMMAMA|1|rewr|BEAEY|3

Perl - count number of columns per row in a csv file

I want to count the number of columns in a row for a CSV file.
row 1 10 columns
row 2 11 columns
etc.
I can print out the value of the last column, but I really just want a count per row.
perl -F, -lane "{print #keys[$_].$F[$_] foreach(-1)}" < testing.csv
I am on a windows machine
Thanks.
If you have a proper csv file, it can contain embedded delimiters (e.g. 1,"foo,bar",2), in which case a simple split will not be enough. You can use the Text::CSV module fairly easily with a one-liner like this:
Copy/paste version:
perl -MText::CSV -lwe"my $c=Text::CSV->new({sep_char=>','}); while($r=$c->getline(*STDIN)) { print scalar #$r }" < sorted.csv
Readable version:
perl -MText::CSV # use Text::CSV module
-lwe # add newline to print, use warnings
"my $c = Text::CSV->new(); # set up csv object
while( $r = $c->getline(*STDIN) ) { # get lines from stdin
print scalar #$r # print row size
}" < sorted.csv # input file to stdin
If your input can be erratic, Text::CSV->getline might choke on corrupted lines (the while loop is ended), in which case it may be safer to use plain parsing:
perl -MText::CSV -nlwe"
BEGIN { $r = Text::CSV->new() };
$r->parse($_);
print scalar $r->fields
" comma.csv
Note that in this case we use a different input method. This is because while getline() requires a file handle, parse() does not. Since the diamond operator uses either ARGV or STDIN depending on your argument, I find it is better to be explicit.
If you don't have commas as part of the fields, you can split the line and count the number of fields
#! /usr/bin/perl
use strict;
use warnings;
my #cols = split(',', $_);
my $n = #cols;
print "row $. $n columns\n";
you can call this
perl -n script.pl testing.csv

How can I correctly process this file containing tab separated values in Perl?

I am fairly new to Perl and know next to nothing about Perl's 'proper' syntax.
I have a text file that I use everyday with a listing of names, and other info for our users. This file changes daily and sometimes has two rows in it(tab delimited), and other times has 100+ rows in it.
The file also varies between 6-9 columns of data in a row. I have put together a Perl script that uses the split function on tabs, but the issue I am running into is that if I take row a, which has 5 columns in it and then add a second row b that has 6 columns in it that are all populated with data.
I cannot figure out how to get Perl to see that row a only has 5 columns of data and to continue parsing the text file from that point forward. It continues, but the output wraps lines strangely. How can I get around this issue? I hope that made sense.
You will have to post some code and possibly some sample data, but here's a code that is parsing rows of different lengths without issue.
Script:
#!/usr/bin/perl
use strict;
while (<STDIN>)
{
chomp;
my #info = split("\t");
print join(";", #info), "\n";
}
exit;
Test File:
jsmith 101 777-222-5555 Office 1 Building 1 Manager
aposse 104 777-222-5556 Office 2 Building 2 Stock Clerk
jbraza 105 777-222-5557 Office 3
mcuzui 102 777-222-5557 Office 3 Building 3 Cashier
ghines 107 777-222-5557 Office 3
Output:
%> test.pl < file.txt
jsmith;101;777-222-5555;Office 1;Building 1;Manager
aposse;104;777-222-5556;Office 2;Building 2;Stock Clerk
jbraza;105;777-222-5557;Office 3
mcuzui;102;777-222-5557;Office 3;Building 3;Cashier
ghines;107;777-222-5557;Office 3
You should post some sample data and code and explain desired behavior in terms of what the code currently does and what you want it to do. split will give you as many fields as there are in the input.
#!/usr/bin/perl
use strict; use warnings;
while ( my $row = <DATA> ) {
last unless $row =~ /\S/;
chomp $row;
my #cells = split /\t/, $row;
print "< #cells >\n";
}
__DATA__
1 2 3 4 5
a b c d e f
Text::CSV module can be used for parsing tab-separated-values as well. In reality, Text::CSV could parse values delimited by any character.
Relevant excerpt from its POD:
The module accepts either strings or
files as input and can utilize any
user-specified characters as
delimiters, separators, and escapes so
it is perhaps better called ASV
(anything separated values) rather
than just CSV.
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new( { 'sep_char' => "\t" } );
open my $fh, '<', 'data.tsv' or die "Unable to open: $!";
my #rows;
while ( my $row_ref = $csv->getline($fh) ) {
push #rows, $row_ref;
}
$csv->sep_char('|');
for my $row_ref (#rows) {
$csv->combine(#$row_ref);
print $csv->string(), "\n";
}