Perl To Parse Whitespace Separated Columns - perl

I have a large text file which has three columns present, each column separated by four spaces. I need a perl script to read this text file and output columns #1, and #2 to a new text file with each of these columns wrapped in quotation marks and separated commas in the output file.
The text file with the four columns has data which looks like this:
9a2ba3c0580b5f3799ad9d6f487b2d3 /folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg HOST
What I would like the output to look like is
"9a2ba3c0580b5f3799ad9d6f487b2d38","/folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg"

Easy as a one-liner:
perl -lane 'print join ",", map qq("$_"), #F[0, 1]'
-l handles newlines in print
-n reads the input line by line
-a splits each line on whitespace into the #F array
#F[0, 1] is an array slice, it extracts the first two elements of the #F array
map wraps each element in double quotes
join inserts the comma in between

Below code for your reference:
#!/usr/bin/perl
my $defaultFileName=defined $ARGV[0]?$ARGV[0]:"filename.txt";
die "Could not find file: $defaultFileName" unless(-f $defaultFileName);
open my $fh, '<',"textFileName.log";
foreach my $line(<$fh>) {
my #tmpData=split(/\s+/, $line);
printf "\"%s\",\"%s\"\\n\n",$tmpData[1],$tmpData[2];
}
close $fh;

This can also be done with awk
>>cat test
9a2ba3c0580b5f3799ad9d6f487b2d3 /folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg HOST
9a2ba3c0580b5f3799ad9d6f487b2d3 /folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg HOST
9a2ba3c0580b5f3799ad9d6f487b2d3 /folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg HOST
9a2ba3c0580b5f3799ad9d6f487b2d3 /folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg HOST
OUTPUT:
>>awk '{FS=" "}{print "\""$1"\",""\""$2"\",""\""$3"\"" }' test
"9a2ba3c0580b5f3799ad9d6f487b2d3","/folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg","HOST"
"9a2ba3c0580b5f3799ad9d6f487b2d3","/folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg","HOST"
"9a2ba3c0580b5f3799ad9d6f487b2d3","/folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg","HOST"
"9a2ba3c0580b5f3799ad9d6f487b2d3","/folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg","HOST"
>>awk '{FS=" "}{print "\""$1"\",""\""$2"\",""\""$3"\"" }' test > output.txt
then output.txt will have a desired output.

Related

get column list using sed/awk/perl

I have different files like below format
Scenario 1 :
File1
no,name
1,aaa
20,bbb
File2
no,name,address
5,aaa,ghi
7,ccc,mn
I would like to get column list which is having more number of columns and if it is in the same order
**Expected output for scenario 1 :**
no,name,address
Scenario 2 :
File1
no,name
1,aaa
20,bbb
File2
no,age,name,address
5,2,aaa,ghi
7,3,ccc,mn
Expected Results :
Both file headers and positions are different as a message
I am interested in any short solution using bash / perl / sed / awk.
Perl solution:
perl -lne 'push #lines, $_;
close ARGV;
next if #lines < 2;
#lines = sort { length $a <=> length $b } #lines;
if (0 == index "$lines[1],", $lines[0]) {
print $lines[1];
} else {
print "Both file headers and positions are different";
}' -- File1 File2
-n reads the input line by line and runs the code for each line
-l removes newlines from input and adds them to printed lines
closing the special file handle ARGV makes Perl open the next file and read from it instead of processing the rest of the currently opened file.
next makes Perl go back to the beginning of the code, it can continue once more than one input line has been read.
sort sorts the lines by length so that we know the longer one is in the second element of the array.
index is used to check whether the shorter header is a prefix of the longer one (including the comma after the first header, so e.g. no,names is correctly rejected)

Extracting fasta ids after string match

I have a list of fasta sequences as following:
>Product_1_001:299:H377WBGXB:1:11101
TGATCATCTCACCTACTAATAGGACGATGACCCAGTGACGATGA
>Product_2_001:299:H377WBGXB:2:11101
CATCGATGATCATTGATAAGGGGCCCATACCCATCAAAACCGTT
The original fasta sequence is much longer than the subset posted here. I wanted to extract the 10 characters after the pattern "TCAT" into a separate file and did this
grep -oP "(?<=TCAT).{10}"
I do get the needed result as:
CTCACCTACT
TGATAAGGGG
I would like their corresponding fasta ids as one column and the extracted pattern as second column like:
>Product_1_001:299:H377WBGXB:1:11101 CTCACCTACT
>Product_2_001:299:H377WBGXB:2:11101 TGATAAGGGG
Try this one-liner
perl -lne ' /^[^<].+?(?<=TCAT)(.{10})/ and print $p,"\t",$1; $p=$_ ' file
with your given inputs
$ cat fasta.txt
>Product_1_001:299:H377WBGXB:1:11101
TGATCATCTCACCTACTAATAGGACGATGACCCAGTGACGATGA
>Product_2_001:299:H377WBGXB:2:11101
CATCGATGATCATTGATAAGGGGCCCATACCCATCAAAACCGTT
$ perl -lne ' /^[^<].+?(?<=TCAT)(.{10})/ and print $p,"\t",$1; $p=$_ ' fasta.txt
>Product_1_001:299:H377WBGXB:1:11101 CTCACCTACT
>Product_2_001:299:H377WBGXB:2:11101 TGATAAGGGG
$
Another way will be ussing awk command like this :
cat <your_file>| awk -F"_" '/Product/{printf "%s", $0; next} 1'|awk -F"TCAT" '{ print substr($1,1,35) "\t" substr($2,1,10)}'
the output :
Product_1_001:299:H377WBGXB:1:11101 CTCACCTACT
Product_2_001:299:H377WBGXB:2:11101 TGATAAGGGG
hope it help you.

Can "perl -a" somehow re-join #F using the original whitespace?

My input has a mix of tabs and spaces for readability. I want to modify a field using perl -a, then print out the line in its original form. (The data is from findup, showing me a count of duplicate files and the space they waste.) Input is:
2 * 4096 backup/photos/photo.jpg photos/photo.jpg
2 * 111276032 backup/books/book.pdf book.pdf
The output would convert field 3 to kilobytes, like this:
2 * 4 KB backup/photos/photo.jpg photos/photo.jpg
2 * 108668 KB backup/books/book.pdf book.pdf
In my dream world, this would be my code, since I could just will perl to automatically recombine #F and preserve the original whitespace:
perl -lanE '$F[2]=int($F[2]/1024)." KB"; print;'
In real life, joining with a single space seems like my only option:
perl -lanE '$F[2]=int($F[2]/1024)." KB"; print join(" ", #F);'
Is there any automatic variable which remembers the delimiters? If I had a magic array like that, the code would be:
perl -lanE 'BEGIN{use List::Util "reduce";} $F[2]=int($F[2]/1024)." KB"; print reduce { $a . shift(#magic) . $b } #F;'
No, there is no such magic object. You can do it by hand though
perl -wnE'#p = split /(\s+)/; $p[4] = int($p[4]/1024); print #p' input.txt
The capturing parens in split's pattern mean that it is also returned, so you catch exact spaces. Since spaces are in the array we now need the fifth field.
As it turns out, -F has this same property. Thanks to Сухой27. Then
perl -F'(\s+)' -lanE'$F[4] = int($F[4]/1024); say #F' input.txt
Note: with 5.20.0 "-F now implies -a and -a implies -n". Thanks to ysth.
You could just find the correct part of the line and modify it:
perl -wpE's/^\s*+(?>\S+\s+){2}\K(\S+)/int($1\/1024) . " KB"/e'

Perl - "/" causing issues for splitting by comma

I'm trying to split a file by ",". It is a CSV file.
However, one "column" has values that includes "/" and spaces. And it seems to freak out with that column and does not print anything after that column but moves on to the next row.
My code is simply:
perl -lane '#values = split(",",$F[0]); print $values[0]."\t".$values[3];' basefile.txt > newfile.txt
The basefile.txt looks like:
"1","text","abc // 123 /// some more text // text","filename1"
"2","text","abc // 123 /// some more text // text","filename2"
"3","text","abc // 123 /// some more text // text","filename3"
My newfile.txt should have an output of:
"1","filename1"
"2","filename2"
"3","filename3"
Instead I get:
"1",
"2",
"3",
Thanks!
It's not the / that is confusing perl here, it's the spaces combined with the -a flag. Try:
perl -lne '#values = split(",",$_); print $values[0]."\t".$values[3]' basefile
Or, better yet, use Text::CSV_XS to do the splitting.
It's not the '/', it's the spaces.
The -a flag causes perl to split each line of input and put the fields into the variable #F. The delimiter for this split operation is whitespace, unless you override it with the -Fdelimiter option on the command line, too.
So for the input
"1","text","abc // 123 /// some more text // text","filename"
with the -lan flags specified, perl sets
$F[0] = '"1","text","abc';
$F[1] = '//';
$F[2] = '123';
$F[3] = '///';
$F[4] = 'some';
etc.
It seems like you just want to do your split operation on the whole line. In which case you should stop using the -a flag and just say
#values = split(",",$_); ...
or leverage the -a and -F... options and say
perl -F/,/ -lane '#values=#F; ...'

Sort a file with unordered columns of integers

I have an input file with two columns of integer values. I would like to chop the input file in this way
input file:
...
...
12312 565456
565456 12312
...
...
#
output file:
...
...
12312 565456
...
...
namely if two numbers are present in couple more then one time, writing a unique line in the output file where the first number if the smaller of the two.
How can be done with sort or a perl script?
You can try:
perl -nale ' #F=reverse #F if($F[0]>$F[1]);
$x=$F[0]." ".$F[1]; if(!$h{$x}){print $x;$h{$x}=1;}'
See it
You could combine perl and sort:
perl -lne 'BEGIN { $, = " " } print sort split' infile | sort -u
awk -vOFS="\t" '$2<$1 {print $2,$1} $1<=$2 {print}'|sort -u
would also work