Perl - Print first letter of column - perl

I'm trying to print the first letter of column2 of an input file as well as other columns of interest. I'm not sure why the following script, adapted from Matching first letter of word gives me an 'Use of uninitialized value $columns[2]' warning.
Input File Example:
ATOM 1 CAY GLY X 1 -0.124 0.401 -0.153 1.00 2.67 PEP
ATOM 2 HY1 GLY X 1 -0.648 0.043 -1.064 1.00 0.00 PEP
ATOM 3 HY2 GLY X 1 -0.208 1.509 -0.145 1.00 0.00 PEP
Output File Example:
1 C -0.124 0.401 -0.153 1.00 2.67
2 H -0.648 0.043 -1.064 1.00 0.00
3 H -0.208 1.509 -0.145 1.00 0.00
Script
open (my $input_fh, "<", $filename) or die $!;
while (my $data = <$input_fh>) {
chomp $data;
my #columns = split(/\t/, $data);
my ($firstletter) = ($columns[2] =~ m/^\d+(\w)/);
if (/CAY/../HT2/)
print $output_fh join ("\t", $columns[1], $firstletter, $columns[6], $columns[7], $columns[8]), "\n";
}
UPDATE The warning occurred due to the if (/CAY/../HT2/) statement for some reason -- but since the input files are identical, I don't really need this condition. Also, since there are no digits in column2 it is more appropriate to use the /^(\w)/ regex.

Is there some particular reason that you must split on tabs? Getting various kinds of white space in an arbitrary text file correctly can be picky. If not necessary, it seems fully fitting to just split by (any) space, then grab the first letter
my #cols = split '\s+', $data;
my ($firstletter) = $cols[1] =~ m/^(\w)/;
I am not sure what the rest does but you can easily pluck the columns you need.

Try to debug what you get after splitting:
my #columns = split(/\t/, $data);
local $" = "\n"; print "$data\nSplitted into:\n#columns";
As guess your file have double \t characters. I mean you probably have:
ATOM\t\t1 CAY GLY X... so second column is undef

It sounds to me like the code that gave that warning was not what you show but instead had something like
($columns[2]) = ($columns[2] =~ m/^\d+(\w)/);
And you are getting the warning because the regex is failing due to not finding a digit. Maybe you meant \d*?

For me, maybe i would like to use cut command and pipeline, then split command to get the exact info you want.

Related

Append adds to end of last line and not new line

So lets say I have text file that looks like this
new york
new jersey
florida
South Carolina
I'm going to append california to it, but when I do it ends up looking like this:
new york
new jersey
florida
South Carolinacalifornia
How can I make it so it appends to a new line? I thought about just making a new line before the appending of california but then every time I run the script it'll just have gaps between the lines because of the new line
Presumably your file is a list of states all on their own line. Your issue appears to be that the line: "South Carolina", is missing a newline character at the end, which is inconsistent with the rest of your data.
This is a data issue. With hidden characters shown your data looks like this:
new york\n
new jersey\n
florida\n
South Carolina
When it should look like this:
new york\n
new jersey\n
florida\n
South Carolina\n
In a scenario like this where your data is inconsistent the best way to programmatically fix it is to check if the file ends in a newline, and if not, append one.
The issue is that the final line of your original file isn't terminated with a newline. If its contents are within your control then the best solution is simply to ensure that every line printed to the file is properly terminated, but if you have to deal with malformed data then there are a few options
The first, as people have said, is to read the entire file into memory, remove any existing terminators with chomp, and print them back out to the file with the correct newline after all of them
If your file is of any significant size then this approach is wasteful. You may avoid the rewriting by reading the last character of the file and checking whether it is a newline as required. Then, when the file is opened for appending, you can first add a newline if it was originally missing, followed by the new data record. That would look like this
The seek call is used to move the read position to just before the last character, then <$fh> will read the final character which can be compared to "\n" to establish whether the file is properly terminated
Note that, if there is any chance that the file is completely empty or non-existent before your program runs, then you will have to code for the case where the open fails or the <$fh> returns undef
use strict;
use warnings 'all';
use autodie;
use Fcntl ':seek';
my ($file) = #ARGV;
my $terminated = do {
open my $fh, '<', $file;
seek $fh, -1, SEEK_END;
<$fh> eq "\n";
};
open my $fh, '>>', $file;
print $fh "\n" unless $terminated;
print $fh "california\n";
close $fh;
As mentioned by Dkwan33 this might be data issue. You can use "od" command to run and see if every line is ending with \n or not.
od -bc data.txt
0000000 156 145 167 040 171 157 162 153 012 156 145 167 040 152 145 162
n e w y o r k \n n e w j e r
0000020 163 145 171 012 146 154 157 162 151 144 141 012 123 157 165 164
s e y \n f l o r i d a \n S o u t
0000040 150 040 103 141 162 157 154 151 156 141 012
h C a r o l i n a \n
0000053
If you want to handle it via Perl then I would suggest you to first do the "chomp" on each line and then print that line with \n.

Extract blocks of text from a file and write each block to a new file using perl, but it is not working [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have a large data file (text) that contains records from a database.
Each record is is delimited by "-" (146 of them)
I would like to write the text between each delimiter to a new file based on data from that block. The new file name should contain a piece of the data followed be the year (ie 2017)
For instance with this sample file:
1. 12/18/17
Company
PAGE 2
1:14 PM GET CURRENTLY SELECTED DEBTORS
------------------------------------------------------------------------------------------------------------------------
----------------------------- STATUS **Acct:1234** Disposition:9000 CANCEL Wait: 04/11/17
DEBTOR Name:Doe John
Ssn:123456789 Cbr: Ph:555-555-555
Rp:Doe John Ssn: Rp Ph:
Adr1:211 some road POE: Lgl: POE Ph:
City:anytown Cty: Canc:UNC Born:01/01/1937
St: VA Zip:54321 St: Zip: COF:N Sal:
Clnt:00248 Someplace, 93076 Org: 4
80.00
List:01/05/17 Srv:08/25/16 Pl95: Time:9 Calls:0 Con:0 Bal: 480.00
Co-Maker's Previous Address Spouse's Previous Address MULTIPLE
ACCOUNTS RM# Acct Name / Client Chk# / Lst Srv Lpy
Col Disp Bal Check Reason Drivers Licen se #
PRN INT LI3 LI4 AIN CC ATY MS1 PJI 1 142424* Doe John
93076/00248/Somewhere 01/05/17 08/25/16 0 9000
480.00
480.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2. ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
So from the above is need to extract this block and write it to a new file named 1234_2017.txt based on the Acct:1234 data for that record- then move to the next block. I have counted the number of records for the first file at 754 records, and I have a total of 10 files such as this.
#! / usr/bin/perl -w
use strict;
use warnings;
use POSIX 'strftime';
my $filename = "Facs_Data.txt";
my $outfile = "Acct_".strftime '%Y', localtime;
my $start = "'-' x 146";
open (INFILE, "<$filename"); open (OUTFILE, ">$outfile");
while ($start) {
print $outfile;
next
}
close (INFILE);
close (OUTFILE);
Any suggestions will be greatly appreciated.
If you're not going to put any effort into explaining what you have tried, I'm not going to put any effort into explaining my solution.
#!/usr/bin/perl
use strict;
use warnings;
local $/ = '>' . ('-' x 146) . "\n";
my $year = (localtime)[5] + 1900;
# <> reads from STDIN or a filename given as an argument
while (<>) {
next unless /\S/; # Ignore empty records (like the first one)
my ($acct) = /Acct:(\d+)/;
my $file = "${acct}_${year}.txt";
open my $fh, '>', $file
or die "Can't open $file: $!\n";
print $fh;
}

Manipulating digits

This is a program which grabs lines which contains the $position AND $amino value in the first two columns.
Code:
#!/usr/bin/perl
my $id = $ARGV[0];
my $position = $ARGV[1]; # POSITION OF THE RESIDUE
my $amino= $ARGV[2]; #THREE LETTER AMINO ACID CODE IN CAPITALS
my #grabbed;
open (FILE, $id.$amino.$position.".hb2");
#CREATES AN ARRAY WITH ONLY THE VALUES FROM THE HB2 FILE. REMOVES THE HEADER OF THE FILE.
while (<FILE>) {
if (/^-/) {
push #grabbed, $_;
while (<FILE>) {
last if /^$/;
push #grabbed, $_;
}
}
}
close (FILE);
for ( #grabbed ) {
my #f = split;
if (( $f[2] == "-"."00".$position."-".$amino ) or ($f[0] == "-"."00".$position."-".$amino)) {
push #line, $id.$amino.$position, " ",$_;
}
}
print #line;
Partial input data :
-0007-ARG NH2 -0009-GLN OE1 3.24 SS 2 6.00 143.3 2.38 105.9 95.8 1 #CASE 1
-0008-GLU N -0008-GLU OE1 2.62 MS 0 -1.00 120.8 1.96 102.3 103.4 2
-0011-ILE N -0117-ARG O 2.87 MM 106 4.90 144.0 2.00 127.5 139.0 3
-0117-ARG N -0011-ILE O 2.75 MM 106 4.90 160.4 1.79 153.2 148.6 4 #CASE 2
-0016-SER N -0012-THR O 2.89 MM 4 6.00 156.2 1.95 149.8 154.8 5 #CASE 3
-0017-ALA N -0013-LEU O 3.10 MM 4 6.24 152.8 2.17 143.4 149.7 6
-0018-GLU N -0014-ARG O 3.04 MM 4 6.24 154.1 2.11 147.2 154.2 7
-0019-ILE N -0015-GLY O 2.90 MM 4 6.16 155.8 1.96 150.7 156.2 8
-0016-SER OG -0188-THR OG1 2.72 SS 172 5.92 172.0 1.73 98.9 99.6 9
-0188-THR OG1 -0016-SER OG 2.72 SS 172 5.92 163.7 1.75 116.4 115.1 10
Question :
In order to generalize the program I made the match as :
( $f[2] == "-"."00".$position."-".$amino ) or ($f[0] == "-"."00".$position."-".$amino)
The format is always four digits after "-" before $amino (-0188-THR). I suddenly realized that my code wouldnt work if the $position input is "one digit(like CASE 1)" or "three digit (like CASE 2, column 1)". Since I hard coded it as format as "-" followed by two zeros and THEN position, it has to always be two digit input to work.
I am stumped to generalize this code so that I could put in 1/2/3 digits. The remaining digits would always be replaced by zeros.
You can format the string using sprintf:
my $mstring = sprintf("-%04d-%s", $position, $amino);
if ( ($f[2] eq $mstring) or ($f[0] eq $mstring) ) {
# ...
}
Here, %04d adds 0's to the left of position to make it 4 digits long.
First, == operator in perl used only for comparing arithmetic expressions
To compare strings you should use eq operator
Second, to format strings from digits you can use sprintf function.
if ($f[2] eq "-".sprintf("%04d", $position)."-".$amino ...

Perl+Selenium: chomp() fails

I'm using Selenium for work and I have extract some data from "//ul", unfortunately this data contains a newline, I tried to use chomp() function to remove this (because I need to write in a CSV's file) but it's not working, the portion of code is:
open (INFO, '>>file.csv') or die "$!";
print INFO ("codice\;descrizione\;prezzo\;URLFoto\n");
my $sel = Test::WWW::Selenium->new( host => "localhost",
port => 4444,
browser => "*chrome",
browser_url => "http://www.example.com/page.htm" );
$sel->open_ok("/page.htm");
$sel->click_ok("//table[2]/tbody/tr/td/a/img");
$sel->wait_for_page_to_load_ok("30000");
my $descrizione = $sel->get_text("//ul");
my $prezzo = $sel->get_text("//p/font");
my $codice = $sel->get_text("//p/font/b");
my $img = $sel->get_attribute ("//p/img/\#src");
chomp ($descrizione);
print INFO ("$codice\;$descrizione\;$prezzo\;$img\n");
$sel->go_back_ok();
# Close file
close (INFO);
but the output is:
Art. S500 Set Yoga "Siddhartha";Idea regalo ?SET YOGA Siddhartha? Elegante scatola in cartone lucido contenente:
2 mattoni in legno naturale mis. cm 20 x 12,5 x 7
1 cinghia in cotone mis. cm 4 x 235
1 stuoia in cotone mis. cm 70 x 170
1 manuale di introduzione allo yoga stampato
Tutto rigorosamente realizzato con materiali natural;€ 82,50;../images/S500%20(Custom).jpg
chomp removes the platform specific end-of-line character sequence from the end of a string or a set of strings.
In your case, you seem to have a single string with embedded newlines and/or carriage returns. Hence, you probably want to replace any sequence of possible line endings with something else, let's say a single space character. In that case, you'd do:
$descrizione =~ s/[\r\n]+/ /g;
If you want to replace all vertical whitespace, Perl has a special character class shortcut for that:
use v5.10;
$descrizione =~ s/\v+/ /g;
Use this to remove \r as well.
$descrizione =~ s#[\r\n]+\z##;
regards,

Loading Big files into Hashes in Perl (BLAST tables)

I'm a perl beginner, please help me out with my query... I'm trying to extract information from a blast table (a snippet of what it looks like is below):
It's a standard blast table input... I basically want to extract any information on a list of reads (Look at my second script below , to get an idea of what I want to do).... Anyhow this is precisely what I've done in the second script:
INPUTS:
1) the blast table:
38.1 0.53 59544 GH8NFLV01A02ED GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345 1 YP_003242370 Dynamin family protein [Paenibacillus sp. Y412MC10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590
34.3 7.5 123828 GH8NFLV01A03QJ GH8NFLV01A03QJ rank=0239249 x=305.0 y=1945.5 length=452 1 XP_002639994 Hypothetical protein CBG10824 [Caenorhabditis briggsae] 3 0 52.1739130434783 32.6086956521739 46 452 367 10.1769911504425 12.5340599455041 111 248 79 124
37.7 0.70 62716 GH8NFLV01A09B8 GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306
37.7 0.98 33114 GH8NFLV01A0H5C GH8NFLV01A0H5C rank=0066011 x=298.0 y=2638.5 length=573 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] -3 0 73.5294117647059 52.9411764705882 34 573 703 5.93368237347295 4.83641536273115 131 232 273 306
103 1e-020 65742 GH8NFLV01A0MXI GH8NFLV01A0MXI rank=0124865 x=300.5 y=644.0 length=475 1 ABZ08973 hypothetical protein ALOHA_HF4000APKG6B14ctg1g18 [uncultured marine crenarchaeote HF4000_APKG6B14] 2 0 77.9411764705882 77.9411764705882 68 475 151 14.3157894736842 45.0331125827815 2 205 1 68
41.6 0.053 36083 GH8NFLV01A0QKX GH8NFLV01A0QKX rank=0071366 x=301.0 y=1279.0 length=526 1 XP_766153 hypothetical protein [Theileria parva strain Muguga] -1 0 66.6666666666667 56.6666666666667 30 526 304 5.70342205323194 9.86842105263158 392 481 31 60
45.4 0.003 78246 GH8NFLV01A0Z29 GH8NFLV01A0Z29 rank=0148293 x=304.0 y=1315.0 length=432 1 ZP_04111769 hypothetical protein bthur0007_56280 [Bacillus thuringiensis serovar monterrey BGSC 4AJ1] 3 0 51.8518518518518 38.8888888888889 54 432 193 12.5 27.979274611399 48 209 97 150
71.6 4e-011 97250 GH8NFLV01A14MR GH8NFLV01A14MR rank=0184885 x=317.5 y=609.5 length=314 1 ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244] 1 0 92.5 92.5 40 314 311 12.7388535031847 12.8617363344051 193 312 13 52
58.2 5e-007 154555 GH8NFLV01A1KCH GH8NFLV01A1KCH rank=0309994 x=310.0 y=2991.0 length=267 1 ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244] 1 0 82.051282051282 82.051282051282 39 267 311 14.6067415730337 12.540192926045 142 258 1 39
2) The reads list:
GH8NFLV01A09B8
GH8NFLV01A02ED
etc
etc
3) the output I want:
37.7 0.70 62716 GH8NFLV01A09B8 GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306
38.1 0.53 59544 GH8NFLV01A02ED GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345 1 YP_003242370 Dynamin family protein [Paenibacillus sp. Y412MC10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590
I want a subset of the information in the first list, given a list of read names I want to extract (that is found in the 4th column)
Instead of hashing the reads list (only?) I want to hash the blast table itself, and use the information in Column 4 (of the blast table)as the keys to extract the values of each key, even when that key may have more than one value(i.e: each read name might actually have more than one hit , or associated blast result in the table), keeping in mind, that the value includes the WHOLE row with that key(readname) in it.
My greplist.pl script does this, but is very very slow, I think , ( and correct me if i'm wrong) that by loading the whole table in a hash, that this should speed things up tremendously ...
Thank you for your help.
My scripts:
The Broken one (mambo5.pl)
#!/usr/bin/perl -w
# purpose: extract blastX data from a list of readnames
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash = <DATA>;
close (DATA);
my $filename=$ARGV[0];
open(OUT, "> $filename.bololom");
my $readName;
while ( <LIST> )
{
#########;
if(/^(.*?)$/)#
{
$readName=$1;#
chomp $readName;
if (exists $hash{$readName})
{
print "bingo!";
my $output =$hash{$readName};
print OUT "$output\n";
}
else
{
print "it aint workin\n";
#print %hash;
}
}
}
close (LIST);
The Slow and quick cheat (that works) and is very slow (my blast tables can be about 400MB to 2GB large, I'm sure you can see why it's so slow)
#!/usr/bin/perl -w
##
# This script finds a list of names in a blast table and outputs the result in a new file
# name must exist and list must be correctly formatted
# will not output anything using a "normal" blast file, must be a table blast
# if you have the standard blast output use blast2table script
use strict;
my $filein=$ARGV[0] or die ("usage: ./listgrep.pl readslist blast_table\n");
my $db=$ARGV[1] or die ("usage: ./listgrep.pl readslist blast_table\n");
#open the reads you want to grep
my $read;
my $line;
open(READSLIST,$filein);
while($line=<READSLIST>)
{
if ($line=~/^(.*)$/)
{
$read = $1;
print "$read\n";
system("grep \"$read\" $db >$read\_.out\n");
}
#system("grep $read $db >$read\_.out\n");
}
system("cat *\_.out >$filein\_greps.txt\n");
system("rm *.out\n");
I don't know how to define that 4th column as the key : maybe I could use the split function, but I've tried to find a way that does this for a table of more than 2 columns to no avail... Please help!
If there is an easy way out of this please let me know
Thanks !
I'd do the opposite i.e read the readslist file into a hash then walk thru the big blast file and print the desired lines.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
# Read the readslist file into a hash
open my $fh, '<', 'readslist' or die "Can't open 'readslist' for reading:$!";
my %readslist = map { chomp; $_ => 1 }<$fh>;
close $fh;
open my $fh_blast, '<', 'blastfile' or die "Can't open 'blastfile' for reading:$!";
# loop on all the blastfile lines
while (<$fh_blast>) {
chomp;
# retrieve the key (4th column)
my ($key) = (split/\s+/)[3];
# print the line if the key exists in the hash
say $_ if exists $readslist{$key};
}
close $fh_blast;
I suggest you build an index to turn your blasts file temporarily into an indexed-sequential file. Read through it and build a hash of addresses within the file where every record for each key starts.
After that it is just a matter of seeking to the correct places in the file to pick up the records required. This will certainly be faster than most simple solutions, as it entails read the big file only once. This example code demonstrates.
use strict;
use warnings;
use Fcntl qw/SEEK_SET/;
my %index;
open my $blast, '<', 'blast.txt' or die $!;
until (eof $blast) {
my $place = tell $blast;
my $line = <$blast>;
my $key = (split ' ', $line, 5)[3];
push #{$index{$key}}, $place;
}
open my $reads, '<', 'reads.txt' or die $!;
while (<$reads>) {
next unless my ($key) = /(\S+)/;
next unless my $places = $index{$key};
foreach my $place (#$places) {
seek $blast, $place, SEEK_SET;
my $line = <$blast>;
print $line;
}
}
Voila, 2 ways of doing this, one with nothing to do with perl :
awk 'BEGIN {while ( i = getline < "reads_list") ar[$i] = $1;} {if ($4 in ar) print $0;}' blast_table > new_blast_table
Mambo6.pl
#!/usr/bin/perl -w
# purpose: extract blastX data from a list of readnames. HINT: Make sure your list file only has unique names , that way you save time.
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash;
my $val;
my $key;
while (<DATA>)
{
#chomp;
if(/((.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?))$/)
{
#print "$1\n";
$key= $5;#read
$val= $1;#whole row; notice the brackets around the whole match.
$hash{$key} .= exists $hash{$key} ? "$val\n" : $val;
}
else {
print "something wrong with format";
}
}
close (DATA);
open(OUT, "> $ARGV[1]\_out\.txt");
my $readName;
while ( <LIST> )
{
#########;
if(/^(.*?)$/)#
{
$readName=$1;#
chomp $readName;
if (exists $hash{$readName})
{
print "$readName\n";
my $output =$hash{$readName};
print OUT "$output";
}
else
{
#print "it aint workin\n";
}
}
}
close (LIST);
close (OUT);
The oneliner is faster, and probably better than my script, I'm sure some people can find easier ways to do it... I just thought I'd put this up since it does what I want.