So lets say I have text file that looks like this
new york
new jersey
florida
South Carolina
I'm going to append california to it, but when I do it ends up looking like this:
new york
new jersey
florida
South Carolinacalifornia
How can I make it so it appends to a new line? I thought about just making a new line before the appending of california but then every time I run the script it'll just have gaps between the lines because of the new line
Presumably your file is a list of states all on their own line. Your issue appears to be that the line: "South Carolina", is missing a newline character at the end, which is inconsistent with the rest of your data.
This is a data issue. With hidden characters shown your data looks like this:
new york\n
new jersey\n
florida\n
South Carolina
When it should look like this:
new york\n
new jersey\n
florida\n
South Carolina\n
In a scenario like this where your data is inconsistent the best way to programmatically fix it is to check if the file ends in a newline, and if not, append one.
The issue is that the final line of your original file isn't terminated with a newline. If its contents are within your control then the best solution is simply to ensure that every line printed to the file is properly terminated, but if you have to deal with malformed data then there are a few options
The first, as people have said, is to read the entire file into memory, remove any existing terminators with chomp, and print them back out to the file with the correct newline after all of them
If your file is of any significant size then this approach is wasteful. You may avoid the rewriting by reading the last character of the file and checking whether it is a newline as required. Then, when the file is opened for appending, you can first add a newline if it was originally missing, followed by the new data record. That would look like this
The seek call is used to move the read position to just before the last character, then <$fh> will read the final character which can be compared to "\n" to establish whether the file is properly terminated
Note that, if there is any chance that the file is completely empty or non-existent before your program runs, then you will have to code for the case where the open fails or the <$fh> returns undef
use strict;
use warnings 'all';
use autodie;
use Fcntl ':seek';
my ($file) = #ARGV;
my $terminated = do {
open my $fh, '<', $file;
seek $fh, -1, SEEK_END;
<$fh> eq "\n";
};
open my $fh, '>>', $file;
print $fh "\n" unless $terminated;
print $fh "california\n";
close $fh;
As mentioned by Dkwan33 this might be data issue. You can use "od" command to run and see if every line is ending with \n or not.
od -bc data.txt
0000000 156 145 167 040 171 157 162 153 012 156 145 167 040 152 145 162
n e w y o r k \n n e w j e r
0000020 163 145 171 012 146 154 157 162 151 144 141 012 123 157 165 164
s e y \n f l o r i d a \n S o u t
0000040 150 040 103 141 162 157 154 151 156 141 012
h C a r o l i n a \n
0000053
If you want to handle it via Perl then I would suggest you to first do the "chomp" on each line and then print that line with \n.
Related
I'm trying to print the first letter of column2 of an input file as well as other columns of interest. I'm not sure why the following script, adapted from Matching first letter of word gives me an 'Use of uninitialized value $columns[2]' warning.
Input File Example:
ATOM 1 CAY GLY X 1 -0.124 0.401 -0.153 1.00 2.67 PEP
ATOM 2 HY1 GLY X 1 -0.648 0.043 -1.064 1.00 0.00 PEP
ATOM 3 HY2 GLY X 1 -0.208 1.509 -0.145 1.00 0.00 PEP
Output File Example:
1 C -0.124 0.401 -0.153 1.00 2.67
2 H -0.648 0.043 -1.064 1.00 0.00
3 H -0.208 1.509 -0.145 1.00 0.00
Script
open (my $input_fh, "<", $filename) or die $!;
while (my $data = <$input_fh>) {
chomp $data;
my #columns = split(/\t/, $data);
my ($firstletter) = ($columns[2] =~ m/^\d+(\w)/);
if (/CAY/../HT2/)
print $output_fh join ("\t", $columns[1], $firstletter, $columns[6], $columns[7], $columns[8]), "\n";
}
UPDATE The warning occurred due to the if (/CAY/../HT2/) statement for some reason -- but since the input files are identical, I don't really need this condition. Also, since there are no digits in column2 it is more appropriate to use the /^(\w)/ regex.
Is there some particular reason that you must split on tabs? Getting various kinds of white space in an arbitrary text file correctly can be picky. If not necessary, it seems fully fitting to just split by (any) space, then grab the first letter
my #cols = split '\s+', $data;
my ($firstletter) = $cols[1] =~ m/^(\w)/;
I am not sure what the rest does but you can easily pluck the columns you need.
Try to debug what you get after splitting:
my #columns = split(/\t/, $data);
local $" = "\n"; print "$data\nSplitted into:\n#columns";
As guess your file have double \t characters. I mean you probably have:
ATOM\t\t1 CAY GLY X... so second column is undef
It sounds to me like the code that gave that warning was not what you show but instead had something like
($columns[2]) = ($columns[2] =~ m/^\d+(\w)/);
And you are getting the warning because the regex is failing due to not finding a digit. Maybe you meant \d*?
For me, maybe i would like to use cut command and pipeline, then split command to get the exact info you want.
Need help in merging/concatenating /combining /binding etc
I have several ascii files each defining one variable which I have converted to a single column array
I have such columnised data for many variables ,so I need to perform a column bind like R does and make it one single file.
I can do the same in R but there are too many files. Being able to do it with one single code will help save a lot of time.
Using the following code ,new to perl and need help with this.
#filenames = ("file1.txt","file2.txt");
open F2, ">file_combined.txt" or die;
for($j = 0; $j< scalar #filenames;$j++){
open F1, $filenames[$j] or die;
for($i=1;$i<=6;$i++){$line=<F1>;}
while($line=<F1>){
chomp $line;
#spl = split '\s+', $line;
for($i=0;$i<scalar #spl;$i++){
print F2 "$spl[$i]\n";
paste "file_bio1.txt","file_bio2.txt"> file_combined.txt;
}
}
close F1;
}
Input files here are Ascii text files of a raster.They look like this
32 12 34 21 32 21 22 23
12 21 32 43 21 32 21 12
The above mentioned code without the paste syntax converts these files into a single column
32
12
34
21
32
21
22
23
12
21
32
43
21
32
21
12
The output should look like this
12 21 32
32 23 23
32 21 32
12 34 12
43 32 32
32 23 23
32 34 21
21 32 23
Each column represents a different ascii file.
I need around 15 such ascii files into one dataframe.I can do the same in R but it consumes a lot of time as the number of files and regions of interest are too many and the files are a bit large too.
Let's step through what you have...
# files you want to open for reading..
#filenames = ("file1.txt","file2.txt");
# I would use the 3 arg lexical scoped open
# I think you want to open this for 'append' as well
# open($fh, ">>", "file_combined.txt") or die "cannot open";
open F2, ">file_combined.txt" or die;
# #filenames is best thought as a 'list'
# for my $file (#filenames) {
for($j = 0; $j< scalar #filenames;$j++){
# see above example of 'open'
# - $filenames[$j] + $file
open F1, $filenames[$j] or die;
# what are you trying to do here? You're overriding
# $line in the next 'while loop'
for($i=1;$i<=6;$i++){$line=<F1>;}
# while(<$fh1>) {
while($line=<F1>){
chomp $line;
# #spl is short for split?
# give '#spl' list a meaningful name
#spl = split '\s+', $line;
# again, #spl is a list...
# for my $word (#spl) {
for($i=0;$i<scalar #spl;$i++){
# this whole block is a bit confusing.
# 'F2' is 'file_combined.txt'. Then you try and merge
# ( and overwrite the file) with the paste afterwards...
print F2 "$spl[$i]\n";
# is this a 'system call'?
# Missing 'backticks' or 'system'
paste "file_bio1.txt","file_bio2.txt"> file_combined.txt;
}
}
# close $fh1
close F1;
}
# I'm assuming there's a 'close F2' somewhere here..
It looks like you're trying to do this:
#filenames = ("file1.txt","file2.txt");
$oufile = "combined_text.txt";
`paste $filenames[0] $filenames[1] > $outfile`;
I'm using Selenium for work and I have extract some data from "//ul", unfortunately this data contains a newline, I tried to use chomp() function to remove this (because I need to write in a CSV's file) but it's not working, the portion of code is:
open (INFO, '>>file.csv') or die "$!";
print INFO ("codice\;descrizione\;prezzo\;URLFoto\n");
my $sel = Test::WWW::Selenium->new( host => "localhost",
port => 4444,
browser => "*chrome",
browser_url => "http://www.example.com/page.htm" );
$sel->open_ok("/page.htm");
$sel->click_ok("//table[2]/tbody/tr/td/a/img");
$sel->wait_for_page_to_load_ok("30000");
my $descrizione = $sel->get_text("//ul");
my $prezzo = $sel->get_text("//p/font");
my $codice = $sel->get_text("//p/font/b");
my $img = $sel->get_attribute ("//p/img/\#src");
chomp ($descrizione);
print INFO ("$codice\;$descrizione\;$prezzo\;$img\n");
$sel->go_back_ok();
# Close file
close (INFO);
but the output is:
Art. S500 Set Yoga "Siddhartha";Idea regalo ?SET YOGA Siddhartha? Elegante scatola in cartone lucido contenente:
2 mattoni in legno naturale mis. cm 20 x 12,5 x 7
1 cinghia in cotone mis. cm 4 x 235
1 stuoia in cotone mis. cm 70 x 170
1 manuale di introduzione allo yoga stampato
Tutto rigorosamente realizzato con materiali natural;€ 82,50;../images/S500%20(Custom).jpg
chomp removes the platform specific end-of-line character sequence from the end of a string or a set of strings.
In your case, you seem to have a single string with embedded newlines and/or carriage returns. Hence, you probably want to replace any sequence of possible line endings with something else, let's say a single space character. In that case, you'd do:
$descrizione =~ s/[\r\n]+/ /g;
If you want to replace all vertical whitespace, Perl has a special character class shortcut for that:
use v5.10;
$descrizione =~ s/\v+/ /g;
Use this to remove \r as well.
$descrizione =~ s#[\r\n]+\z##;
regards,
I'm a perl beginner, please help me out with my query... I'm trying to extract information from a blast table (a snippet of what it looks like is below):
It's a standard blast table input... I basically want to extract any information on a list of reads (Look at my second script below , to get an idea of what I want to do).... Anyhow this is precisely what I've done in the second script:
INPUTS:
1) the blast table:
38.1 0.53 59544 GH8NFLV01A02ED GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345 1 YP_003242370 Dynamin family protein [Paenibacillus sp. Y412MC10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590
34.3 7.5 123828 GH8NFLV01A03QJ GH8NFLV01A03QJ rank=0239249 x=305.0 y=1945.5 length=452 1 XP_002639994 Hypothetical protein CBG10824 [Caenorhabditis briggsae] 3 0 52.1739130434783 32.6086956521739 46 452 367 10.1769911504425 12.5340599455041 111 248 79 124
37.7 0.70 62716 GH8NFLV01A09B8 GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306
37.7 0.98 33114 GH8NFLV01A0H5C GH8NFLV01A0H5C rank=0066011 x=298.0 y=2638.5 length=573 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] -3 0 73.5294117647059 52.9411764705882 34 573 703 5.93368237347295 4.83641536273115 131 232 273 306
103 1e-020 65742 GH8NFLV01A0MXI GH8NFLV01A0MXI rank=0124865 x=300.5 y=644.0 length=475 1 ABZ08973 hypothetical protein ALOHA_HF4000APKG6B14ctg1g18 [uncultured marine crenarchaeote HF4000_APKG6B14] 2 0 77.9411764705882 77.9411764705882 68 475 151 14.3157894736842 45.0331125827815 2 205 1 68
41.6 0.053 36083 GH8NFLV01A0QKX GH8NFLV01A0QKX rank=0071366 x=301.0 y=1279.0 length=526 1 XP_766153 hypothetical protein [Theileria parva strain Muguga] -1 0 66.6666666666667 56.6666666666667 30 526 304 5.70342205323194 9.86842105263158 392 481 31 60
45.4 0.003 78246 GH8NFLV01A0Z29 GH8NFLV01A0Z29 rank=0148293 x=304.0 y=1315.0 length=432 1 ZP_04111769 hypothetical protein bthur0007_56280 [Bacillus thuringiensis serovar monterrey BGSC 4AJ1] 3 0 51.8518518518518 38.8888888888889 54 432 193 12.5 27.979274611399 48 209 97 150
71.6 4e-011 97250 GH8NFLV01A14MR GH8NFLV01A14MR rank=0184885 x=317.5 y=609.5 length=314 1 ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244] 1 0 92.5 92.5 40 314 311 12.7388535031847 12.8617363344051 193 312 13 52
58.2 5e-007 154555 GH8NFLV01A1KCH GH8NFLV01A1KCH rank=0309994 x=310.0 y=2991.0 length=267 1 ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244] 1 0 82.051282051282 82.051282051282 39 267 311 14.6067415730337 12.540192926045 142 258 1 39
2) The reads list:
GH8NFLV01A09B8
GH8NFLV01A02ED
etc
etc
3) the output I want:
37.7 0.70 62716 GH8NFLV01A09B8 GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306
38.1 0.53 59544 GH8NFLV01A02ED GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345 1 YP_003242370 Dynamin family protein [Paenibacillus sp. Y412MC10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590
I want a subset of the information in the first list, given a list of read names I want to extract (that is found in the 4th column)
Instead of hashing the reads list (only?) I want to hash the blast table itself, and use the information in Column 4 (of the blast table)as the keys to extract the values of each key, even when that key may have more than one value(i.e: each read name might actually have more than one hit , or associated blast result in the table), keeping in mind, that the value includes the WHOLE row with that key(readname) in it.
My greplist.pl script does this, but is very very slow, I think , ( and correct me if i'm wrong) that by loading the whole table in a hash, that this should speed things up tremendously ...
Thank you for your help.
My scripts:
The Broken one (mambo5.pl)
#!/usr/bin/perl -w
# purpose: extract blastX data from a list of readnames
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash = <DATA>;
close (DATA);
my $filename=$ARGV[0];
open(OUT, "> $filename.bololom");
my $readName;
while ( <LIST> )
{
#########;
if(/^(.*?)$/)#
{
$readName=$1;#
chomp $readName;
if (exists $hash{$readName})
{
print "bingo!";
my $output =$hash{$readName};
print OUT "$output\n";
}
else
{
print "it aint workin\n";
#print %hash;
}
}
}
close (LIST);
The Slow and quick cheat (that works) and is very slow (my blast tables can be about 400MB to 2GB large, I'm sure you can see why it's so slow)
#!/usr/bin/perl -w
##
# This script finds a list of names in a blast table and outputs the result in a new file
# name must exist and list must be correctly formatted
# will not output anything using a "normal" blast file, must be a table blast
# if you have the standard blast output use blast2table script
use strict;
my $filein=$ARGV[0] or die ("usage: ./listgrep.pl readslist blast_table\n");
my $db=$ARGV[1] or die ("usage: ./listgrep.pl readslist blast_table\n");
#open the reads you want to grep
my $read;
my $line;
open(READSLIST,$filein);
while($line=<READSLIST>)
{
if ($line=~/^(.*)$/)
{
$read = $1;
print "$read\n";
system("grep \"$read\" $db >$read\_.out\n");
}
#system("grep $read $db >$read\_.out\n");
}
system("cat *\_.out >$filein\_greps.txt\n");
system("rm *.out\n");
I don't know how to define that 4th column as the key : maybe I could use the split function, but I've tried to find a way that does this for a table of more than 2 columns to no avail... Please help!
If there is an easy way out of this please let me know
Thanks !
I'd do the opposite i.e read the readslist file into a hash then walk thru the big blast file and print the desired lines.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
# Read the readslist file into a hash
open my $fh, '<', 'readslist' or die "Can't open 'readslist' for reading:$!";
my %readslist = map { chomp; $_ => 1 }<$fh>;
close $fh;
open my $fh_blast, '<', 'blastfile' or die "Can't open 'blastfile' for reading:$!";
# loop on all the blastfile lines
while (<$fh_blast>) {
chomp;
# retrieve the key (4th column)
my ($key) = (split/\s+/)[3];
# print the line if the key exists in the hash
say $_ if exists $readslist{$key};
}
close $fh_blast;
I suggest you build an index to turn your blasts file temporarily into an indexed-sequential file. Read through it and build a hash of addresses within the file where every record for each key starts.
After that it is just a matter of seeking to the correct places in the file to pick up the records required. This will certainly be faster than most simple solutions, as it entails read the big file only once. This example code demonstrates.
use strict;
use warnings;
use Fcntl qw/SEEK_SET/;
my %index;
open my $blast, '<', 'blast.txt' or die $!;
until (eof $blast) {
my $place = tell $blast;
my $line = <$blast>;
my $key = (split ' ', $line, 5)[3];
push #{$index{$key}}, $place;
}
open my $reads, '<', 'reads.txt' or die $!;
while (<$reads>) {
next unless my ($key) = /(\S+)/;
next unless my $places = $index{$key};
foreach my $place (#$places) {
seek $blast, $place, SEEK_SET;
my $line = <$blast>;
print $line;
}
}
Voila, 2 ways of doing this, one with nothing to do with perl :
awk 'BEGIN {while ( i = getline < "reads_list") ar[$i] = $1;} {if ($4 in ar) print $0;}' blast_table > new_blast_table
Mambo6.pl
#!/usr/bin/perl -w
# purpose: extract blastX data from a list of readnames. HINT: Make sure your list file only has unique names , that way you save time.
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash;
my $val;
my $key;
while (<DATA>)
{
#chomp;
if(/((.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?))$/)
{
#print "$1\n";
$key= $5;#read
$val= $1;#whole row; notice the brackets around the whole match.
$hash{$key} .= exists $hash{$key} ? "$val\n" : $val;
}
else {
print "something wrong with format";
}
}
close (DATA);
open(OUT, "> $ARGV[1]\_out\.txt");
my $readName;
while ( <LIST> )
{
#########;
if(/^(.*?)$/)#
{
$readName=$1;#
chomp $readName;
if (exists $hash{$readName})
{
print "$readName\n";
my $output =$hash{$readName};
print OUT "$output";
}
else
{
#print "it aint workin\n";
}
}
}
close (LIST);
close (OUT);
The oneliner is faster, and probably better than my script, I'm sure some people can find easier ways to do it... I just thought I'd put this up since it does what I want.
I'm running Windows XP, Eclipse 3.2 with EPIC and Cygwin for my Perl interpreter, and I get an unexpected result.
FYI... When I run it on my Ubuntu distribution (VMware, same pc) I get the expected results. Why?
############ CODE: #############
use warnings;
use strict;
my $test = "test";
my $input = <STDIN>;
print length $test, " ", length $input, "\n";
chomp $input;
print "|$test| |$input| \n"; #The bars indicate white space, new line, etc...
print length $test, " ", length $input, "\n";
if ($test eq $input) {
print "TIME TO QUIT";
}
Results on Windows XP:
test <-- My input
4 6 <-- Lengths printed before chomp
|test| |test <-- Print the variables after chomp
| <-- There is still a new line there
4 5 <-- Lengths after the initial chomp
Given that Windows XP figures in the problem, the difference must be due to CRLF (carriage return, line feed) handling. The chomp removes, it appears, the LF but not the CR; the print translates the CR into CR LF.
The Perl doc for chomp says that if you set the EOL correctly for Windows ($/ = "\r\n";), then chomp should do its stuff correctly:
$/ = "\r\n";
$test = "test\r\n";
print "<<$test>>\n";
chomp $test;
print "<<$test>>\n";
A hex dump of the output of that yields:
0x0000: 3C 3C 74 65 73 74 0D 0A 3E 3E 0A 3C 3C 74 65 73 <<test..>>.<<tes
0x0010: 74 3E 3E 0A t>>.
0x0014:
I'm not sure why $/ is not set automatically - it may be Cygwin confusing things (pretending too successfully it is running on Unix).
Based on the lengths, I'd say you're getting the input string as:
test<cr><lf>
where <cr> and <lf> are ASCII codes 0x13 and 0x10 respectively.
When you chomp it, it removes the <lf> but leaves the <cr> there.
It's almost certainly an interaction issue between Eclipse, Cygwin and Windows, disagreeing on what the end-of-line character sequence should be. I couldn't replicate your problem with just Perl/Cygwin or Perl/Windows but this command gives similar results (in Cygwin):
echo 'test^M' | perl qq.pl | sed 's/^M/\n/g'
(qq.pl is your script and "^M" is the actual CTRL-M). Here's the output in text form:
4 6
|test| |test
|
4 5
and octal dump:
0000000 2034 0a36 747c 7365 7c74 7c20 6574 7473
4 6 \n | t e s t | | t e s t
064 040 066 012 174 164 145 163 164 174 040 174 164 145 163 164
0000020 7c0a 340a 3520 000a
\n | \n 4 5 \n \0
012 174 012 064 040 065 012 000
0000027
So I'd say that your input is putting on both <cr> and <lf>, and the print is translating <cr> to <lf> (or just doing the same thing for both of them).
If you need a workaround for your environment, you can replace your chomp line with:
$input =~ s/\r?\n$//;
as in:
use warnings;
use strict;
my $test = "test";
my $input = <STDIN>;
print length $test ," ",length $input,"\n";
$input =~ s/\r?\n$//;
print "|$test| |$input|\n";
print length $test," ",length $input,"\n";
if ($test eq $input) {
print "TIME TO QUIT";
}
which works on Cygwin for the test data I used (check it for your own situation, of course), but you may find you can solve it better by using tools that all agree on the line end sequence (eg, Perl for Windows rather than the Cygwin one may do the trick for you).
Here is how to remove a trailing \r\n or \n (whichever is at the end):
$input =~ s#\r?\n\Z(?!\n)##;
Another option is to do a
binmode(STDIN, ':crlf')
before reading anything from STDIN. This would convert trailing \r\n to just a \n, which you can remove using chomp. This will also work even if your input contains only \n. See the documentation about PerlIO for more.