I have the following file:
#TWEETY:150:000000000-ACFKE:1:2104:27858:17965
AAATTAGCAAAAAACAATAACAAAACTGGGAAAATGCAATTTAACAACGAAAATTTTCCGAGAACTTGAAAGCGTACGAAAACGATACGCTCC
+
D1FFFB11FDG00EE0FFFA1110FAA1F/ABA0FGHEGDFEEFGDBGGGGFEHBFDDG/FE/EGH1#GF#F0AEEEEFHGGFEFFCEC/>EE
#TWEETY:150:000000000-ACFKE:1:1105:22044:20029
AAAAAATATTAAAACTACGAATGCATAAATTATTTCGTTCGAAATAAACTCACACTCGTAACATTGAACTACGCGCTCC
+
CCFDDDFGGGGGGGGGGHGGHHHHGHHHHHHHHHHHHHHHGHHGHHHHHHHHHHHHHGHGHGGHHHHHHGHHEGGGGGG
#TWEETY:150:000000000-ACFKE:1:2113:14793:7182
TATATAAAGCGAGAGTAGAAACTTTTTAATTGACGCGGCGAGAAAGTATATAGCAACAAGCGAGCACCCGCTCC
+
BBFFFFFGGGGFFGGFGHHHHHHHHHHHHHHHHHGGAEEEAFGGGHHFEGHHGHHHHHGHHGGGGFHHGG?EEG
#TWEETY:150:000000000-ACFKE:1:2109:5013:22093
AAAAAAATAATTCATATCGCCATATCGACTGACAGATAATCTATCTATAATCATAACTTTTCCCTCGCTCC
+
DAFAADDGF1EAGG3EG3A00ECGDFFAEGFCHHCAGHBGEAGBFDEDGGHBGHGFGHHFHHHBDG?/FA/
#TWEETY:150:000000000-ACFKE:1:2106:25318:19875
+
CCCCCCCCCCCCGGGGGGGGGGGGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
The lines are in groups of four (each time there is a name, starting with #TWEETY, a string of letters, a + character, and another string of letters).
The second and fourth lines should have the same number of characters.
But there are cases where the second line is empty, as in the last four lines.
In these cases, I would like to get rid of the whole block (the previous line before the empty line and the next two lines).
I have just started perl and have been trying to write a script for my problem, but am having a hard time. Does anyone have some feedback?
Thanks!
Keep an array buffer of the last four lines. When it's full, check the second line, print the lines or not, empty the buffer, repeat.
#!/usr/bin/perl
use warnings;
use strict;
my #buffer;
sub output {
print #buffer unless 1 == length $buffer[1];
#buffer = ();
}
while (<>) {
if (4 == #buffer) {
output();
}
push #buffer, $_;
}
output(); # Don't forget to process the last four lines.
Yes. Start with looking at $/ and set it so you can work on a chunk at a time. I would suggest you can treat # as a record separator in your example.
Then iterate your records using a while loop. E.g. while ( <> ) {
Use split on \n to turn the current chunk into an array of lines.
Perform your test on the appropriate lines, and either print - or not - depending on whether it passed.
If you get stuck with that, then I'm sure a specific question including your code and where you're having problems will be well received here.
If you chunk the data correctly, this becomes almost trivial.
#!/usr/bin/perl
use strict;
use warnings;
# Use '#TWEETY' as the record separator to make it
# easy to chunk the data.
local $/ = '#TWEETY';
while (<DATA>) {
# The first entry will be empty (as the separator
# is the first thing in the file). Skip that record.
next unless /\S/;
# Skip any records with two consecutive newlines
# (as they will be the ones with the empty line 2)
next if /\n\n/;
# Print the remaining records
# (with $/ stuck back on the front)
print "$/$_";
}
__DATA__
#TWEETY:150:000000000-ACFKE:1:2104:27858:17965
AAATTAGCAAAAAACAATAACAAAACTGGGAAAATGCAATTTAACAACGAAAATTTTCCGAGAACTTGAAAGCGTACGAAAACGATACGCTCC
+
D1FFFB11FDG00EE0FFFA1110FAA1F/ABA0FGHEGDFEEFGDBGGGGFEHBFDDG/FE/EGH1#GF#F0AEEEEFHGGFEFFCEC/>EE
#TWEETY:150:000000000-ACFKE:1:1105:22044:20029
AAAAAATATTAAAACTACGAATGCATAAATTATTTCGTTCGAAATAAACTCACACTCGTAACATTGAACTACGCGCTCC
+
CCFDDDFGGGGGGGGGGHGGHHHHGHHHHHHHHHHHHHHHGHHGHHHHHHHHHHHHHGHGHGGHHHHHHGHHEGGGGGG
#TWEETY:150:000000000-ACFKE:1:2113:14793:7182
TATATAAAGCGAGAGTAGAAACTTTTTAATTGACGCGGCGAGAAAGTATATAGCAACAAGCGAGCACCCGCTCC
+
BBFFFFFGGGGFFGGFGHHHHHHHHHHHHHHHHHGGAEEEAFGGGHHFEGHHGHHHHHGHHGGGGFHHGG?EEG
#TWEETY:150:000000000-ACFKE:1:2109:5013:22093
AAAAAAATAATTCATATCGCCATATCGACTGACAGATAATCTATCTATAATCATAACTTTTCCCTCGCTCC
+
DAFAADDGF1EAGG3EG3A00ECGDFFAEGFCHHCAGHBGEAGBFDEDGGHBGHGFGHHFHHHBDG?/FA/
#TWEETY:150:000000000-ACFKE:1:2106:25318:19875
+
CCCCCCCCCCCCGGGGGGGGGGGGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
Thanks everyone for the feedback!
It was all really useful. Thanks to your suggestions, I explored all the options and learned the unless statement.
The easiest solution given my existing code, was just to add an unless statement at the end.
### Write to output, but remove non-desired Gs
open OUT, ">$outfile";
my #accorder = #{$store0{"accorder"}};
foreach my $acc (#accorder){
# retrieve seq(2nd line) and qual(4th line)
my $seq = $store0{$acc}{"seq"};
my $qual = $store0{$acc}{"qual"};
# clean out polyG at end
$seq =~ s/G{3,}.{0,1}$//;
my $lenseq = length($seq);
my $lenqual = length($qual);
my $startqual = $lenqual - $lenseq;
$qual = substr($qual, 0, $lenseq);
#the above was in order to remove multiple G characters at the end of the
#second line, which is what led to empty lines (lines that were made up of
#only Gs got cut out)
# print to output, unless sequence has become empty
unless($lenseq == 0){ #this is the unless statement I added
print OUT "\#$acc\n$seq\n+\n$qual\n";
}
}
close(OUT);
Related
Being a Perl newbie, please pardon me for asking this basic question.
I have a text file #server1 that shows a bunch of sentences (white space is the field separator) on many lines in the file.
I needed to match lines with my keyword, remove the same lines, and extract only the last field, so I have tried with:
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my #uniqmatchedline = split(/ /, #allmatchedlines);
my $lastfield = $uniqmatchedline[-1]\n";
print "$lastfield\n";
and it gives me the output showing:
1
I don't know why it's giving me just "1".
Could someone please explain why I'm getting "1" and how I can get the last field of the matched line correctly?
Thank you!
my #uniqmatchedline = split(/ /, #allmatchedlines);
You're getting "1" because split takes a scalar, not an array. An array in scalar context returns the number of elements.
You need to split on each individual line. Something like this:
my #uniqmatchedline = map { split(/ /, $_) } #allmatchedlines;
There are two issues with your code:
split is expecting a scalar value (string) to split on; if you are passing an array, it will convert the array to scalar (which is just the array length)
You did not have a way to remove same lines
To address these, the following code should work (not tested as no data):
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my %existing;
my #uniqmatchedline = grep !$existing{$_}++, #allmatchedlines; #this will return the unique lines
my #lastfields = map { ((split / /, $_)[-1]) . "\n" } #uniqmatchedline ; #this maps the last field in each line into an array
print for #lastfields;
Apart from two errors in the code, I find the statement "remove the same lines and extract only the last field" unclear. Once duplicate matching lines are removed, there may still be multiple distinct sentences with the pattern.
Until a clarification comes, here is code that picks the last field from the last such sentence.
use warnings 'all';
use strict;
use List::MoreUtils qw(uniq)
my $file = '/tmp/myfile.txt';
my $cmd = "ssh user1\#server1 cat $file";
open my $fh, '-|', $cmd // die "Error opening $cmd: $!"; # /
while (<$fh>) {
chomp;
push #allmatchedlines, $_ if /mysearch/;
}
close(output1);
my #unique_matched_lines = uniq #allmatchedlines;
my $lastfield = ( split ' ', $unique_matched_lines[-1] )[-1];
print $lastfield, "\n";
I changed to the three-argument open, with error checking. Recall that open for a process involves a fork and returns pid, so an "error" doesn't at all relate to what happened with the command itself. See open. (The # / merely turns off wrong syntax highlighting.) Also note that # under "..." indicates an array and thus need be escaped.
The (default) pattern ' ' used in split splits on any amount of whitespace. The regex / / turns off this behavior and splits on a single space. You most likely want to use ' '.
For more comments please see the original post below.
The statement #allmatchedlines = $_ if /mysearch/; on every iteration assigns to the array, overwriting whatever has been in it. So you end up with only the last line that matched mysearch. You want push #allmatchedlines, $_ ... to get all those lines.
Also, as shown in the answer by Justin Schell, split needs a scalar so it is taking the length of #allmatchedlines – which is 1 as explained above. You should have
my #words_in_matched_lines = map { split } #allmatchedlines;
When all this is straightened out, you'll have words in the array #uniqmatchedline and if that is the intention then its name is misleading.
To get unique elements of the array you can use the module List::MoreUtils
use List::MoreUtils qw(uniq);
my #unique_elems = uniq #whole_array;
I have happened upon a problem with a program that parses through a CSV file with a few million records: two fields in each line has comments that users have put in, and sometimes they use commas within their comments. If there are commas input, that field will be contained in double quotes. I need to replace any commas found in those fields with a space. Here is one such line from the file to give you an idea -
1925,47365,2,650187016,1,1,"MADE FOR DRAWDOWNS, NEVER P/U",16,IFC 8112NP,Standalone-6,,,44,10/22/2015,91607,,B24W02651,,"PA-3, PURE",4/28/2015,1,0,,1,MAN,,CUST,,CUSTOM MATCH,0,TRUE,TRUE,O,C48A0D001EF449E3AB97F0B98C811B1B,POS.MISTINT.V0000.UP.Q,PROD_SMISA_BK,414D512050524F445F504F5331393235906F28561D2F0020,10/22/2015 9:29,10/22/2015 9:30
NOTE - I do not have the Text::CSV module available to me, nor will it be made available in the server I am using.
Here is part of my code in parsing this file. The first thing I do is concatenate the very first three fields and prepend that concatenated field to each line. Then I want to clear out the commas in #fields[7,19], then format the DATE in three fields and the DATETIME in two fields. The only line I can't figure out is clearing out those commas -
my #data;
# Read the lines one by one.
while ( $line = <$FH> ) {
# split the fields, concatenate the first three fields,
# and add it to the beginning of each line in the file
chomp($line);
my #fields = split(/,/, $line);
unshift #fields, join '_', #fields[0..2];
# remove user input commas in fields[7,19]
$_ = for fields[7,19];
# format DATE and DATETIME fields for MySQL/sqlbatch60
$_ = join '-', (split /\//)[2,0,1] for #fields[14,20,23];
$_ = Time::Piece->strptime($_,'%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M') for #fields[38,39];
# write the parsed record back to the file
push #data, \#fields;
}
If it is ONLY the eighth field that is troubling AND you know exactly how many fields there should be, you can do it this way
Suppose the total number of fields is always N
Split the line on commas ,
Separate and store the first six fields
Separate and store the last n fields, where n is N-8
Rejoin what remains with commas ,. This now forms field 8
and then do what ever you like to do with it. For example, write it to a proper CSV file
Text::CSV_XS handles quoted commas just fine:
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS qw{ csv };
my $aoa = csv(in => 'file.csv'); # The file contains the sample line.
print $aoa->[0][6];
Note The two main versions below clean up one field. The most recent change in the question states that there are, in fact, two such fields. The third version, at the end, works with any number of bad fields.
All code has been tested with the supplied example and its variations.
Following clarifications, this deals with the case when the file need be processed by hand. A module is easily recommended for parsing .csv, but there is a problem here: reliance on the user to enter double quotes. If they end up not being there we have a malformed file.
I take it that the number of fields in the file is known with certainty and ahead of time.
The two independent solutions below use either array or string processing.
(1) The file is being processed line by line anyway, the line being split already. If there are more fields than expected, join the extra array elements by space and then overwrite the array with correct fields. This is similar to what is outlined in the answer by vanHoesel.
use strict;
use warnings;
my $num_fields = 39; # what should be, using the example
my $ibad = 6; # index of the malformed field-to-be
my #last = (-($num_fields-$ibad-1)..-1); # index-range, rest of fields
my $file = "file.csv";
open my $fh, '<', $file;
while (my $line = <$fh>) { # chomp it if needed
my #fields = split ',', $line;
if (#fields != $num_fields) {
# join extra elements by space
my $fixed = join ' ', #fields[$ibad..$ibad+#fields-$num_fields];
# overwrite array by good fields
#fields = (#fields[0..$ibad-1], $fixed, #fields[#last]);
}
# Process #fields normally
print "#fields";
}
close $fh;
(2) Preprocess the file, only checking for malformed lines and fixing them as needed. Uses string manipulations. (Or, the method above can be used.) The $num_fields and $ibad are the same.
while (my $line = <$fh>) {
# Number of fields: commas + 1 (tr|,|| counts number of ",")
my $have_fields = $line =~ tr|,|| + 1;
if ($have_fields != $num_fields) {
# Get indices of commas delimiting the bad field
my ($beg, $end) = map {
my $p = '[^,]*,' x $_;
$line =~ /^$p/ and $+[0]-1;
} ($ibad, $ibad+$have_fields-$num_fields);
# Replace extra commas and overwrite that part of the string
my $bad_field = substr($line, $beg+1, $end-$beg-1);
(my $fixed = $bad_field) =~ tr/,/ /;
substr($line, $beg+1, $end-$beg-1) = $fixed;
}
# Perhaps write the line out, for a corrected .csv file
print $line;
}
In the last line the bad part of $line is overwritten by assigning to substr, what this function allows. The new substring $fixed is constructed with commas changed (or removed, if desired), and used to overwrite the bad part of the $line. See docs.
If quotes are known to be there a regex can be used. This works with any number of bad fields.
while (my $line = <$fh>) {
$line =~ s/."([^"]+)"/join ' ', split(',', $1)/eg; # "
# process the line. note that double quotes are removed
}
If the quotes are to be kept move them inside parenthesis, to be captured as well.
This one line is all that need be done after while (...) { to clean up data.
The /e modifier makes the replacement side be evaluated as code, instead of being used as a double-quoted string. There the matched part of the line (between ") is split by comma and then joined by space, thus fixing the field. See the last item under "Search and replace" in perlretut.
All code has been tested with multiple lines and multiple commas in the bad field.
I am dealing with very large amounts of data. Every now and then there is a slip up. I want to identify each row with an error, under a condition of my choice. With that I want the row number along with the line number of each erroneous row. I will be running this script on a handful of files and I will want to output the report to one.
So here is my example data:
File_source,ID,Name,Number,Date,Last_name
1.csv,1,Jim,9876,2014-08-14,Johnson
1.csv,2,Jim,9876,2014-08-14,smith
1.csv,3,Jim,9876,2014-08-14,williams
1.csv,4,Jim,9876,not_a_date,jones
1.csv,5,Jim,9876,2014-08-14,dean
1.csv,6,Jim,9876,2014-08-14,Ruzyck
Desired output:
Row#5,4.csv,4,Jim,9876,not_a_date,jones (this is an erroneous row)
The condition I have chosen is print to output if anything in the date field is not a date.
As you can see, my desired output contains the line number where the error occurred, along with the data itself.
After I have my output that shows the lines within each file that are in error, I want to grab that line from the untouched original CSV file to redo (both modified and original files contain the same amount of rows). After I have a file of these redone rows, I can omit and clean up where needed to prevent interruption of an import.
Folder structure will contain:
Modified: 4.txt
Original: 4.csv
I have something started here, written in Perl, which by the logic will at least return the rows I need. However I believe my syntax is a little off and I do not know how to plug in the other subroutines.
Code:
$count = 1;
while (<>) {
unless ($F[4] =~ /\d+[-]\d+[-]\d+/)
print "Row#" . $count++ . "," . "$_";
}
The code above is supposed to give me my erroneous rows, but to be able to extract them from the originals is beyond me. The above code also contains some syntax errors.
This will do as you ask.
Please be certain that none of the fields in the data can ever contain a comma , otherwise you will need to use Text::CSV to process it instead of just a simple split.
use strict;
use warnings;
use 5.010;
use autodie;
open my $fh, '<', 'example.csv';
<$fh>; # Skip header
while (<$fh>) {
my #fields = split /,/;
if( $fields[4] !~ /^\d{4}-\d{2}-\d{2}$/ ) {
print "Row#$.,$_";
}
}
output
Row#5,4.csv,4,Jim,9876,not_a_date,jones
Update
If you want to process a number of files then you need this instead.
The close ARGV at the end of the loop is there so that the line counter $. is reset to
1 at the start of each file. Without it it just continues from 1 upwards across all the files.
You would run this like
rob#Samurai-U:~$ perl findbad.pl *.csv
or you could list the files individually, separated by spaces.
For the test I have created files 1.csv and 2.csv which are identical to your example data except that the first field of each line is the name of the file containing the data.
You may not want the line in the output that announces each file name, in which case you should replace the entire first if block with just next if $. == 1.
use strict;
use warnings;
#ARGV = map { glob qq{"$_"} } #ARGV; # For Windows
while (<>) {
if ($. == 1) {
print "\n\nFile: $ARGV\n\n";
next;
}
my #fields = split /,/;
unless ( $fields[4] =~ /^\d{4}-\d{2}-\d{2}$/ ) {
printf "Row#%d,%s", $., $_;
}
close ARGV if eof ARGV;
}
output
File: 1.csv
Row#5,1.csv,4,Jim,9876,not_a_date,jones
File: 2.csv
Row#5,2.csv,4,Jim,9876,not_a_date,jones
database.Win.txt is a file that contains a multiple of 3 lines. The second of every three lines is a number. The code is supposed to print out the three lines (in a new order) on one line separated by tabs, but only if the second line is 1.
Am I, by this code, actually getting the loop to create an array with three lines of database.Win.txt each time it runs through the loop? That's my goal, but I suspect this isn't what the code does, since I get an error saying that the int() function expects a numeric value, and doesn't find one.
while(<database.Win.txt>){
$new_entry[0] = <database.Win.txt>;
$new_entry[1] = <database.Win.txt>;
$new_entry[2] = <database.Win.txt>;
if(int($new_entry[1]) == 1) {
chomp($new_entry);
print "$new_entry[1], \t $new_entry[2], \t $new_entry[0], \n"
}
}
I am a total beginner with Perl. Please explain as simply as possible!
I think you've got a good start on the solution. However, your while reads one line right before the next three lines are read (if those were <$file_handles>). int isn't necessary, but chomp is--before you check the value of $new_entry[1] else there's still a record separator at the end.
Given this, consider the following:
use strict;
use warnings;
my #entries;
open my $fh, '<', 'database.Win.txt' or die $!;
while (1) {
last if eof $fh;
chomp( $entries[$_] = <$fh> ) for 0 .. 2;
if ( $entries[1] == 1 ) {
print +( join "\t", #entries ), "\n";
}
}
close $fh;
Always start with use strict; use warnings. Next, open the file using the three-argument form of open. A while (1) is used here, so three lines at a time can be read within the while loop. Since it's an 'infinite' while loop, the last if eof $fh; gives a way out, viz., if the next file read produces an end of file, it's the last. Right below that is a for loop that effectively does what you did: assign a file line to an array position. Note that chomp is used to remove the record separator during the assignment. The last part is also similar to yours, as it checks whether the second of the three lines is 1, and then the line is printed if it is.
Hope this helps!
I'm trying to take a file INPUT and, if a line in that file contains a string, replace the line with something else (the entire line, including line breaks), or nothing at all (remove the line like it wasn't there). Writing all this to a new file .
Here's that section of code...
while(<INPUT>){
if ($_ =~ / <openTag>/){
chomp;
print OUTPUT "Some_Replacement_String";
} elsif ($_ =~ / <\/closeTag>/) {
chomp;
print OUTPUT ""; #remove the line
} else {
chomp;
print OUTPUT "$_\r\n"; #print the original line
}
}
while(<INPUT>) should read one line at a time (if my understanding is correct) and store each line in the special variable $_
However, when I run the above code I get only the very first if statement condition returned Some_Replacement_String, and only once. (1 line, out of a file with 1.3m, and expecting 600,000 replacements). This obviously isn't the behavior I expect. If I do something like while(<INPUT>){print OUTPUT $_;) I get a copy of the entire file, every line, so I know the entire file is being read (expected behavior).
What I'm trying to do is get a line, test it, do something with it, and move on to the next one.
If it helps with troubleshooting at all, if I use print $.; anywhere in that while statement (or after it), I get 1 returned. I expected this to be the "Current line number for the last filehandle accessed.". So by the time my while statement loops through the entire file, it should be equal to the number of lines in the file, not 1.
I've tried a few other variations of this code, but I think this is the closest I've come. I assume there's a good reason I'm not getting the behavior I expect, can anyone tell me what it is?
The problem you are describing indicates that your input file only contains one line. This may be because of a great many different things, such as:
You have changed the input record separator $/
Your input file does not contain the correct line endings
You are running your script with -0777 switch
Some notes on your code:
if ($_ =~ / <openTag>/){
chomp;
print OUTPUT "Some_Replacement_String";
No need to chomp a line you are not using.
} elsif ($_ =~ / <\/closeTag>/) {
chomp;
print OUTPUT "";
This is quite redundant. You don't need to print an empty string (ever, really), and chomp a value you're not using.
} else {
chomp;
print OUTPUT "$_\r\n"; #print the original line
No need to remove newlines and then put them back. Also, normally you would use \n as your line ending, even on windows.
And, since you are chomping in every if-else clause, you might as well move that outside the entire if-block.
chomp;
if (....) {
But since you are never relying on line endings not being there, why bother using chomp at all?
When using the $_ variable, you can abbreviate some commands, such as you are doing with chomp. For example, a lone regex will be applied to $_:
} elsif (/ <\/closeTag>/) { # works splendidly
When, like above, you have a regex that contains slashes, you can choose another delimiter for your regex, so that you do not need to escape the slashes:
} elsif (m# </closeTag>#) {
But then you need to use the full notation of the m// operator, with the m in front.
So, in short
while(<INPUT>){
if (/ <openTag>/){
print OUTPUT "Some_Replacement_String";
} elsif (m# </closeTag>#) {
# do nothing
} else {
print OUTPUT $_; # print the original line
}
}
And of course, the last two can be combined into one, with some negation logic:
} elsif (not m# </closeTag>#) {
print OUTPUT $_;
}