Iteration to Match Line Patterns from Text File and Then Parse out N Lines - perl

I have a text file that contains three columns. Using perl, I'm trying to loop through the text file and search for a particular pattern...
Logic: IF column2 = 00z24aug2016 & column3 = e01. When this pattern is matched I need to parse out the matched line and then the next 3 lines. to new files.
Text File:
site1,00z24aug2016,e01
site1,00z24aug2016,e01
site1,00z24aug2016,e01
site1,00z24aug2016,e01
site2,00z24aug2016,e02
site2,00z24aug2016,e02
site2,00z24aug2016,e02
site2,00z24aug2016,e02
Desired Output...
New File 1:
site1,00z24aug2016,e01
site1,00z24aug2016,e01
site1,00z24aug2016,e01
site1,00z24aug2016,e01
New File 2:
site2,00z24aug2016,e02
site2,00z24aug2016,e02
site2,00z24aug2016,e02
site2,00z24aug2016,e02

Based on your comment in response to zdim and Borodin, it appears that you're asking for pointers on how to do this with Perl rather than actual working code, so I am answering on that basis.
What you describe in the "logic" portion of your question is extremely simple and straightforward to do in Perl - the actual code would be far shorter than this description of it:
Start your program with use strict; use warnings; - this will catch most common errors and make debugging vastly easier!
Open your input file for reading (open(my $fh, '<', $file_name) or die "Failed to open $file_name: $!")
Read in each line of the file (my $line = <$fh>;)
Optionally use chomp to remove line endings
Use split to break the line into fields (my #column = split /,/, $line;)
Check the values of the first and third fields (note that arrays start counting from 0, not from 1, so these will be $column[1] and $column[2] rather than 2 and 3)
If the field values match your criteria, set a counter to 4 (the total number of lines to output)
If the counter is greater than zero, output the original $line and decrement the counter
The logic mentions "new files" but does not specify when a new output file should be created and when output should continue to be sent to the same file. Since this was not specified, I have ignored it and described all output going to a single destination.
Note, however, that your sample desired output does not match the described logic. According to the specified logic, the output should include the first seven lines of your example data, but not the final line (because none of the three lines preceding it include "e01").
So. Take this information, along with whatever you may already know about Perl, and try to write a solution. If you reach a point where you can't figure out how to make any further progress, post a new question (or update this one) containing a copy of your code and input data, so that we can run it ourselves, and a description of how it fails to work properly, then we'll be much more able to help you with that information (and more people will be willing to help if you can show that you made an effort to do it yourself first).

Related

Using perl to split over multiple lines

I'm trying to write a perl script to process a log4net log file. The fields in the log file are separated by a semi-colon. My end goal is to capture each field and populate a mysql table.
Usually I have lines that look a little like this (all on a single line)
DEBUG;2017-06-13T03:56:38,316-05:00;2017-06-13 08:56:38,316;79ab0b95-7f58-
44a8-a2c6-1f8feba1d72d;(null);WorkerStartup 1;"Starting services."
These are easy to process. I can simply split by semicolon to get the information I need.
However occassionally the "message" field at the end may span several lines, especially if there is a stack trace. I would want to capture the entire message as a single column. I cannot use split by semicolon, because the next lines would typically look like:
at some.random.classname
at another.classname
...
Can someone give some tips how to solve this problem?
The following solution uses that the number of " in a field is even ($p=~y/"//%2), this condition number of " odd may be changed by other that can indicate the field is not complete.
The number of columns splitted is fixed to 7 (to allow ; in last field) and may be changed for example #array = map {s/;$//} $p=~/\G(?:"[^"]*"|[^;])*;/g;.
The file is read line by line but a line is processed sub process when it's complete $p variable to store the previous line the last line is processed in END block.
perl -ne '
sub process {
#array = split /;/,$p,7;
# do something with array
print ((join "\n---\n", #array),"\n");
}
if ($p=~y/"//%2) {
$p.=$_;
next;
}
process;
$p=$_;
END{process}
' < logfile.txt

MATLAB simultaneous read and write the same file

I want to read and write the same file simultaneously. Here is a simplified code:
clc;
close all;
clearvars;
fd = fopen ('abcd.txt','r+'); %opening file abcd.txt given below
while ~feof(fd)
nline = fgetl(fd);
find1 = strfind(nline,'abcd'); %searching for matching string
chk1 = isempty(find1);
if(chk1==0)
write = '0000'; %in this case, matching pattern found
% then replace that line by 0000
fprintf(fd,'%s \n',write);
else
continue;
end
end
File abcd.txt
abcde
abcd23
abcd2
abcd355
abcd65
I want to find text abcd in string of each line and replace the entire line by 0000. However, there is no change in the text file abcd.txt. The program doesn't write anything in the text file.
Someone can say read each line and write a separate text file line by line. However, there is a problem in this approach. In the original problem, instead of finding matching text `abcd, there is array of string with thousands of elements. In that case, I want to read the file, parse the file for find matching string, replace string as per condition, go to next iteration to search next matching string and so on. So in this approach, line by line reading original file and simultaneously writing another file does not work.
Another approach can be reading the entire file in memory, replacing the string and iterate. But I am not very sure how will it work. Another issue is memory usage.
Any comments?
What you are attempting to do is not possible in a efficient way. Replacing abcde with 0000, which should be done for the first line, would require all remaining text forward because you remove one char.
Instead solve this reading one file and write to a second, then remove the original file and rename the new one.

Read from csv file and write output to a file

I am new to Perl and would like your help on following scenario, can you please help on this subject.
I have a CSV files with following information and I am trying to prepare a key-value pair from CSV file. Can you please help me with below scenario.
Line 1: List,ID
Line 2: 1,2,3
Line 3: 4,5,6
Line 4: List,Name
Line 5: Tom, Peter, Joe
Line 6: Jim, Harry, Tim
I need to format the above CSV file to get an output in a new file like below:
Line 1: ID:1,2,3 4,5,6
Line 2: Name:Tom,Peter,Joe Jim, Harry, Tim
Can you please direct me on how I can use Perl functions for this scenario.
You're in luck, this is extremely easy in Perl.
There's a great library called Text::CSV which is available on CPAN, docs are here: https://metacpan.org/pod/Text::CSV
The synopsis at the top of the page gives a really good example which should let you do what you want with minor modifications.
I don't think the issue here is the CSV format so much as the fact that you have different lists broken up with header lines. I haven't tried this code yet, but I think you want something like the following:
while (<>) { # Loop over stdin one line at a time
chomp; # Strip off trailing newline
my ($listToken, $listName) = split(',');
next unless $listToken; # Skip over blank lines
if ($listToken =~ /^List/) { # This is a header row
print "\n$listName: "; # End previous list, start new one
} else { # The current list continues
print "$_ "; # Append the entire row to the output
}
}
print "\n"; # Terminate the last line
Note that this file format is a little dubious, as there is no way to have a data row where the first value is the literal "List". However, I'm assuming that either you have no choice in file format or you know that List is not a legal value.
(Note - I fixed a mistake where I used $rest as a variable; that was caused by my renaming them as I went along and missing one)

How to "jump" to a line of a file, rather than read file line by line, using Perl

I am opening a file containing a single but very long column. I want to retrieve from it just a short segment, starting at a specified line and ending at another specified line. Currently, my script is reading the file line by line until the desired lines are found. I am using:
my ( $from, $to ) = ( some line number, some larger line number );
my $count = 1;
my #seq = ();
while ( <SEQUENCE> ) {
print "$_ for $count\n";
$count++;
while ( $count >= $from && $count <= $to ) {
push( #seq, $_ );
last;
}
}
print "seq is: #seq\n";
Input looks like:
A
G
T
C
A
G
T
C
.
.
.
How might I "jump" to where I want to be?
You'll need to use seek to move to the correct portion of the file. ref: http://perldoc.perl.org/functions/seek.html
This works on bytes, not on lines, so generally if you need to use line seeking its not an option. However, since you're working on a fixed length line (2 or 3 bytes depending on your platform's EOL encoding) you can multiply the line length by the line you want (0 indexed) and you'll be at the correct location for reading.
If you happen to know that all the lines are of exactly the same length (accounting for line ending characters, generally 1 byte on Unix/Linux and 2 on Windows), you can use seek to go directly to a specified point in the file
The seek function lets you specify a file position in bytes/characters, not in lines. In the general case, the only way to go to a specified line number is to read from the beginning and skip that many lines (minus one).
Unless you have an index mapping line numbers to byte offsets; then you can look up the specified line number in the index and use seek to jump to that location. To do this, you have to build the index separately (a process that will require reading through the entire file) and make sure the index is always up to date. If the file changes frequently, this is likely to be impractical.
I'm not aware of any existing tools for building and using such an index, but I wouldn't be surprised if they exist. But it should be easy enough to roll your own.
But unless scanning the file to find the line number you want is a significant performance bottleneck, I wouldn't bother with the extra complexity.

Using a .fasta file to compute relative content of sequences

So me being the 'noob' that I am, being introduced to programming via Perl just recently, I'm still getting used to all of this. I have a .fasta file which I have to use, although I'm unsure if I'm able to open it, or if I have to work with it 'blindly', so to speak.
Anyway, the file that I have contains DNA sequences for three genes, written in this .fasta format.
Apparently it's something like this:
>label
sequence
>label
sequence
>label
sequence
My goal is to write a script to open and read the file, which I have gotten the hang of now, but I have to read each sequence, compute relative amounts of 'G' and 'C' within each sequence, and then I'm to write it to a TAB-delimited file the names of the genes, and their respective 'G' and 'C' content.
Would anyone be able to provide some guidance? I'm unsure what a TAB-delimited file is, and I'm still trying to figure out how to open a .fasta file to actually see the content. So far I've worked with .txt files which I can easily open, but not .fasta.
I apologise for sounding completely bewildered. I'd appreciate your patience. I'm not like you pros out there!!
I get that it's confusing, but you really should try to limit your question to one concrete problem, see https://stackoverflow.com/faq#questions
I have no idea what a ".fasta" file or 'G' and 'C' is.. but it probably doesn't matter.
Generally:
Open input file
Read and parse data. If it's in some strange format that you can't parse, go hunting on http://metacpan.org for a module to read it. If you're lucky someone has already done the hard part for you.
Compute whatever you're trying to compute
Print to screen (standard out) or another file.
A "TAB-delimite" file is a file with columns (think Excel) where each column is separated by the tab ("\t") character. As quick google or stackoverflow search would tell you..
Here is an approach using 'awk' utility which can be used from the command line. The following program is executed by specifying its path and using awk -f <path> <sequence file>
#NR>1 means only look at lines above 1 because you said the sequence starts on line 2
NR>1{
#this for-loop goes through all bases in the line and then performs operations below:
for (i=1;i<=length;i++)
#for each position encountered, the variable "total" is increased by 1 for total bases
total++
}
{
for (i=1;i<=length;i++)
#if the "substring" i.e. position in a line == c or g upper or lower (some bases are
#lowercase in some fasta files), it will carry out the following instructions:
if(substr($0,i,1)=="c" || substr($0,i,1)=="C")
#this increments the c count by one for every c or C encountered, the next if statement does
#the same thing for g and G:
c++; else
if(substr($0,i,1)=="g" || substr($0,i,1)=="G")
g++
}
END{
#this "END-block" prints the gene name and C, G content in percentage, separated by tabs
print "Gene name\tG content:\t"(100*g/total)"%\tC content:\t"(100*c/total)"%"
}