Split string from file in Perl

Split string from file in Perl - perl

So, i have a file to read like this
Some.Text~~~Some big text with spaces and numbers and something~~~Some.Text2~~~Again some big test, etc~~~Text~~~Big text~~~And so on
What I want is if $x matches with Some.Text for example, how can I get a variable with "Some big text with spaces and numbers and something" or if it matches with "Some.Text2" to get "Again some big test, etc".
open FILE, "<cats.txt" or die $!;
while (<FILE>) {
chomp;
my #values = split('~~~', $_);
foreach my $val (#values) {
print "$val\n" if ($val eq $x)
}
exit 0;
}
close FILE;
And from now on I don't know what to do. I just managed to print "Some.text" if it matches with my variable.

splice can be used to remove elements from #values in pairs:
while(my ($matcher, $printer) = splice(#values, 0, 2)) {
print $printer if $matcher eq $x;
}
Alternatively, if you need to leave #values intact you can use a c style loop:
for (my $i=0; $i<#values; $i+=2) {
print $values[$i+1] if $values[$i] eq $x;
}

Your best option is perhaps not to split, but to use a regex, like this:
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
while (/Some.Text2?~~~(.+?)~~~/g) {
say $1;
}
}
__DATA__
Some.Text~~~Some big text with spaces and numbers and something~~~Some.Text2~~~Again some big test, etc~~~Text~~~Big text~~~And so on
Output:
Some big text with spaces and numbers and something
Again some big test, etc

Related

How to fix the error of "Use of unitialized value in addition..." in perl script?

Here is the script of user Suic for calculating molecular weight of fasta sequences (calculating molecular weight in perl),
#!/usr/bin/perl
use strict;
use warnings;
use Encode;
for my $file (#ARGV) {
open my $fh, '<:encoding(UTF-8)', $file;
my $input = join q{}, <$fh>;
close $fh;
while ( $input =~ /^(>.*?)$([^>]*)/smxg ) {
my $name = $1;
my $seq = $2;
$seq =~ s/\n//smxg;
my $mass = calc_mass($seq);
print "$name has mass $mass\n";
}
}
sub calc_mass {
my $a = shift;
my #a = ();
my $x = length $a;
#a = split q{}, $a;
my $b = 0;
my %data = (
A=>71.09, R=>16.19, D=>114.11, N=>115.09,
C=>103.15, E=>129.12, Q=>128.14, G=>57.05,
H=>137.14, I=>113.16, L=>113.16, K=>128.17,
M=>131.19, F=>147.18, P=>97.12, S=>87.08,
T=>101.11, W=>186.12, Y=>163.18, V=>99.14
);
for my $i( #a ) {
$b += $data{$i};
}
my $c = $b - (18 * ($x - 1));
return $c;
}
and the protein.fasta file with n (here is 2) sequences:
seq_ID_1 descriptions etc
ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH
ASDSADGDASHDASHSAREWAWGDASHASGASGASGSDGASDGDSAHSHAS
SFASGDASGDSSDFDSFSDFSD
>seq_ID_2 descriptions etc
ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH
ASDSADGDASHDASHSAREWAWGDASHASGASGASG
When using: perl molecular_weight.pl protein.fasta > output.txt
in terminal, it will generate the correct results, however it also presents an error of "Use of unitialized value in addition (+) at molecular_weight.pl line36", which is just localized in line of "$b += $data{$i};" how to fix this bug ? Thanks in advance !

You probably have an errant SPACE somewhere in your data file. Just change
$seq =~ s/\n//smxg;
into
$seq =~ s/\s//smxg;
EDIT:
Besides whitespace, there may be some non-whitespace invisible characters in the data, like WORD JOINER (U+2060).
If you want to be sure to be thorough and you know all the legal symbols, you can delete everything apart from them:
$seq =~ s/[^ARDNCEQGHILKMFPSTWYV]//smxg;
Or, to make sure you won't miss any (even if you later change the symbols), you can populate a filter regex dynamically from the hash keys.
You'd need to make %Data and the filter regex global, so the filter is available in the main loop. As a beneficial side effect, you don't need to re-initialize the data hash every time you enter calc_mass().
use strict;
use warnings;
my %Data = (A=>71.09,...);
my $Filter_regex = eval { my $x = '[^' . join('', keys %Data) . ']'; qr/$x/; };
...
$seq =~ s/$Filter_regex//smxg;
(This filter works as long as the symbols are single character. For more complicated ones, it may be preferable to match for the symbols and collect them from the sequence, instead of removing unwanted characters.)

Perl: Printing out the file where a word occurs

I am trying to write a small program that takes from command line file(s) and prints out the number of occurrence of a word from all files and in which file it occurs. The first part, finding the number of occurrence of a word, seems to work well.
However, I am struggling with the second part, namely, finding in which file (i.e. file name) the word occurs. I am thinking of using an array that stores the word but don’t know if this is the best way, or what is the best way.
This is the code I have so far and seems to work well for the part that counts the number of times a word occurs in given file(s):
use strict;
use warnings;
my %count;
while (<>) {
my $casefoldstr = lc $_;
foreach my $str ($casefoldstr =~ /\w+/g) {
$count{$str}++;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}:\n";
}

The filename is accessible through $ARGV.
You can use this to build a nested hash with the filename and word as keys:
use strict;
use warnings;
use List::Util 'sum';
while (<>) {
$count{$word}{$ARGV}++ for map +lc, /\w+/g;
}
foreach my $word ( keys %count ) {
my #files = keys %$word; # All files containing lc $word
print "Total word count for '$word': ", sum( #{ $count{$word} }{#files} ), "\n";
for my $file ( #files ) {
print "$count{$word}{$file} counts of '$word' detected in '$file'\n";
}
}

Using an array seems reasonable, if you don't visit any file more than once - then you can always just check the last value stored in the array. Otherwise, use a hash.
#!/usr/bin/perl
use warnings;
use strict;
my %count;
my %in_file;
while (<>) {
my $casefoldstr = lc;
for my $str ($casefoldstr =~ /\w+/g) {
++$count{$str};
push #{ $in_file{$str} }, $ARGV
unless ref $in_file{$str} && $in_file{$str}[-1] eq $ARGV;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}: #{ $in_file{$str} }\n";
}

search a group of string in a file is present in another file or not

I"m writing to perl script where basically want to open a file having many strings(one string in one line) and compare each of these strings is present in another file(search file) and print each occurrence of it. I have written the below code for one particular string finding. How can i improve it for list of strings from a file.
open(DATA, "<filetosearch.txt") or die "Couldn't open file filetosearch.txt for reading: $!";
my $find = "word or string to find";
#open FILE, "<signatures.txt";
my #lines = <DATA>;
print "Lined that matched $find\n";
for (#lines) {
if ($_ =~ /$find/) {
print "$_\n";
}
}

I'd try something like this:
use strict;
use warnings;
use Tie::File;
tie my #lines, 'Tie::File', 'filetosearch.txt';
my #matched;
my #result;
tie my #patterns, 'Tie::File', 'patterns.txt';
foreach my $pattern (#patterns)
{
$pattern = quotemeta $pattern;
#matched = grep { /$pattern/ } #lines;
push #result, #matched;
}
I use Tie::File, because it is convenient (not especially in this case, but others), others (perhaps a lot of others?) would disagree, but it is of no importance here
grep is a core function, that is very good at what it does (In my experience)

Ok, something like this will be faster.
sub testmatch
{
my ($find, $linesref)= #_ ;
for ( #$linesref ) { if ( $_ =~ /$find/ ) { return 1 ; } }
return 0 ;
}
{
open(DATA, "<filetosearch.txt") or die "die" ;
my #lines = <DATA> ;
open(SRC, "tests.txt") ;
while (<SRC>)
{
if ( testmatch( $_, \#lines )) { print "a match\n" }
}
}
If its matching full line to full line, you can pack the one line in as keys to a hash and just test existance:
{
open(DATA, "<filetosearch.txt") or die "die" ;
my %lines ;
#lines{<DATA>}= undef ;
open(SRC, "tests.txt") ;
while (<SRC>)
{
if ($_ ~~ %lines) { print "a match\n" }
}
}

maybe something like this will do the job:
open FILE1, "filetosearch.txt";
my #arrFileToSearch = <FILE1>;
close FILE1;
open FILE2, "signatures.txt";
my #arrSignatures = <FILE2>;
close FILE2;
for(my $i = 0; defined($arrFileToSearch[$i]);$i++){
foreach my $signature(#arrSignatures){
chomp($signature);
$signature = quotemeta($signature);#to be sure you are escaping special characters
if($arrFileToSearch[$i] =~ /$signature/){
print $arrFileToSearch[$i-3];#or any other index that you want
}
}
}

Here's another option:
use strict;
use warnings;
my $searchFile = pop;
my #strings = map { chomp; "\Q$_\E" } <>;
my $regex = '(?:' . ( join '|', #strings ) . ')';
push #ARGV, $searchFile;
while (<>) {
print if /$regex/;
}
Usage: perl script.pl strings.txt searchFile.txt [>outFile.txt]
The last, optional parameter directs output to a file.
First, the search file's name is (implicitly) popped off #ARGV and saved for later. Then the strings' file is read (<>) and map is used to chomp each line, escape meta-characters (the \Q and \E, in case there may be regex chars, e.g., a '.' or '*' etc., in the string) then these lines are passed to an array. The array's elements are joined with the regex alternation character (|) to effectively form an OR statement of all the strings that will be matched against each of the search file's lines. Next, the search file's name is pushed onto #ARGV so its lines can be searched. Again, each line is chomped and printed if one of the strings are found on the line.
Hope this helps!

Reading the next line in the file and keeping counts separate

Another question for everyone. To reiterate I am very new to the Perl process and I apologize in advance for making silly mistakes
I am trying to calculate the GC content of different lengths of DNA sequence. The file is in this format:
>gene 1
DNA sequence of specific gene
>gene 2
DNA sequence of specific gene
...etc...
This is a small piece of the file
>env
ATGCTTCTCATCTCAAACCCGCGCCACCTGGGGCACCCGATGAGTCCTGGGAA
I have established the counter and to read each line of DNA sequence but at the moment it is do a running summation of the total across all lines. I want it to read each sequence, print the content after the sequence read then move onto the next one. Having individual base counts for each line.
This is what I have so far.
#!/usr/bin/perl
#necessary code to open and read a new file and create a new one.
use strict;
my $infile = "Lab1_seq.fasta";
open INFILE, $infile or die "$infile: $!";
my $outfile = "Lab1_seq_output.txt";
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";
#establishing the intial counts for each base
my $G = 0;
my $C = 0;
my $A = 0;
my $T = 0;
#initial loop created to read through each line
while ( my $line = <INFILE> ) {
chomp $line;
# reads file until the ">" character is encounterd and prints the line
if ($line =~ /^>/){
print OUTFILE "Gene: $line\n";
}
# otherwise count the content of the next line.
# my percent counts seem to be incorrect due to my Total length counts skewing the following line. I am currently unsure how to fix that
elsif ($line =~ /^[A-Z]/){
my #array = split //, $line;
my $array= (#array);
# reset the counts of each variable
$G = ();
$C = ();
$A = ();
$T = ();
foreach $array (#array){
#if statements asses which base is present and makes a running total of the bases.
if ($array eq 'G'){
++$G;
}
elsif ( $array eq 'C' ) {
++$C; }
elsif ( $array eq 'A' ) {
++$A; }
elsif ( $array eq 'T' ) {
++$T; }
}
# all is printed to the outfile
print OUTFILE "G:$G\n";
print OUTFILE "C:$C\n";
print OUTFILE "A:$A\n";
print OUTFILE "T:$T\n";
print OUTFILE "Total length:_", ($A+=$C+=$G+=$T), "_base pairs\n";
print OUTFILE "GC content is(percent):_", (($G+=$C)/($A+=$C+=$G+=$T)*100),"_%\n";
}
}
#close the outfile and the infile
close OUTFILE;
close INFILE;
Again I feel like I am on the right path, I am just missing some basic foundations. Any help would be greatly appreciated.
The final problem is in the final counts printed out. My percent values are wrong and give me the wrong value. I feel like the total is being calculated then that new value is incorporated into the total.

Several things:
1. use hash instead of declaring each element.
2. assignment such as $G = (0); is indeed working, but it is not the right way to assign scalar. What you did is declaring an array, which in scalar context $G = is returning the first array item. The correct way is $G = 0.
my %seen;
$seen{/^([A-Z])/}++ for (grep {/^\>/} <INFILE>);
foreach $gene (keys %seen) {
print "$gene: $seen{$gene}\n";
}

Just reset the counters when a new gene is found. Also, I'd use hashes for the counting:
use strict; use warnings;
my %counts;
while (<>) {
if (/^>/) {
# print counts for the prev gene if there are counts:
print_counts(\%counts) if keys %counts;
%counts = (); # reset the counts
print $_; # print the Fasta header
} else {
chomp;
$counts{$_}++ for split //;
}
}
print_counts(\%counts) if keys %counts; # print counts for last gene
sub print_counts {
my ($counts) = #_;
print "$_:=", ($counts->{$_} || 0), "\n" for qw/A C G T/;
}
Usage: $ perl count-bases.pl input.fasta.
Example output:
> gene 1
A:=3
C:=1
G:=5
T:=5
> gene 2
A:=1
C:=5
G:=0
T:=13
Style comments:
When opening a file, always use lexical filehandles (normal variables). Also, you should do a three-arg open. I'd also recommend the autodie pragma for automatic error handling (since perl v5.10.1).
use autodie;
open my $in, "<", $infile;
open my $out, ">", $outfile;
Note that I don't open files in my above script because I use the special ARGV filehandle for input, and print to STDOUT. The output can be redirected on the shell, like
$ perl count-bases.pl input.fasta >counts.txt
Declaring scalar variables with their values in parens like my $G = (0) is weird, but works fine. I think this is more confusing than helpful. → my $G = 0.
Your intendation is a bit weird. It is very unusual and visually confusing to put closing braces on the same line with another statement like
...
elsif ( $array eq 'C' ) {
++$C; }
I prefer cuddling elsif:
...
} elsif ($base eq 'C') {
$C++;
}
This statement my $array= (#array); puts the length of the array into $array. What for? Tip: You can declare variables right inside foreach-loops, like for my $base (#array) { ... }.

How to get last duplicate line from file using perl

I'd like to get the last entry of a duplicate line from a file.
The basis for duplicate checking would be the first element from a csv.
The duplicates may or may not be adjacent.
Input file:
971~11
972~12
973~11
974~11
972~11
Expected output:
971~11
973~11
974~11
972~11
I'm not looking for a perl one-liner as I intend to write this as
a subroutine.
Thanks!
PS:
I have modified this code from somewhere, but this just removes the duplicates
#!/usr/bin/perl -w
while (<STDIN>) { push (#lines, $_); }
print "-\n";
foreach my $i (#lines)
{
#newline = split(/\||~/, $i);
if (scalar(grep{ /$newline[0]/ } #lines) == 1)
{
print $i;
}
}

If the output order doesn't matter, the easiest way to do this is to use a hash to do the duplicate removal. Something like the following:
#!/usr/bin/perl -w
use strict;
sub printlast(#) {
my %dedup;
foreach my $line (#_) {
my $a = (split(/\||~/, $line))[0];
$dedup{$a} = $line;
}
print $dedup{$_} for keys %dedup; # or sort keys %dedup for prettier output
}
my #lines;
while (<STDIN>) { push (#lines, $_); }
print "-\n";
printlast(#lines);

When looking to dedup, it's almost always best to use a hash.
Here's something similar to the accepted answer (since #Mat beat me to it)
#!/usr/bin/env perl -lw
use Data::Dumper; $Data::Dumper::Indent = 1;
my %seen;
while (<DATA>) {
chomp;
my #fields = split('~');
$seen{$fields[0]} = $fields[1];
}
my #output;
while (my ($k,$v) = each %seen) {
push #output, join('~', $k, $v);
}
print Dumper \#output;
__DATA__
971~11
972~12
973~11
974~11
972~11

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Split string from file in Perl - perl

Related

How to fix the error of "Use of unitialized value in addition..." in perl script?

Perl: Printing out the file where a word occurs

search a group of string in a file is present in another file or not

Reading the next line in the file and keeping counts separate

How to get last duplicate line from file using perl

Categories

Resources