perl sequence extraction loop - perl

I have an existing perl one-liner (from the Edwards lab) that works wonderfully to read a text file (named ids.file) that contains one column of IDs and searches a second, specially formatted text file (named fasta.file in this example - in "fasta" format for those who know bioinformatics) and returns sequences that match the ID from the first file. I was hoping to expand this script to do two additional things:
The current perl one-liner only seems to work if the ids.file contains one column of data. I would like it to work on a file that contains two columns (separated by spaces), and act on the second column of data (well, really any column of data, but I assume that it will be obvious enough to adapt it if someone can give an example using a second column)
I would like to append the any results returned from the output of the search to a third column, instead of just to a new file.
If someone is kind enough to offer an example but only has time or inclination to work on one of these, I would prefer that you try to solve #2 - I have come close to solving #1 with a for loop that uses awk to only use the Perl code on the second column - I haven't gotten it yet, but am close, so #2 seems like the harder one to me.
The perl one liner is as follows:
perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if #ARGV' ids.file fasta.file
I appreciate any help you can give!

Not quite sure but will this do?
perl -ne 'chomp; s/^>(\S+).*/$c=$i{$1}/e; print if $c;
$i{(/^\S*\s(\S*)$/)[0]}="$_ " if #ARGV'
ids.file fasta.file

Related

Extract Data From Second Line of Output

I have a table that contains message_content. It looks like this:
message_content | WFUS54 ABNT 080344\r\r
| TORLCH\r\r
| TXC245-361-080415-\r
How would I extract only the 2nd line of that output(TORLCH)? I've tried to shorten the output to a certain number of characters but that ultimately doesn't provide what I want. I've also tried removing carriage returns and new lines. I am outputting my results to a CSV I could manipulate with Python, but was wondering if there's a way to do it in the query first.
Based on other examples, it seems like I could use a regular expression to maybe do this? Not sure where to start with learning that though.
you can split the line into an array, then take the second element:
(string_to_array(message_content, e'\r\r'))[2]
Online example: https://rextester.com/MDYLXB40812

How to perl convert xml (name with pattern) to json?

The next convert test.xml to json:
perl -MJSON::Any -MXML::Simple -le'print JSON::Any->new()->objToJson(XMLin("/tmp/test.xml "))'
but I need convert any xml (example test-1.xml test-2.xml test-3.xml test-4.xml etc) with pattern name /tmp/test-*.xml, but if I use:
perl -MJSON::Any -MXML::Simple -le'print JSON::Any->new()->objToJson(XMLin("/tmp/test-*.xml "))'
I have the next messages:
File does not exist: /tmp/test-*.xml at -e line 1
How I do it?
There's problems with what you're trying to do:
XML::Simple isn't simple. It's for simple XML. It'll mangle your XML and give inconsistent results. See: Why is XML::Simple "Discouraged"?
XML is fundamentally more complicated than JSON, so there's no linear transformation. You need to figure out what'd you'd do with attributes and duplicate elements for a start.
File does not exist: /tmp/test-*.xml at -e line 1 - means the file doesn't exist. So you're not going to get very far. But XMLin doesn't accept wildcards. You'll have to process one file at a time.
The first two points are solvable, provided you accept that this cannot be a generic solution - to give a moderately general solution, we'll need an example of your source XML. But it won't be a one liner.
You seem to be asking how to find files matching a file glob.
You could use
my #qfns = glob("/tmp/test-*.xml");
If you just want the first matching file, use
my ($qfn) = glob("/tmp/test-*.xml");
Do not use the following since glob acts an iterator in scalar context.
my $qfn = glob("/tmp/test-*.xml"); # XXX
You can try this using glob and map functions.
perl -MJSON::Any -MXML::Simple -le'local $,="\n"; print map { JSON::Any->new()->objToJson(XMLin($_)) } glob "/path/to/my/test*.xml"'

Perl : How to extract unique entries of a text file

I am totally a beginner in Perl. I have a large file (around 100 G) which looks like this:
domain, ip
"www.google.ac.",173.194.33.111
"www.google.ac.",173.194.33.119
"www.google.ac.",173.194.33.120
"www.google.ac.",173.194.33.127
"www.google.ac.",173.194.33.143
"apple.com., 173.194.33.143
"studio.com.", 173.194.33.143
"www.google.ac.",101.78.156.201
"www.google.ac.",101.78.156.201
So basically I have 1-duplicate lines, 2- one ip with different domains, 3- one domain with different ips. and I would like to remove the duplicate lines from the file (the ones with same domain,ip pair).
**I have already reviewed other answers in regards to the same question, none of them address my problem with large files .
Does anybody have a clue how can I do it in PERL? or any suggestion for more optimal language?
The easiest thing to do is read the file a line at a time and use each line as the key of a hash. You have to have memory to store each unique line once, though. There's no getting around that.
Here's a one-liner as run from the shell:
perl -ne '$lines{$_}++; END { print keys %lines }' filename

Substitute only one part of a string using perl

I have an array that have some symbols that I want to remove and even thought I find a solution, I will like to know if this is the right way because I'm afraid if I use it with array will remove the character that I might need on future arrays.
Here is an example item on my array:
$string1='22 | logging monitor informational';
so I try the following:
$string1=~ s/\s{6}\|(?=\s{6})//;
So my output is:
22 logging monitor informational
Is the other way that best match "|". I just want to remove the pipe character.
Thanks in advance
"I want to remove just the pipe character."
OK, then do this:
$string1 =~ s/\|//;
This will remove the first pipe character in the string. (You said in another comment that you don't want to remove any additional pipe characters.) If that's not what you want, then I'd suggest telling us exactly what you do want. We can't read minds, you know.
In the mean time, I'd also strongly recommend reading the Perl regular expressions tutorial.

Using a .fasta file to compute relative content of sequences

So me being the 'noob' that I am, being introduced to programming via Perl just recently, I'm still getting used to all of this. I have a .fasta file which I have to use, although I'm unsure if I'm able to open it, or if I have to work with it 'blindly', so to speak.
Anyway, the file that I have contains DNA sequences for three genes, written in this .fasta format.
Apparently it's something like this:
>label
sequence
>label
sequence
>label
sequence
My goal is to write a script to open and read the file, which I have gotten the hang of now, but I have to read each sequence, compute relative amounts of 'G' and 'C' within each sequence, and then I'm to write it to a TAB-delimited file the names of the genes, and their respective 'G' and 'C' content.
Would anyone be able to provide some guidance? I'm unsure what a TAB-delimited file is, and I'm still trying to figure out how to open a .fasta file to actually see the content. So far I've worked with .txt files which I can easily open, but not .fasta.
I apologise for sounding completely bewildered. I'd appreciate your patience. I'm not like you pros out there!!
I get that it's confusing, but you really should try to limit your question to one concrete problem, see https://stackoverflow.com/faq#questions
I have no idea what a ".fasta" file or 'G' and 'C' is.. but it probably doesn't matter.
Generally:
Open input file
Read and parse data. If it's in some strange format that you can't parse, go hunting on http://metacpan.org for a module to read it. If you're lucky someone has already done the hard part for you.
Compute whatever you're trying to compute
Print to screen (standard out) or another file.
A "TAB-delimite" file is a file with columns (think Excel) where each column is separated by the tab ("\t") character. As quick google or stackoverflow search would tell you..
Here is an approach using 'awk' utility which can be used from the command line. The following program is executed by specifying its path and using awk -f <path> <sequence file>
#NR>1 means only look at lines above 1 because you said the sequence starts on line 2
NR>1{
#this for-loop goes through all bases in the line and then performs operations below:
for (i=1;i<=length;i++)
#for each position encountered, the variable "total" is increased by 1 for total bases
total++
}
{
for (i=1;i<=length;i++)
#if the "substring" i.e. position in a line == c or g upper or lower (some bases are
#lowercase in some fasta files), it will carry out the following instructions:
if(substr($0,i,1)=="c" || substr($0,i,1)=="C")
#this increments the c count by one for every c or C encountered, the next if statement does
#the same thing for g and G:
c++; else
if(substr($0,i,1)=="g" || substr($0,i,1)=="G")
g++
}
END{
#this "END-block" prints the gene name and C, G content in percentage, separated by tabs
print "Gene name\tG content:\t"(100*g/total)"%\tC content:\t"(100*c/total)"%"
}