Perl : How to extract unique entries of a text file - perl

I am totally a beginner in Perl. I have a large file (around 100 G) which looks like this:
domain, ip
"www.google.ac.",173.194.33.111
"www.google.ac.",173.194.33.119
"www.google.ac.",173.194.33.120
"www.google.ac.",173.194.33.127
"www.google.ac.",173.194.33.143
"apple.com., 173.194.33.143
"studio.com.", 173.194.33.143
"www.google.ac.",101.78.156.201
"www.google.ac.",101.78.156.201
So basically I have 1-duplicate lines, 2- one ip with different domains, 3- one domain with different ips. and I would like to remove the duplicate lines from the file (the ones with same domain,ip pair).
**I have already reviewed other answers in regards to the same question, none of them address my problem with large files .
Does anybody have a clue how can I do it in PERL? or any suggestion for more optimal language?

The easiest thing to do is read the file a line at a time and use each line as the key of a hash. You have to have memory to store each unique line once, though. There's no getting around that.
Here's a one-liner as run from the shell:
perl -ne '$lines{$_}++; END { print keys %lines }' filename

Related

perl memory usage when processing a file inline

I have a CGI script that's used by our employees to fetch logs from servers that they don't have direct access to. For reasons I won't go into, after a recent update to our app some of these logs now have characters like linefeeds, tabs, backslashes, etc. translated into their text equivalents. As such, I've modified the CGI script to invoke the following to convert these back to their original values:
perl -i -pe 's/\\r/\r/g && s/\\n/\n/g && s/\\t/\t/g && s/\\\//\//g' $filename
I was just informed that some people are now getting out of memory errors when they try to fetch logs that are fairly large (a few hundred MB).
My question: How does perl manage memory when an inline command like this is invoked? Is it reading the whole file in, processing it, then writing it out, or is it creating a temporary file, processing the lines from the input file one at a time then replacing the file once complete?
This is using perl 5.10.1 on a 64-bit Amazon linux instance.
The -p switch creates a while(<>){...; print} loop to iterate on each “line” in your input file.
If all of your newlines have been converted into "\\n", then your file would just be a single very long line. Therefore, your command would be loading the entire file into memory to perform your fix.
To avoid that, you'll have to intentionally buffer the file using either sysread or $/.
It would probably be easiest to create an actual script instead of a one-liner to do the work. However, if you know that all of your newlines are converted, then one simple fix would be to use $/ = "\\n"
As a secondary note, your regex is flawed. You're currently listing out your translations s/// using a shortcut operator. If any one of the earlier regexes doesn't match for a particular line, then no other translations would be attempted. You should instead use simple semicolons to separate your regexes:
's/\\r/\r/g; s/\\n/\n/g; s/\\t/\t/g; s|\\/|/|g'

perl sequence extraction loop

I have an existing perl one-liner (from the Edwards lab) that works wonderfully to read a text file (named ids.file) that contains one column of IDs and searches a second, specially formatted text file (named fasta.file in this example - in "fasta" format for those who know bioinformatics) and returns sequences that match the ID from the first file. I was hoping to expand this script to do two additional things:
The current perl one-liner only seems to work if the ids.file contains one column of data. I would like it to work on a file that contains two columns (separated by spaces), and act on the second column of data (well, really any column of data, but I assume that it will be obvious enough to adapt it if someone can give an example using a second column)
I would like to append the any results returned from the output of the search to a third column, instead of just to a new file.
If someone is kind enough to offer an example but only has time or inclination to work on one of these, I would prefer that you try to solve #2 - I have come close to solving #1 with a for loop that uses awk to only use the Perl code on the second column - I haven't gotten it yet, but am close, so #2 seems like the harder one to me.
The perl one liner is as follows:
perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if #ARGV' ids.file fasta.file
I appreciate any help you can give!
Not quite sure but will this do?
perl -ne 'chomp; s/^>(\S+).*/$c=$i{$1}/e; print if $c;
$i{(/^\S*\s(\S*)$/)[0]}="$_ " if #ARGV'
ids.file fasta.file

Pick random lines from file with fixed seed (pseudo-random)

I would like to randomly pick some lines (e.g.20) from a file and print it into another but I want to have a seed fixed so that I get the same output if input file is the same.
The examples I've found that pick several lines, their output is different everytime
e.g:
perl -e '$f="inputfile";$_=`wc -l $f`;#l=split( );$r=int rand(#l[0]);system("head -n$r $f|tail -20")'> outputfile
And those that talk about fixed seed and pseudo-random are just for printing numbers, not extracting lines from files, or just extract a single line. Is there a command for unix or some code in perl or similar? (sort -R, --random- & shuf didn't work (using Mac OS X 10.5.8)).
You can set the seed via srand(); (for example. srand(5)) to get a fixed seed for rand.

Using a .fasta file to compute relative content of sequences

So me being the 'noob' that I am, being introduced to programming via Perl just recently, I'm still getting used to all of this. I have a .fasta file which I have to use, although I'm unsure if I'm able to open it, or if I have to work with it 'blindly', so to speak.
Anyway, the file that I have contains DNA sequences for three genes, written in this .fasta format.
Apparently it's something like this:
>label
sequence
>label
sequence
>label
sequence
My goal is to write a script to open and read the file, which I have gotten the hang of now, but I have to read each sequence, compute relative amounts of 'G' and 'C' within each sequence, and then I'm to write it to a TAB-delimited file the names of the genes, and their respective 'G' and 'C' content.
Would anyone be able to provide some guidance? I'm unsure what a TAB-delimited file is, and I'm still trying to figure out how to open a .fasta file to actually see the content. So far I've worked with .txt files which I can easily open, but not .fasta.
I apologise for sounding completely bewildered. I'd appreciate your patience. I'm not like you pros out there!!
I get that it's confusing, but you really should try to limit your question to one concrete problem, see https://stackoverflow.com/faq#questions
I have no idea what a ".fasta" file or 'G' and 'C' is.. but it probably doesn't matter.
Generally:
Open input file
Read and parse data. If it's in some strange format that you can't parse, go hunting on http://metacpan.org for a module to read it. If you're lucky someone has already done the hard part for you.
Compute whatever you're trying to compute
Print to screen (standard out) or another file.
A "TAB-delimite" file is a file with columns (think Excel) where each column is separated by the tab ("\t") character. As quick google or stackoverflow search would tell you..
Here is an approach using 'awk' utility which can be used from the command line. The following program is executed by specifying its path and using awk -f <path> <sequence file>
#NR>1 means only look at lines above 1 because you said the sequence starts on line 2
NR>1{
#this for-loop goes through all bases in the line and then performs operations below:
for (i=1;i<=length;i++)
#for each position encountered, the variable "total" is increased by 1 for total bases
total++
}
{
for (i=1;i<=length;i++)
#if the "substring" i.e. position in a line == c or g upper or lower (some bases are
#lowercase in some fasta files), it will carry out the following instructions:
if(substr($0,i,1)=="c" || substr($0,i,1)=="C")
#this increments the c count by one for every c or C encountered, the next if statement does
#the same thing for g and G:
c++; else
if(substr($0,i,1)=="g" || substr($0,i,1)=="G")
g++
}
END{
#this "END-block" prints the gene name and C, G content in percentage, separated by tabs
print "Gene name\tG content:\t"(100*g/total)"%\tC content:\t"(100*c/total)"%"
}

Perl: pattern match a string and then print next line/lines

I am using Net::Whois::Raw to query a list of domains from a text file and then parse through this to output relevant information for each domain.
It was all going well until I hit Nominet results as the information I require is never on the same line as that which I am pattern matching.
For instance:
Name servers:
ns.mistral.co.uk 195.184.229.229
So what I need to do is pattern match for "Name servers:" and then display the next line or lines but I just can't manage it.
I have read through all of the answers on here but they either don't seem to work in my case or confuse me even further as I am a simple bear.
The code I am using is as follows:
while ($record = <DOMAINS>) {
$domaininfo = whois($record);
if ($domaininfo=~ m/Name servers:(.*?)\n/){
print "Nameserver: $1\n";
}
}
I have tried an example of Stackoverflow where
<DOMAINS>;
will take the next line but this didn't work for me and I assume it is because we have already read the contents of this into $domaininfo.
EDIT: Forgot to say thanks!
how rude.
So, the $domaininfo string contains your domain?
What you probably need is the m parameter at the end of your regular expression. This treats your string as a multilined string (which is what it is). Then, you can match on the \n character. This works for me:
my $domaininfo =<<DATA;
Name servers:
ns.mistral.co.uk 195.184.229.229
DATA
$domaininfo =~ m/Name servers:\n(\S+)\s+(\S+)/m;
print "Server name = $1\n";
print "IP Address = $2\n";
Now, I can match the \n at the end of the Name servers: line and capture the name and IP address which is on the next line.
This might have to be munged a bit to get it to work in your situation.
This is half a question and perhaps half an answer (the question's in here as I am not yet allowed to write comments...). Okay, here we go:
Name servers:
ns.mistral.co.uk 195.184.229.229
Is this what an entry in the file you're parsing looks like? What will follow immediately afterwards - more domain names and IP addresses? And will there be blank lines in between?
Anyway, I think your problem may (in part?) be related to your reading the file line by line. Once you get to the IP address line, the info about 'Name servers:' having been present will be gone. Multiline matching will not help if you're looking at your file line by line. Thus I'd recommend switching to paragraph mode:
{
local $/ = ''; # one paragraph instead of one line constitutes a record
while ($record = <DOMAINS>) {
# $record will now contain all consecutive lines that were NOT separated
# by blank lines; once there are >= 1 blank lines $record will have a
# new value
# do stuff, e.g. pattern matching
}
}
But then you said
I have tried an example of Stackoverflow where
<DOMAINS>;
will take the next line but this didn't work for me and I assume it is because we have already read the contents of this into $domaininfo.
so maybe you've already tried what I have just suggested? An alternative would be to just add another variable ($indicator or whatever) which you'll set to 1 once 'Name servers:' has been read, and as long as it's equal to 1 all following lines will be treated as containing the data you need. Whether this is feasible, however, depends on you always knowing what else your data file contains.
I hope something in here has been helpful to you. If there are any questions, please ask :)