Pick random lines from file with fixed seed (pseudo-random) - perl

I would like to randomly pick some lines (e.g.20) from a file and print it into another but I want to have a seed fixed so that I get the same output if input file is the same.
The examples I've found that pick several lines, their output is different everytime
e.g:
perl -e '$f="inputfile";$_=`wc -l $f`;#l=split( );$r=int rand(#l[0]);system("head -n$r $f|tail -20")'> outputfile
And those that talk about fixed seed and pseudo-random are just for printing numbers, not extracting lines from files, or just extract a single line. Is there a command for unix or some code in perl or similar? (sort -R, --random- & shuf didn't work (using Mac OS X 10.5.8)).

You can set the seed via srand(); (for example. srand(5)) to get a fixed seed for rand.

Related

Why is no output files written in prinseqlite perl loop?

I am completely new to this type of coding/command lines, so I am sorry if I am asking this question in a wrong way.
I want to loop over all files in a directory (I am quality trimming DNA sequencing files (.fastq format))
I have written this loop:
for i in *.fastq; do
perl /apps/prinseqlite/0.20.4/prinseq-lite.pl -fastq $i -min_len 220 -max_len 240 -min_qual_mean 30 -ns_max_n 5 -trim_tail_right 15 -trim_tail_left 15 -out_good /proj/forhot/qfiltered/looptest/$i_filtered.fastq -out_bad null; done
The code itself seems to work, I can see in my terminal that it is taking the right files and it is doing the trimming (it is writing a summary log in the terminal as it goes), but no output files are generated - i.e these ones:
-out_good /proj/forhot/qfiltered/looptest/$i_filtered.fastq
If I run the code in a non-loop way, just on one file it works (= the output is generated). link this example:
prinseq-lite.pl -fastq 60782_merged_rRNA.fastq -min_len 220 -max_len 240 -min_qual_mean 30 -ns_max_n 5 -trim_tail_right 15 -trim_tail_left 15 -out_good 60782_merged_rRNA_filt_codeTEST.fastq -out_bad null
Is there a simple reason/answer to this?
This problem has nothing to do with Perl at all.
/proj/forhot/qfiltered/looptest/$i_filtered.fastq is read by the shell as interpolating the contents of i_filtered. There is no such shell variable, so this argument turns into /proj/forhot/qfiltered/looptest/.fastq ($i_filtered turns into nothing).
Therefore all of your prinseq-lite.pl executions place their output in the same file, which (because its name starts with a .) is "hidden": You need to use ls -a to see it, not just ls.
Fix
... -out_good /proj/forhot/qfiltered/looptest/${i}_filtered.fastq
Note that this would give you e.g. 60782_merged_rRNA.fastq_filtered.fastq for an input file of 60782_merged_rRNA.fastq. If you want to get rid of the duplicate .fastq part, you need something like:
... -out_good /proj/forhot/qfiltered/looptest/"${i%.fastq}"_filtered.fastq

Perl : How to extract unique entries of a text file

I am totally a beginner in Perl. I have a large file (around 100 G) which looks like this:
domain, ip
"www.google.ac.",173.194.33.111
"www.google.ac.",173.194.33.119
"www.google.ac.",173.194.33.120
"www.google.ac.",173.194.33.127
"www.google.ac.",173.194.33.143
"apple.com., 173.194.33.143
"studio.com.", 173.194.33.143
"www.google.ac.",101.78.156.201
"www.google.ac.",101.78.156.201
So basically I have 1-duplicate lines, 2- one ip with different domains, 3- one domain with different ips. and I would like to remove the duplicate lines from the file (the ones with same domain,ip pair).
**I have already reviewed other answers in regards to the same question, none of them address my problem with large files .
Does anybody have a clue how can I do it in PERL? or any suggestion for more optimal language?
The easiest thing to do is read the file a line at a time and use each line as the key of a hash. You have to have memory to store each unique line once, though. There's no getting around that.
Here's a one-liner as run from the shell:
perl -ne '$lines{$_}++; END { print keys %lines }' filename

perl sequence extraction loop

I have an existing perl one-liner (from the Edwards lab) that works wonderfully to read a text file (named ids.file) that contains one column of IDs and searches a second, specially formatted text file (named fasta.file in this example - in "fasta" format for those who know bioinformatics) and returns sequences that match the ID from the first file. I was hoping to expand this script to do two additional things:
The current perl one-liner only seems to work if the ids.file contains one column of data. I would like it to work on a file that contains two columns (separated by spaces), and act on the second column of data (well, really any column of data, but I assume that it will be obvious enough to adapt it if someone can give an example using a second column)
I would like to append the any results returned from the output of the search to a third column, instead of just to a new file.
If someone is kind enough to offer an example but only has time or inclination to work on one of these, I would prefer that you try to solve #2 - I have come close to solving #1 with a for loop that uses awk to only use the Perl code on the second column - I haven't gotten it yet, but am close, so #2 seems like the harder one to me.
The perl one liner is as follows:
perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if #ARGV' ids.file fasta.file
I appreciate any help you can give!
Not quite sure but will this do?
perl -ne 'chomp; s/^>(\S+).*/$c=$i{$1}/e; print if $c;
$i{(/^\S*\s(\S*)$/)[0]}="$_ " if #ARGV'
ids.file fasta.file

Using a .fasta file to compute relative content of sequences

So me being the 'noob' that I am, being introduced to programming via Perl just recently, I'm still getting used to all of this. I have a .fasta file which I have to use, although I'm unsure if I'm able to open it, or if I have to work with it 'blindly', so to speak.
Anyway, the file that I have contains DNA sequences for three genes, written in this .fasta format.
Apparently it's something like this:
>label
sequence
>label
sequence
>label
sequence
My goal is to write a script to open and read the file, which I have gotten the hang of now, but I have to read each sequence, compute relative amounts of 'G' and 'C' within each sequence, and then I'm to write it to a TAB-delimited file the names of the genes, and their respective 'G' and 'C' content.
Would anyone be able to provide some guidance? I'm unsure what a TAB-delimited file is, and I'm still trying to figure out how to open a .fasta file to actually see the content. So far I've worked with .txt files which I can easily open, but not .fasta.
I apologise for sounding completely bewildered. I'd appreciate your patience. I'm not like you pros out there!!
I get that it's confusing, but you really should try to limit your question to one concrete problem, see https://stackoverflow.com/faq#questions
I have no idea what a ".fasta" file or 'G' and 'C' is.. but it probably doesn't matter.
Generally:
Open input file
Read and parse data. If it's in some strange format that you can't parse, go hunting on http://metacpan.org for a module to read it. If you're lucky someone has already done the hard part for you.
Compute whatever you're trying to compute
Print to screen (standard out) or another file.
A "TAB-delimite" file is a file with columns (think Excel) where each column is separated by the tab ("\t") character. As quick google or stackoverflow search would tell you..
Here is an approach using 'awk' utility which can be used from the command line. The following program is executed by specifying its path and using awk -f <path> <sequence file>
#NR>1 means only look at lines above 1 because you said the sequence starts on line 2
NR>1{
#this for-loop goes through all bases in the line and then performs operations below:
for (i=1;i<=length;i++)
#for each position encountered, the variable "total" is increased by 1 for total bases
total++
}
{
for (i=1;i<=length;i++)
#if the "substring" i.e. position in a line == c or g upper or lower (some bases are
#lowercase in some fasta files), it will carry out the following instructions:
if(substr($0,i,1)=="c" || substr($0,i,1)=="C")
#this increments the c count by one for every c or C encountered, the next if statement does
#the same thing for g and G:
c++; else
if(substr($0,i,1)=="g" || substr($0,i,1)=="G")
g++
}
END{
#this "END-block" prints the gene name and C, G content in percentage, separated by tabs
print "Gene name\tG content:\t"(100*g/total)"%\tC content:\t"(100*c/total)"%"
}

sipp generate random string in scenario

I have created a program that uses sipp for sip trafic generation. would like to generate a random numver for the destination randomly and at run time without injecting from external csv. Currently I am doing the same for the originator using the [service] command. Is there another command I can use from comman line? Can I generate a random number from inside the scenario?
I don't think it is possible to make SIPp directly generate a random number. But if you have access to common Unix utilities, you can provide it via the command line.
But I am not sure what you want to do.
If you want to perform one call, you can provide the random destination in the command line thanks to set command line parameter.
Example:
mydest=`n=8; rand -M $((10**${n})) | awk "{ printf(\"%0${n}u\", \\$1) }"`
sipp ... -set service_route mydest $mydest
(replace n=8 by the number of digits you want. If you don't want a fixed number of digits, just remove the awk part)
Then you declare your variable at the beginning of your SIPp script:
<Global variables="mydest" />
<Reference variables="mydest" />
Afterwards you can place it in SIP messages by using [$mydest].
But if you want to perform lots of calls from the same SIPp launch, you can generate on the fly a CSV file with random numbers.
Example:
n=8; echo "RANDOM" > zrandom; rand -e -N 1000000 -d "\n" -M $((10**${n})) | awk "{ printf(\"%0${n}u\n\", \$1) }" >> zrandom
sipp ... -inf zrandom
(same remark: replace n=8 by the number of digits you want. If you don't want a fixed number of digits, just remove the awk part)