Multiple file upload with mojolicious fails on large number of files - perl

I've hit a wall and my google skills have this time failed me. I'm in the process of learning mojolicious to create a useful front end for a series of Perl scripts that I frequently use. I've got a long way through it but I'm stumped at (multiple) file uploads when the total number of files reaches 950.
Previously, I encountered the problem where- in multiple file uploads- files would begin to be uploaded, but once the filesize reached 16 mb the upload stopped. I fixed this by setting $ENV{MOJO_MAX_MESSAGE_SIZE} = 50000000000. However, this problem is different. To illustrate, this is part of my script where I try to grab the uploaded files:
my $files = $self->req->every_upload('localfiles');
for my $file ( #{$files} ) {
my $fileName = $file->filename =~ s/[^\w\d\.]+/_/gr;
$file->move_to("temporary_uploads/$fileName");
$self->app->log->debug("$fileName uploaded\n");
push #fileNames, $fileName;
};
say "FILES: ".scalar(#fileNames);
I apologise that it may be ugly. If I attempt to upload 949 files, my array #fileNames is populated correctly, but if I try to upload 950 files, my array ends up empty, and it seems as though $files is empty also. If anyone has any ideas or pointers to guide me to the solution I would be extremely grateful!

If I attempt to upload 949 files, my array #fileNames is populated correctly, but if I try to upload 950 files, my array ends up empty, and it seems as though $files is empty also.
That means the process is running out of file descriptors. In particular, the default for the Linux kernel is 1024:
For example, the kernel default for maximum number of file descriptors (ulimit -n) was 1024/1024 (soft, hard), and has been raised to 1024/4096 in Linux 2.6.39.

Related

Perl: Run script on multiple files in multiple directories

I have a perl script that reads a .txt and a .bam file, and creates an output called output.txt.
I have a lot of files that are all in different folders, but are only slightly different in the filename and directory path.
All of my txt files are in different subfolders called PointMutation, with the full path being
/Volumes/Lab/Data/Darwin/Patient/[Plate 1/P1H10]/PointMutation
The text(s) in the bracket is the part that changes, But the Patient subfolder contains all of my txt files.
My .bam file is located in a subfolder named DNA with a full path of
/Volumes/Lab/Data/Darwin/Patient/[Plate 1/P1H10]/SequencingData/DNA
Currently how I run this script is go on the terminal
cd /Volumes/Lab/Data/Darwin/Patient/[Plate 1/P1H10]/PointMutation
perl ~/Desktop/Scripts/Perl.pl "/Volumes/Lab/Data/Darwin/Patient/[Plate
1/P1H10]/PointMutation/txtfile.txt" "/Volumes/Lab/Data/Darwin/Patient/[Plate
1/P1H10]/SequencingData/DNA/bamfile.bam"
With only 1 or two files, that is fairly easy, but I would like to automate it once the files get much larger. Also once I run these once, I don't want to do it again, but I will get more information from the same patient, is there a way to block a folder from being read?
I would do something like:
for my $dir (glob "/Volumes/Lab/Data/Darwin/Patient/*/"){
# skip if not a directory
if (! -d $dir) {
next;
}
my $txt = "$dir/PointMutation/txtfile.txt";
my $bam = "$dir/SequencingData/DNA/bamfile.bam";
# ... you magical stuff here
}
This is assuming that all directories under /Volumes/Lab/Data/Darwin/Patient/ follow the convention.
That said, more long term/robust way of organizing analyses with lots of different files all over the place is either 1) organize all files necessary for each analysis under one directory, or 2) to create meta files (i'd use JSON/yaml) which contain the necessary file names.

Reading part of a file into a Stream in Powershell

I have some files which are 'offsetted' Zip files in that they have 4 extra bytes at the begining which must be ignored when extracting them.
I've been using ReadAllBytes/WriteAllBytes (with an offset of 4) - that works but obviously I have to write read/write/read the file which is slow.
I'd prefer to use System.IO.Compression.ZipArchive to read from a Stream loaded from the file (sans the first 4 bytes) - but I cannot figure-out the steps required to do that?
I tried 'Seek' but ZipArchive ignores position
I cannot seem to get Byte Arrays to pass into System.IO.Compression at all...
Ideas?
Finally!
After trying all manner of hoop-jumping, it seems the simplest answer was the right one
$bytes = [system.io.file]::ReadAllBytes("file.zip4")
$ms = New-Object System.IO.MemoryStream -Argumentlist $bytes,4,($bytes.length-4)
$arch = New-Object System.IO.Compression.ZipArchive($ms)
I can then process $arch.Entries and extract things just fine - reading the file once and processing it instead of reading it, writing 'most' of it back to disc, reading that file back again!!

Matlab publish - Want to use a custom file name to publish several pdf files

I have several data log files (here: 34) for those I have to calculate some certain values. I wrote a seperate function to publish the results of the calculation in a pdf file. But I only can publish one file after another, so it takes a while to publish all 34 files.
Now I want to automize that with a loop - importing the data, calculate the values and publish the results for every log file in a new pdf file. I want 34 pdf files for every log file at the end.
My problem is, that I couldn't find a way to rename the pdf files during publishing. The pdf file is always named after the script which is calculating the values. Obviously the pdf is overwritten within a loop. So at the end everything is calculated, but I only have the pdf from the last calculated log file.
There was this hacky solution to change the Matlab publish script, but since I don't have admin rights I can't use that:
"This is really hacky, but I would modify publish to accept a new option prefix. Replace line 93
[scriptDir,prefix] = fileparts(fullPathToScript);
with
if ~isfield(options, 'prefix')
[scriptDir,prefix] = fileparts(fullPathToScript);
else
[scriptDir,~] = fileparts(fullPathToScript);
prefix = options.prefix; end
Now you can set options.prefix to whatever filename you want. If you want to be really hardcore, make the appropriate modifications to supplyDefaultOptions and checkOptionFields as well."
Any suggestions?
Thanks in advance,
Martin
Here's one idea using movefile to rename the resultant published PDF on each iteration:
for i = 1:34
file = publish(files(i)); % Replace with your own command(s)
[pathStr,fileName,ext] = fileparts(file);
newFile = [pathStr filesep() fileName '_' int2str(i) ext]; % Example: append _# to each
[success,msg,msgid] = movefile(file,newFile);
if ~success
error(msgid,msg);
end
end
Also used are fileparts and filesep. See this question for other ways to rename and move files.

Perl Copy to more than one Location

Hello I'm attempting to copy a file to more then one location. I have written a script that will copy files using File::Find and File::Copy together to one location but I can not get the script to copy to a second or third location. I've tried to add a second variable target2 to the script so I can also copy the JPG files to this second target location. When I try to do this I get an error message. I want the copy to run on a loop that will copy the files X amount of seconds so I added the sleep function in as well. Can anyone explain to me why I can't copy to more than one location or help me find a way to do this? Thank you.
while (1)
{ sleep (10);
find(
sub {
if (-f &&/\.jpg$/i) {
print "$File::Find::name -> $target, $target2";
copy($File::Find::name, $target ,$target2)
or die(q{copy failed:} . $!);
}
},
#source
);
}
Error message: Bad Buffer size for copy: 0
copy function can't copy to 2 destinations. Call it twice:
copy($File::Find::name, $target);
copy($File::Find::name, $target2);
Look here to see the parameters explanation: http://perldoc.perl.org/File/Copy.html#SYNOPSIS
As the docs say:
An optional third parameter can be used to specify the buffer size
used for copying. This is the number of bytes from the first file,
that will be held in memory at any given time, before being written to
the second file.
So, when you specified the third parameter, Perl understood, that you want to set the buffer size manually. But you gave a string instead of a number, so it converted the string to a number: 0, and gave you an error:
Bad Buffer size for copy: 0

Using a .fasta file to compute relative content of sequences

So me being the 'noob' that I am, being introduced to programming via Perl just recently, I'm still getting used to all of this. I have a .fasta file which I have to use, although I'm unsure if I'm able to open it, or if I have to work with it 'blindly', so to speak.
Anyway, the file that I have contains DNA sequences for three genes, written in this .fasta format.
Apparently it's something like this:
>label
sequence
>label
sequence
>label
sequence
My goal is to write a script to open and read the file, which I have gotten the hang of now, but I have to read each sequence, compute relative amounts of 'G' and 'C' within each sequence, and then I'm to write it to a TAB-delimited file the names of the genes, and their respective 'G' and 'C' content.
Would anyone be able to provide some guidance? I'm unsure what a TAB-delimited file is, and I'm still trying to figure out how to open a .fasta file to actually see the content. So far I've worked with .txt files which I can easily open, but not .fasta.
I apologise for sounding completely bewildered. I'd appreciate your patience. I'm not like you pros out there!!
I get that it's confusing, but you really should try to limit your question to one concrete problem, see https://stackoverflow.com/faq#questions
I have no idea what a ".fasta" file or 'G' and 'C' is.. but it probably doesn't matter.
Generally:
Open input file
Read and parse data. If it's in some strange format that you can't parse, go hunting on http://metacpan.org for a module to read it. If you're lucky someone has already done the hard part for you.
Compute whatever you're trying to compute
Print to screen (standard out) or another file.
A "TAB-delimite" file is a file with columns (think Excel) where each column is separated by the tab ("\t") character. As quick google or stackoverflow search would tell you..
Here is an approach using 'awk' utility which can be used from the command line. The following program is executed by specifying its path and using awk -f <path> <sequence file>
#NR>1 means only look at lines above 1 because you said the sequence starts on line 2
NR>1{
#this for-loop goes through all bases in the line and then performs operations below:
for (i=1;i<=length;i++)
#for each position encountered, the variable "total" is increased by 1 for total bases
total++
}
{
for (i=1;i<=length;i++)
#if the "substring" i.e. position in a line == c or g upper or lower (some bases are
#lowercase in some fasta files), it will carry out the following instructions:
if(substr($0,i,1)=="c" || substr($0,i,1)=="C")
#this increments the c count by one for every c or C encountered, the next if statement does
#the same thing for g and G:
c++; else
if(substr($0,i,1)=="g" || substr($0,i,1)=="G")
g++
}
END{
#this "END-block" prints the gene name and C, G content in percentage, separated by tabs
print "Gene name\tG content:\t"(100*g/total)"%\tC content:\t"(100*c/total)"%"
}