perl code taking values from three files and printing in another file - perl

I have one file named 1.txt having values like:
a
b
c
...
Second file named 2.txt like this:
a 123,
a 156,
a 899,
c 255,
Third file named 3.txt like this:
a 236,
a 890,
b 123,
How can read the values from all three files above and write my results in a single file like the one below:
a 123 236,
- 156 890,
- 899 -,
b - 123,
The files have not equal lines and none of lines are about 10000. I have to use Perl for this.
I have to take the values from the first file.
I have to take the second file and I have to take the values of the second column of the second file corresponding to values of first file.
Similarly I have to take values from the third file.
And I have to write my results in an output file like
values from first file in first column, all the corresponding values from the second file in the second column of output-file and all the corresponding values from the third file in the third column of output-file.

Read the three files: the first can be read into a simple array, the other two into hashes where the hash value is an array (reference to an array).
Read through the first array in sorted order.
For each value in the first array, find the array from the second and third files (the corresponding hashes), sorting the array values into order. Handle the printing appropriately, dealing with missing values in the second and third columns and repeated values in the first column specially.

I would start by reading the contents of 2.txt and storing it in a hash of arrayrefs. For example, if 2.txt is this:
a 123,
a 156,
a 899,
c 255,
then the data structure would be this:
{ a => [123, 156, 899],
c => [255]
}
Then, do the same for 3.txt.
Only then does it make sense to read through 1.txt. For each line of 1.txt, you would look up the appropriate arrays from the above data structures, and iterate over those arrays (using something like for(my $i = 0; $i < #two && $i < #three; ++$i)), printing your results.

Related

Is there a way to save each HDF5 data set as a .csv column?

I'm struggling with a H5 file to extract and save data as a multi column csv. as shown in the picture the structure of h5 file consisted of main groups (Genotypes, Positions, and taxa). The main group, Genotypes contains more than 1500 subgroups (genotype partial names) and each subgroup contains sub-sun groups (complete name of genotypes).There are about 1 million data sets (named calls) -each one is laid in one sub-sub group - which i need them to be written - each one - in a separate column. The problem is that when i use h5py (group.get function) i have to use the path of any calls. I extracted the all paths containing "calls" at the end of path but I cant reach all
1 million calls to get them into a csv file.
could anybody help me to extracts "calls" which are 8bit integer i\as a separate columns in a csv file.
By running the code in first answer I get this error:
Traceback (most recent call last): File "path/file.py", line 32,
in
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string! File "path/file.py", line 565, in visititems
return h5o.visit(self.id, proxy) File "h5py_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File
"h5py_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5o.pyx", line 355, in h5py.h5o.visit File
"h5py\defs.pyx", line 1641, in h5py.defs.H5Ovisit_by_name File
"h5py\h5o.pyx", line 302, in h5py.h5o.cb_obj_simple File
"path/file.py", line 564, in proxy
return func(name, self[name]) File "path/file.py", line 10, in dump_calls2csv
np.savetxt(csvfname, arr, fmt='%5d', delimiter=',') File "<array_function internals>", line 6, in savetxt File
"path/file.py", line 1377, in savetxt
open(fname, 'wt').close() OSError: [Errno 22] Invalid argument: 'Genotypes_ArgentineFlintyComposite-C(1)-37-B-B-B2-1-B25-B2-B?-1-B:100000977_calls.csv
16-May-2020 Update:
Added a second example that reads and exports using Pytables (aka
tables) using .walk_nodes(). I prefer this method over h5py
.visititems()
For clarity, I separated the code that creates the example file from the
2 examples that read and export the CSV data.
Enclosed below are 2 simple examples that show how to recursively loop on all top level objects. For completeness, the code to create the test file is at the end of this post.
Example 1: with h5py
This example uses the .visititems() method with a callable function (dump_calls2csv).
Summary of this procedure:
1) Checks for dataset objects with calls in the name.
2) When it finds a matching object it does the following:
a) reads the data into a Numpy array,
b) creates a unique file name (using string substitution on the H5 group/dataset path name to insure uniqueness),
c) writes the data to the file with numpy.savetxt().
import h5py
import numpy as np
def dump_calls2csv(name, node):
if isinstance(node, h5py.Dataset) and 'calls' in node.name :
print ('visiting object:', node.name, ', exporting data to CSV')
csvfname = node.name[1:].replace('/','_') +'.csv'
arr = node[:]
np.savetxt(csvfname, arr, fmt='%5d', delimiter=',')
##########################
with h5py.File('SO_61725716.h5', 'r') as h5r :
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
If you want to get fancy, you can replace arr in np.savetxt() with node[:].
Also, you you want headers in your CSV, extract and reference the dtype field names from the dataset (I did not create any in this example).
Example 2: with PyTables (tables)
This example uses the .walk_nodes() method with a filter: classname='Leaf'. In PyTables, a leaf can be any of the storage classes (Arrays and Table).
The procedure is similar to the method above. walk_nodes() simplifies the process to find datasets and does NOT require a call to a separate function.
import tables as tb
import numpy as np
with tb.File('SO_61725716.h5', 'r') as h5r :
for node in h5r.walk_nodes('/',classname='Leaf') :
print ('visiting object:', node._v_pathname, 'export data to CSV')
csvfname = node._v_pathname[1:].replace('/','_') +'.csv'
np.savetxt(csvfname, node.read(), fmt='%d', delimiter=',')
For completeness, use the code below to create the test file used in the examples.
import h5py
import numpy as np
ngrps = 2
nsgrps = 3
nds = 4
nrows = 10
ncols = 2
with h5py.File('SO_61725716.h5', 'w') as h5w :
for gcnt in range(ngrps):
grp1 = h5w.create_group('Group_'+str(gcnt))
for scnt in range(nsgrps):
grp2 = grp1.create_group('SubGroup_'+str(scnt))
for dcnt in range(nds):
i_arr = np.random.randint(1,100, (nrows,ncols) )
ds = grp2.create_dataset('calls_'+str(dcnt), data=i_arr)

Matlab: Read a certain file and make a calculation on columns

I have file names go on 2.txt 4.txt 8.txt 12.txt 14.txt. And each file structure looks like
I want to read each designated file and do some calculations with the designated columns for instance, after calling 2.txt file I want to calculate
column(A)+column(I)
The questions
How can I call the certain file with their name
How can I do calculations with this file columns
Here is my code
function[t]=ad(x)
folderName='C:\Users\zeldagrey6\Desktop\AD';
fileinfo=dir([folderName filesep '**/*.txt'] );
filename={fileinfo.name};
fullFileName=[folderName filesep filename{x}];
d=readtable(fullFileName, 'ReadVariableNames', true);
t=d.A+d.I;
end
The problems of the code
When I put ad(2) into array i call 4.txt instead of 2.txt. I guess it does not care the names of the text just read them according to their sequence
Is there any way to assign with each column like var1,var2 and do some
calculations with var1+var2 instead of d.A+d.I
yes, you can refer to table contents with curly brackets like this:
A = (30.1:0.1:30.5)';
I = (324:328)';
Angle = (35:5:55)';
FWHM = (0.2:0.05:0.4)';
d = table(A,I,Angle,FWHM);
t1 = d.A + d.I;
t2 = d{:,1} + d{:,2};
See that t1 and t2 are equal

Converting a .txt file with 1 million digits of "e" into a vector in matlab

I have a text file with 1 million decimal digits of "e" number with 80 digits on each line excluding the first and the last line which have 76 and 4 digits and the file has 12501 lines. I want to convert it into a vector in matlab with each digit on each row. I tried num2str function, but the problem is that it gets converted like for example '7.1828e79' (13 characters). What can I do?
P.S.1: The first two lines of the text file (76 and 80 digits) are:
7182818284590452353602874713526624977572470936999595749669676277240766303535 47594571382178525166427427466391932003059921817413596629043572900334295260595630
P.S.2: I used "dlmread" and got a 12501x1 vector, with the first and second row of 7.18281828459045e+75 and 4.75945713821785e+79 and the problem is that when I use num2str for example for the first row value, I get: '7.182818284590453e+75' as a string and not the whole 76 digits. My aim was to do something like this:
e1=dlmread('e.txt');
es1=num2str(e1);
for i=1:12501
for j=1:length(es1(1,:))
a1((i-1)*length(es1(1,:))+j)=es1(i,j);
end
end
e_digits=a1.';
but I get a string like this:
a1='7.182818284590453e+754.759457138217852e+797.381323286279435e+799.244761460668082e+796.133138458300076e+791.416928368190255e+79 5...'
with 262521 characters instead of 1 million digits.
P.S.3: I think the problem might be solved if I can manipulate the text file in a way that I have one digit on each line and simply use dlmread.
Well, this is not hard, there are many ways to do it.
So first you want to load in your file as a Char Array using something simple like (you want a Char Array so that you can easily manipulate it to forget about the lines breaks) :
C = fileread('yourfile.txt'); %loads file as Char Array
D = C(~isspace(C)); %Removes SPACES which are line-breaks
Next, you want to actually append a SPACE between each char (this is because you want to use the num2str transform - and matlab needs to see the space), you can do this using a RESHAPE, a STRTRIM or simply a REGEX:
E = strtrim(regexprep(D, '.{1}', '$0 ')); %Add a SPACE after each Numeric Digit
Now you can transform it using str2num into a Vector:
str2num(E)'; %Converts the Char Array back to Vector
which will give you a single digit each row.
Using your example, I get a vector of 156 x 1, with 1 digit each row as you required.
You can get a digit per row like this
fid = fopen('e.txt','r');
c = textscan(fid,'%s');
c=cat(1,c{:});
c = cellfun(#(x) str2num(reshape(x,[],1)),c,'un',0);
c=cat(1,c{:});
And it is not the only possible way.
Could you please tell what is the final task, how do you plan using the array of e digits?

Extracting and joining exons from multiple sequence alignments

Using my (fairly) basic coding skills, I have put together a script that will parse an aligned multi-fasta file (a multiple sequence alignment) and extract all the data between two specified columns.
use Bio::SimpleAlign;
use Bio::AlignIO;
$str = Bio::AlignIO->new(-file => $inputfilename, -format => 'fasta');
$aln = $str->next_aln();
$mini = $aln->slice($array[0], $array[1]);
$out = Bio::AlignIO->new(-file => $array[3], -format => 'fasta');
$out->write_aln($mini);
The problem I have is that I want to be able to slice multiple regions from the same alignment and then join these regions prior to writing to an outfile. The complication is that I want to supply a file with a list of co-ordinates where each line contains two or more co-ordinates between which data should be extracted and joined.
Here is an example co-ordinate file
ORF1, 10, 50, exon1 # The above line should produce a slice between columns 10-50 and write to an outfile
ORF2, 70, 140, exon1
ORF2, 190, 270, exon2
ORF2, 500, 800, exon3 # Data should be extracted between the ranges specified here and in the above two lines and then joined (side by side) to produce the outfile.
ORF3, 1200, 1210, exon1
etc etc
And here is an (small) example of an aligned fasta file
\>Sample1
ATGGCGACCGTGCACTACTCCCGCCGACCTGGGACCCCGCCGGTCACCCTCACGTCGTCC
CCCAGCATGGATGACGTTGCGACCCCCATCCCCTACCTACCCACATACGCCGAGGCCGTG
GCAGACGCGCCCCCCCCTTACAGAAGCCGCGAGAGTCTGGTGTTCTCCCCGCCTCTTTTT
CCTCACGTGGAGAATGGCACCACCCAACAGTCTTACGATTGCCTAGACTGCGCTTATGAT
GGAATCCACAGACTTCAGCTGGCTTTTCTAAGAATTCGCAAATGCTGTGTACCGGCTTTT
TTAATTCTTTTTGGTATTCTCACCCTTACTGCTGTCGTGGTCGCCATTGTTGCCGTTTTT
CCCGAGGAACCTCCCAACTCAACTACATGA
\>Sample2
ATGGCGACCGTGCACTACTCCCGCCGACCTGGGACCCCGCCGGTCACCCTCACGTCGTCC
CCCAGCATGGATGACGTTGCGACCCCCATCCCCTACCTACCCACATACGCCGAGGCCGTG
GCAGACGCGCCCCCCCCTTACAGAAGCCGCGAGAGTCTGGTGTTCTCCCCGCCTCTTTTT
CCTCACGTGGAGAATGGCACCACCCAACAGTCTTACGATTGCCTAGACTGCGCTTATGAT
GGAATCCACAGACTTCAGCTGGCTTTTCTAAGAATTCGCAAATGCTGTGTACCGGCTTTT
TTAATTCTTTTTGGTATTCTCACCCTTACTGCTGTCGTGGTCGCCATTGTTGCCGTTTTT
CCCGAGGAACCTCCCAACTCAACTACATGA
I think there should be a fairly simple way to solve this problem, potentially using the information in the first column, paired with the exon number, but I can't for the life of me figure out how this can be done.
Can anyone help me out?
The aligned fasta file you posted -- at least as it appears on the stackoverflow web page -- did not compile. According to https://en.wikipedia.org/wiki/FASTA_format, the description lines should begin with a >, not with \>.
Be sure to run all Perl programs with use strict; use warnings;. This will facilitate debugging.
You have not populated #array. Consequently, you can expect to get errors like these:
Use of uninitialized value $start in pattern match (m//) at perl-5.24.0/lib/site_perl/5.24.0/Bio/SimpleAlign.pm line 1086, <GEN0> line 16.
Use of uninitialized value $start in concatenation (.) or string at perl-5.24.0/lib/site_perl/5.24.0/Bio/SimpleAlign.pm line 1086, <GEN0> line 16.
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Slice start has to be a positive integer, not []
STACK: Error::throw
STACK: Bio::Root::Root::throw perl-5.24.0/lib/site_perl/5.24.0/Bio/Root/Root.pm:444
STACK: Bio::SimpleAlign::slice perl-5.24.0/lib/site_perl/5.24.0/Bio/SimpleAlign.pm:1086
STACK: fasta.pl:26
Once you assign plausible values, e.g.,
#array = (1,17);
... you will get more plausible results:
$ perl fasta.pl
>Sample1/1-17
ATGGCGACCGTGCACTA
>Sample2/1-17
ATGGCGACCGTGCACTA
HTH!

Matlab: finding/writing data from mutliple files by column header text

I have an array which I read into Matlab with importdata. It has 5 header lines
file = 'aoao.csv';
s = importdata(file,',', 5);
Matlab automatically treats the last line as the column header. I can then call up the column number that I want with
s.data(:,n); %n is desired column number
I want to be able to load up many similar files at once, and then call up the columns in the different files which have the same column header name (which are not necessarily the same column number). I want to be able to write and export all of these columns together into a new matrix, preferably with each column labelled with its file name,
what should I do?
samp = 'len-c.mp3'; %# define desired sample/column header name
file = dir('*.csv');
have the directory ready in main screen current folder. This creates a detailed description of file,
for i=1:length(file)
set(i) = importdata(file(i).name,',', 5);
end
this imports the data from each of the files (comma delimited, 5 header lines) and transports it to a cell array called 'set'
for k = 1:14;
for i=1:length(set(k).colheaders)
TF = strcmp(set(k).colheaders(i),samp); %compares strings for match
if TF == 1; %if match is true
group(:,k) = set(k).data(:,i); %save matching column# to 'group'
end
end
end
this retrieves the data from the named colheader within each file