Genomic Ranges - Merge Overlaps in Single File (R STUDIO) - merge

I would like to find overlapped regions in the file and merge them keeping the earlier start and the later stop (merge 2 regions in 1)
I meant to use Genomic Ranges but I am not sure how to code the script.
This is what the file fileA.txt contain:
chr start end value
chr1 58708485 58708713 1
chr1 58709084 58710538 2
chr1 98766295 98766639 3
chr1 98766902 98770338 4
Script:
library(GenomicRanges)
query = with(fileA.txt, GRanges(chr, IRanges(start=start, end=end)))
subject = with(fileA.txt, GRanges(chr, IRanges(start=start, end=end)))
hits = findOverlaps(gr1)
ranges(query)[queryHits(hits)] = ranges(subject)[subjectHits(hits)]
I am not sure how to set query and subject for a single file, as well the object being a document need any kind of "" or specific format (bedGraph, txt are fine?) in order to be recognised in the script?
Thank you a lot in advance for your help!
K.

Related

How to pass multiple comment style to skip the header of a text file?

I am trying to read hundreds of .dat file by skipping header lines (I do not know how many of them I need to skip beforehand). Header lines very from 1 to 20 and have at beginning either or "$" oder "!".
A sample data (left column - node, right column - microstructure) has always two columns and looks like the following:
!===
!Comment
$Material
1 1.452E-001
2 1.446E-001
3 1.459E-001
I tried the following codeline, assuming I know beforehand that there 3 lines in header:
fid = fopen('Graphite_Node_Test.dat') ;
data = textscan(fid,'%f %f','HeaderLines',3) ;
fclose(fid);
This solution works if the number of header lines is known. How can I change the code so that it can read the .dat file without knowing the number of header lines beginning with either "$" or "!" sign?

How to import dates correctly from this .csv file into Matlab?

I have a .csv file with the first column containing dates, a snippet of which looks like the following:
date,values
03/11/2020,1
03/12/2020,2
3/14/20,3
3/15/20,4
3/16/20,5
04/01/2020,6
I would like to import this data into Matlab (I think the best way would probably be using the readtable() function, see here). My goal is to bring the dates into Matlab as a datetime array. As you can see above, the problem is that the dates in the original .csv file are not consistently formatted. Some of them are in the format mm/dd/yyyy and some of them are mm/dd/yy.
Simply calling data = readtable('myfile.csv') on the .csv file results in the following, which is not correct:
'03/11/2020' 1
'03/12/2020' 2
'03/14/0020' 3
'03/15/0020' 4
'03/16/0020' 5
'04/01/2020' 6
Does anyone know a way to automatically account for this type of data in the import?
Thank you!
My version: Matlab R2017a
EDIT ---------------------------------------
Following the suggestion of Max, I have tried specifiying some of the input options for the read command using the following:
T = readtable('example.csv',...
'Format','%{dd/MM/yyyy}D %d',...
'Delimiter', ',',...
'HeaderLines', 0,...
'ReadVariableNames', true)
which results in:
date values
__________ ______
03/11/2020 1
03/12/2020 2
NaT 3
NaT 4
NaT 5
04/01/2020 6
and you can see that this is not working either.
If you are sure all the dates involved do not go back more than 100 years, you can easily apply the pivot method which was in use in the last century (before th 2K bug warned the world of the danger of the method).
They used to code dates in 2 digits only, knowing that 87 actually meant 1987. A user (or a computer) would add the missing years automatically.
In your case, you can read the full table, parse the dates, then it is easy to detect which dates are inconsistent. Identify them, correct them, and you are good to go.
With your example:
a = readtable(tfile) ; % read the file
dates = datetime(a.date) ; % extract first column and convert to [datetime]
idx2change = dates.Year < 2000 ; % Find which dates where on short format
dates.Year(idx2change) = dates.Year(idx2change) + 2000 ; % Correct truncated years
a.date = dates % reinject corrected [datetime] array into the table
yields:
a =
date values
___________ ______
11-Mar-2020 1
12-Mar-2020 2
14-Mar-2020 3
15-Mar-2020 4
16-Mar-2020 5
01-Apr-2020 6
Instead of specifying the format explicitly (as I also suggested before), one should use the delimiterImportoptions and in the case of a csv-file, use the delimitedTextImportOptions
opts = delimitedTextImportOptions('NumVariables',2,...% how many variables per row?
'VariableNamesLine',1,... % is there a header? If yes, in which line are the variable names?
'DataLines',2,... % in which line does the actual data starts?
'VariableTypes',{'datetime','double'})% as what data types should the variables be read
readtable('myfile.csv',opts)
because the neat little feature recognizes the format of the datetime automatically, as it knows that it must be a datetime-object =)

How to batch rename files to 3-digit numbers?

I apologize in advance that this question is not specific. But my goal is to take a bunch of image files, which are currently named as: 0.tif, 1.tif, 2.tif, etc... and rename them just as numbers to 000.tif, 001.tif, 002.tif, ... , 010.tif, etc...
The reason I want to do this is because I am trying to load the images into matlab and for batch processing but matlab does not order them correctly. I use the dir command as dir(*.tif) to get all the images and load them into an array of files that I can iterate over and process, but in this array element 1 is 0.tif, element 2 is 1.tif, element 3 is 10.tif, element 4 is 100.tif, and so on.
I want to keep the ordering of the elements as I process them. However, I do not care if I have to change the order of the elements BEFORE processing them (i.e. I can make it work to rename, for example, 2.tif to 10.tif if I had to) but I am looking for a way to convert the file names the way I initially described.
If there is a better way to get matlab to properly order the files when it loads them into the array using dir please let me know because that would be much easier.
Thanks!!
You can do this without having to rename the files, if you want. When you grab the files using dir, you'll have a list of files like so:
files =
'0.tif'
'1.tif'
'10.tif'
...
You can grab just the numeric part using regexp:
nums = regexp(files,'\d+','match');
nums = str2double([nums{:}]);
nums =
0 1 10 11 12 ...
regexp returns its matches as a cell-array, the second line converts it back to actual numbers.
We can now get an actual numeric order by sorting the resulting array:
[~,order] = sort(nums);
and then put the files in the correct order:
files = files(order);
This should (I haven't tested it, I don't have a folder full of numerically labelled files handy) produce a list of files like so:
files=
'0.tif'
'1.tif'
'2.tif'
'3.tif'
...
this is partially dependent on the version of matlab you have. If you have a version with findstr this should work well
num_files_to_rename = numel(name_array);
for ii=1:num_files_to_rename
%in my test i used cells to store my strings you may need to
%change the bracket type for your application
curr_file = name_array{ii};
%locates the period in the file name (assume there is only one)
period_idx = findstr(curr_file ,'.');
%takes everything to the left of the period (excluding the period)
file_name = str2num(curr_file(1:period_idx-1));
%zeropads the file name to 3 spaces using a 0
new_file_name = sprintf('%03d.tiff',file_name)
%you can uncomment this after you are sure it works as you planned
%movefile(curr_file, new_file_name);
end
the actual rename operation movefile is commented out for now. make sure the output names are as you expect before uncommenting it and renaming all the files.
EDIT there is no real error checking in this code, it just assumes every file name has one and only one period, and an actual number as the name
The Batch file below do the rename of the files you want:
#echo off
setlocal EnableDelayedExpansion
for /F "delims=" %%f in ('dir /B *.tif') do (
set "name=00%%~Nf"
ren "%%f" "!name:~-3!.tif"
)
Note that this solution preserve the same order of your original files, even if there are missing numbers in the sequence..

Scheme read specific data from file

I have a txt file that looks like this:
1 17.3
2 18.2
3 18.6
I would like to make a variable (for example temp) which would store store first value (17.3). I would then compare this value with something else (< temp 20). Next step would be to store second value in temp (18.2), so I could again compare values.
Any help would be appreciated!
In Matlab it would look like this:
A=importdata(...)
i=0;
while i<length(temp) do
temp=A(i,2)
i=i+1;
if temp < 20
...
end
end
There are several ways to skin this cat in R6RS:
You can use read. read will read any Scheme datum so since these are all numbers read will read the next number.
You can make your own parser. You read one char at a time and when you hit a space or linefeed you take the list of chars you have though list->string to get string and then string->number This can also be done in two parts reading lines then parsing each line or do a slurp first then process the string.

Read and parse more than one text file matlab

I have four .txt files. Each one has 250 lines, where each line has 4 values separated by commas as shown below are the first 5 lines in one of the file, but all are of the same structure:
NaN,NaN,NaN,-1
792.98,419.48,333.35,245.63
787.13,408.59,345.05,251.48
798.3,414.17,333.36,245.63
803.61,414.43,333.35,239.78
One of the four files is the reference file, named groundtruth.txt I want to read each line from the three files and compare it with the values found in the same line number in the groudtruth.txt file. And after that save the difference between the values of the ground_truth and each one in a file for further processing, so the result will be that I'll have 3 new different files holding the differences where each file will have 250 lines and each line holds the difference such as the first line of the result file having the difference between the ground_truth and the first file will be like this :79.8,9.42,22.35,10.63
So if anyone could please advise.
If I understand correctly, this should be the thing you are after:
groundtruth = dlmread('groundtruth.txt');
file1 = dlmread('file_01.txt');
file2 = dlmread('file_02.txt');
file3 = dlmread('file_03.txt');
dlmwrite('diff_01.txt', file1 - groundtruth);
dlmwrite('diff_02.txt', file2 - groundtruth);
dlmwrite('diff_03.txt', file3 - groundtruth);