Reading comma in content of csv file , matlab - matlab

I have a csv file that contains comma in contents.
% with dot
15.12.2012 11:27; 0.9884753
11.12.2012 11:12; 10.670.642
11.12.2012 10:57; 114.455.145
Gdata= textscan(fid, '%s %f')
It works well.
% but what to do with dot
15.12.2012 11:27; 0,9884753
11.12.2012 11:12; 10,670.642
11.12.2012 10:57; 114,455.145
How can I read it.
regards,

This may solve possible unevennes due to the presence of both ',' and '.'
fid = fopen('data.d','r');
Gdata= textscan(fid, '%s %s','delimiter', ';' )
% // cancels '.' and sets ',' as '.'
f = #(i) str2double(regexprep(regexprep(i,'\.',''),',','\.'));
Num = cellfun(f,Gdata(2),'UniformOutput' , false);
Num{:}
ans =
0.9885
10.6706
114.4551

Unfortunately, textscan doesn't respect locale settings, so there's no way to make it interpret the comma as a decimal point by modifying the current locale. As a workaround, you could read the entire line in, replace the comma with a dot and then use textscan to parse the line.
line = fgetl( fid );
line = strrep( line, ',', '.' );
Gdata = textscan( line, '%s %f' );
You may have to resort to regexp or something else fancier than a simple strrep if the line may contain commas that you don't want replaced.

Related

Escape characters in Matlab

I am reading a file using fileread() which returns me the entire file. Now I need to read line by line and convert them into process the data. Can I know how I would be able to detect the newline character in Matlab? I tried '\n' and '\r\n' and it doesn't work.
Thanks in advance
For special acharacters either use the char function with the character code (http://www.asciitable.com/) or sprintf (my preferred way for better readability.
For example you are looking for sprintf('\n') or sprintf('\r\n')
char(13) is carriage return \r
char(10) is new line \n
You can read the file line by line (see fgetl):
fid = fopen ( 'file', 'r' );
% Check that it opened okay
if fid ~= -1
while ( true )
line = fgetl ( fid );
% Check for end of file
if line == -1; break; end
%Do stuff with line;
end
fclose ( fid );
end

Add additional string to printing

Hi I would to print a string while adding the dots to the end rather than reprinting the string every time before it prints out the string again and again. I want it to print but only adding the dots to the already printed out string.
reboot = '### rebooting the mmp';
display(reboot)
for i = 1 : 15
reboot = strcat(reboot,'.')
pause(1);
end
How would i do this?
Rather than printing out the entire string every time, you can just print out a new dot each time through the loop.
To make this work, you'll want to use fprintf to print the dot rather than disp since disp will automatically append a newline to the end and fprintf will not so all of the dots end up on the same line.
% Print the initial message without a trailing newline
fprintf('### rebooting the mmp');
% Print 5 dots all on the same line with a 1-second pause
for k = 1:5
fprintf('.')
pause(1)
end
% We DO want to print a newline after we're all done
fprintf('\n')
fprintf(reboot)
for i=1:15
fprintf('.')
pause(1)
end

Write unicode strings to a file in Matlab

I have a string containing urdu characters like 'بجلی' this is a 1x4 array. I want to save this to a file, which would be viewed externally. Although this string doesnt display in the main Command Window, but variable 'str' does hold it. When I save this using fprintf(fid, str), and open that file in notepad there appear 'arrows' instead on the original characters. I can easily paste my characters into notepad manually. Where is the problem?
You need to use fwrite() not fprintf():
fid = fopen('temp.txt', 'w');
str = char([1576, 1580, 1604, 1740, 10]);
encoded_str = unicode2native(str, 'UTF-8');
fwrite(fid, encoded_str, 'uint8');
fclose(fid);
verified with:
perl -E "open my $fh, q{<:utf8}, q{temp.txt}; while (<$fh>) {while (m/(.)/g) {say ord $1}}"
1576
1580
1604
1740
It's not really necessary to avoid fprintf in order to write UTF-8 strings in a file. The idea is to open correctly the file:
f = fopen('temp.txt', 'w', 'native', 'UTF-8');
s = char([1576, 1580, 1604, 1740]);
fprintf(f, 'This is written as UTF-8: %s.\n', s);
fclose(f);
looking up every character in character map may seem hard. The code can be modified into the following code :
fid = fopen('temp.txt', 'w');
str = char(['س','ل','ا','م');
encoded_str = unicode2native(str, 'UTF-8');
fwrite(fid, encoded_str, 'uint8');
fclose(fid);
This seems to be easier but the only drawback is that it requires you to have Arabic/Persian/Urdo,... installed.

Renaming names in a file using another file without using loops

I have two files:
(one.txt) looks Like this:
>ENST001
(((....)))
(((...)))
>ENST002
(((((((.......))))))
((((...)))
I have like 10000 more ENST
(two.txt) looks like this:
>ENST001 110
>ENST002 59
and so on for the rest of all ENSTs
I basically would like to replace the ENSTs in the (one.txt) by the combination of the two fields in the (two.txt) so the results will look like this:
>ENST001_110
(((....)))
(((...)))
>ENST002_59
(((((((.......))))))
((((...)))
I wrote a matlab script to do so but since it loops for all lines in (two.txt) it take like 6 hours to finish, so I think using awk, sed, grep, or even perl we can get the result in few minutes. This is what I did in matlab:
frf = fopen('one.txt', 'r');
frp = fopen('two.txt', 'r');
fw = fopen('result.txt', 'w');
while feof(frf) == 0
line = fgetl(frf);
first_char = line(1);
if strcmp(first_char, '>') == 1 % if the line in one.txt start by > it is the ID
id_fold = strrep(line, '>', ''); % Reomve the > symbol
frewind(frp) % Rewind two.txt file after each loop
while feof(frp) == 0
raw = fgetl(frp);
scan = textscan(raw, '%s%s');
id_pos = scan{1}{1};
pos = scan{2}{1};
if strcmp(id_fold, id_pos) == 1 % if both ids are the same
id_new = ['>', id_fold, '_', pos];
fprintf(fw, '%s\n', id_new);
end
end
else
fprintf(fw, '%s\n', line); % if the line doesn't start by > print it to results
end
end
One way using awk. FNR == NR process first file in arguments and saves each number. Second condition process second file, and when first field matches with a key in the array modifies that line appending the number.
awk '
FNR == NR {
data[ $1 ] = $2;
next
}
FNR < NR && data[ $1 ] {
$0 = $1 "_" data[ $1 ]
}
{ print }
' two.txt one.txt
Output:
>ENST001_110
(((....)))
(((...)))
>ENST002_59
(((((((.......))))))
((((...)))
With sed you can at first run only on two.txt you can make a sed commands to replace as you want and run it at one.txt:
First way
sed "$(sed -n '/>ENST/{s=.*\(ENST[0-9]\+\)\s\+\([0-9]\+\).*=s/\1/\1_\2/;=;p}' two.txt)" one.txt
Second way
If files are huge you'll get too many arguments error with previous way. Therefore there is another way to fix this error. You need execute all three commands one by one:
sed -n '1i#!/bin/sed -f
/>ENST/{s=.*\(ENST[0-9]\+\)\s\+\([0-9]\+\).*=s/\1/\1_\2/;=;p}' two.txt > script.sed
chmod +x script.sed
./script.sed one.txt
The first command will form the sed script that will be able to modify one.txt as you want. chmod will make this new script executable. And the last command will execute command. So each file is read only once. There is no any loops.
Note that first command consist from two lines, but still is one command. If you'll delete newline character it will break the script. It is because of i command in sed. You can look for details in ``sed man page.
This Perl solution sends the modified one.txt file to STDOUT.
use strict;
use warnings;
open my $f2, '<', 'two.txt' or die $!;
my %ids;
while (<$f2>) {
$ids{$1} = "$1_$2" if /^>(\S+)\s+(\d+)/;
}
open my $f1, '<', 'one.txt' or die $!;
while (<$f1>) {
s/^>(\S+)\s*$/>$ids{$1}/;
print;
}
Turn the problem on its head. In perl I would do something like this:
#!/usr/bin/perl
open(FH1, "one.txt");
open(FH2, "two.txt");
open(RESULT, ">result.txt");
my %data;
while (my $line = <FH2>)
{
chomp(line);
# Delete leading angle bracket
$line =~ s/>//d;
# split enst and pos
my ($enst, $post) = split(/\s+/, line);
# Store POS with ENST as key
$data{$enst} = $pos;
}
close(FH2);
while (my $line = <FH1>)
{
# Check line for ENST
if ($line =~ m/^>(ENST\d+)/)
{
my $enst = $1;
# Get pos for ENST
my $pos = $data{$enst};
# make new line
$line = '>' . $enst . '_' . $pos . '\n';
}
print RESULT $line;
}
close(FH1);
close(RESULT);
This might work for you (GNU sed):
sed -n '/^$/!s|^\(\S*\)\s*\(\S*\).*|s/^\1.*/\1_\2/|p' two.txt | sed -f - one.txt
Try this MATLAB solution (no loops):
%# read files as cell array of lines
fid = fopen('one.txt','rt');
C = textscan(fid, '%s', 'Delimiter','\n');
C1 = C{1};
fclose(fid);
fid = fopen('two.txt','rt');
C = textscan(fid, '%s', 'Delimiter','\n');
C2 = C{1};
fclose(fid);
%# use regexp to extract ENST numbers from both files
num = regexp(C1, '>ENST(\d+)', 'tokens', 'once');
idx1 = find(~cellfun(#isempty, num)); %# location of >ENST line
val1 = str2double([num{:}]); %# ENST numbers
num = regexp(C2, '>ENST(\d+)', 'tokens', 'once');
idx2 = find(~cellfun(#isempty, num));
val2 = str2double([num{:}]);
%# construct new header lines from file2
C2(idx2) = regexprep(C2(idx2), ' +','_');
%# replace headers lines in file1 with the new headers
[tf,loc] = ismember(val2,val1);
C1( idx1(loc(tf)) ) = C2( idx2(tf) );
%# write result
fid = fopen('three.txt','wt');
fprintf(fid, '%s\n',C1{:});
fclose(fid);

Matlab: Remove chars from string with unicode chars

I have a long string that looks like:
その他,-9999.00
その他,-9999.00
その他,-9999.00
その他,-9999.00
and so forth. I'd like to split at linebreak and remove everything up to a comma, and just keep the floats. So my output should be something like:
A =
[-9999.99 -9999.99 -9999.99 -9999.99]
Any idea how to do that relatively quickly (a few seconds at most)? There are close to a million lines in that string.
Thanks!
I think the best way to do this is with textscan:
out = textscan(str, '%*s%f', 'delimiter', ',');
out = out{1};
I'm assuming the input is in a file. And I'm also assuming that the file is UTF-8 encoded, otherwise this won't work.
My solution is a simple Perl script. No doubt it can be done with MATLAB, but different tools have different strengths. I wouldn't attempt numerical analysis with Perl, that's for sure.
convert.pl
print "A = \n [ ";
while (<>) {
chomp;
s/.*,//;
print " ";
print;
}
print " ]";
input.txt
その他,-9999.00
その他,-9999.00
その他,-9999.00
その他,-9999.00
Command line
perl convert.pl < input.txt > output.txt
output.txt
A =
[ -9999.00 -9999.00 -9999.00 -9999.00 ]
Partial answer since I don't have access to matlab from home
The following can be used to split on tab. Use this to split on newline.
s=sprintf('one\ttwo three\tfour');
r=regexp(s,'\t','split')
% r = 'one' 'two three' 'four'
help strtok might be helpful as well
Here's how to use regexp with Matlab for your problem (with str containing your string):
out = regexp(str,[',([^,',char(10),']+)',char(10)],'tokens')
out = cat(1,out{:});
str2double(out)
out =
-9999
-9999
-9999
-9999
One simple way to extract the numeric parts and convert them to doubles is to use the functions ISMEMBER and STR2NUM:
A = str2num(str(ismember(str,',.e-0123456789')));