Matlab: Remove chars from string with unicode chars - matlab

I have a long string that looks like:
その他,-9999.00
その他,-9999.00
その他,-9999.00
その他,-9999.00
and so forth. I'd like to split at linebreak and remove everything up to a comma, and just keep the floats. So my output should be something like:
A =
[-9999.99 -9999.99 -9999.99 -9999.99]
Any idea how to do that relatively quickly (a few seconds at most)? There are close to a million lines in that string.
Thanks!

I think the best way to do this is with textscan:
out = textscan(str, '%*s%f', 'delimiter', ',');
out = out{1};

I'm assuming the input is in a file. And I'm also assuming that the file is UTF-8 encoded, otherwise this won't work.
My solution is a simple Perl script. No doubt it can be done with MATLAB, but different tools have different strengths. I wouldn't attempt numerical analysis with Perl, that's for sure.
convert.pl
print "A = \n [ ";
while (<>) {
chomp;
s/.*,//;
print " ";
print;
}
print " ]";
input.txt
その他,-9999.00
その他,-9999.00
その他,-9999.00
その他,-9999.00
Command line
perl convert.pl < input.txt > output.txt
output.txt
A =
[ -9999.00 -9999.00 -9999.00 -9999.00 ]

Partial answer since I don't have access to matlab from home
The following can be used to split on tab. Use this to split on newline.
s=sprintf('one\ttwo three\tfour');
r=regexp(s,'\t','split')
% r = 'one' 'two three' 'four'
help strtok might be helpful as well

Here's how to use regexp with Matlab for your problem (with str containing your string):
out = regexp(str,[',([^,',char(10),']+)',char(10)],'tokens')
out = cat(1,out{:});
str2double(out)
out =
-9999
-9999
-9999
-9999

One simple way to extract the numeric parts and convert them to doubles is to use the functions ISMEMBER and STR2NUM:
A = str2num(str(ismember(str,',.e-0123456789')));

Related

Perl: how to format a string containing a tilde character "~"

I have run into an issue where a perl script we use to parse a text file is omitting lines containing the tilde (~) character, and I can't figure out why.
The sample below illustrates what I mean:
#!/usr/bin/perl
use warnings;
formline " testing1\n";
formline " ~testing2\n";
formline " testing3\n";
my $body_text = $^A;
$^A = "";
print $body_text
The output of this example is:
testing1
testing3
The line containing the tilde is dropped entirely from the accumulator. This happens whether there is any text preceding the character or not.
Is there any way to print the line with the tilde treated as a literal part of the string?
~ is special in forms (see perlform) and there's no way to escape it. But you can create a field for it and populate it with a tilde:
formline " \#testing2\n", '~';
The first argument to formline is the "picture" (template). That picture uses various characters to mean particular things. The ~ means to suppress output if the fields are blank. Since you supply no fields in your call to formline, your fields are blank and output is suppressed.
my #lines = ( '', 'x y z', 'x~y~z' );
foreach $line ( #lines ) { # forms don't use lexicals, so no my on control
write;
}
format STDOUT =
~ ID: #*
$line
.
The output doesn't have a line for the blank field because the ~ in the picture told it to suppress output when $line doesn't have anything:
ID: x y z
ID: x~y~z
Note that tildes coming from the data are just fine; they are like any other character.
Here's probably something closer to what you meant. Create a picture, #* (variable-width multiline text), and supply it with values to fill it:
while( <DATA> ) {
local $^A;
formline '#*', $_;
print $^A, "\n";
}
__DATA__
testing1
~testing2
testing3
The output shows the field with the ~:
testing1
~testing2
testing3
However, the question is very odd because the way you appear to be doing things seems like you aren't really doing what formats want to do. Perhaps you have some tricky thing where you're trying to take the picture from input data. But if you aren't going to give it any values, what are you really formatting? Consider that you may not actually want formats.

Append string in the beginning and the end of a line containing certain string

all
I want to know how to append string in the beginning and the end of a line containing certain string using perl?
So for example, my line contains:
%abc %efd;
and I want to append 123 at the beginning of the line and 456 at the end of the line, so it would look like this:
123 %abc %efd 456
8/30/16 UPDATE--------------------------------
So far I have done something like this:
foreach file (find . -type f)
perl -ne 's/^\%abc\s+(\S*)/**\%abc $1/; print;' $file > tmp; mv tmp $file
end
foreach file (find . -type f)
perl -ne 's/$\%def\;\s+(\S*)/\%def\;**\n $1/; print;' $file > tmp; mv tmp $file
end
so this does pretty well except that when abc and def are not in one string.
for example:
%abc
something something something
%def
this would turn out to be
%abc
something something something
%def;
which is not what I want.
Thank you
In you case, you want to append string when line of file match the certain string, it means match and replace.
Firstly, read each line of your input file.
Secondly, check if it match with the string you want to append string into the beginning and the end.
Then replace the match string by the new string which contain additional beginning string, the match string and additional end string.
my $input_file = 'your file name here';
my $search_string = '%abc %efd';
my $add_begin = '123';
my $add_end = '456';
# Read file
open(my $IN, '<', $input_file) or die "cannot open file $input_file";
# Check each line of file
while (my $row = <$IN>) {
chomp $row;
$row =~ s/^($search_string)$/$add_begin $1 $add_end/g;
print $row."\n";
}
Try with input file as below:
%abc %efd
asdahsd
234234
%abc
%efd
%abc%efd
You will receive the result as we expected:
123 %abc %efd 456
asdahsd
234234
%abc
%efd
%abc%efd
Modify the code as your requirement and contact me if there's any issue.
Use m modifier to replacing beginning and ending with line by line.
s/^\%abc/123 $&/mg;
s/\%def$/ 456/mg;
Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string. source
Welcome to StackOverflow. We strive to help people solve problems in their existing code and learn languages, rather than simply answer one-off questions, the solutions to which can be easily found in 101 tutorials and documentation. The type of question you've posted doesn't leave a lot of room for learning, and doesn't do much to help future learners. It would help us greatly if you could post a more complete example, including what you've tried so far to get it working.
All that being said, there are two main ways to prepend and append to a string in Perl: 1. the concatenation operator, . and 2. string interpolation.
Concatenation
Use a . to join two strings together. You can chain operations together to compose a longer string.
my $str = '%abc %efd';
$str = '123 ' . $str . ' 456';
say $str; # prints "123 %abc %efd 456" with a trailing newline
Interpolation
Enclose a string in double quotes to instruct Perl to interpolate (i.e. find and evaluate) any Perl-style variables enclosed within the string.
my $str = '%abc %efd';
$str = "123 $str 456";
say $str; # prints "123 %abc %efd 456" with a trailing newline
You'll notice that in both examples we prepended and appended to the existing string. You can also create new variable(s) to hold the result(s) of these operations. Other methods of manipulating and building strings include the printf and sprintf functions, the substr function, the join function, and regular expressions, all of which you will encounter as you continue learning Perl.
As far as looking to see if a string contains a certain substring before performing the operation, you can use the index function or a regular expression:
if (index($str, '%abc %efd') >= 0) {
# or...
if ($str =~ /%abc %efd/) {
Remember to use strict; at the top of your Perl scripts and always (at least while you're learning) declare variables with my. If you're having trouble with the say function, you may need to add the statement use feature 'say'; to the top of your script.
You can find an index of excellent Perl tutorials at learn.perl.org. Good luck and have fun!
UPDATE Here is (I believe) a complete answer to your revised question:
find . -type f -exec perl -i.bak -pe's/^(%abc)\s+(\S*)\s+(%def;)$/**\1 \2 \3**/'
This will modify the files in place and create backup files with the extension .bak. Keep in mind that the expression \S* will only match non-whitespace characters; if you need to match strings that contain whitespace, you will need to update this expression (something like .*? might be workable for you).

convert ASCII code to number

I have one file, which contain 12 columns (bam file), the eleventh column contain ASCII code. In one file I have more than one ASCII code. I would like to convert it to number.
I think it this code:
perl -e '$a=ord("ALL_ASCII_CODES_FROM-FILE"); print "$a\t"'
And I would like to create for cycle to read all ASCII codes, which are in eleventh column, convert it to number and count the results to one number.
You need to split the string into individual characters, loop over every character, and call ord in the loop.
my #codes = map ord, split //, $str;
say join '.', map { sprintf("%02X", $_) } #codes;
Conveniently, unpack 'C*' does all of that.
my #codes = unpack 'C*', $str;
say join '.', map { sprintf("%02X", $_) } #codes;
If you do intend to print it out in hex, you can use the v modifier in a printf.
say sprintf("%v02X", $str);
The natural tool to convert a string of characters into a list of the corresponding ASCII codes would be unpack:
my #codes = unpack "C*", $string;
In particular, assuming that you're parsing the QUAL column of an SAM file (or, more generally, any FASTQ-style quality string, I believe the correct conversion would be:
my #qual = map {$_ - 33} unpack "C*", $string;
Ps. From your mention of "columns", I'm assuming you're actually parsing a SAM file, not a BAM file. If I'm reading the spec correctly, the BAM format doesn't seem to use the +33 offset for quality values, so if you are parsing BAM files, you'd simply use the first example above for that.

Split a perl string with a substring and a space

local_addr = sjcapp [value2]
How do you split this string so that I get 2 values in my array i.e.
array[0] = sjcapp and array[1] = value2.
If I do this
#array = split('local_addr =', $input)
then my array[0] has sjcapp [value2]. I want to be able to separate it into two in my split function itself.
I was trying something like this but it didn't work:
split(/local_addr= \s/, $input)
Untested, but maybe something like this?
#array = ($input =~ /local_addr = (\S+)\s\[(\S+)\]/);
Rather than split, this uses a regex match in list context, which gives you an array of the parts captured in parentheses.
~/ cat data.txt
local_addr = sjcapp [value2]
other_addr = superman [value1492]
euro_addr = overseas [value0]
If the data really is as regularly structured as that , then you can just split on the whitespace. On the command line (see the perlrun(1) manual page) this is easiest with "autosplit" (-a) which magically creates an array of fields called #F from the input:
perl -lane 'print "$F[2] $F[3]" ' data.txt
sjcapp [value2]
superman [value1492]
overseas [value0]
In your script you can change the name of array, and the position of the elements within,it by shift-ing or splice-ing - possibly in a more elegant way than this - but it works:
perl -lane 'my #array = ($F[2],$F[3]) ; print "$array[0], $array[1]" ' data.txt
Or, without using autosplit, as follows :
perl -lne 'my #arr=split(" ");splice(#arr,0,2); print "$arr[0] $arr[1]"' data.txt
try :
if ( $input =~ /(=)(.+)(\[)(.+)(\])/ ) {
#array=($2,$4);
}
I would use a regexp rather than a split, since this is clearly a standard format config file line. How you construct your regexp will likely depend on the full line syntax and how flexible you want to be.
if( $input =~ /(\S+)\s*=\s*(\S+)\s*\[\s*(\S+)\s*\]/ ) {
#array = ($2,$3);
}

How can I quickly parse large (>10GB) files?

I have to process text files 10-20GB in size of the format:
field1 field2 field3 field4 field5
I would like to parse the data from each line of field2 into one of several files; the file this gets pushed into is determined line-by-line by the value in field4. There are 25 different possible values in field2 and hence 25 different files the data can get parsed into.
I have tried using Perl (slow) and awk (faster but still slow) - does anyone have any suggestions or pointers toward alternative approaches?
FYI here is the awk code I was trying to use; note I had to revert to going through the large file 25 times because I wasn't able to keep 25 files open at once in awk:
chromosomes=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25)
for chr in ${chromosomes[#]}
do
awk < my_in_file_here -v pat="$chr" '{if ($4 == pat) for (i = $2; i <= $2+52; i++) print i}' >> my_out_file_"$chr".query
done
With Perl, open the files during initialization and then match the output for each line to the appropriate file:
#! /usr/bin/perl
use warnings;
use strict;
my #values = (1..25);
my %fh;
foreach my $chr (#values) {
my $path = "my_out_file_$chr.query";
open my $fh, ">", $path
or die "$0: open $path: $!";
$fh{$chr} = $fh;
}
while (<>) {
chomp;
my($a,$b,$c,$d,$e) = split " ", $_, 5;
print { $fh{$d} } "$_\n"
for $b .. $b+52;
}
Here is a solution in Python. I have tested it on a small fake file I made up. I think this will be acceptably fast for even a large file, because most of the work will be done by C code inside of Python. And I think this is a pleasant and easy to understand program; I prefer Python to Perl.
import sys
s_usage = """\
Usage: csplit <filename>
Splits input file by columns, writes column 2 to file based on chromosome from column 4."""
if len(sys.argv) != 2 or sys.argv[1] in ("-h", "--help", "/?"):
sys.stderr.write(s_usage + "\n")
sys.exit(1)
# replace these with the actual patterns, of course
lst_pat = [
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
'u', 'v', 'w', 'x', 'y'
]
d = {}
for s_pat in lst_pat:
# build a dictionary mapping each pattern to an open output file
d[s_pat] = open("my_out_file_" + s_pat, "wt")
if False:
# if the patterns are unsuitable for filenames (contain '*', '?', etc.) use this:
for i, s_pat in enumerate(lst_pat):
# build a dictionary mapping each pattern to an output file
d[s_pat] = open("my_out_file_" + str(i), "wt")
for line in open(sys.argv[1]):
# split a line into words, and unpack into variables.
# use '_' for a variable name to indicate data we don't care about.
# s_data is the data we want, and s_pat is the pattern controlling the output
_, s_data, _, s_pat, _ = line.split()
# use s_pat to get to the file handle of the appropriate output file, and write data.
d[s_pat].write(s_data + "\n")
# close all the output file handles.
for key in d:
d[key].close()
EDIT: Here's a little more information about this program, since it seems you will be using it.
All of the error handling is implicit. If an error happens, Python will "raise an exception" which will terminate processing. For example, if one of the files fails to open, this program will stop executing and Python will print a backtrace showing which line of code caused the exception. I could have wrapped the critical parts with a "try/except" block, to catch errors, but for a program this simple, I didn't see any point.
It's subtle, but there is a check to see if there are exactly five words on each line of the input file. When this code unpacks a line, it does so into five variables. (The variable name "_" is a legal variable name, but there is a convention in the Python community to use it for variables you don't actually care about.) Python will raise an exception if there are not exactly five words on the input line to unpack into the five variables. If your input file can sometimes have four words on a line, or six or more, you could modify the program to not raise an exception; change the main loop to this:
for line in open(sys.argv[1]):
lst = line.split()
d[lst[3]].write(lst[1] + "\n")
This splits the line into words, and then just assigns the whole list of words into a single variable, lst. So that line of code doesn't care how many words are on the line. Then the next line indexes into the list to get the values out. Since Python indexes a list using 0 to start, the second word is lst[1] and the fourth word is lst[3]. As long as there are at least four words in the list, that line of code won't raise an exception either.
And of course, if the fourth word on the line is not in the dictionary of file handles, Python will raise an exception for that too. That would stop processing. Here is some example code for how to use a "try/except" block to handle this:
for line in open(sys.argv[1]):
lst = line.split()
try:
d[lst[3]].write(lst[1] + "\n")
except KeyError:
sys.stderr.write("Warning: illegal line seen: " + line)
Good luck with your project.
EDIT: #larelogio pointed out that this code doesn't match the AWK code. The AWK code has an extra for loop that I do not understand. Here is Python code to do the same thing:
for line in open(sys.argv[1]):
lst = line.split()
n = int(lst[1])
for i in range(n, n+53):
d[lst[3]].write(i + "\n")
And here is another way to do it. This might be a little faster, but I have not tested it so I am not certain.
for line in open(sys.argv[1]):
lst = line.split()
n = int(lst[1])
s = "\n".join(str(i) for i in range(n, n+53))
d[lst[3]].write(s + "\n")
This builds a single string with all the numbers to write, then writes them in one chunk. This may save time compared to calling .write() 53 times.
do you know why its slow? its because you are processing that big file 25 times with the outer shell for loop.!!
awk '
$4 <=25 {
for (i = $2; i <= $2+52; i++){
print i >> "my_out_file_"$4".query"
}
}' bigfile
There are times when awk is not the answer.
There are also times when scripting languages are not the answer, when you are just plain better off biting the bullet and dragging down your copy of K&R and hacking some C code.
If your operating system implements pipes using concurrent processes and interprocess communications, as opposed to big temp files, what you might be able to do is write an awk script that reformats the line, to put the selector field at the beginning of the line in a format easily readable with scanf(), write a C program that opens the 25 files and distributes lines among them, and the pipe the awk script output into the C program.
Sounds like you're on your way, but I just wanted to mention Memory Mapped I/O as being a huge help when working with gigantic files. There was a time when I had to parse a .5GB binary file with Visual Basic 5 (yes)... importing the CreateFileMapping API allowed me to parse the file (and create several-gig "human-readable" file) in minutes. And it only took a half hour or so to implement.
Here's a link describing the API on Microsoft platforms, though I'm sure MMIO should be on just about any platform: MSDN
Good luck!
There are some precalculations that may help.
For example you can precalculate the outputs for each value of your field2. Admiting that they are 25 like field4:
my %tx = map {my $tx=''; for my $tx1 ($_ .. $_+52) {$tx.="$tx1\n"}; $_=>$tx} (1..25);
Latter when writing you can do print {$fh{$pat}} $tx{$base};