How to access feature values that are not primitive types in Ruta script? - uima

I can access features that are defined as primitive types using Ruta script.
For example, posTag is a string feature of TokenAnnotation. The following script works.
STRING str1, str2;
TokenAnnotation{->GETFEATURE("posTag", str1), LOG("posTag=" + str1)};
However, I can't read feature that is defined as another Annotation type.
TokenAnnotation inherits a feature called lemma, which is type Lemma.
Lemma has it's own features. "key" is one of them.
How do I access "key" feature value of a Lemma through a given TokenAnnotation in Ruta script?
I tried Type variable. but, I don't know what I can do with it after assigning lemma feature to a type variable.
It would be great if someone can show me some examples of Type variable usage.
The following is my type descriptions, and cas.
Thanks in advance.
<typeDescription>
<name>uima.tt.TokenLikeAnnotation</name>
<description>Base type for token annotation types</description>
<supertypeName>uima.tt.LexicalAnnotation</supertypeName>
<features>
<featureDescription>
<name>lemma</name>
<description>The best probable entry containing all morphological information for the token</description>
<rangeTypeName>uima.tt.Lemma</rangeTypeName>
</featureDescription>
<featureDescription>
<name>lemmaEntries</name>
<description>List of lemma entries containing all morphological information for the token</description>
<rangeTypeName>uima.cas.FSArray</rangeTypeName>
</featureDescription>
<featureDescription>
<name>dictionaryMatch</name>
<description>A flag indicating whether or not the token matches a dictionary entry</description>
<rangeTypeName>uima.cas.Boolean</rangeTypeName>
</featureDescription>
</features>
</typeDescription>
<typeDescription>
<name>uima.tt.TokenAnnotation</name>
<description>General token annotation type. It is also the base type for the special token types</description>
<supertypeName>uima.tt.TokenLikeAnnotation</supertypeName>
<features>
<featureDescription>
<name>posTag</name>
<description>Part-of-Speech tag</description>
<rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>
</features>
</typeDescription>
<typeDescription>
<name>uima.tt.KeyStringEntry</name>
<description>Base type for types defining key/value feature (e.g. uima.tt.Lemma type)</description>
<supertypeName>uima.cas.TOP</supertypeName>
<features>
<featureDescription>
<name>key</name>
<description>A key/value feature (e.g. lemma string in uima.tt.Lemma type)</description>
<rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>
</features>
</typeDescription>
<typeDescription>
<name>uima.tt.Lemma</name>
<description>Morphological information retrieved from a lexical dictionary entry</description>
<supertypeName>uima.tt.KeyStringEntry</supertypeName>
<features>
<featureDescription>
<name>partOfSpeech</name>
<description>An integral encoding representing the part-of-speech for the lemma</description>
<rangeTypeName>uima.cas.Integer</rangeTypeName>
</featureDescription>
<featureDescription>
<name>frost_ExtendedPOS</name>
<description>An integer representing additional information related to the part-of-speech</description>
<rangeTypeName>uima.cas.Integer</rangeTypeName>
</featureDescription>
<featureDescription>
<name>isStopword</name>
<description/>
<rangeTypeName>uima.cas.Boolean</rangeTypeName>
</featureDescription>
</features>
</typeDescription>
<cas:NULL xmi:id="0"/>
<tcas:DocumentAnnotation xmi:id="8" sofa="1" begin="0" end="70" language="en"/>
<tcas:DocumentAnnotation xmi:id="21" sofa="14" begin="0" end="22" language="en"/>
<ontology:Column xmi:id="27" sofa="14" begin="0" end="11"/>
<ontology:Column xmi:id="31" sofa="14" begin="12" end="22"/>
<ontology:Table xmi:id="35" sofa="14" begin="0" end="22"/>
<uimatypes:TitlecaseAlphabetic xmi:id="42" sofa="14" begin="0" end="8" lemma="74" lemmaEntries="74" dictionaryMatch="true" posTag="NN"/>
<uimatypes:TitlecaseAlphabetic xmi:id="58" sofa="14" begin="12" end="17" lemma="90" lemmaEntries="90" dictionaryMatch="true" posTag="NN"/>
<uimatypes:TitlecaseAlphabetic xmi:id="66" sofa="14" begin="18" end="22" lemma="98" lemmaEntries="98" dictionaryMatch="true" posTag="NN"/>
<uimatypes:UppercaseAlphabetic xmi:id="50" sofa="14" begin="9" end="11" lemma="82" lemmaEntries="82" dictionaryMatch="true" posTag="NN"/>
<tt:SentenceAnnotation xmi:id="106" sofa="14" begin="0" end="22" sentenceNumber="1"/>
<tt:ParagraphAnnotation xmi:id="111" sofa="14" begin="0" end="22" paragraphNumber="1"/>
<type:CW xmi:id="116" sofa="14" begin="0" end="8"/>
<type:CW xmi:id="132" sofa="14" begin="12" end="17"/>
<type:CW xmi:id="140" sofa="14" begin="18" end="22"/>
<type:SPACE xmi:id="120" sofa="14" begin="8" end="9"/>
<type:SPACE xmi:id="128" sofa="14" begin="11" end="12"/>
<type:SPACE xmi:id="136" sofa="14" begin="17" end="18"/>
<type:CAP xmi:id="124" sofa="14" begin="9" end="11"/>
<type:RutaBasic xmi:id="144" sofa="14" begin="0" end="8"/>
<type:RutaBasic xmi:id="149" sofa="14" begin="8" end="9"/>
<type:RutaBasic xmi:id="154" sofa="14" begin="9" end="11"/>
<type:RutaBasic xmi:id="159" sofa="14" begin="11" end="12"/>
<type:RutaBasic xmi:id="164" sofa="14" begin="12" end="17"/>
<type:RutaBasic xmi:id="169" sofa="14" begin="17" end="18"/>
<type:RutaBasic xmi:id="174" sofa="14" begin="18" end="22"/>
<demo:Identifier xmi:id="187" sofa="14" begin="0" end="11"/>
<cas:Sofa xmi:id="1" sofaNum="2" sofaID="global1" mimeType="text" sofaString="<Table><Column>Employee ID</Column><Column>Birth Date</Column></Table>"/>
<cas:Sofa xmi:id="14" sofaNum="3" sofaID="global2" mimeType="text" sofaString="Employee ID Birth Date"/>
<tt:Lemma xmi:id="74" key="employee" partOfSpeech="3" frost_ExtendedPOS="0" isStopword="false"/>
<tt:Lemma xmi:id="90" key="birth" partOfSpeech="3" frost_ExtendedPOS="0" isStopword="false"/>
<tt:Lemma xmi:id="98" key="date" partOfSpeech="3" frost_ExtendedPOS="0" isStopword="false"/>
<tt:Lemma xmi:id="82" key="id" partOfSpeech="3" frost_ExtendedPOS="0" isStopword="false"/>
<cas:View sofa="1" members="8"/>
<cas:View sofa="14" members="21 27 31 35 42 58 66 50 106 111 116 132 140 120 128 136 124 144 149 154 159 164 169 174 187"/>

Yes, you can do that, but the support for TOP feature structure is limitied in UIMA Ruta 2.4.0 (current release). Only annotation types are supported.
This works best with feature expressions in UIMA Ruta. You can simply use a dot to refer to a feature of the matched annotation (can be stacked).
In your example, this could look like:
TokenAnnotation.lemma.key=="birth"{-> T1};
TokenAnnotation{TokenAnnotation.posTag=="NN" -> T1};
... but you probably need to replace uima.cas.TOP with uima.tcas.Annotation in your type system.
A type veriable won't help to solve this, but the new annotation variables could help here, but are not really needed.
DISCLAIMER: I am a developer of UIMA Ruta

Related

perl: using Digest::SHA3, using basic example from online, the bit value of the output puts it at 160, which says is a weak hash length(?)

i am using the following code to learn/familiarize myself with one-way password encryption, salting, and using them to verify a user on log in.
it works, i store the hashed password and the salt value in my database, i can retrieve both and compare against the plain text password, no problem.
my question is about the output, how secure it is, etc.
use Digest::SHA3;
$plaintextpassword='cheeseburgerandfries';
$salts = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ123456789";
$s1a = rand(62);
$s1b = rand(62);
$s1c = rand(62);
$salt = substr($salts,$s1a,1).substr($salts,$s1b,1).substr($salts,$s1c,1);
$sha1 = Digest::SHA3->new;
$sha1->add($salt.$plaintextpassword);
$encpw = $sha1->hexdigest;
which gives an output similar to
$encpw='7fd7d6e9b574fe6306be6c709d23050b5ad28f07e094403469229b6d'
when i take that value and run it through a text to bytes converter (online), i get
00110111 01100110 01100100 00110111 01100100 00110110 01100101 00111001 01100010 00110101
00110111 00110100 01100110 01100101 00110110 00110011 00110000 00110110 01100010 01100101
00110110 01100011 00110111 00110000 00111001 01100100 00110010 00110011 00110000 00110101
00110000 01100010 00110101 01100001 01100100 00110010 00111000 01100110 00110000 00110111
01100101 00110000 00111001 00110100 00110100 00110000 00110011 00110100 00110110 00111001
00110010 00110010 00111001 01100010 00110110 01100100
which i believe is 160 bits. as i'm really new to hashes and bits, i'm confused.
my thinking is SHA3 is 256 bit and up, so why is the output 160 bit. i may even be misinterpreting the data, or even the information that i'm gathering from research, so forgive me.
also, i'm certain there are easier/better/stronger/whatever ways to accomplish my goals, but i think my question is more along the lines of understanding bit length, etc.
also, i was reading that it may be best to use a salt value length equal to the output character length, meaning my salt value would be 56 characters just like my SHA3 output from above? i was thinking of using something rudimentary such as
$salts = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ123456789";
$sv=0;
while ($sv<56) {
$s1 = rand(62);
$newsalt = $newsalt.substr($salts,$s1,1);
$sv++;
}
$salt = $newsalt;
i did read about some module(s) that would give me truly random salt values, and i am interested in those, but my while loop seems to be doing the task, however unnecessary having a 56 character salt value is.
any help and guidance would be sweet. thanks!
spewn
The hash you were provided is 224 bits in size (not 160).
The module's abstract says
The module gives Perl programmers a convenient way to calculate SHA3-224, SHA3-256, SHA3-384, and SHA3-512 message digests, as well as variable-length hashes using SHAKE128 and SHAKE256.
Wikipedia confirms that these (224, 256, 384 and 512) are the standard sizes.
If you wish to get a specific size, use
use Digest::SHA3 qw( );
my $sha3 = Digest::SHA3->new(XXX)
$sha3->add(...);
my $hash = $sha3->hexdigest;
or
use Digest::SHA3 qw( sha3_XXX_hex );
my $hash = sha3_XXX_hex(...);
Use an appropriate number of bits instead of XXX.

KDB split by fixed delimiter

I have a column with xmls
<Options TE="2017/09/01, 16:45:00.000" ST="2017/09/01, 09:00:00.000" TT="2017/09/01, 16:45:00.000"/>
<Options TE="2017/09/01, 16:45:00.000" ST="2017/09/01, 09:00:00.000" TT="2017/09/01, 16:45:00.000"/>
<Options TE="2017/09/04, 16:45:00.000" ST="2017/09/04, 09:00:00.000" TT="2017/09/04, 16:45:00.000"/>
That I am trying to split in columns
TE, ST, TT
The type of the data is C
Not very familiar with kdb/q I tried to go the very manual way. First removed the start and end tags
x:update `$ssr[;"<Options";""] each tags from x
x:update `$ssr[;"/>";""] each string tags from x
leaving me with rows like
TE="2017/09/01, 16:45:00.000" ST="2017/09/01, 09:00:00.000" TT="2017/09/01, 16:45:00.000"
Then, splitting the string
select `$"\"" vs' string tags from x
gives me a list where the odd entries are my times. I just can't figure out how to take that list and split it into separate columns. Any ideas?
I've taken a slightly different approach but the following should do what you want:
//Clean the tags up for separation
//(get rid of open/close tags, change ", " to "," for ease of parsing and remove quote marks)
x:update tags:{ssr/[x;("<Options ";"/>";", ";"\"");("";"";",";"")]} each tags from x
//Parse the various tags using 0:, put the result into a dictionary,
//exec out to table form and add to x
x:x,'exec (!) ./: ("S= " 0:/: tags) from x
For reference here's the table I used:
x:([] tags:("<Options TE=\"2017/09/01, 16:45:00.000\" ST=\"2017/09/01, 09:00:00.000\" TT=\"2017/09/01, 16:45:00.000\"/>";
"<Options TE=\"2017/09/01, 16:45:00.000\" ST=\"2017/09/01, 09:00:00.000\" TT=\"2017/09/01, 16:45:00.000\"/>";
"<Options TE=\"2017/09/04, 16:45:00.000\" ST=\"2017/09/04, 09:00:00.000\" TT=\"2017/09/04, 16:45:00.000\"/>"))
Crazy thought: Is your XML data that regular looking, so that one can select "columns" via indexing. If so, suppose the data (above) was in 3-element list of strings, is it not possible that you apply some function foo to:
foo xmllist[;ind]
where ind selects the data required. The function foo would do the necessary conversion to the timestamp datatype, either by using (types;delimiter) 0: ... ?
see if you can export XML file into JSON file.
kdb+/q has a json parser which does all the dirty work for you.
.j.k and .j.j.
Reference: http://code.kx.com/q/cookbook/websockets/#json

How to match Unicode vowels?

What character class or Unicode property will match any Unicode vowel in Perl?
Wrong answer: [aeiouAEIOU]. (sermon here, item #24 in the laundry list)
perluniprops mentions vowels only for Hangul and Indic scripts.
Let's set aside the question what a vowel is. Yes, i may not be a vowel in some contexts. So, any character that can be a vowel will do.
There's no such property.
$ uniprops --all a
U+0061 <a> \N{LATIN SMALL LETTER A}
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
AHex POSIX_XDigit All Alnum X_POSIX_Alnum Alpha X_POSIX_Alpha Alphabetic Any ASCII
ASCII_Hex_Digit Assigned Basic_Latin ID_Continue Is_IDC Cased Cased_Letter LC
Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L
Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Hex X_POSIX_XDigit Hex_Digit IDC ID_Start
IDS Letter L_ Latin Latn Lowercase_Letter Lower X_POSIX_Lower Lowercase PerlWord POSIX_Word
POSIX_Alnum POSIX_Alpha POSIX_Graph POSIX_Lower POSIX_Print Print X_POSIX_Print Unicode Word
X_POSIX_Word XDigit XID_Continue XIDC XID_Start XIDS
Age=1.1 Age=V1_1 Block=Basic_Latin Bidi_Class=L Bidi_Class=Left_To_Right BC=L
Bidi_Paired_Bracket_Type=None Block=ASCII BLK=ASCII Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR
Decomposition_Type=None DT=None East_Asian_Width=Na East_Asian_Width=Narrow EA=Na
Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
Hangul_Syllable_Type=Not_Applicable HST=NA Indic_Positional_Category=NA InPC=NA
Indic_Syllabic_Category=Other InSC=Other Joining_Group=No_Joining_Group JG=NoJoiningGroup
Joining_Type=Non_Joining JT=U Joining_Type=U Script=Latin Line_Break=AL
Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN
Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0
Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1
Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0
Present_In=6.1 IN=6.1 Present_In=6.2 IN=6.2 Present_In=6.3 IN=6.3 Present_In=7.0 IN=7.0
Present_In=8.0 IN=8.0 SC=Latn Script=Latn Script_Extensions=Latin Scx=Latn
Script_Extensions=Latn Sentence_Break=LO Sentence_Break=Lower SB=LO Word_Break=ALetter WB=LE
Word_Break=LE
The most important thing when dealing with i18n is to think about what you actually need, yet you didn't even mention what you are trying to accomplish.
Find vowels? That can't be what you are actually trying to do. I could see a use for identifying vowel sounds in a word, but those are often formed from multiple letters (such as "oo" in English, and "in", "an"/"en", "ou", "ai", "au"/"eau", "eu" in French), and it would be language-specific.
As it stands, you're asking for a global solution but you're defining the problem in local terms. You first need to start by defining the actual problem you are trying to solve.
Setting aside the definition of a vowel and the obvious problem that different languages share symbols but use them differently, there's a way that you can define your own property for use in a Perl pattern.
Define a subroutine that starts with In or Is and specify the characters that can be in it. The simplest is one code number be line, or a range of code numbers separated by horizontal whitespace:
#!perl
use v5.10;
use utf8;
use open qw(:std :utf8);
sub InSpecial {
return <<"HERE";
00A7
00B6
2295\t229C
HERE
}
$_ = "ABC\x{00A7}";
say $_;
say /\p{InSpecial}/ ? 'Matched' : 'Missed';
First of all, not all written languages have "vowels". For one example, 中文 (Zhōngwén) (written Chinese) does not, as it is ideogrammatic instead of phonetic. For another example, Japanese mostly doesn't; it uses mostly consonant+vowel hiragana or katakana syllabics such as "ga", "wa", "tsu" instead.
And some written languages (for example, Hindi, Bangla, Greek, Russian) do have vowels, but use characters which are not easily mapable to aeiou. For such languages you'd have to find (search metacpan?) or make look-up tables specifying which letters are "vowels".
But if you're dealing with any written language based even loosely on the Latin alphabet (abcdefghijklmnopqrstuvwxyz), even if the language uses tons of diacritics (called "combining marks" in Perl and Unicode circles) (eg, Vietnamese), you can easily map those to "vowel" or "not-vowel", yes. The way is to "normalize-to-fully-decomposed-form", then strip-out all the combining marks, then fold-case, then compare each letter to regex /[aeiou]/. The following Perl script will find most-or-all "vowels" in any language using a Latin-based alphabet:
#!/usr/bin/perl -CSDA
# vowel-count.pl
use v5.20;
use Unicode::Normalize 'NFD';
my $vcount;
while (<>)
{
$_ =~ s/[\r\n]+$//;
say "\nRaw string: $_";
my $decomposed = NFD $_;
my $stripped = ($decomposed =~ s/\pM//gr);
say "Stripped string: $stripped";
my $folded = fc $stripped;
my #base_letters = split //, $stripped;
$vcount = 0;
/[aeiou]/ and ++$vcount for #base_letters;
say "# of vowels: $vcount";
}

Parse a PC-Axis (.px) file in Matlab

Background: PC-Axis is a file format format used for dissemination of statistical information. The format is used by a number of national statistical organisations to disseminate official statistics.
A PC-Axis file looks a little like this, although they're usually a lot longer:
CHARSET=”ANSI”;
MATRIX="BE001";
SUBJECT-CODE="BE";
SUBJECT-AREA="Population";
TITLE="Population by region, time, marital status and sex.";
Data=
".." ".." ".." ".." ".."
".." ".." ".." ".." ".."
".." 24.80 34.20 52.00 23.00
".." 32.10 40.30 50.70 1.00
".." 31.60 35.00 49.10 2.30
41.20 43.00 50.80 60.10 0.00
50.90 52.00 53.90 65.90 0.00
28.90 31.80 39.60 51.00 0.00;
More details about PC-Axis files can be found at the Statistics Sweden website, but the basic gist is that the metadata is positioned at the top of the file and after "DATA=" is the actual data itself. It's also worth noting that the data is organized more like a data-table rather than in columns.
The Problem: I'd like to parse a PC-Axis file using Matlab, but I'm a little stumped as to how to go about doing it. Does anyone know how to parse one of these files in Matlab? Would it be easier to parse this type of file using some other language, like Perl, and then import the data into Matlab, or, would Matlab be a suitable enough tool for the job? Note that the plan would be to analyze the data in Matlab after the text processing stage.
I've tried using Matlab's text processing tools such as fgetl, textscan, fscanf, and a few others, but it's terribly tricky. Does anyone have any pointers on how to go about doing it?
Essentially, I'd like to store each of the keywords (CHARSET, MATRIX, etc.) and their corresponding values (ANSI, BE001, etc.) as metadata in Matlab - as a structure, perhaps. I'd like to have the data stored in Matlab also - as a matrix, for example.
Note: I'm aware of the pxR package (CRAN) in R, which works a treat for reading .px files into the workspace as a data.frame object. There's also a Perl module called Data::PcAxis (CPAN) that is also very good, but I'm specifically wanting to know how to parse a .px file using Matlab.
UPDATE: I should have mentioned that in addition to metadata and data, there are also variables. This is best explained by an example. The example PC-Axis file below is the same as the one above except I've added two variables. They're named VALUES("Month") and VALUES("region") and are positioned after the metadata and before the data.
CHARSET=”ANSI”;
MATRIX="BE001";
SUBJECT-CODE="BE";
SUBJECT-AREA="Population";
TITLE="Population by region, time, marital status and sex.";
VALUES("Month")="1976M01","1976M02","1976M03","1976M04",
"1976M05","1976M06","1976M07","1976M08",
"1976M09","1976M10","1976M11","1976M12";
VALUES("region")="Sweden","Germany","France",
"Ireland","Finland";
Data=
".." ".." ".." ".." ".."
".." ".." ".." ".." ".."
".." 24.80 34.20 52.00 23.00
".." 32.10 40.30 50.70 1.00
".." 31.60 35.00 49.10 2.30
41.20 43.00 50.80 60.10 0.00
50.90 52.00 53.90 65.90 0.00
28.90 31.80 39.60 51.00 0.00;
Textscan works a treat when reading in each line of the text file as a string (in a cell array). However, the elements after the "=" sign for both of the variables (i.e. VALUES("Month") and VALUES("region")) span more than one line. It seems that using textscan in this case means that some strings would have to be concatenated, say, for example, in order to collect the list of months (1976M01 to 1976M12).
Question: What's the best way to collect the variables data? Read the text file as a single string and then use strtok twice to extract the substring of dates? Perhaps, there's a better (more systematic) way?
Usually textscan and regexp is the way to go when parsing string fields (as shown here):
Read the input lines as strings with textscan:
fid = fopen('input.px', 'r');
C = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
Parse the header field names and values using regexp. Picking the right regular expression should do the trick!
X = regexp(C{:}, '^\s*([^=\(\)]+)\s*=\s*"([^"]+)"\s*', 'tokens');
X = [X{:}]; %// Flatten the cell array
X = reshape([X{:}], 2, []); %// Reshape into name-value pairs
The "VALUE" fields may span over multiple lines, so they need to be concatenated first:
idx_data = find(~cellfun('isempty', regexp(C{:}, '^\s*Data')), 1);
idx_values = find(~cellfun('isempty', regexp(C{:}, '^\s*VALUES')));
Y = arrayfun(#(m, n){[C{:}{m:m + n - 1}]}, ...
idx_values(idx_values < idx_data), diff([idx_values; idx_data]));
... and then tokenized:
Y = regexp(Y, '"([^,"]+)"', 'tokens'); %// Tokenize values
Y = cellfun(#(x){{x{1}{1}, {[x{2:end}]}}}, Y); %// Group values in one array
Y = reshape([Y{:}], 2, []); %// Reshape into name-value pairs
Make sure the field names are legal (I've decided to convert everything to lowercase and replace apostrophes and any whitespace with underscores), and plug them into a struct:
X = [X, Y]; %// Store all fields in one array
X(1, :) = lower(regexprep(X(1, :), '-+|\s+', '_'));
S = struct(X{:});
Here's what I get for your input file (only the header fields):
S =
charset: 'ANSI'
matrix: 'BE001'
subject_code: 'BE'
subject_area: 'Population'
title: 'Population by region, time, marital status and sex.'
month: {1x12 cell}
region: {1x5 cell}
As for the data itself, it needs to be handled separately:
Extract data lines after the "Data" field and replace all ".." values with default values (say, NaN):
D = strrep(C{:}(idx_data + 1:end), '".."', 'NaN');
Obviously this assumes that there are only numerical data after the "Data" field. However, this can be easily modified if this is not case.
Convert the data to a numerical matrix and add it to the structure:
D = cellfun(#str2num, D, 'UniformOutput', false);
S.data = vertcat(D{:})
And here's S.data for your input file:
S.data =
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN 24.80000 34.20000 52.00000 23.00000
NaN 32.10000 40.30000 50.70000 1.00000
NaN 31.60000 35.00000 49.10000 2.30000
41.20000 43.00000 50.80000 60.10000 0.00000
50.90000 52.00000 53.90000 65.90000 0.00000
Hope this helps!
I'm not personally familiar with PC-Axis files, but here are my thoughts.
Parse the header first. If the header is of fixed size, you can read in that many lines and parse out the values you want. The regexp method may be useful for this.
The data appear to be both string and numeric. I would change the ".." values to NaN (make an original backup first, of course), and then scan the matrix using textscan. Textscan can be tricky, so make sure the file parses completely. If textscan encounters a line that does not match the format string, it will stop parsing. You can check the position of the file handle (using ftell) to see if it matches the end of the file (you can fseek to the end of the file to find what that value should be). The length of the cell arrays returned by textscan should all be the same. If not, the length will tell you what line they failed on - you can check this line with a text editor to see what violated the format.
You can assign and access fields in Matlab structs using string arguments. For example:
foo.('a') = 1;
foo.a
ans =
1
So, the workflow I suggest is to parse the header lines, assigning each attribute/value pair as field/value pairs in struct. Then parse the matrix (after some brief text preprocessing to make sure all the data are numeric).

Haskell IO with non English characters

Look at this , i am try
appendFile "out" $ show 'д'
'д' is character from Russian alphabet.
After that "out" file contains:
'\1076'
How i understand is the unicode numeric code of character 'д'. Why is it happens ? And How i can to get the normal representation of my character ?
For additional information it is works good:
appendFile "out" "д"
Thanks.
show escapes all characters outside the ASCII range (and some inside the ASCII range), so don't use show.
Since "д" works fine, just use that. If you can't because the д is actually inside a variable, you can use [c] (where c is the variable containing the character. If you need to surround it by single quotes (like show does), you can use ['\'', c, '\''].
After reading your reply to my comment, I think your situation is that you have some data structure, maybe with type [(String,String)], and you'd like to output it for debugging purposes. Using show would be convienent, but it escapes non-ASCII characters.
The problem here isn't with the unicode, you need a function that will properly format your data for display. I don't think show is the right choice, in part because of the problems with escaping some characters. What you need is a type class like Show, but one that displays data for reading instead of escaping characters. That is, you need a pretty-printer, which is a library that provides functions to format data for display. There are several pretty-printers available on Hackage, I'd look at uulib or wl-pprint to start. I think either would be suitable without too much work.
Here's an example with the uulib tools. The Pretty type class is used instead of Show, the library comes with many useful instances.
import UU.PPrint
-- | Write each item to StdOut
logger :: Pretty a => a -> IO ()
logger x = putDoc $ pretty x <+> line
running this in ghci:
Prelude UU.PPrint> logger 'Д'
Д
Prelude UU.PPrint> logger ('Д', "other text", 54)
(Д,other text,54)
Prelude UU.PPrint>
If you want to output to a file instead of the console, you can use the hPutDoc function to output to a handle. You could also call renderSimple to produce a SimpleDoc, then pattern match on the constructors to process output, but that's probably more trouble. Whatever you do, avoid show:
Prelude UU.PPrint> show $ pretty 'Д'
"\1044"
You could also write your own type class similar to show but formatted as you like it. The Text.Printf module can be helpful if you go this route.
Use Data.Text. It provides IO with locale-awareness and encoding support.
A quick web search for "UTF Haskell" should give you good links. Probably the most recommended package is the text package.
import Data.Text.IO as UTF
import Data.Text as T
main = UTF.appendFile "out" (T.pack "д")
To display national characters by show, put in your code:
{-# LANGUAGE FlexibleInstances #-}
instance {-# OVERLAPPING #-} Show String where
show = id
You can try then:
*Main> show "ł"
ł
*Main> show "ą"
ą
*Main> show "ę"
ę
*Main> show ['ę']
ę
*Main> show ["chleb", "masło"]
[chleb,masło]
*Main> data T = T String deriving (Show)
*Main> t = T "Chleb z masłem"
*Main> t
T Chleb z masłem
*Main> show t
T Chleb z masłem
There were no quotes in my previous solution. In addition, I put the code in the module now and the module must be imported into your program.
{-# LANGUAGE FlexibleInstances #-}
module M where
instance {-# OVERLAPPING #-} Show String where
show x = ['"'] ++ x ++ ['"']
Information for beginners: remember that the show does not display anything. show converts data to string with additional formatting characters.
We can try in WinGHCi:
automaticaly by WinGHCi
*M> "ł"
"ł"
*M> "ą"
"ą"
*M> "ę"
"ę"
*M> ['ę']
"ę"
*M> ["chleb", "masło"]
["chleb","masło"]
*M> data T = T String deriving (Show)
*M> t = T "Chleb z masłem"
or manualy
*M> (putStrLn . show) "ł"
"ł"
*M> (putStrLn . show) "ą"
"ą"
*M> (putStrLn . show) "ę"
"ę"
*M> (putStrLn . show) ['ę']
"ę"
*M> (putStrLn . show) ["chleb", "masło"]
["chleb","masło"]
*M> data T = T String deriving (Show)
*M> t = T "Chleb z masłem"
*M> (putStrLn . show) t
T "Chleb z masłem"
In code to display:
putStrLn "ł"
putStrLn "ą"
putStrLn "ę"
putStrLn "masło"
(putStrLn . show) ['ę']
(putStrLn . show) ["chleb", "masło"]
data T = T String deriving (Show)
t = T "Chleb z masłem"
(putStrLn . show) t
I'm adding tag "polskie znaki haskell" for Google.