Understanding UTF-8 encoding in emails - email

I am trying to understand hoUTF-8 encoding in emails works. I thought looking at what Thunderbird does should help. So I wrote a email with the characters "äöü" and then looked at the source in Thunderbirds "Sent"-Folder.
The mail-header says
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Those characters are shown as "äöü" (Unicode-Positions 195 164 195 182 195 188). I then followed the rules (that I found here) and was able to get back my original letters. Using 195 164 as an example: bytes are 1 1 0 0 0 0 1 1 * 1 0 1 0 0 1 0 0. The first two bits indicate I need to process those two bytes. I then drop the 4 leading bits from the the first bytes and the 2 leading bits from the 2nd bytes which gives my 0011100100 or 228 - unicode-position of "ä".
BUT then I extended that string to "äöüÄÖÜ" and saw "äöüÄÖÜß" which I find in unicode at positions 195 164 195 182 195 188 195 8222 195 8211 195 339 195 376. Eh...8222, 8211, 339, 376??? All values > 255 - I thought 8bit-encoding could't go beyond 255?
How can I decode that text correctly?

Related

Can we discard a numerical variable based on the T test when our target variable is a categorical?

I have a numeric variable within my data.
sample(d$timedelta, 20)
[1] 601 561 44 162 554 443 604 68 140 446 178 506 348 402 401 700 127 717 669 68
My target is a binary variable (Popularity = 1/0)
I want to drop the variable if there is no statistically significant difference between $timedelta among the two groups
pop.1.time = d$timedelta[d$Popularity==1]
pop.0.time = d$timedelta[d$Popularity==0]
t.test(pop.1.time,pop.0.time, var.equal = F, paired = F)
Can I drop Timedelta altogether if the above test shows that there is no difference among the two groups?
Is that a valid approach? Or am I misinterpreting the meaning of a T-test?

Data type changes when saving a binary image

I converted a grayscale image to a binary image as shown in the script below:
D = '/folder-path/';
S = dir(fullfile(D,'*.jpg'));
for k = 1:numel(S)
F = fullfile(D,S(k).name);
I = imread(F);
I2 = im2bw(I);
imwrite(I2,F);
end
The issue is when I try to read any of the images that were converted to binary and saved to the hard drive, the returned type is uint8!
I thought the image would contain two values like 0 and 255 for instance at least, but when running unique(I) on one image I got the following:
75×1 uint8 column vector
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
35
36
37
38
217
218
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
Why do you think this is happening? How can I read the saved images as binary and not uint8?
Thanks.
Do not write your binary image to a jpeg file, it is compressed and you certainly loose the exact values in the process.
In addition, erasing the source file really looks like a bad practice.
A solution would be to save your binary image in a png file with the same name. For instance:
imwrite(I2, [D s(k).name(1:end-3) 'png']);
In this case the png contains only zeros and ones. To be able to see your binary image in a viewer, better to have 0s and 255s:
imwrite(I2*255, [D s(k).name(1:end-3) 'png']);

tesseract OCR output words bounds

How to output words bounds using tesseract command line with config file?
So far I been able to output chars using
tesseract image.png myBox makebox
This created a myBox.box file that looks like this:
N 51 1844 75 1874 0
o 80 1843 100 1867 0
S 113 1843 136 1875 0
I 140 1844 145 1874 0
M 151 1844 181 1874 0
c 197 1843 216 1867 0
a 219 1843 238 1867 0
r 243 1844 254 1867 0
d 256 1843 275 1876 0
How ever those only chars and I need words, so I been able to combine it with standard output
tesseract image.png myBox
This creates a file like this:
no simcard
Combining those two outputs I can get words bounds. How ever I prefer to find a method that does not require examining the same image twice. Please help

error reading a text file in octave

I have a text file named xMat.txt which has 200 space separated elements in one line and some 767 lines.
This is how xMat.txt looks.
386.0 386.0 388.0 394.0 402.0 413.0 ... .0 800.0 799.0 796
801.0 799.0 799.0 802.0 802.0 80 ... 399.0 397.0 394.0 391
.
.
.
When I try to read the file in octave using X = dlmread('xMat.txt',' ') I get a matrix of size 767 X 610. I am expecting a matrix of size 767 X 200 since there are 200 elements in one row. How can I solve this problem?
Edit - This is my file
Your uploaded file https://bpaste.net/raw/96cf21aa21b8 has incosistent number of columns per row.
$ awk "{print NF}" tmp | sort | uniq -c
2 200
754 201
1 206
1 217
1 223
1 234
1 237
1 238
1 269
1 273
1 390
1 420
1 610
So the most rows have 201 columns but one has 420 columns and one even has 610 columns. This is the reason you get a 767x610 matrix from dlmread.
Lets look which lines have more than 201 columns:
$ awk "{if (NF>201) print NR, NF}" tmp
68 217
580 206
613 390
615 234
657 273
676 610
679 237
720 269
722 238
743 223
762 420
The first coloumn shows the line number, the second number of columns.
So your line with 610 columns is line number 676. I aslo printed line 676:
so you see it really contains data, no multiple spaces which are filles with zeros.

MATLAB accessing conditional values and performing operation in single column

Just started MATLAB 2 days ago and I can't figure out a non-loop method (since I read they were slow/inefficient and MATLAB has better alternatives) to perform a simple task.
I have a matrix of 5 columns and 270 rows. What I want to do is:
if the value of an element in column 5 of matrix goodM is below 90, I want to take that element and and subtract it from 90.
So far I tried:
test = goodM(:,5) <= 90;
goodM(test) = 999;
It changes all goodM values within column 1 not 5 into 999, in addition this method doesn't allow me to perform operations on the elements below 90 in column 5. Any elegant solution to doing this?
edit:: goodM(:,5)(test) = 999; doesn't seem to work either so I have no idea to specify the target column.
I am assuming you are looking to operate on elements that have values below 90 as your text in the question reads, rather than 'below or equal to' as represented by '<=' as used in your code. So try this -
ind = find(goodM(:,5) < 90) %// Find indices in column 5 that have values less than 90
goodM(ind,5) = 90 - goodM(ind,5) %// Operate on those elements using indices obtained from previous step
Try this code:
b=90-a(a(:,5)<90,5);
For example:
a =
265 104 479 13 176
26 110 447 208 144
379 163 179 366 464
301 48 274 391 26
429 374 174 184 297
495 375 312 373 82
465 272 399 447 420
205 170 373 122 84
1 417 63 65 252
271 277 412 113 500
then,
b=90-a(a(:,5)<90,5);
b =
64
8
6