Show Unicode characters in PostScript - unicode

How do I get my PostScript program to show G clef character from Bravura font? According to this SMuFL document the Unicode code point for a G (treble) clef in Bravura is U+E050 (see page 48 Clefs (U+E050–U+E07F)). The PostScript glyph name might be gClef (not sure).
Here is my best attempt so far to get the unicode characters on page. I am using GhostScript 9.25 to produce a PDF. This is the output from GhostScript:
GPL Ghostscript 9.25 (2018-09-13)
Copyright (C) 2018 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Scanning C:/Windows/Fonts for fonts... 550 files, 358 scanned, 337 new fonts.
Can't find (or can't open) font file %rom%Resource/Font/Calibri.
Can't find (or can't open) font file Calibri.
Loading Calibri font from C:/Windows/Fonts/calibri.ttf... 8525920 7081126 4118548 2767358 1 done.
Can't find (or can't open) font file %rom%Resource/Font/BravuraText.
Can't find (or can't open) font file BravuraText.
Loading BravuraText font from C:/Windows/Fonts/BravuraText.otf... 9545496 7985907 8185868 6762307 1 done.
GPL Ghostscript 9.25: Can't embed the complete font BravuraText as it is too large, embedding a subset.
Main
And this is the minimal PostScript program:
%!PS-Adobe-3.0
%%Title: unicode.ps
%%LanguageLevel: 3
%%EndComments
%%BeginProlog
userdict begin
%%EndProlog
%%BeginSetup
/mm { 25.4 div 72 mul } bind def
/A4Landscape [297 mm 210 mm] def
/PageSize //A4Landscape def
<< /PageSize PageSize >> setpagedevice
% ‘‘ReEncodeSmall’’ generates a new re-encoded font. It
% takes 3 arguments: the name of the font to be
% re-encoded, a new name, and an array of new character
% encoding and character name pairs (see the definition of
% the ‘‘scandvec’’ array below for the format of this
% array). This method has the advantage that it allows the
% user to make changes to an existing encoding vector
% without having to specify an entire new encoding
% vector. It also saves space when the character encoding
% and name pairs array is smaller than an entire encoding
% vector.
% Usage: /Times-Roman /Times-Roman-Scand scandvec new-font-encoding
/new-font-encoding { <<>> begin
/newcodesandnames exch def
/newfontname exch def
/basefontname exch def
/basefontdict basefontname findfont def % Get the font dictionary on which to base the re-encoded version.
/newfont basefontdict maxlength dict def % Create a dictionary to hold the description for the re-encoded font.
basefontdict
{ exch dup /FID ne % Copy all the entries in the base font dictionary to the new dictionary except for the FID field.
{ dup /Encoding eq
{ exch dup length array copy % Make a copy of the Encoding field.
newfont 3 1 roll put }
{ exch newfont 3 1 roll put }
ifelse
}
{ pop pop } % Ignore the FID pair.
ifelse
} forall
newfont /FontName newfontname put % Install the new name.
newcodesandnames aload pop % Modify the encoding vector. First load the new encoding and name pairs onto the operand stack.
newcodesandnames length 2 idiv
{ newfont /Encoding get 3 1 roll put}
repeat % For each pair on the stack, put the new name into the designated position in the encoding vector.
newfontname newfont definefont pop % Now make the re-encoded font description into a POSTSCRIPT font. Ignore the modified dictionary returned on the operand stack by the definefont operator.
end} def
/Calibri /TextFont [
16#41 /Scaron % A (/Scaron Š U+0160)
16#42 /quarternote % B U+2669
16#43 /musicalnote % C
16#44 /eighthnotebeamed % D
16#45 /musicalnotedbl % E
16#46 /beamedsixteenthnotes % F
16#47 /musicflatsign % G
16#47 /musicsharpsign % H U+266F
] new-font-encoding
% https://github.com/steinbergmedia/bravura
% The Unicode code point for a G (treble) clef in Bravura Text is U+E050
% http://www.smufl.org/files/smufl-0.9.pdf
% p48 Clefs (U+E050–U+E07F)
% U+E050 (and U+1D11E) gClef G clef
% http://www.jdawiseman.com/papers/trivia/character-entities.html
/Bravura /MusicFont [
16#41 /gClef % A
16#42 /quarternote % B U+2669
16#43 /musicalnote % C
16#44 /eighthnotebeamed % D
16#45 /musicalnotedbl % E
16#46 /beamedsixteenthnotes % F
16#47 /musicflatsign % G
16#47 /musicsharpsign % H U+266F
] new-font-encoding
/MusicFont findfont 48 scalefont setfont
%%EndSetup
%%BeginScript
%% Main
(Main\n) print
<<>>begin
/TextFont findfont 48 scalefont setfont
0 setgray
72 72 moveto
(#ABCDEFGHIJKL) show
0 72 translate
/MusicFont findfont 48 scalefont setfont
0 setgray
72 72 moveto
(#ABCDEFGHIJKL) show
end
showpage
%%EndScript
%%Trailer
%%EOF

The first question is how you are defining Bravura and Calibri. These fonts are not part of the standard Ghostscript installation, so they must be added in some fashion, possibly via fontconfig (on Linux), but I see you are using Windows (from the path name). How have you added the fonts ?
Now you are (again from the back channel messages) loading TrueType fonts and using them as substitutes for missing PostScript fonts. That's a non-standard feature, so Ghostscript has to do a lot of guessing in order to try and create a Type 42 font (PostScript font with TrueType outlines) from a TrueType font. There's no guarantee it'll get it right, though it is pretty good these days.
By the way, this is nothing to do with Unicode :-)
In PostScript you use a character code for each character you want to display. In your case you have used 0x40 (#) to 0x4C (L) consecutively. When rendering the glyph, the interpreter takes the character code, and looks up the Encoding at that position. Note that your Encoding arrays only contain entries from 0x41 to 0x47, so codes 0x48 to 0x4C will be undefined.
Lets think about your 'TextFont', which is Calibri. At position 0x41 in the Encoding you have a glyph name 'Scaron'. So the interpreter then consults the CharStrings dictionary of the font. The CharStrings dictionary contains key/value pairs, the key (in this case) is a name, and the value is an executable program which defines how to render the glyph.
So the interpreter looks for a key called /Scaron in the CharStrings dictionary, and then executes the program associated with it. If it can't find the key /Scaron, then it looks up the key /.notdef (all fonts are required to have a .notdef) and executes that instead.
You haven't actually said what you're getting out. I'm assuming there's a problem, because you've posted a question (which doesn't seem to contain any actual questions....) but you have't said what it is. If you are getting hollow rectangles instead of the expected glyph, then that's because the interpreter is executing the /.notdef which for TrueType fonts is often a rectangle (PostScript fonts often have a completely blank .notdef, but both font types can have anything they want)
In which case the problem is that you are using a glyph name (eg /muscialnote) which doesn't exist in the CharStrings dictionary. Unless the TrueType font had a POST table (most do not) then that's not surprising, because /musicalnote is a very non-standard name for a glyph.
If I add Calibri to fontmap.GS and then do:
%!
/Calibri findfont /CharStrings get {== ==} forall
Then I see many entries of the form:
0
/_6756
0
/_6689
these are mapping the names (eg /_6576) to the TrueType GID. When using a TrueType font Ghostscript needs the GID so that it can find the glyph program in the font from the GLYF table. When defining a TrueType font for use as a type 42, this is somethign Ghostscript has to try and create for itself (a real Type 42 font is defined with this dictionary as part of the font). How it achieves this is heuristic, ie it guesses a lot.
In this case the GID is 0, which is the TrueType reserved GID for the .notdef glyph, so these names will all map to the .notdef.
I also see a number of entries like:
4
/A
These (obviously) are the glyphs that you can use, in this case the name /A maps to GID 4. Checking the output, there is no name 'quarternote, 'musicalnote' etc. There is an Scaron, so I expect that your '#' character will render as a capital S with a caron accent. The remaining glyphs will render as empty squares, or nothing at all. Testing here shows (interestingly) a rectangle with a question mark in it.
Now it may be that the Calibri font contains the glyphs you want, if it doe, then I'm afraid the only way to access them (from PostScript) is to identify the name that Ghostscript associated with the glyph. The same is true of the Bravura font.
A little PostScript programming (seems like you're more than competent to write this) would allow you to retrieve the CharStrings dictionary from the font, iterate through it, and build an array of all the names which have a non-zero value. You could then print a page (probably many pages) where you print a named glyph from the font, and under it print the name associated with that glyph. There's your map, now you can build an Encoding which maps the glyph name to the character code you want to use in your PostScript program to draw that glyph.
FWIW when I try to use Bravura (which is an OpenType font, not a TrueType font) I get a syntax error whie loading the font. Same for BravuraText.

Related

Transform a matrix of integers (0 to 30) to a matrix of emojis

I am working on transforming an image into a set of emojis, depending on how many colors are there. The Maths part is done. I have the matrix of numbers from 0 to 30, but I specifically need to convert the numbers into symbols and I was thinking about emojis since they are so used nowadays.
My question is how am I supposed to read a matrix of integers from a file, transform the matrix of integers into a matrix with different emojis (eventually, from a list of my choice) and put the output in another text file, this time containing the emojis? Is that possible? I guess it should be, but how do I do that? Does anyone have any suggestions?
The problem that I face is actually with the emojis unicode, I don't seem to have success when it comes to receiving messages on the console in their case. I just get "? ?" instead of a smiley face. But that thing happens only for them, the ASCII characters seem to work a bit better. The problem with ASCII characters is that I need, again, expressive images instead of numbers or random pipe shapes.
There is the code:
%make sure you have the "1234567.jpg" in the same location as the .m file
imdata = imread('1234567.jpg');
[X_no_dither,map] = rgb2ind(imdata,30,'nodither');
imshow(X_no_dither,map)
% and there I try to put the output in a text file
dlmwrite('result.txt',X_no_dither,'delimiter',"\t");
Ok, and the output in the text file is:
0 0 0 0 26 26 ... etc.
And I wonder how am I supposed to write the code in such a way that I will get emojis instead of numbers.
🤔 🤔 🤔 🤔 💖 💖 ... etc.
That's how I'd want the output to be like. But, from what I tried yesterday, I cannot print them without getting warnings/errors.
What you need to do is create a table with your 30 emojis (this documentation page might be helpful), then index into that table. I'm using the compose function as indicated in the page above, it should also be possible to copy-paste emojis into your M-file. If you don't see the emojis in MATLAB's console, change the font you're using.
For example:
table = [compose("\xD83D\xDE0E"),
"B", % find the utf16 encoding for your emojis, or copy-paste them in
"C",
"D",
...
];
output = table(X_no_dither + 1);
f = fopen('result.txt', 'wt');
for ii = 1:size(output, 1)
fprintf(f, '%s', output(ii, :));
fprintf(f, '\n');
end
fclose(f);
This will write the file out in UTF16 format, which is what MATLAB uses. If you're on Windows this might work well for you. On other platforms you might want to save as UTF8 instead, which can be accomplished by opening the file in UTF8 mode:
f = fopen('result.txt', 'wt', 'native', 'UTF-8');
Note that, even if you don't manage to get the emojis shown in the MATLAB command window, opening the text file in an editor will show the emojis correctly.

Use of strtok function

Based on MATLAB's code for strtok (see end):
"Here’s a more advanced example that finds the first token in a character string. A token is a set of characters delimited by whitespace or some other character. Given one input, the function assumes a default delimiter of whitespace; given two, it lets you specify another delimiter if desired. It also allows for two possible output argument lists"
I have a few questions:
1) Is a delimiter specified at the beginning or end of a token?
So for example, if I wanted to find the section of a text which gave me a certain date and the whole text was: "I like the date april 10 because it is close to May Day". I imagine the token is "april 10" but the starting delimiter would be "a" and the ending delimiter would be a digit?
You see I am confused as to what a "delimiter" is exactly in context. In MATLAB I would normally probably write the token as (\w*\s\d*) in order to locate the date (april 10) in the text since I do not know what date it would be (what letter it starts with or the digits after it). But is a delimiter that whole "april 10" or just an "a" at the beginning? How would this help if I do not know what month it is (april, may, june, etc) or does it basically just work as a "find" command?
I ran the program shown in the picture and tried it with 'hello my friend' as the string and 'o' as the delimiter and it gives:
token=hell
remainder=o my friend
So basically I am getting the impression delimiter are usually used at the end of fields or different regions in order to specify when the new field/section (remainder) begins? Basically a delimiter is commonly used as a simple one (or maybe more) character device to indicate the start of a new field or datum whereas using (/d/w*....etc) format is used for more specific extractions like dates where there is no "comma" or specific indicator in front of it? Are these two observations correct?
BUT then when I run it using "hello my fri" as delimiter (see --> running it with delimiter, it seems to arbitrarily select "I want to say hello my friend good man" as the remainder and "nd" as the token which makes no sense so I am wondering if there is a bug in this program or if it's just not set up to handle a delimiter that appear twice.
Also,
2) Can someone please explain why [9:13 32] is made the default for one input argument? If we're assuming whitespace is the delimiter, then what does that [9:13 32] mean?
3) Is there any purpose to using "any" since it is ran by a looping process? Would not it check it each iteration anyways?
function [token, remainder] = strtok(string, delimiters)
%STRTOK Find token in string.
% TOKEN = STRTOK(STR) returns the first token in the string STR delimited
% by white-space characters. STRTOK ignores any leading white space.
% If STR is a cell array of strings, TOKEN is a cell array of tokens.
%
% TOKEN = STRTOK(STR,DELIM) returns the first token delimited by one of
% the characters in DELIM. STRTOK ignores any leading delimiters.
% Do not use escape sequences as delimiters. For example, use char(9)
% rather than '\t' for tab.
%
% [TOKEN,REMAIN] = STRTOK(...) returns the remainder of the original
% string.
%
% If the body of the input string does not contain any delimiter
% characters, STRTOK returns the entire string in TOKEN (excluding any
% leading delimiter characters), and REMAIN contains an empty string.
%
% Example:
%
% s = ' This is a simple example.';
% [token, remain] = strtok(s)
%
% returns
%
% token =
% This
% remain =
% is a simple example.
%
% See also ISSPACE, STRFIND, STRNCMP, STRCMP, TEXTSCAN.
% Copyright 1984-2009 The MathWorks, Inc.
if nargin<1
error(message('MATLAB:strtok:NrInputArguments'));
end
token = ''; remainder = '';
len = length(string);
if len == 0
return
end
if (nargin == 1)
delimiters = [9:13 32]; % White space characters
end
i = 1;
while (any(string(i) == delimiters))
i = i + 1;
if (i > len),
return,
end
end
start = i;
while (~any(string(i) == delimiters))
i = i + 1;
if (i > len),
break,
end
end
finish = i - 1;
token = string(start:finish);
if (nargout == 2)
remainder = string(finish + 1:length(string));
end
EDIT: I was not aware that strtok was a built in function. I was under the assumption it was a UDF the textbook was building as an example. This is why there are many ambiguities since the book does not specify clearly what the function does.
This, for example, was not specified in the text which only stated the function found the first token in a character string. --> token = strtok(str) parses input character vector str from left to right, returning part or all of that character vector in token. Using the white-space character as a delimiter, the token output begins at the start of str, skipping any delimiters that might appear at the start, and includes all characters up to either the next delimiter or the end of the character vector. White-space characters include space (ASCII 32), tab (ASCII 9), and carriage return (ASCII 13).
Copyright 1984-2009 The MathWorks, Inc.
strtok is very much not going to help you here so I'm not going to answer your main question. I think you should use regular expression for this but I don't speak regex so I'll leave that to someone else.
[9:13 32]
Why is the default delimiter set to [9:13 32]. From the comments, MATLAB is claiming that those are all the white space characters. In other words then numbers 9, 10, 11, 12, 13 and 32 are the ASCII values for white space characters. For example 32 is the value of a space. Prove this to yourself by casting one to an integer:
uint8(' ') % or even ' ' + 0
I don't know what all the others are but I'm pretty sure one must be the tab character. To check the ASCII value of a tab character you can do
uint8(sprintf('\t'))
which returns 9 which is indeed in the list.
So [9:13 32] is a list of all the white space characters, as the comment implies.
Actually there are many more white space characters that this doesn't cover: https://en.wikipedia.org/wiki/Whitespace_character
any
When you say any I'm assuming you mean in lines like this: any(string(i) == delimiters). So yes, the loop ensures that only one character of string is compared at a time however there can be multiple values in delimiter for example all the white space characters as mentioned above or maybe you called strtok like this:
strtok('I like the date...', 'ad')
now both 'a' and 'd' are used as delimiters and so it returns
'I like the '
because it hit a 'd' first.

Matlab not recognising comment characters

I downloaded a Matlab code sample online and the comments are showing as strange characters. as shown below
% Ïðîãðàììà ïîèñêà íîìåðîâ àâòîìîáèëåé è ðàñïîçíàâàíèÿ
% áóêâ è öèôð íîìåðà ïðè èñïîëüçîâàíèè íåéðîííûõ ñåòåé
% Íîìåð îòîáðàæàåòñÿ ïîñëå ãîëîñîâàíèÿ
function Detection_Recognition1()
clear
clc
% Îòêðûòèå ôàéëà
video=mmreader('car10.avi'); % 2 4 5 6" 7 8 9 10 11 12
% Íåêîòîðûå ñâîéñòâà âèäåî
width=video.Width; % Øèðèíà êàäðà
height=video.Height; % Âûñîòà êàäðà
frameRate=video.FrameRate; % Ñêîðîñòü êàäðîâ â ñåê.
numOfFrames=video.NumberOfFrames; % Êîëè÷åñòâî êàäðîâ â âèäåî ôàéëå
% ×òåíèå äèàïàçîíà êàäðîâ (íóìåðóþòñÿ ñ 1)
Range=[1 numOfFrames]; % Äèàïàçîí êàäðîâ
frames=read(video,Range);
sizFrames=size(frames);
I tried opening it on Windows and Linux too and the same jibberish comes out. What could cause this and how can it be converted to ASCII?
Matlab encodes source files always in the operating system default character set, in this case it was Cyrillic(Windows).
If you speak russian or whatever language it is and intend to read the comments, you may write a small script based on this to change the encoding to unicode. Same is possible using batch

Char (non ascii) in Matlab

I have three characters (bigger than 127) and I need to write it in a binary file.
For some reason, MATLAB and PHP/Python tends to write a different characters.
For Python, I have:
s = chr(143)+chr(136);
f = open('pythonOut.txt', 'wb');
f.write(s);
f.close();
For MATLAB, I have:
s = strcat(char(143),char(136));
fid = fopen('matlabOut.txt');
fwrite(fid, s, 'char');
fclose(fid);
When I compare these two files, they're different. (using diff and/or cmp command).
More over, when I do
disp(char(hex2dec('88'))) //MATLAB prints
print chr(0x88) //PYTHON prints ˆ
Both outputs are different. I want to make my MATLAB code same as Python. What's wrong with MATLAB code?
You're trying to display extended ASCII characters, i.e symbols with an ASCII number greater than 128. MATLAB does not use extended ASCII internally, it uses 16-bit Unicode instead.
If you want to write the same values as in the Python script, use native2unicode to obtain the characters you want. For example, native2unicode(136) returns ^.
The fact that the two files are different seems obvious; chr(134) is obviously different from char(136) :)
In Matlab, only the first 127 characters correspond to (non-extended) ASCII; anything after that is Unicode16.
In Python, the first 255 characters correspond to (extended) ASCII (use unichr() for Unicode).
However, unicode 0x88 is the same as the extended ASCII 0x88 (as are most of the others). The reason Matlab does not display it correctly, is due to the fact that the Matlab command window does not treat Unicode very well by default, while Python (running in a terminal or so I presume) usually does a better job.
Try changing the font in Matlab's command window, or starting Matlab in a terminal and print the 0x88 character; it should be the same.
In any case, the output of the characters to file should not result in any difference; it is just a display issue.

Using Greek alphabet (or any non-ANSI alphabet) as variable names in MATLAB

Is it possible using Greek alphabet to represent variables in MATLAB?
For example, I'd like to use the Greek character epsilon as a variable in MATLAB. I tried to insert \epsilon but I received an error.
It is not possible.
I refer to the following part of Matlab documentation:
Valid Names
A valid variable name starts with a letter, followed by letters,
digits, or underscores. MATLAB is case sensitive, so A and a are not
the same variable. The maximum length of a variable name is the value
that the namelengthmax command returns.
Letter is defined as ANSI character between a-z and A-Z.
For example, the following hebrew letter Aleph returns false (in Matlab R2018a returns true):
isletter('א')
By the way, you can always check whether your variable name is fine by using genvarname.
genvarname('א')
ans =
x0x1A
While Andrey's answer is true for variable names it's a different story for figures.
title('\epsilon\omega') will actually work and generate an epsilon and an omega as title (although the matlab font replaces them with different symbols). If you export the figure as an eps or pdf file you will see that the title really is epsilon omega. Actually any LaTeX control sequence will work!
Same is true for all the figure text objects such as legends and axis labels.