Unicode Character in flex? - unicode

I have a simple question about two unicode-characters, which I want to use in my programming language. For an assignement I want to use the old APL Symbols ← as well as →.
My flex-file (snazzle.l) looks like the following:
/** phi#gress.ly 2017 **/
/** parser for omni programming language. **/
%{
#include <iostream>
using namespace std;
#define YY_DECL extern "C" int yylex()
int linenum = 0;
%}
%%
[\n] {++linenum;}
[ \t] ;
[0-9]+\.[0-9]+([eE][+-]?[0-9]+)? { cout << linenum << ". Found a floating-point number: " << yytext << endl; }
\"[^\"]*\" { cout << linenum << ". Found string: " << yytext << endl; }
[0-9]+ { cout << linenum << ". Found an integer: " << yytext << endl; }
[a-zA-Z0-9]+ { cout << linenum << ". Found an identifier: " << yytext << endl; }
([\←])|([\→])|(:=)|(=:) { cout << linenum << ". Found assignment operator: " << yytext <<endl; }
[\;] { cout << linenum << ". Found statement delimiter: " << yytext <<endl; }
[\[\]\(\)\{\}] { cout << linenum << ". Found parantheses: " << yytext << endl; }
%%
main() {
// lex through the input:
yylex();
}
When I "snazzle" the following input:
x → y;
I get the assignement character a) wrong and b) three (3) times:
0. Found an identifier: x
0. Found assignment operator: �
0. Found assignment operator: �
0. Found assignment operator: �
0. Found an identifier: y
0. Found statement delimiter: ;
How can I add ← and → as possible flex-characters?

Flex produces eight-bit clean scanners; that is, it can handle any input consisting of arbitrary octets. It knows nothing about UTF-8 or Unicode codepoints, but that doesn't stop it from recognizing a Unicode input character as a sequence of octets (not a single character). Which sequence it will be depends on which Unicode encoding you are using, but assuming that your files are UTF-8, → will be the three bytes e2 86 92 and ← will be e2 86 90.
You don't actually have to know that, however; you can just put the UTF-8 sequence into your flex pattern. You don't even need to quote it, although it is probably a good idea because it will prove less confusing if you end up using regular expression operators. Here I really mean quote it, as in "←". \← will not do what you expect, because the \ only applies to the next octet (as I said, flex knows nothing about Unicode encodings), which is only the first of the three bytes in that symbol. In other words, "←"? really means "an optional left-arrow", while \←? means "the two octets \xE2 \x86 optionally followed by \x90". I hope that's clear.
Flex character classes are not useful for Unicode sequences (or any other multi-character sequence) because a character class is a set of octets. So if you write [←], flex will interpret that as "one of the octets \xE2, \x86 or \x90". [Note 1]
Notes
It is rarely necessary to backslash-escape characters inside flex character classes; the only character which must be backslash-escaped is the backslash itself. It is not an error to escape characters which don't need escaping, so flex won't complain about it, but it makes the character classes hard for humans to read (at least, for this human to read). So [\←] means exactly the same as [←] and you could write [\[\]\(\)\{\}] as [][)(}{]. (] does not close a character class if it is the first character in the class, so it is conventional to write parentheses "face-to-face").
It is also not necessary to parenthesize character sequences inside alternatives, so you could write ([\←])|([\→])|(:=)|(=:) as ←|→|:=|=:. Or, if you prefer, "←"|"→"|":="|"=:". Of course, you wouldn't usually do that, since the scanner normally informs the parser about each individual operator. If your intention is to make ← a synonym of :=, then you would probably end up with:
←|:= { return LEFT_ARROW; }
→|=: { return RIGHT_ARROW; }
Rather than inserting printf actions in your scanner specification, you would be better off asking flex to put your scanner in debug mode. That is as simple as adding -d to the flex command line when you are building your scanner. See the flex manual section on debugging for more details.

Related

How to add / remove 1 Byte of control character(s) STX & EXT in swift

I am writing a code, where I want to add 1 byte of STX at the start of the string & 1 byte of ETX at the end of the swift string.
Not sure how to do it.
Example:
<1B>---<3B>--<1B>-<1B>---<3B>--<1B>
<STX><String><ETX><STX><String><ETX>
Where 1B = 1 Byte & 3B = 3 Byte
STX= Start of Text
ETX= End of Text
Control Characters Reference: enter link description here
You could just use the special characters in string litterals. Considering that the ASCII control codes STX and ETX have no special escape character, you could use their unicode scalar values 0x02 and 0x03.
Directly in the string literal
You can construct the string using a string literal, if needed with interpolation:
let message="\u{02}...\u{03}\u{02}xyz\u{03}"
You can cross-check printing the numeric value of each byte :
for c in message.utf8 {
print (c)
}
Concatenating the string
You can also define some string constants:
let STX="\u{02}"
let ETX="\u{03}"
And build the string by concatenating parts. In this example, field1 and field2 are string variables of arbitrary length that are transformed to exactly 3 character length:
let message2=STX+field1.padding(toLength: 3, withPad: " ", startingAt:0)+ETX+STX+field2.padding(toLength: 3, withPad: " ", startingAt:0)+ETX
print (message2)
for c in message2.unicodeScalars {
print (c, c.value)
}

Garbage characters printed by vscode [duplicate]

Everytime I use the terminal to print out a string or any kind of character, it automatically prints an "%" at the end of each line. This happens everytime I try to print something from C++ or php, havent tried other languages yet. I think it might be something with vscode, and have no idea how it came or how to fix it.
#include <iostream>
using namespace std;
int test = 2;
int main()
{
if(test < 9999){
test = 1;
}
cout << test;
}
Output:
musti#my-mbp clus % g++ main.cpp -o tests && ./tests
1%
Also changing the cout from cout << test; to cout << test << endl; Removes the % from the output.
Are you using zsh? A line without endl is considered a "partial line", so zsh shows a color-inverted % then goes to the next line.
When a partial line is preserved, by default you will see an inverse+bold character at the end of the partial line: a ‘%’ for a normal user or a ‘#’ for root. If set, the shell parameter PROMPT_EOL_MARK can be used to customize how the end of partial lines are shown.
More information is available in their docs.

Create Unicode from a hex number in C++

My objective is to take a character which represents to UK pound symbol and convert it to it's unicode equivalent in a string.
Here's my code and output so far from my test program:
#include <iostream>
#include <stdio.h>
int main()
{
char x = 163;
unsigned char ux = x;
const char *str = "\u00A3";
printf("x: %d\n", x);
printf("ux: %d %x\n", ux, ux);
printf("str: %s\n", str);
return 0;
}
Output
$ ./pound
x: -93
ux: 163 a3
str: £
My goal is to take the unsigned char 0xA3 and put it into a string representing the unicode UK pound representation: "\u00A3"
What exactly is your question? Anyway, you say you're writing C++, but you're using char* and printf and stdlib.h so you're really writing C, and base C does not support unicode. Remember that a char in C is not a "character" it's just a byte, and a char* is not an array of characters, it's an array of bytes. When you printf the "\u00A3" string in your sample program, you are not printing a unicode character, you are actually printing those literal bytes, and your terminal is helping you out and interpreting them as a unicode character. The fact that it correctly prints the £ character is just coincidence. You can see this for yourself. If you printf str[0] in your sample program you should just see the "\" character.
If you want to use unicode correctly in C you'll need to use a library. There are many to choose from and I haven't used any of them enough to recommend one. Or you'll need to use C++11 or newer and use std::wstring and friends. But what you are doing is not real unicode and will not work as you expect in the long run.

lex program to count the Number of Words

I made the following lex program to count the Number of words in a Textfile. A 'Word' for me is any string that starts with an alphabet and is followed by 0 or more occurrence of alphabets/numbers/_ .
%{
int words;
%}
%%
[a-zA-Z][a-zA-Z0-9_]* {words++; printf("%s %d\n",yytext,words);}
. ;
%%
int main(int argc, char* argv[])
{
if(argc == 2)
{
yyin = fopen(argv[1], "r");
yylex();
printf("No. of Words : %d\n",words);
fclose(yyin);
}
else
printf("Invalid No. of Arguments\n");
return 0;
}
The Problem is that for the following Textfile, I am getting the No. of Words : 13. I tried printing the yytext and it shows that it is taking 'manav' from '9manav' as a word even though it doesnot match my definition of a word.
I also tried including [0-9][a-zA-Z0-9_]* ; within my code but still shows the same output. I want to know why is this happening and possible ways to avoid it.
Textfile : -
the quick brown fox jumps right over the lazy dog cout for
9manav
-99-7-5 32 69 99 +1
First, the manav is perfectly matching your definition of word. The 9 in front of it is matched by the . rule. Remember, that white space is not special in lex.
You had the right idea by adding another rule [0-9][a-zA-Z0-9_]* ; but since the ruleset is ambiguous (there are several ways to match the input) order of the rules matters. It's a while I worked with lex but I think putting the new rule before the word rule should work.

format specifier for long double (I want to truncate the 0's after decimal)

I have a 15-digit floating-point number and I need to truncate the trailing zeros after the decimal point. Is there a format specifier for that?
%Lg is probably what you want: see http://developer.apple.com/library/ios/#DOCUMENTATION/System/Conceptual/ManPages_iPhoneOS/man3/printf.3.html.
Unfortunately in C there is no format specifier that seems to meet all the requirements you have. %Lg is the closest but as you noted it switched to scientific notation at its discretion. %Lf won't work by itself because it won't remove the trailing zeroes.
What you're going to have to do is print the fixed format number to a buffer and then manually remove the zeroes with string editing (which can STILL be tricky if you have rounding errors and numbers like 123.100000009781).
Is this what you want:
#include <iostream>
#include <iomanip>
int main()
{
double doubleValue = 78998.9878000000000;
std::cout << std::setprecision(15) << doubleValue << std::endl;
}
Output:
78998.9878
Note that trailing zeros after the decimal point are truncated!
Online Demo : http://www.ideone.com/vRFlQ
You could print the format specifier as a string, filling in the appropriate amount of digits if you can determine how many:
sprintf(fmt, "%%.%dlf", digits);
printf(fmt, number);
or, just checking trailing 0 characters:
sprintf(fmt, "%.15lf", 2.123);
truncate(fmt);
printf("%s", fmt);
truncate(char * fmt) {
int i = strlen(fmt);
while (fmt[--i] == '0' && i != 0);
fmt[i+1] = '\0';
}
%.15g — the 15 being the maximum number of significant digits required in the string (not the number of decimal places)
1.012345678900000 => 1.0123456789
12.012345678900000 => 12.0123456789
123.012345678900000 => 123.0123456789
1234.012345678900000 => 1234.0123456789
12345.012345678900000 => 12345.0123456789
123456.012345678900000 => 123456.012345679