How to insert 0xFF into scanf("%s") - scanf

Is it possible to insert a byte 0xFF into a string input through scanf()?
I did try CTRL + ALT + u + FF, but that gives the yuml symbol (unicode FF). If I use GDB to see what the value of the byte inserted, I don't get 0xFF in that byte, and that symbol takes two bytes.
I also know that I can pipe a printf("\xFF") into my program as shown in this post.
However, I would like to see if I can input that manually.
I am using Linux, with console interface.

Related

How to write Chinese / multi-byte characters in ESC/POS?

I would like to know how to write Chinese / multi-byte characters in ESC/POS.
There is a reference table here:
https://reference.epson-biz.com/modules/ref_charcode_en/index.php?content_id=110
And a guide to how to read the table:
https://reference.epson-biz.com/modules/ref_charcode_en/index.php?content_id=4
which has this diagram:
with the text:
The first column shows the character code of the first character in the row. The first row shows the value to be added for the character code in the column.
However, I am struggling to understand how to read this table, even with this diagram.
For example how do I write this symbol:
It can be found here:
https://reference.epson-biz.com/modules/ref_charcode_en/index.php?content_id=111
on the first row on the 2nd column.
the JIS is: 30-20
the S-JIS is: 88-9E
and the additional value is: 2
So what is the byte value of the chracter.
The value of multibyte characters is 16-bit big endian.
In JIS, the code for 唖 is 0x3020 + 0x0002, which is 0x3022, and in ShiftJIS, 0x889E + 0x0002, which is 0x88A0.
The byte array is 0x30, 0x22 in JIS and 0x88, 0xA0 in Shift JIS, respectively.
By the way, the table and character code are Japanese, not Chinese.

SyntaxError:(unicode error) 'unicodeescape' codec' can't decode bytes in position 0-5: truncated \UXXXXXXXX escape

Using Autokey 95.8, Python 3 version in Linux Mint 19.3 and I have a series of keyboard macros which generate Unicode characters. This example works:
# alt+shift+a = á
import sys
char = "\u00E1"
keyboard.send_keys(char)
sys.exit()
But the attempt to print an mdash [—] generates the following error:
SyntaxError:(unicode error) 'unicodeescape' codec' can't decode bytes in position 0-5: truncated \UXXXXXXXX escape
# alt+shift+- = —
import sys
char = "\u2014"
keyboard.send_keys(char)
sys.exit()
Any idea how to overcome this problem in Autokey is greatly appreciated.
The code you posted above would not generated the error you ae getting - "truncated \UXXXXXXXX" needs an uppercase \U - and 8 hex-digits - if you try putting in the Python source char = "\U2014", you will get that error message (and probably it you got it when experimenting with the file in this way).
The sequence char = "\u2014" will create an mdash unicode character on the Python side - but that does not mean it is possible to send this as a Keyboard sybo via autokey to Windows. That is the point your program is likely failing (and since there is no programing error, you won't get a Python error message - it is just that it won't work - although Autokey might be nice and print out some apropriate error message in this case).
You'd have to look around on how to type an arbitrary unicode character on your S.O. config (on Linux mint it should be on the docs for "wayland" I guess), and send the character composign sequence to Autokey instead. If there is no such a sequence, then finding a way to copy the desired character to the window environment clipboard, and then send Autokey the "paste" sequence (usually ctrl + v - but depending on the app it could change. Terminal emulators use ctrl + shift + v, for example)
When you need to emit non-English US characters in AutoKey, you have two choices. The simplest is to put them into the clipboard with clipboard.fill_clipboard(your characters) and paste them into the window using keyboard.send_keys("<ctrl>+v"). This almost always works.
If you need to define a phrase with multibyte characters in it, select the Paste using Clipboard (Ctrl+V) option. (I'm trying to get that to be the default option in a future release.)
The other choice, that I'm still not quite sure of, is directly sending the Unicode escape sequence to the window, letting it convert that into the actual Unicode character. Something like keyboard.send_keys("\U2014"). Assigning that to a variable first, as in the question, creates the actual Unicode character which that API call can't handle correctly.
The problem being that the underlying code for keyboard.send_keys() wants to send keycodes that actually exist on your keyboard or that it can add to an unused key in your layout. Most of the time that doesn't work for anything multibyte.

Change of char encoding in Eclipse

I am working on an assignment where I need to XOR the bits of each char of a given text. For example, weird char's like '��'.
When trying to save, Eclipse prompts that "Some characters cannot be mapped with Cp1252...", after which I can choose to save as UTF-8.
My knowledge of character encoding is quite fuzzy; wouldn't saving to UTF-8 change the bits? If so, how may I instead work with the original message (original bits) to XOR them and do my assignment?
Thanks!
I am assuming you are using Java in this answer.
The file encoding only changes how the data is represented in the file. When you read the file again (using the correct encoding) it will converted back to Unicode in your String so the program will see the same bits.
Encoding Cp1252 can only represent a small number of characters (less than 256) compared to the 113,021 characters in Unicode 7 all of which can be encoded with UTF-8.

How to print escaped hexadecimal in a string in C++?

I have questions related to Unicode, printing escaped hexadecimal values in const char*.
From what I have understood, utf-8 includes 2-, 3- or 4-byte characters, ranging from pound symbol to Kanji characters. Within strings these are represented in hexadecimal values using \u as escape sequence. Also I have understood while using hexadecimal escape in a string, the characters whose value can be included in the escape will be included. For example say "abc\x0f0dab" will include 0f0dab to be considered within \x as hex even though you want only 0f0d to be considered.
Now while writing a Unicode string, say you want to write "abc𤭢def₤ghi", where Unicode for 𤭢 is 0x24B62 and ₤ is 0x00A3. So I will have to compose the string as "abc0x24B62def0x00A3ghi". The 0x will consider all values that can be included in it. So if you want to print "abc𤭢62" the string will be "abc0x24B6262". Won't the entire string be taken as a 4-byte unicode (0x24B6262) value considered within 0x? How to solve this? How to print "abc𤭢62" and not abc(0x24B6262)?
I have a string const char* tmp = "abc\x0fdef";. When I print using printf("\n string = %s", tmp); then it will print abcdef. Where is 0f here? I know the decimal value of \x0f will be stored in the string, i.e. 15, so when we try to print, 15 should be printed right? I mean, it should be "abc15def" but it prints only "abcdef".
I think you may be unfamiliar with the concept of encodings, from reading your post.
For instance, you say "unicode of ... ₤ is 0x00A3". That is true - unicode codepoint U+00A3 is the pound sign. But 0x00A3 is not how you represent the pound sign in, for example, UTF-8 (a particular common encoding of Unicode). Take a look here to see what I mean. As you can see, the UTF-8 encoding of U+00A3 is the two bytes is 0xc2, 0xa3 (in that order).
There are several things that happen between your call to printf() and when something appears on your screen.
First, your program runs the code printf("abc\x0fdef"), and that means that the following bytes in order, are written to stdout for your program:
0x61, 0x62, 0x63, 0x0f, 0x64, 0x65, 0x66
Note: I'm assuming your source code is ASCII (or UTF-8), which is very common. Technically, the interpretation of your source code's character set is implementation-defined, I believe.
Now, in order to see output, you will typically be running this program inside some kind of shell, and it has to eventually transform those bytes into visual characters. It does this by using an encoding. Again, something ASCII-compatible is common, such as UTF-8. On Windows, CP1252 is common.
And if that is the case, you get the following mapping:
0x61 - a
0x62 - b
0x63 - c
0x0f - the 'shift in' ASCII control code
0x64 - d
0x65 - e
0x66 - f
This prints out as "abcdef" because the 'shift in' control code is a non-printing character.
Note: The above can change depending on what exact character sets are involved, but ASCII or UTF-8 is very likely what you're dealing with unless you have an exotic setup.
If you have a UTF-8 compatible terminal, the following should print out "abc₤def", just as an example to get you started:
printf("abc\xc2\xa3def");
Make sense?
Update: To answer the question from your comment: you need to distinguish between a codepoint and the byte values for an encoding of that codepoint.
The Unicode standard defines 'codepoints' which are numerical values for characters. These are commonly written as U+XYZ where XYZ is a hexidecimal value.
For instance, the character U+219e is LEFTWARDS TWO HEADED ARROW.
This might also be written 0x219e. You would know from context that the writer is talking about a codepoint.
When you need to encode that codepoint (to print, or save to file, etc), you use an encoding, such as UTF-8. Note, if you used, for example, the UTF-32 encoding, every codepoint corresponds exactly to the encoded value. So in UTF-32, the codepoint U+219e would indeed be encoded simply as 0x219e. But other encodings will do things differently. UTF-8 will encode U+219e as the three bytes 0xE2 0x86 0x9E.
Lastly, the \x notation is simply how you write arbitrary byte values inside a C/C++ quoted string. If I write, in C source code, "\xff", then that string in memory will be the two bytes 0xff 0x00 (since it automatically gets a null terminator).

use wcstombs() to convert wchar_t* (containing Unicode) to MBCS (char*) dependent on locale

My input is Unicode characters, eg:, "(U+00DB) (U+0081)" (wchar_t*). I use wcstombs to convert this wide char string into char * (MBCS). Since Unicode is already encoded in UTF-8 I am expecting it to return byte by byte copied sequence of Unicode as DB81 char*. But instead I get c3 9b. This is happening on Linux and on windows i get "DB 81" only.
I need to open a file with name DB 81 (as shown in hexdump), but fopen uses char* filename. Thus I have to convert this wchar_t* to MBCS. Please help!!
No, what you want to do is not what you think you should.
fopen(), under any circumstances - cannot handle all possible filenames on your system, because it lacks unicode support.
Please refer to http://www.utf8everywhere.org to see how to do it with _wfopen().