Windows 1250 or UTF-8 in MS Visual Studio? - encoding

Here I have a simple, exemplary code in MS Visual Studio:
#include<string>
#include<iostream>
using namespace std;
int main()
{
cout << static_cast<int>('ą') << endl; // -71
return 0;
}
The question is why this cout prints out -71 as if MS Visual Studio was using Windows 1250 if as far as I know it uses UTF-8?

Your source file is saved in Windows-1250, not UTF-8, so the byte stored between the two single quotes is 0xB9 (see Windows-1250 table). 0xB9 taken as a signed 8-bit value is -71.
Save your file in UTF-8 encoding and you'll get a different answer. I get 50309 which is 0xc485. since UTF-8 is a multibyte encoding, it would be better to use modern C++ to output the bytes of an explicit UTF-8 string, use UTF-8 source encoding, and tell the compiler explicitly that the source encoding it UTF-8:
test.c - saved in UTF-8 encoding and compiled with /utf-8 switch in MSVS:
#include<string>
#include<iostream>
#include <cstdint>
using namespace std;
int main()
{
string s {u8"ą马"};
for(auto c : s)
cout << hex << static_cast<int>(static_cast<uint8_t>(c)) << endl;
return 0;
}
Output:
c4
85
e9
a9
ac
Note C4 85 is the correct UTF-8 bytes for ą and E9 A9 AC are correct for Chinese 马 (horse).

Related

STM32 SPI data is sent the reverse way

I've been experimenting with writing to an external EEPROM using SPI and I've had mixed success. The data does get shifted out but in an opposite manner. The EEPROM requires a start bit and then an opcode which is essentially a 2-bit code for read, write and erase. Essentially the start bit and the opcode are combined into one byte. I'm creating a 32-bit unsigned int and then bit-shifting the values into it. When I transmit these I see that the actual data is being seen first and then the SB+opcode and then the memory address. How do I reverse this to see the opcode first then the memory address and then the actual data. As seen in the image below, the data is BCDE, SB+opcode is 07 and the memory address is 3F. The correct sequence should be 07, 3F and then BCDE (I think!).
Here is the code:
uint8_t mem_addr = 0x3F;
uint16_t data = 0xBCDE;
uint32_t write_package = (ERASE << 24 | mem_addr << 16 | data);
while (1)
{
/* USER CODE END WHILE */
/* USER CODE BEGIN 3 */
HAL_SPI_Transmit(&hspi1, &write_package, 2, HAL_MAX_DELAY);
HAL_Delay(10);
}
/* USER CODE END 3 */
It looks like as your SPI interface is set up to process 16 bit halfwords at a time. Therefore it would make sense to break up the data to be sent into 16 bit halfwords too. That would take care of the ordering.
uint8_t mem_addr = 0x3F;
uint16_t data = 0xBCDE;
uint16_t write_package[2] = {
(ERASE << 8) | mem_addr,
data
};
HAL_SPI_Transmit(&hspi1, (uint8_t *)write_package, 2, HAL_MAX_DELAY);
EDIT
Added an explicit cast. As noted in the comments, without the explicit cast it wouldn't compile as C++ code, and cause some warnings as C code.
You're packing your information into a 32 bit integer, on line 3 of your code you have the decision about which bits of data are placed where in the word. To change the order you can replace the line with:
uint32_t write_package = ((data << 16) | (mem_addr << 8) | (ERASE));
That is shifting data 16 bits left into the most significant 16 bits of the word, shifting mem_addr up by 8 bits and or-ing it in, and then adding ERASE in the least significant bits.
Your problem is the Endianness.
By default the STM32 uses little edian so the lowest byte of the uint32_t is stored at the first adrress.
If I'm right this is the declaration if the transmit function you are using:
HAL_StatusTypeDef HAL_SPI_Transmit(SPI_HandleTypeDef *hspi, uint8_t *pData, uint16_t Size, uint32_t Timeout)
It requires a pointer to uint8_t as data (and not a uint32_t) so you should get at least a warning if you compile your code.
If you want to write code that is independent of the used endianess, you should store your data into an array instead of one "big" variable.
uint8_t write_package[4];
write_package[0] = ERASE;
write_package[1] = mem_addr;
write_package[2] = (data >> 8) & 0xFF;
write_package[3] = (data & 0xFF);

Segfault when running hello world shellcode in C program

sorry if this question sounds dumb but I am very new to shellcoding and I was trying to get a hello world example to work on a 32 bit linux machine.
As this is shellcoding, I used a few tricks to remove null bytes and shorten the code. Here it is:
section .data
section .text
global _start
_start:
;Instead of xor eax,eax
;mov al,0x4
push byte 0x4
pop eax
;xor ebx,ebx
push byte 0x1
pop ebx
;xor ecx,ecx
cdq ; instead of xor edx,edx
;mov al, 0x4
;mov bl, 0x1
mov dl, 0x8
push 0x65726568
push 0x74206948
;mov ecx, esp
push esp
pop ecx
int 0x80
mov al, 0x1
xor ebx,ebx
int 0x80
This code works fine when I compile and link it with the following commands:
$ nasm -f elf print4.asm
$ ld -o print4 -m elf_i386 print4.o
However, I tried running it within the following C code:
$ cat shellcodetest.c
#include
#include
char *shellcode = "\x04\x6a\x58\x66\x01\x6a\x5b\x66\x99\x66\x08\xb2\x68\x68\x68\x65\x69\x48\x54\x66\x59\x66\x80\xcd\x01\xb0\x31\x66\xcd\xdb\x80";
int main(void) {
( *( void(*)() ) shellcode)();
}
$ gcc shellcodetest.c –m32 –z execstack -o shellcodetest
$ ./shellcodetest
Segmentation fault (core dumped)
Could someone please explain what is happening there? I tried running the code in gdb and noticed something weird happening with esp. But as I said before, I still lack experience to really understand what is going on here.
Thanks in advance!
Your shellcode does not work, because it is not entered in the correct endianness. You did not state how you extracted the bytes from the file print4, but both objdump and xxd gives the bytes in correct order.
$ xxd print4 | grep -A1 here
0000060: 6a04 586a 015b 99b2 0868 6865 7265 6848 j.Xj.[...hherehH
0000070: 6920 7454 59cd 80b0 0131 dbcd 8000 2e73 i tTY....1.....s
$ objdump -d print4
print4: file format elf32-i386
Disassembly of section .text:
08048060 <_start>:
8048060: 6a 04 push $0x4
8048062: 58 pop %eax
8048063: 6a 01 push $0x1
...
The changes you need to do is to swap the byte order, '\x04\x6a' -> '\x6a\x04'.
When I run your code with this change, it works!
$ cat shellcodetest.c
char *shellcode = "\x6a\x04\x58\x6a\x01\x5b\x99\xb2\x08\x68\x68\x65\x72\x65\x68\x48\x69\x20\x74\x54\x59\xcd\x80\xb0\x01\x31\xdb\xcd\x80";
int main(void) {
( *( void(*)() ) shellcode)();
}
$ gcc shellcodetest.c -m32 -z execstack -o shellcodetest
$ ./shellcodetest
Hi there$

Invalid CRC32 Hash Generation

I'm creating SHA1 and CRC32 hash from plain text using Crypto++ Library as the following:
#include <cryptopp/filters.h>
#include <cryptopp/hex.h>
#include <cryptopp/sha.h>
#include <cryptopp/crc.h>
#include <string.h>
#include <iostream>
int main()
{
// Calculate SHA1
std::string data = "Hello World";
std::string base_encoded_string;
byte sha_hash[CryptoPP::SHA::DIGESTSIZE];
CryptoPP::SHA().CalculateDigest(sha_hash, (byte*)data.data(), data.size());
CryptoPP::StringSource ss1( std::string(sha_hash, sha_hash+CryptoPP::SHA::DIGESTSIZE), true,
new CryptoPP::HexEncoder( new CryptoPP::StringSink( base_encoded_string ) ));
std::cout << base_encoded_string << std::endl;
base_encoded_string.clear();
// Calculate CRC32
byte crc32_hash[CryptoPP::CRC32::DIGESTSIZE];
CryptoPP::CRC32().CalculateDigest(crc32_hash, (byte*)data.data(), data.size());
CryptoPP::StringSource ss2( std::string(crc32_hash, crc32_hash+CryptoPP::CRC32::DIGESTSIZE), true,
new CryptoPP::HexEncoder( new CryptoPP::StringSink( base_encoded_string ) ));
std::cout << base_encoded_string << std::endl;
base_encoded_string.clear();
}
The output I get is:
0A4D55A8D778E5022FAB701977C5D840BBC486D0
56B1174A
Press any key to continue . . .
And, out of these I confirmed that CRC32 is incorrect according to various online resources such as this one: http://www.fileformat.info/tool/hash.htm?text=Hello+World
I have no idea why because I'm creating CRC32 hash by following the same procedure as I followed for SHA1. Is there really different way or am I really doing something wrong in here?
byte crc32_hash[CryptoPP::CRC32::DIGESTSIZE];
I believe you have a bad endian interaction. Treat the CRC32 value is an integer, not a byte array.
So try this:
int32_t crc = (crc32_hash[0] << 0) | (crc32_hash[1] << 8) |
(crc32_hash[2] << 16) | (crc32_hash[3] << 24);
If crc32_hash is integer aligned, then you can:
int32_t crc = ntohl(*(int32_t*)crc32_hash);
Or, this might be easier:
int32_t crc32_hash;
CryptoPP::CRC32().CalculateDigest(&crc32_hash, (byte*)data.data(), data.size());
I might be wrong about int32_t, it might be uint32_t (I did not look at the standard).

Qt equivalent of Perl pack/unpack

The following is a list of hex data and the float numbers it represents:
e77ed8f8201a5440 = 78.4083
0000000000005540 = 82
4c541773e2185040 = 62.3888
0000000000005640 = 86
The following Perl code uses pack/unpack to gets the conversion almost right (out by exactly 2):
use strict;
use warnings;
while (<DATA>)
{
chomp;
my $dat = $_;
my $hval = pack "H*", $dat;
my $fval = unpack "F", $hval;
print "$dat .. $fval \n";
}
__DATA__
e77ed8f8201a5440
0000000000005540
4c541773e2185040
0000000000005640
Output:
e77ed8f8201a5440 .. 80.408262454435
0000000000005540 .. 84
4c541773e2185040 .. 64.3888213851762
0000000000005640 .. 88
What is the Qt/C equivalent of this pack/unpack, or what is the algorithm it is use to "convert" the hex to float so I can that up instead?
What is the Qt/C equivalent of this pack/unpack,
#include <QString>
#include <QByteArray>
#include <algorithm>
#include <cstring>
...
QString value="e77ed8f8201a5440";
// convert the hex string to an array of bytes
QByteArray arr=QByteArray::fromHex(value);
// reverse if necessary
#if Q_BYTE_ORDER==Q_BIG_ENDIAN
std::reverse(arr.begin(), arr.end());
#endif
// if everything went right, copy the bytes in a double
if(arr.size()==sizeof(double))
{
double out;
std::memcpy((void *)&out, (void *)arr.data(), sizeof(double));
// ...
}
Maybe you could also get away with QtEndian (instead of conditionally calling std::reverse over arr), but it's not clear if those functions can be called on anything but integral types.
or what is the algorithm it is use to "convert" the hex to float so I can that up instead?
The data you have is just the dump of the raw content of a little-endian IEEE-754 double; the "algorithm" is simply decoding the hexadecimal to the bytes it represents and copying them to a double variable (reversing the byte order if we are on a big-endian machine).
pack 'H*' is the conversion from hex character pairs to the corresponding bytes.
unpack 'F' is a cast to a double. This can be done using memcpy (to avoid alignment issues), as shown below:
#include <stdio.h>
#include <string.h>
int main() {
char bytes[] = { 0xE7, 0x7e, 0xd8, 0xf8, 0x20, 0x1a, 0x54, 0x40 };
double d;
memcpy(&d, bytes, sizeof(bytes));
printf("%lf\n", d);
return 0;
}
Output:
$ gcc -o a a.c && a
80.408262
Note that this will fail on big-endian machines. You'll have to reverse the bytes before copying them on those machines.

UTF-8 & Unicode, what's with 0xC0 and 0x80?

I've been reading about Unicode and UTF-8 in the last couple of days and I often come across a bitwise comparison similar to this :
int strlen_utf8(char *s)
{
int i = 0, j = 0;
while (s[i])
{
if ((s[i] & 0xc0) != 0x80) j++;
i++;
}
return j;
}
Can someone clarify the comparison with 0xc0 and checking if it's the most significant bit ?
Thank you!
EDIT: ANDed, not comparison, used the wrong word ;)
It's not a comparison with 0xc0, it's a logical AND operation with 0xc0.
The bit mask 0xc0 is 11 00 00 00 so what the AND is doing is extracting only the top two bits:
ab cd ef gh
AND 11 00 00 00
-- -- -- --
= ab 00 00 00
This is then compared to 0x80 (binary 10 00 00 00). In other words, the if statement is checking to see if the top two bits of the value are not equal to 10.
"Why?", I hear you ask. Well, that's a good question. The answer is that, in UTF-8, all bytes that begin with the bit pattern 10 are subsequent bytes of a multi-byte sequence:
UTF-8
Range Encoding Binary value
----------------- -------- --------------------------
U+000000-U+00007f 0xxxxxxx 0xxxxxxx
U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx
10xxxxxx
U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx
10yyyyxx
10xxxxxx
U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx
10zzyyyy
10yyyyxx
10xxxxxx
So, what this little snippet is doing is going through every byte of your UTF-8 string and counting up all the bytes that aren't continuation bytes (i.e., it's getting the length of the string, as advertised). See this wikipedia link for more detail and Joel Spolsky's excellent article for a primer.
An interesting aside by the way. You can classify bytes in a UTF-8 stream as follows:
With the high bit set to 0, it's a single byte value.
With the two high bits set to 10, it's a continuation byte.
Otherwise, it's the first byte of a multi-byte sequence and the number of leading 1 bits indicates how many bytes there are in total for this sequence (110... means two bytes, 1110... means three bytes, etc).