Parser is not encoding string correctly

Parser is not encoding string correctly - encoding

Text I'm trying to get:
przełącznica
This is what I actually have (browser might now view it properly - there are two squares instead of "łą"):
przecznica
BLOB:
70 72 7A 65 C5 82 C4 85 63 7A 6E 69 63 61
EDIT: This is what I get from parser
70 72 7A 65 1A 1A 63 7A 6E 69 63 61
ESQL used to parse BLOB:
DECLARE blobMsg BLOB InputRoot.BLOB.BLOB ;
CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg DOMAIN ('XMLNSC') NAME 'XMLNSC';
CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg.XMLNSC PARSE(blobMsg OPTIONS FolderBitStream CCSID 1208 FORMAT 'XMLNSC');
I have tried CCSIDs: 1208 (UTF8), 912 (ISO-8859-2), 1200(UTF16 I guess):
https://www.ibm.com/support/knowledgecenter/ssw_ibm_i_71/nls/rbagsccsidcdepgscharsets.htm
EDIT: Working code:
DECLARE blobMsg BLOB InputRoot.BLOB.BLOB;
DECLARE remove BLOB X'EFBBBF';
DECLARE message BLOB REPLACE(InputRoot.BLOB.BLOB, remove, CAST('' AS BLOB));
CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg DOMAIN ('XMLNSC') NAME 'XMLNSC';
CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg.XMLNSC PARSE(message OPTIONS FolderBitStream CCSID 05348 FORMAT 'XMLNSC');

Firstly przełącznica by itself is not valid XML and so you'll get an exception when you try to invoke the XMLNSC parser using the code you have outlined. You need to do a CAST instead.
I generated a little test Application/MsgFlow in IIB 10 to illustrate CASTing the BLOB.
The code in ConvertAndParse is
CREATE COMPUTE MODULE ConvertAndParse
CREATE FUNCTION Main() RETURNS BOOLEAN
BEGIN
DECLARE blobMsg BLOB X'70727A65C582C485637A6E696361';
CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg DOMAIN 'XMLNSC';
CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg.XMLNSC NAME 'AsUtf8' VALUE CAST(blobMsg AS CHAR CCSID 1208);
CREATE LASTCHILD OF OutputRoot DOMAIN 'XMLNSC';
CREATE LASTCHILD OF OutputRoot.XMLNSC.EncodingResponse NAME 'AsUtf8InTag' VALUE CAST(blobMsg AS CHAR CCSID 1208);
CREATE LASTCHILD OF OutputRoot.XMLNSC.EncodingResponse NAME CAST(blobMsg AS CHAR CCSID 1208) VALUE 'As a tag name';
RETURN TRUE;
END;
END MODULE;
When I run a debug session the value put into the LocalEnvironment tree looks like.
And the result of invoking the flow from a browser.
Now let's deal with the which encoding we are looking at. Looking at what I assume is the input BLOB let's see if the BLOB matches up with UTF-8.
70 72 7A 65 C5 82 C4 85 63 7A 6E 69 63 61
UTF-8 is a variable width character encoding that sets the high order bit to indicate two or more bytes. We also want a page that shows the common code points for UTF-8 Complete Character List for UTF-8. Note it's not actually complete.
Looking at the first 4 bytes none of them have the high order bit on
70 72 7A 65
And the aforementioned Character List says that's prze, so far so good.
Then we hit C8 which has the high order bit on. Doing a bit of visual parsing we get two sets of probable two byte character pairs
C5 82
C4 85
Referring to the Character List our two candidate pairs do in fact match the two characters we want and the next six characters which do not have their high order bits on translate to cznica. Looking really good.
Now to eliminate the other candidate encodings, if we can.
UTF-16 uses 2 or 4 bytes to represent each character depending on the Byte Order Mark with prze encoded as
UTF-16BE - CP 1200 - 00 70 00 72 00 7A 00 65
UTF-16LE - CP 1202 - 70 00 72 00 7A 00 65 00
Given that there are not lots and lots of null characters 00 it is reasonable to discount UTF-16.
ISO-8859-2 - CP 912 is a single byte character set and the C5 and C4 code points do not match the two desired characters and thus we can eliminate it.

Related

Snort rules for byte code

I just started to learn how to use Snort today.
However, I need a bit of help with my rules setup.
I am trying to look for the following code on the network sent to a machine. This machine has snort installed on it (as I installed it now).
The code I want to analyze on the network is in bytes.
\xAA\x00\x00\x00\x00\x00\x00\x0F\x00\x00\x02\x74\x00\x00' (total of 14 bytes)
Now, I am looking at wanting to analyze the first 7 bytes of the code. For me if the 1st byte is (AA) and 7th byte is (0F). Then I want snort to set off an alarm.
So far my rules are:
alert tcp any any -> any any \
(content:"|aa 00 00 00 00 00 00 0f|"; msg:"break in attempt"; sid:10; rev:1; \
classtype:shellcode-detect; rawbytes;)
byte_test:1, =, aa, 0, relative;
byte_test:7 =, 0f, 7, relative;
I'm guessing I obviously have made a mistake somewhere. Maybe someone that is familair with snort could help me out?
Thanks.

Congrats on deciding to learn snort.
Assuming the bytes are going to be found in the payload of a TCP packet your rule header should be fine:
alert tcp any any -> any any
We can then specify the content match using pipes (||) to let snort know that these characters should be interpreted as hex bytes and not ascii:
content:"|AA 00 00 00 00 00 00 0F|"; depth:8;
And since we only want the rule to match if these bytes are found in the first 8 bytes of the packet or buffer we can add "depth". The "depth" keyword modifier tells snort to check where in the packet or buffer the content match was found. For the above content match to return true all eight bytes must be found within the first eight bytes of the packet or buffer.
"rawbytes" is not necessary here and should only ever be used for one specific purpose; to match on telnet control characters. "byte_test" isn't needed either since we've already verified that bytes 1 and 8 are "AA" and "0F" respectively using a content match.
So, the final rule becomes:
alert tcp any any -> any any ( \
msg:"SHELLCODE Break in attempt"; \
content:"|AA 00 00 00 00 00 00 0F|"; depth:8; \
classtype:shellcode-detect; sid:10;)
If you decide that this should only match inside a file you can use the "sticky" buffer "file_data" like so:
alert tcp any any -> any any ( \
msg:"SHELLCODE Break in attempt"; file_data; \
content:"|AA 00 00 00 00 00 00 0F|"; depth:8; \
classtype:shellcode-detect; sid:10;)
This will alert if the shellcode is found inside the alternate data (file data) buffer.
If you'd like for your rule to only look inside certain file types for this shellcode you can use "flowbits" like so:
alert tcp any any -> any any ( \
msg:"SHELLCODE Break in attempt"; \
flowbits:isset,file.pdf; file_data; \
content:"|AA 00 00 00 00 00 00 0F|"; depth:8; \
classtype:shellcode-detect; sid:10;)
This will alert if these bytes are found when the file.pdf flowbit is set. You will need the rule enabled that sets the pdf flowbit. Rules that set file flowbits and other good examples can be found in the community ruleset available for free here https://www.snort.org/snort-rules.

Base64 Encoding and Decoding

I would appreciate if someone could please explain this to me.
I came across this post (not important just reference) and saw a token encoded with base64 where the guy decoded it.
EYl0htUzhivYzcIo+zrIyEFQUE1PQkk= -> t3+(:APPMOBI
I then tried to encode t3+(:APPMOBI again using base64 to see if I would get the same result, but was very surprised to get:
t3+(:APPMOBI - > dDMrKDpBUFBNT0JJ
Completly different token.
I then tried to decode the original token EYl0htUzhivYzcIo+zrIyEFQUE1PQkk= and got t3+(:APPMOBI with random characters between it. (I got ◄ëtå╒3å+╪═┬(√:╚╚APPMOBI could be wrong, I quickly did it off the top off my head)
What is the reason for the difference in tokens were they not supposed to be the same?

The whole purpose of base64 encoding is to encode binary data into text representation so that they can be transmitted over the network or displayed without corruption. But it ironically happened with the original post you were referring to,
EYl0htUzhivYzcIo+zrIyEFQUE1PQkk= does NOT decode to t3+(:APPMOBI
instead, it contains some binary bytes(not random btw) that you correctly showed. So the problem was due to the original post where either the author or the tool/browser that she/he used "cleaned up", or rather corrupted the decoded binary data.
There is always one-to-one relationship between encoded and decoded data (provided the same "base" is used, i.e. the same set of characters are used for encoded text.)
t3+(:APPMOBI indeed will be encoded into dDMrKDpBUFBNT0JJ

The problem is in the encoding that displayed the output to you, or in the encoding that you used to input the data to base64. This is actually the problem that base64 encoding was invented to help solve.
Instead of trying to copy and paste the non-ASCII characters, save the output as a binary file, then examine it. Then, encode the binary file. You'll see the same base64 string.
c:\TEMP>type b.txt
EYl0htUzhivYzcIo+zrIyEFQUE1PQkk=
c:\TEMP>base64 -d b.txt > b.bin
c:\TEMP>od -t x1 b.bin
0000000 11 89 74 86 d5 33 86 2b d8 cd c2 28 fb 3a c8 c8
0000020 41 50 50 4d 4f 42 49
c:\TEMP>base64 -e b.bin
EYl0htUzhivYzcIo+zrIyEFQUE1PQkk=
od is a tool (octal dump) that outputs binary data using hexadecimal notation, and shows each of the bytes.
EDIT:
You asked about a different string in your comments, dDMrKDpBUFBNT0JJ, and why does that decode to the same thing? Well, it doesn't decode to the same thing. It decodes to this string of bytes: 74 33 2b 28 3a 41 50 50 4d 4f 42 49. Your original string decoded to this string of bytes: 11 89 74 86 d5 33 86 2b d8 cd c2 28 fb 3a c8 c8 41 50 50 4d 4f 42 49.
Notice the differences: your original string decoded to 23 bytes, your second string decoded to only 12 bytes. The original string included non-ASCII bytes like 11, d5, d8, cd, c2, fb, c8, c8. These bytes don't print the same way on every system. You referred to them as "random bytes", but they're not. They're part of the data, and base64 is designed to make sure they can be transmitted.
I think to understand why these strings are different, you need to first understand the nature of character data, what base64 is, and why it exists. Remember that computers work only on numbers, but people need to work with familiar concepts like letters and digits. So ASCII was created as an "encoding" standard that represents a little number (we call this little number a "byte") as a letter or a digit, so that we humans can read it. If we line up a group of bytes, we can spell out a message. 41 50 50 4d 4f 42 49 are the bytes that represent the word APPMOBI. We call a group of bytes like this a "string".
Every letter from A-Z and every digit from 0-9 has a number specified in ASCII that represents it. But there are many extra numbers that are not in the standard, and not all of those represent visible or sensible letters or digits. We say they're non-printable. Your longer message includes many bytes that aren't printable (you called them random.)
When a computer program like email is dealing with a string, if the bytes are printable ASCII characters, it's easy. The email program knows what to do with them. But if your bytes instead represent a picture, the bytes could have values that aren't ASCII, and various email programs won't know what to do with them. Base64 was created to take all kinds of bytes, both printable and non-printable bytes, and translate them into a string of bytes representing only printable letters. Because they're all printable, a program like email or a web server can easily handle them, even if it doesn't know that they actually contain a picture.
Here's the decode of your new string:
c:\TEMP>type c.txt
dDMrKDpBUFBNT0JJ
c:\TEMP>base64 -d c.txt
t3+(:APPMOBI
c:\TEMP>base64 -d c.txt > c.bin
c:\TEMP>od -t x1 c.bin
0000000 74 33 2b 28 3a 41 50 50 4d 4f 42 49
0000014
c:\TEMP>type c.bin
t3+(:APPMOBI
c:\TEMP>

What can explain this bad character-encoding?

What "stack" of bad encoding would produce the following bytes of weirdness for the string "cinéma télédiffusion"? (I left out the space character, hex: 20)
cinÃ%ma
in HEX: 63 69 6E C3 83 25 6D 61
mapped: c i n ---�---- m a
tÃclÃcdiffusion
in HEX: 74 C3 83 63 6C C3 83 63 64 69 66 66 75 73 69 6F 6E
mapped: t ---�---- l ---�---- d i f f u s i o n
The ---�---- parts represent the bytes that aren't right.
I considered the idea "What if it was a messed up transcoding? How about a double encoding?", but, looking at http://www.fileformat.info/info/unicode/char/00e9/charset_support.htm (and the code page edition, too), I noted that there no encodings that could possibly end é with the hex bytes %25 or %63. It doesn't even look like double-UTF8 encoding at this point, because, http://en.wikipedia.org/wiki/UTF-8 clarified that bytes following a %C3 would need to be have the first bits set to 10xxxxxx.
How could some program have turned the accented é into an "Ã followed by %" as well as "Ã followed by c"? I want to trace back the history of the misencoding so that I can try to come up with something that can take steps at repairing the mangled strings.
There also exists the possibility that the é weren't ever é to begin with, but I can't fathom what kind of typo someone could have made in the same phrase to get two different versions of é that eventually get misencoded into two completely different sets of bytes.
Extra context details: I find these mangled strings inside of an XML file. The file has no <?xml version="1.0"?> header, so it's presumed to be UTF-8. There exists nodes containing phrases that have perfectly good é characters in them at the same time that there exists nodes containing phrases with mangled é characters.
iconv-and-family don't do anything at all to help this situation, as far as I've attempted.
A couple of trailing considerations that I now hold are: Should I suspect MySQL and its infamously lazy character set transcodings? Could it be somebody's really badly written custom encoding function as they exported the XML?

The encoding looks a bit strange:
Taken the é from cinéma results in for utf-8 encoding:
é = C3 A9
where you got:
C3 83 25
So when it will be double encoded the following should happen:
c3: Ã -> c3 83
a9: © -> c2 a9
But this will not explain the 25 within your result.
25: %
So the question is if this is encode once, then unknown characters like © will be replaced by % and then it's encoded a second time?

How does 1-wire decide what address to use?

I've been following this simple tutorial to get a temperature reading from a raspberrypi,
http://blog.vokiel.com/raspberry-pi-odczyt-temperatury-przez-nodejs/?lang=en
Under w1/devices, what I'm calling the address is the file where the value of the 1-wire bus is stored.
For example, the tutorial says
/sys/bus/w1/devices/28-00000249bf39 $ cat w1_slave
c3 01 4b 46 7f ff 0d 10 2f : crc=2f YES
c3 01 4b 46 7f ff 0d 10 2f t=28187
Where the address is 28-00000249bf39. On my device, the address is 28-000004acb882.
How are these addresses set? Is it possible to define your own?

As the docs says:
Each Device has a Unique 64-Bit Serial Code
Stored in an On-Board ROM
So you can't set your own.

To read the temperature from youre devide just type
cat 28.00000249bf39/temperature

Base64 decoding gives different result

I'm working on a little Streamserve project (google it :P) where I get some Base64 encoded content. I've tried to decode the base64 string with multiple decoders and all return the correct result.. except the Base64DecodeString method in Streamserve.
The encoded string is: 'VABlAHMAdABpAG4AZwAgAGIAYQBzAGUANgA0AA==' The expected result is: 'Testing base64'
However within Streamserve the result is: 'Tsig ae4'
It simply skips every other letter. Now I know most people dont know Streamserve, but I have a hunch that this might be a character encoding problem.. problem and was hoping someone has a clue what might be happening here.
I can without any problem encode/decode strings within streamserve.. just not strings I get as input

The issue is that you're encoding in UTF-16 and decoding back to ASCII or UTF8. Change your string encoding to UTF8 before encoding the string to base64 and it should work fine.
Here's the hex dump of that base64 blob:
54 00 65 00 73 00 74 00 69 00 6e 00 67 00 20 00 62 00 61 00 73 00 65 00 36 00 34 00
If you remove the null bytes, you get this:
54 65 73 74 69 6e 67 20 62 61 73 65 36 34
Which translates to the following ASCII text:
Testing base64

The result of decoding base64 is binary data (and likewise the input when encoding it is binary data). To go from binary data to a string or vice versa, you need to apply an encoding such as UTF-8 or UTF-16. You need to find out which encoding Streamserve is using, and use the same encoding when you convert your text data to binary data to start with, before base64-encoding it.
It sounds like you might want to use UTF-16 to encode your text to start with, although in that case I'm surprised you're not just getting garbage out... it looks like it's actually ignoring every other byte in the decoded base64, rather than taking it as the high byte in a UTF-16 code unit.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse