What kind of encoding is this and what does it translate to? }(]&[([%!"=))%%!")" - encoding

}(]&[([%!"=))%%!")"}"{"={}]+&"{*="!&"&&]!+}"{])"=]#)!!)]"][}{*/[(#{*%*[#=&)}""]}
{]]%/)(="{![{)=&%{}&+{#)%==#"(*})+%%)(+{)(%*{}!&"=&[#]&*)%+/})+/)!!#{%)+)()+!=[(
)=}([={[!{/)+)&/"]!/=!+*%(&/)#")!*[#(#(="][*=+(*&/!)!()+#)#}[%]*"#")*(#]{&*%%)}#
%*({"+)/#&""=&=/})={)}")"%}]%+&*#={)//=+(}/"!{%!!{=%/}}!}](}*"/]&&%=}*[*(&["={%{
}#&){#%[%[)%)+%}/&#%(/=((][}}["]=!&))!/[]#&"=]=[*+#*{])="]"/[#]]*!"}![)%})(&"/*#
...
I've never seen an encoding like this before. I've tried all sorts of online automatic decoders to no avail, and googling "code with curly braces and parenthesis" isn't doing me any good. Above are the first four lines out of exactly 1000 lines.
Here's a pastebin link since it wouldn't fit into the 30000 body character limit. https://pastebin.com/4LfEvd4b

Related

Is it correct to use Word Joiner (U+2060) in the same word?

In Bangla, Hosonto (U+09CD) is used to create a ligature, which joins adjacent letters. For example ক্ক is created using ক + ্ + ক. But sometimes we need a non-joining Hosonto (ক্‌ক). To make it possible, traditionally we use a Zero-width non-joiner (‌‌‌‌‌U+200C‌).
The problem with ‌‌‌‌‌ZWNJ is that, when the line is too long and line wrapping occurs, the word is broken into two lines. To keep the word as a whole, I need a character, something like “Zero-width non-breaking non-joiner”. But I don’t see such character in Unicode. So I think, Word Joiner (U+2060) is the best option.
To me, Word Joiner sounds like “joins two words”. But in my case, I need to join two parts of a single word. So, the question is, is it correct to use Word Joiner here?
U+200C ZERO WIDTH NON-JOINER has no effect on line breaking. Its absence or presence does not change where line wrapping can occur. If inserting a ZWNJ within a word causes that word to be broken across lines, then whatever application you are using to view your text does not implement the standard correctly.
ZWNJ is the only correct character for your purposes. More than that, using U+2060 WORD JOINER could in fact lead to inconsistent results. Much like ZWNJ does not affect line breaks, WJ is not supposed to affect joining behaviour (it is defined as “transparent” in that regard). While the standard doesn’t explicitly mention cases like this to the best of my knowledge, one could reasonably argue that inserting a WJ between the two letters in your example should not change the way they are displayed.

In RTF, what is the meaning of \u-3913?

I'm parsing RTF data in .MSG file created by Outlook. One thing I stumbled upon is:
mso-level-text:\u-3913 ?
mso-level-text:\u-3929 ?
"\u-3913 ?" denotes a middot character. However, I couldn't identify its Unicode code. Also, I didn't find the meaning of "\u-XXXX ?" strings at all. Neither decimal 3913 nor hex 3913 denote a middot. I also checked UTF-8 value for middot. Not even close. Googling explains the meaning of "\uXXXX ?" only, not "\u-XXXX ?".
The same for 3929 code.
How to get the right characters from the above?

JMeter CSV Data Set is corrupting Japanese strings stored as proper UTF-8, I get Question Marks instead

I read in search terms from a simple text file to send to a search engine.
It works fine in English, but gives me ???? for any Japanese text.
Text with mixed English and Japanese does show the English text, so I know it's reading it.
What I'm seeing:
Input text:
Snow Leopard をインストールする場合、新しい
Turns into:
Snow Leopard ???????????????
This is in my POST field of an HTTP.
If I set JMeter to encode the data, it just puts in the percent sequence for question marks.
About the Data:
The CSV file is very simple in
structure.
There's only one field / one column,
which I name TERM, and later use as
${TERM}
I don't really need full CSV because it's only one string per line.
There's no commas or quotes.
It's UTF-8 and when I run the Unix "file" command on the file, it says UTF-8 text.
I've also verified UTF-8 in command line and graphical mode on two machines.
Interesting note:
An interesting coincidence that I noticed: if there are 15 Japanese characters then I get 15 question marks, so at some point it's being seen as full characters and not just bytes.
JMeter CSV Dataset Config:
Filename: japanese-searches.csv
File encoding: UTF-8 (also tried without)
Variable names: TERM
Delimiter: ,
Allow Quoted Data: False (I also tried True, different, but still wrong)
Recycle at EOF: True
Stop at EOF: False
Staring mode: All threads
A few things I've tried:
- Tried Allow quoted Data. It changed to other strange characters.
- Added -Dfile.encoding=UTF-8
- Tried encoding the POST stage, but it just turned into a bunch of %nn for question marks
And I'm not sure how "debug" just after the each line of the CSV is read in. I think it's corrupted right away, but I'm not sure.
If it's only mangled when I reference it, then instead of ${TERM} perhaps there's some other "to bytes" function call. I'll start checking into that. I haven't done anything with the JMeter functions yet.
Edited Dec 24:
Tweaks:
Changed formatting and added bullet
points for more clarity.
Clarified that the file is UTF-8, and have verified that.
A new theory:
Is it possible that the Japanese characters are making it through, and the issue is that EVERY SINGLE place that shows them maps them to a "?" at DISPLAY TIME only. So even though I've checked in a bunch of places, they all have a display issue just in the UI?
Is there a way in JMeter to see the numeric value of a character or string? Actually, to tell JMeter to display the list of Unicode code points?
I'll look at my last log files... although I suppose even the server logs could mis-mapped the characters.
Also, perhaps when doing variable expansion inside of the text field that I POST, where I reference the ${TERM}, maybe at that point it also maps to question marks, but that the corruption happens at that later point. If that happened, AND it was mis-displayed in the UI, then it might lead to a false conclusion.
What I'd really like to do is pause JMeter after the first CSV record, just after that line is loaded, and look at it with a "data scope" or byte editor or something. Not sure if this is possible.
Found the issue, there was another place the UTF-8 had to be specified.
In the HTTP Request, to the right of the Method, you have to also set Content Encoding to UTF-8
Yes, in hindsight, this seems obvious, but there were a number of reasons I didn't think this was needed. Some of my incorrect assumptions might be helpful for others who are debugging, so here goes - I would have thought that:
1: Once text has made it into Java as Unicode, it stays as Unicode, and goes in and out by UTF-8. Obviously not in this case.
2: I sort of thought HTTP defaulted to UTF-8 unless you say otherwise, but maybe I'm just used to XML, but probably not a good practice to assume that, and maybe HTTP defaults to ISO-Latin1 or something, or even if there's a spec, maybe folks don't follow it.
3: And if I don't specific it, I'd think the "do no harm" approach would be to pass the characters on, and let the receiver on the other end deal with it. Wrong again!
(OK, so points 1, 2 and 3 overlap a bit)
4: Even though my HTTP Request POST, I did still try the Encode checkbox. I certainly thought that would have encoded it, but all I got was the repeating % hex for question marks, so seemed to me that the data was already corrupted at that point. Wrong again. I suspect WITHIN the HTTP phase, there's TWO character transitions, first from Unicode to whatever encoding it thinks you have, and THEN a second encoding into the %signs, and my data was mis-encoded at the first step.
5: And I would have thought JMeter would say something or warn, but from my reading, apparently it's not helpful in that respect. You can do logging or whatever.
And the "?" is Java's way of reporting a problem BY default, this started in the Java 1.4x timeframe. In my Java code I prefer to set encoding errors to report as an exception, but again, not the default, and not what JMeter does.
So I learned my lesson.
The HINT that the Unicode was at least starting out OK was that the number of question marks equaled the number of Japanese characters, instead of having 2 or 3 times as many question marks. If the length of "???" matches your Japanese (or Chinese) string, then Java DID see actual Unicode characters at some point along the journey. Whereas if you see 3 times as many ?'s as input text, then Java always saw them as bytes or ints or whatever, and NEVER as valid codepoints.
Came across this topic when searching for solution to use parameters from csv file that contained some columns written in Hebrew.
I used Excel 2007 to create a 1000 lines data for user registrations. The first and the last names had to be in Hebrew.
I exported the file to "Unicode text" file. It became tab delimited.
"Unicode Text" saves in UTF-16 LE (Little Endian), not in UTF-8. That is important.
I opened the result in Notepad++. I could see the Hebrew letters properly. The Notepad++ has the "Encoding" menu item, where you can check the encoding or change it. So I changed the Little Endian to UTF-8.
Then I replaced tabs with commas (just selected the tab and pasted it into the Find box.
The parameters were substituted ok, but after running the script I saw the following:
In the "View Results Tree" listener I opened the "Result" tab of the "Http Request".
The parameters were substituted, but the HTTP view tab (on the bottom) of the Request showed me some gibberish.
But when I looked at the Raw view, I saw that the request parameters actually contained strings like %D7%A9%D7%A8%D7%9E%D7%95%D7%98%D7%94 that when taken in pairs (%D7 %A9) corersponded properly to Hebrew letters.
To my mind, the JMeter has a bug and can not properly display the unicode chars. But it sends (POSTs) them out ok.
Hope I am right and hope it will help someone.
You can try to use "SHIFT-JIS" in Content encoding (it's nearby Method selection). Then you should uncheck "Encode?" for parameter that included Japanese.
Hope it works you.

Goofy Unicode problem: mï ¿ ½

I have some text coming into a database that apparently has some sort of Unicode issue. the literal text coming in is "5 mï ¿ ½ in area", which appears to be some sort of unit of measure, but I can't sort out what the meaning is in context. Searching Google shows many similar results, so this is apparently a common set of symbols.
It's the Unicode replacement character, 0xFFFD (�); see also How to replace � in a string
So I guess the text used to be 5m² in area, and the ² was garbled into � before it arrived in your database.
It's probably supposed to be ² to indicate "meters squared". But you have an encoding problem clearly. I don't know what the problem is because you didn't paste any code or indicate any details for context.

How can I figure out what code page I am looking at?

I have a device with some documentation on how to send it text. It uses 0x00-0x7F to send 'special' characters like accented characters, euro signs, ...
I am guessing they copied an existing code page and made some changes, but I have no idea how to figure out what code page is closest to the one in my documentation.
In theory, this should be easy to do. For example, they map Á to 0x41, so if I could find some way to go through all code pages and find the ones that have this character on that position, it would be a piece of cake.
However, all I can find on the internet are links to code page dumps just like the one I'm looking at, or software that uses heuristics to read text and guess the most likely code page. Surely someone out there has made it possible to look up what code page one is looking at ?
If it uses 0x00 to 0x7F for the "special" characters, how does it encode the regular ASCII characters?
In most of the charsets that support the character Á, its codepoint is 193 (0xC1). If you subtract 128 from that, you get 65 (0x41). Maybe your "codepage" is just the upper half of one of the standard charsets like ISO-8859-1 or windows-1252, with the high-order bit set to zero instead of one (that is, subtracting 128 from each one).
If that's the case, I would expect to find a flag you can set to tell it whether the next bunch of codepoints should be converted using the "upper" or "lower" encoding. I don't know of any system that uses that scheme, but it's the most sensible explanation I can come with for the situation you describe.
There is no way to auto-detect the codepage without additional information. Below the display layer it’s just bytes and all bytes are created equal. There’s no way to say “I’m a 0x41 from this and that codepage”, there’s only “I’m 0x41. Display me!”
What endian is the system? Perhaps you're flipping bit orders?
In most codepages, 0x41 is just the normal "A", I don't think any standard codepages have "Á" in that position. It could have a control character somewhere before the A that added the accent, or uses a non-standard codepage.
I don't see any use in knowing the "closest codepage", you just need to use the docs you got with the device.
Your last sentence is puzzling, what do you mean by "possible to look up what code page one is looking at"?
If you include your whole codepage, people here on SO could be more helpful and give you more insight about this issue, having one data point 0x41=Á doesn't help much.
Somewhat random idea, but if you can get replicate a significant amount of the text off the device, you could try running it through something like the detect function in http://chardet.feedparser.org/.