Random symbols in Source window instead of Russian characters in RStudio - encoding

I have been googling and stackoverflowing (yes, that is the word now) on how to fix the problem with wrong encoding. However, I could not find the solution.
I am trying to load .Rmd file with UTF-8 encoding which basically has Russian characters in it. They do not show properly. Instead, the code lines in the Source window look like so:
Initially, I created this .Rmd file long ago on my previous laptop. Now, I am using another one and I cannot spot the issue here.
I have already tried to use some Sys.setlocale() commands with no success whatsoever.
I run RStudio on Windows 10.
Edited
This is the output of readBin('raw[1].Rmd', raw(), 10000). Slice from 2075 to 2211:
[2075] 64 31 32 2c 20 71 68 35 20 3d 3d 20 22 d0 a0 d1 9a d0 a0 d0 88 d0 a0
e2 80 93 d0 a0 d0 8e d0 a0 d1 99
[2109] d0 a0 d1 9b d0 a0 e2 84 a2 22 29 3b 20 64 31 32 6d 24 71 68 35 20 3d
20 4e 55 4c 4c 0d 0a 64 31 35 6d
[2143] 20 3d 20 66 69 6c 74 65 72 28 64 31 35 2c 20 74 68 35 20 3d 3d 20 22
d0 a0 d1 9a d0 a0 d0 88 d0 a0 e2
[2177] 80 93 d0 a0 d0 8e d0 a0 d1 99 d0 a0 d1 9b d0 a0 e2 84 a2 22 29 3b 20
64 31 35 6d 24 74 68 35 20 3d 20
Thank you.

Windows doesn't have very good support for UTF-8. Likely your local encoding is something else.
RStudio normally reads files using the system encoding. If that is wrong, you can use "File | Reopen with encoding..." to re-open the file using a different encoding.
Edited to add:
The first line of the sample output looks like UTF-8 encoding with some Cyrillic letters, but not Russian-language text. I decode it as "d12, qh5 == \"РњРЈР–РЎРљ". Is that what RStudio gave you when you re-opened the file, declaring it as UTF-8?

Related

Unknown data between h264 NAL units in an AVC file

I want to understand a weird observation I had while working with h264 encoded AVC files. In such files, each NAL unit is preceded by 1/2/4 bytes that encode the size of the NAL unit (without the size header). However, there has been some cases where the end of one NAL unit doesn't take to another NAL unit, instead it takes to a sequence of some data till it eventually reaches another NAL unit
For example, starting at 01ADF399, we have:
*00 00 35 99 41* 9A 12 25 83 A5 F0 7A 08 41 0C 1E 02 50 20 03 80 A4 12 30 B6 44 90
0C E1 CD A2 68 9F 9F 2E C0 2E 1C 18 A2 28 8A 85 65 AC 0B 7D F1 DD 0F ...
Which ends at 01AE2936 as:
21 1A 54 6D FC 34 3B 32 FA AA D6 71 8A BC 92 F9 95 79 75 8A E6 B5 A9 77 24 4A AC
1C E3 EF A2 9D 97 30 51 D1 7B EB 75 FD B2 8D 8A A7 B9 47 8A C6 59 1A 32 FB 9E 77
03 8E CA 67 23 B7 52 EE 2E A4 BA 43 CE F9 CD 46 48 C5 C4 41 35 32 F3 D6 5B CD BE
DA B8 B3 3E 1B 33 87 AE 65 A0 45 74 DF EB 37 96 2F DA 9C ...
Clearly not the start of an NAL unit (since FC doesn't have forbidden zero bit)
However, at 01AE7535, we have the following:
00 00 27 EA 41 9A 14 29 81 29 7C 80 41 04 18 98 44 64 01 C6 54 00 0D 9F 34 58 71
E5 0A A6 CD B0 4B 38 60 7F E6 1F C8 00 24 7A 06 E5 9B 21 99 F0 51 24 9B ...
Which is the start of an NAL unit. I verified that those two NAL units are consecutive since filtering the file to the annex B h264 format removes the unknown data in between and places those two exact NAL units right next to each other.
I tried looking at ISO 14496 part 15 but it doesn't mention anything about this.

Inspecting binary over sockets

I'm using WireShark to inspect data sent/received over a web-socket, however, all I see is nonsense.
0000 1c 74 0d 7d 42 24 d8 5d e2 26 c1 7d 08 00 45 00 .t.}B$.].&.}..E.
0010 00 3c 75 4e 40 00 80 06 22 eb c0 a8 01 c0 4f 89 .<uN#...".....O.
0020 50 91 c4 f1 0f 78 72 e5 d0 f4 ea 5e 6e e2 50 18 P....xr....^n.P.
0030 00 40 91 b3 00 00 c2 8e 6d 06 87 95 7f 76 78 62 .#......m....vxb
0040 92 f9 54 2a 92 f9 b4 95 6c 06 ..T*....l.
I've seen this type of output before. The left is a line of binary, and the right is the decoded string (ASCII), right?
Is this data obfuscated/encrypted?
Is it possible to get cogent information from my socket?
Also, what do the [FIN] and [MASKED] flags mean?
If you copy and paste the data you supplied into a text file and append a line beginning with 0050 with nothing following it, you can then run text2pcap -a infile.txt outfile.pcap to convert the data to a pcap file that Wireshark can read and decode for you.
See the text2pcap man page for more information about this tool.
I have done this and the packet appears to just be a simple TCP segment. There is no [FIN] or [MASKED] flag, only PSH and ACK. For information about these TCP flags, refer to RFC 793, section 3.1, as well as the other RFC's mentioned at the top, which update this one.

Play WS, how to transmit non English symbols?

I do some requests with scala and play ws.
One of the requests has non English characters (Russian in my case)
Query looks like this:
val data = "query=" + "Россия, Москва, шоссе Энтузиастов, 21с1" + "&sr=" + Json.obj("wkid" -> 4326)
val futureResponse: Future[WSResponse] = WS.url("http://someservise.com/").post(data)
They query doesn't work, but looks like it must.
In this case, I sniff packages with wireshark. And they show me this request like this:
query=, , , 211&sr={"wkid":4326}
and in hex:
0000 71 75 65 72 79 3d d0 a0 d0 be d1 81 d1 81 d0 b8 query=..........
0010 d1 8f 2c 20 d0 9c d0 be d1 81 d0 ba d0 b2 d0 b0 .., ............
0020 2c 20 d1 88 d0 be d1 81 d1 81 d0 b5 20 d0 ad d0 , .......... ...
0030 bd d1 82 d1 83 d0 b7 d0 b8 d0 b0 d1 81 d1 82 d0 ................
0040 be d0 b2 2c 20 32 31 d1 81 31 26 73 72 3d 7b 22 ..., 21..1&sr={"
0050 77 6b 69 64 22 3a 34 33 32 36 7d wkid":4326}
What I need to do, to transmit this query right?

Identifying an unknown data encoding?

I'm trying understand an undocumented API I have discovered, and I can't get past the format of the data that is being returned.
Here is an example of what I get back when I perform a GET on the url I'm looking at:
A+uZL4258wXdnWztlEPJNXtdl3Tu4hRITtW2AUwQHUK5c6BATSBU/XsQEVIttCpI7wrW/oXWiBloT8+cdtUWBag3mzk3cLohKPvi7PWpf7jqCSbjNGh+5Iv5Gb8by2k31kp62sfwZ+i8r/3TA6nGrnJb6edOB7d0c6F34RTFRrrZSeJtiWYXAJ5JeD3yJY+C
At first I thought this was base64 encoded, but that just gives me back gibberish:
echo -n "<above snippet>" | base64 -D
?/???ݝl?C?5Vy??????,?8?s?#M T?{R-?*H?
???ֈhOϜv??7?97p?!(??????? &?4h~???i7?Jz???g輯???Ʈr[??N?ts?w??F??I?m?f?Ix=?%?
When I strip the URL down to just the domain, I get a website with cyrillic text. Maybe the data could be converted to cyrillic somehow?
Does this data format look familiar to you?
I'll continue to keep trying and report back if I make any progress.
This is definitely base64, because of the / and + characters.
When you decode that string using base64, you get this hexdump:
00000000 03 eb 99 2f 8d b9 f3 05 dd 9d 6c ed 94 43 c9 35 |.../......l..C.5|
00000010 7b 5d 97 74 ee e2 14 48 4e d5 b6 01 4c 10 1d 42 |{].t...HN...L..B|
00000020 b9 73 a0 40 4d 20 54 fd 7b 10 11 52 2d b4 2a 48 |.s.#M T.{..R-.*H|
00000030 ef 0a d6 fe 85 d6 88 19 68 4f cf 9c 76 d5 16 05 |........hO..v...|
00000040 a8 37 9b 39 37 70 ba 21 28 fb e2 ec f5 a9 7f b8 |.7.97p.!(.......|
00000050 ea 09 26 e3 34 68 7e e4 8b f9 19 bf 1b cb 69 37 |..&.4h~.......i7|
00000060 d6 4a 7a da c7 f0 67 e8 bc af fd d3 03 a9 c6 ae |.Jz...g.........|
00000070 72 5b e9 e7 4e 07 b7 74 73 a1 77 e1 14 c5 46 ba |r[..N..ts.w...F.|
00000080 d9 49 e2 6d 89 66 17 00 9e 49 78 3d f2 25 8f 82 |.I.m.f...Ix=.%..|
This just looks like 128 bytes of random data. And whenever you call this API URL again, you get a different string, although it starts with the same few characters.
Perhaps you should ask the maintainers of that website how to use their API. Maybe this string is some session ID that you should use in further calls.

Decoding an arbitrary block of NSData?

If I have an arbitrary block of NSData as a hex value, is there a way to determine what the object might have been before it was archived or serialized? I don't mind a few guess and check methods, but I need some pointers in the right direction.
I have an NSData object with some hex in it. What methods of NSData should I look at? Are there other classes to try as well
Don't want to scare people away from answering, but I have a file of game data which was likely encoded using a Cocoa Touch class. The data, when viewed in a hex editor, shows gibberish and a username, which leads me to suspect that it's an archived or encoded object of sorts. I have copied the hex from the hex editor into a sample project which I am using to try and unarchive the data.
I don't believe this is related to the 3d format, the file extension is arbitrary.
Here's the data. I'm hoping it doesn't get lost in translation:
'µköXN[ÎÀü÷h/F9ó9Vìñ°ceE¸z¶=Hmoshbermú«ó¼Ppù#ÝVÔ=4â®L,K;Êç;ASÀ&Ë÷ëÓ%È;Úf¬G}tmQ;µéüø_87´y©ã©!߶óQòAçÛl©âSG4S½3ýJת9äô¡wxiD²M¼ÏB]39øþ:óñ7ª¾÷躣È3Ï¢ÍEFÍ¢ª»r]BmÁ'Ò+åygÞÅQ?luó>÷ú¼è6¸|}[¼[¶Ñ¦g!\OÎÒJSE..pSß&_ÈEäø)6òëó¨¼2¶ð°æà`ï7Ë=Ã¥:cƧ=L4qG-"µ(ÐÝïß ÓãXkÀ4fzæ·p\ññT<tu¥Æ©;Ìn4£³Ï¢ÌFåG´
And the corresponding hex:
27 B5 6B F6 01 00 00 00 58 4E 5B CE C0 FC F7 68 2F 46 86 87 83 39 F3 39 9E 56 EC F1 B0 63 9E 65 45 B8 7A B6 3D 07 99 48 6D 6F 73 68 62 65 72 6D 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90 86 FA 03 0E AB F3 BC 0B 50 70 F9 23 DD 87 56 03 D4 3D 34 90 E2 AE 4C 2C 94 9E 8E 15 4B 0C 83 8C 3B 03 CA E7 3B 1B 41 53 C0 26 04 CB F7 EB D3 25 C8 3B DA 66 8A AC 47 7D 8A 7F 74 6D 51 3B B5 19 E9 FC F8 5F 38 37 B4 11 0C 79 A9 12 E3 A9 21 DF B6 F3 51 F2 41 E7 DB 85 02 9F 6C A9 E2 53 47 1F 34 86 53 BD 33 FD 4A D7 AA 39 C3 A4 F4 A1 77 78 69 44 B2 4D BC CF 42 5D 33 39 F8 FE 97 3A 81 F3 F1 10 37 AA BE 86 91 F7 1F E8 83 BA A3 C8 33 CF 1D A2 CD 45 7F 46 1F CD A2 AA BB 1A 72 5D 42 02 6D C1 0F 27 D2 2B E5 0B 79 67 DE C5 1A 51 3F 14 6C 75 F3 3E F7 FA BC E8 36 8E B8 7C 02 1C 7D 01 00 92 8C 19 5B BC 5B B6 D1 A6 67 7F 21 5C 84 13 4F CE 0C D2 4A 53 19 82 45 1B 2E 2E 96 70 53 DF 26 5F C8 1C 45 8F E4 F8 29 36 F2 EB 9D 95 F3 A8 BC 32 B6 F0 B0 E6 91 98 1A E0 99 60 EF 37 CB 3D C3 A5 3A 63 0C C6 A7 3D 4C 34 71 47 2D 22 B5 28 D0 DD EF DF 09 D3 E3 58 6B C0 17 34 66 7A E6 B7 70 5C F1 F1 54 3C 74 94 75 A5 C6 15 A9 9E 14 3B CC 15 10 83 6E 34 A3 B3 CF 0F A2 9C CC 8E 46 8C E5 00 00 47 B4 17 05 00 00 00 00
If anyone cares to help figure this out it would be much appreciated.
If I have an arbitrary block of NSData as a hex value, is there a way to determine what the object might have been before it was archived or serialized?
Not really. That's about as 'trivial' as reading arbitrary files correctly without the use of a UTI, extension, MIME type. Of course, your program would also need to support reading of all those files/formats.
I don't mind a few guess and check methods, but I need some pointers in the right direction.
You need to narrow your problem/inputs down, if you don't want an impossibly difficult task.
I have an NSData object with some hex in it. What methods of NSData should I look at?
It's just a data blob of length bytes. It could represent anything -- if you don't know where it came from.
Are there other classes to try as well?
Perhaps you would start by saving all your data via NSCoder or another serializer/archiver which offers some introspection and support for you to enter your own information (which would be comparable to a UTI or MIME type).
Edit:
Don't want to scare people away from answering, but I have a file of game data which was likely encoded using a Cocoa Touch class. The data, when viewed in a hex editor, shows gibberish and a username, which leads me to suspect that it's an archived or encoded object of sorts. I have copied the hex from the hex editor into a sample project which I am using to try and unarchive the data.
Using these APIs, the data may be represented multiple ways. You're probably facing something within the domain of 1) a proprietary file format through 2) a keyed archive.
The latter is easier for nontrivial data representations. You would need to define any objc classes you do not have available when unarchived. In that case, a few sample representations would offer a rough outline of the data structures you will need (under conventional implementations). It could also be an archive similar to an NSDictionary, if the unarchiver is capable of opening it. This is a problem which is easier than with other langs, since archiving often falls back on keys and values mapped to members in Cocoa.
Edit2:
The file came from the Draw Something directory. It's called gamedata.i3d
(shrug)
Try using NSKeyedUnarchiver to read it. It's not uncommon to use just the standard Foundation containers like NSArray, NSDictionary, and NSString to store data, so you might get lucky. That obviously won't work if custom classes were involved, but it might be worth 15 minutes of your time to try it.