An error in the encoding of the Content-Disposition header (contains the Cyrillic alphabet) UWP - encoding

When I get the file name from the link "Ð\u0097акаÑ\u0082.jpg" instead "Закат.jpg".
Determined that the file name comes in the encoding "windows-1251". When trying to convert to UTF8, UTF16. The file name continues to be displayed incorrectly. Has anyone come across something like this?
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
Encoding wind1252 = Encoding.GetEncoding("windows-1252");
Encoding utf8 = Encoding.UTF8;
byte[] wind1252Bytes = wind1252.GetBytes(FileName);
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8, wind1252Bytes);
string utf8String = Encoding.UTF8.GetString(utf8Bytes);

Related

Emoji and accent encoding in dart/flutter

I get the next String from my api
"à é í ó ú ü ñ \uD83D\uDE00\uD83D\uDE03\uD83D\uDE04\uD83D\uDE01\uD83D\uDE06\uD83D\uDE05"
from a response in a json format
{
'apiText': "à é í ó ú ü ñ \uD83D\uDE00\uD83D\uDE03\uD83D\uDE04\uD83D\uDE01\uD83D\uDE06\uD83D\uDE05",
'otherInfo': 'etc.',
.
.
.
}
it contains accents à é í ó ú ü ñ that are not correctly encoded and it contains emojis \uD83D\uDE00\uD83D\uDE03\uD83D\uDE04\uD83D\uDE01\uD83D\uDE06\uD83D\uDE05
so far i have tried
var json = jsonDecode(response.body)
String apiText = json['apiText'];
List<int> bytes = apiText.codeUnits;
comentario = utf8.decode(bytes);
but produces a
[ERROR:flutter/lib/ui/ui_dart_state.cc(166)] Unhandled Exception: FormatException: Invalid UTF-8 byte (at offset 21)
how can i get the correct text with accents and emoji?
Based on the fact you called response.body I assumes you are using the http package which does have the body property on Response objects.
You should note the following detail in the documentation:
This is converted from bodyBytes using the charset parameter of the Content-Type header field, if available. If it's unavailable or if the encoding name is unknown, latin1 is used by default, as per RFC 2616.
Well, it seems rather likely that it cannot figure out the charset and therefore defaults to latin1 which explains how your response got messed up.
A solution for this is to use the resonse.bodyBytes instead which contains the raw bytes from the response. You can then manually parse this with e.g. utf8.decode(resonse.bodyBytes) if you are sure the response should be parsed as UTF-8.
rewrite your function as
String utf8convert(String text) {
var bytes = text.codeUnits;
String decodedCode = utf8.decode(bytes, allowMalformed: true);
if (decodedCode.contains("�")) {
return text;
}
return decodedCode;
}

Norwegian Characters as '?' in ics c#

I have been trying to send calendar event ics in a mail as attachment but the summary and description is showing norwegian character like 'ø' as '?'.
Please help me as I am new to the calendar events in ASP.Net MVC.
System.Text.StringBuilder str = new StringBuilder();
str.AppendLine("BEGIN:VCALENDAR");
str.AppendLine("PRODID:-//Schedule a Meeting");
str.AppendLine("VERSION:2.0");
str.AppendLine("METHOD:PUBLISH");
str.AppendLine("BEGIN:VEVENT");
str.AppendLine(string.Format("DTSTART:{0:yyyyMMddTHHmmssZ}",model.Startdate));
str.AppendLine(string.Format("DTSTAMP:{0:yyyyMMddTHHmmssZ}", DateTime.UtcNow));
str.AppendLine(string.Format("DTEND:{0:yyyyMMddTHHmmssZ}", model.EndDate));
str.AppendLine("LOCATION: " + model.Location);
str.AppendLine(string.Format("UID:{0}", Guid.NewGuid()));
str.AppendLine(string.Format("DESCRIPTION:{0}", model.desc));
str.AppendLine(string.Format("SUMMARY:{0}", model.Name));
str.AppendLine(string.Format("ORGANIZER:MAILTO:{0}", model.Email));
str.AppendLine("BEGIN:VALARM");
str.AppendLine("TRIGGER:-PT15M");
str.AppendLine("ACTION:DISPLAY");
str.AppendLine("DESCRIPTION:Reminder");
str.AppendLine("END:VALARM");
str.AppendLine("END:VEVENT");
str.AppendLine("END:VCALENDAR");
byte[] byteArray = Encoding.ASCII.GetBytes(str.ToString());
MemoryStream stream = new MemoryStream(byteArray);
Attachment attach = new Attachment(stream, "Invitation.ics");`
The problem here is that you're loosing the special characters when using the ASCII Encoding. Use some other Encoding, e.g. UTF8, which is a variable multi-byte encoding that can cover all characters.
The attached link shows how to specify the used encoding in the ics file:
https://theeventscalendar.com/support/forums/topic/ical-text-encoding/

Convert persian unicode to Ascii

I need to get the ASCII code of a Persian string to use it in a program. But the method below give the ? marks: "??? ????"
public string PerisanAscii()
{
//persian string
string unicodeString = "صبح بخیر";
// Create two different encodings.
Encoding ascii = Encoding.ASCII;
Encoding unicode = Encoding.Unicode;
// Convert the string into a byte array.
byte[] unicodeBytes = unicode.GetBytes(unicodeString);
// Perform the conversion from one encoding to the other.
byte[] asciiBytes = Encoding.Convert(unicode, ascii, unicodeBytes);
// Convert the new byte[] into a char[] and then into a string.
char[] asciiChars = new char[ascii.GetCharCount(asciiBytes, 0, asciiBytes.Length)];
ascii.GetChars(asciiBytes, 0, asciiBytes.Length, asciiChars, 0);
string asciiString = new string(asciiChars);
return asciiString;
}
Can you help me?
Best regards,
Mohsen
You can convert Persian UTF8 data to Windows-1256 (Arabic Windows):
var enc1256 = Encoding.GetEncoding("windows-1256");
var data = enc1256.GetBytes(unicodeString);
System.IO.File.WriteAllBytes(path, data);
ASCII does not support Persian. You may need old school Iran System encoding standard. This is determined by your Autocad application. I don't know if there is a direct Encoding in windows for it or not. But you can convert characters manually too. It's a simple mapping.

What is the character encoding?

I have several characters that aren't recognized properly.
Characters like:
º
á
ó
(etc..)
This means that the characters encoding is not utf-8 right?
So, can you tell me what character encoding could it be please.
We don't have nearly enough information to really answer this, but the gist of it is: you shouldn't just guess. You need to work out where the data is coming from, and find out what the encoding is. You haven't told us anything about the data source, so we're completely in the dark. You might want to try Encoding.Default if these are files saved with something like Notepad.
If you know what the characters are meant to be and how they're represented in binary, that should suggest an encoding... but again, we'd need to know more information.
read this first http://www.joelonsoftware.com/articles/Unicode.html
There are two encodings: the one that was used to encode string and one that is used to decode string. They must be the same to get expected result. If they are different then some characters will be displayed incorrectly. we can try to guess if you post actual and expected results.
I wrote a couple of methods to narrow down the possibilities a while back for situations just like this.
static void Main(string[] args)
{
Encoding[] matches = FindEncodingTable('Ÿ');
Encoding[] enc2 = FindEncodingTable(159, 'Ÿ');
}
// Locates all Encodings with the specified Character and position
// "CharacterPosition": Decimal position of the character on the unknown encoding table. E.G. 159 on the extended ASCII table
//"character": The character to locate in the encoding table. E.G. 'Ÿ' on the extended ASCII table
static Encoding[] FindEncodingTable(int CharacterPosition, char character)
{
List matches = new List();
byte myByte = (byte)CharacterPosition;
byte[] bytes = { myByte };
foreach (EncodingInfo encInfo in Encoding.GetEncodings())
{
Encoding thisEnc = Encoding.GetEncoding(encInfo.CodePage);
char[] chars = thisEnc.GetChars(bytes);
if (chars[0] == character)
{
matches.Add(thisEnc);
break;
}
}
return matches.ToArray();
}
// Locates all Encodings that contain the specified character
static Encoding[] FindEncodingTable(char character)
{
List matches = new List();
foreach (EncodingInfo encInfo in Encoding.GetEncodings())
{
Encoding thisEnc = Encoding.GetEncoding(encInfo.CodePage);
char[] chars = { character };
byte[] temp = thisEnc.GetBytes(chars);
if (temp != null)
matches.Add(thisEnc);
}
return matches.ToArray();
}
Encoding is the form of modifying some existing content; thus allowing it to be parsed by the required destination protocols.
An example of encoding can be seen when browsing the internet:
The URL you visit: www.example.com, may have the search facility to run custom searches via the URL address:
www.example.com?search=...
The following variables on the URL require URL encoding. If you was to write:
www.example.com?search=cat food cheap
The browser wouldn't understand your request as you have used an invalid character of ' ' (a white space)
To correct this encoding error you should exchange the ' ' with '%20' to form this URL:
www.example.com?search=cat%20food%20cheap
Different systems use different forms of encoding, in this example I have used standard Hex encoding for a URL. In other applications and instances you may find the need to use other types of encoding.
Good Luck!

How to a recover a text from a wrong encoding?

I have got some files created from some asian OS (chinese and japanese XPs)
the file name is garbled, for example:
иè+¾«Ñ¡Õä²ØºÏ¼­
how i can recover the original text?
I tried with this in c#
Encoding unicode = Encoding.Unicode;
Encoding cinese = Encoding.GetEncoding(936);
byte[] chineseBytes = chinese.GetBytes(garbledString);
byte[] unicodeBytes = Encoding.Convert(unicode, chinese, chineseBytes);
//(Then convert byte in string)
and tried to change unicode to windows-1252 but no luck
It's a double-encoded text. The original is in Windows-936, then some application assumed the text is in ISO-8869-1 and encoded the result to UTF-8. Here is an example how to decode it in Python:
>>> print 'иè+¾«Ñ¡Õä²ØºÏ¼­'.decode('utf8').encode('latin1').decode('cp936')
新歌+精选珍藏合辑
I'm sure you can do something similar in C#.
Encoding unicode = Encoding.Unicode;
That's not what you want. “Unicode” is Microsoft's totally misleading name for what is really the UTF-16LE encoding. UTF-16LE plays no part here, what you have is a simple case where a 936 string has been misdecoded as 1252.
Windows codepage 1252 is similar but not the same as ISO-8859-1. There is no way to tell which is in the example string as it does not contain any of the bytes 0x80-0x9F which are different in the two encodings, but I'm assuming 1252 because that's the standard codepage on a western Windows install.
Encoding latin= Encoding.getEncoding(1252);
Encoding chinese= Encoding.getEncoding(936);
chinese.getChars(latin.getBytes(s));
The first argument to Encoding.Convert is the source encoding, Shouldn't that be chinese in your case? So
Encoding.Convert(chinese, unicode, chineseBytes);
might actually work. Because, after all, you want to convert CP-936 to Unicode and not vice-versa. And I'd suggest you don't even try bothering with CP-1252 since your text there is very likely not Latin.
This is an old question, but I just ran into the same situation while trying to migrate WordPress upload files off of an old Windows Server 2008 R2 server. bobince's answer set me on the right track, but I had to search for the right encoding/decoding pair.
With the following C#, I found the relevant encoding/deciding pair:
using System;
using System.Text;
public class Program
{
public static void Main()
{
// garbled
string s = "2020竹慶本樂ä»æ³¢åˆ‡äºžæ´²æ³•ç­µ-Intro-2-1024x643.jpg";
// expected
string t = "2020竹慶本樂仁波切亞洲法筵-Intro-2-1024x643.jpg";
foreach( EncodingInfo ei in Encoding.GetEncodings() ) {
Encoding e = ei.GetEncoding();
foreach( EncodingInfo ei2 in Encoding.GetEncodings() ) {
Encoding e2 = ei2.GetEncoding();
var s2 = e2.GetString(e.GetBytes(s));
if (s2 == t) {
var x = ei.CodePage;
Console.WriteLine($"e1={ei.DisplayName} (CP {ei.CodePage}), e2={ei2.DisplayName} (CP {ei2.CodePage})");
Console.WriteLine(t);
Console.WriteLine(s2);
}
}
}
Console.WriteLine("-----------");
Console.WriteLine(t);
Console.WriteLine(Encoding.GetEncoding(65001).GetString(Encoding.GetEncoding(1252).GetBytes(s)));
}
}
It turned out that the correct encoding/deciding in my case was:
e1=Western European (Windows) (CP 1252), e2=Unicode (UTF-8) (CP 65001)
So the last line of code is a one-liner for the correct conversion Console.WriteLine(Encoding.GetEncoding(65001).GetString(Encoding.GetEncoding(1252).GetBytes(s)));.