SQL db from ASCII to Unicode

SQL db from ASCII to Unicode - unicode

My SQL Server 2008 R2 database has string columns (nvarchar). And some of the old data is showing ASCII. I need to show it to the user in my site and I prefer to convert the data in the database to Unicode. Is there a quick way to do this? Are there downsides that I should be aware of?
Examples to my issue:
In the database, I see special chars instead of regulars chars. I have a name of a user which is supposed to be Amédée, and instead it shows Am?d??.
In other cases I see " instead of Quotation mark ("), or the chars &# instead of the word "and".

Well if you have accented characters in your database, it's definitely not ASCII. Find out first what codepage you/they were using in the old DB, and convert that to UTF-8, and save to a new database.

I built the following function:
public static string FixString(string textString)
{
textString = Regex.Replace(textString, "[\t\r\n]*", String.Empty);
textString = Regex.Replace(textString, "(\\ )+", " ").Replace(" ", " ").Trim(); // .Replace("-", " ");
textString = Regex.Replace(textString, "\\(.*?\\)", String.Empty);
textString = HttpUtility.HtmlDecode(textString).Trim();
textString = Regex.Replace(textString, "<.*?>", String.Empty);
return textString;
}
and that did the trick!!!

Related

Flutter Unicode Apostrophe In String

I'm hoping this is an easy question, and that I'm just not seeing the forest due to all the trees.
I have a string in flutter than came from a REST API that looks like this:
"What\u0027s this?"
The \u is causing a problem.
I can't do a string.replaceAll("\", "\") on it as the single slash means it's looking for a character after it, which is not what I need.
I tried doing a string.replaceAll(String.fromCharCode(0x92), "") to remove it - That didn't work.
I then tried using a regex to remove it like string.replaceAll("/(?:\)/", "") and the same single slash remains.
So, the question is how to remove that single slash, so I can add in a double slash, or replace it with a double slash?
Cheers
Jase

I found the issue. I was looking for hex 92 (0x92) and it should have been decimal 92.
I ended up solving the issue like this...
String removeUnicodeApostrophes(String strInput) {
// First remove the single slash.
String strModified = strInput.replaceAll(String.fromCharCode(92), "");
// Now, we can replace the rest of the unicode with a proper apostrophe.
return strModified.replaceAll("u0027", "\'");
}

When the string is read, I assume what's happening is that it's being interpreted as literal rather than as what it should be (code points) i.e. each character of \0027 is a separate character. You may actually be able to fix this depending on how you access the API - see the dart convert library. If you use utf8.decode on the raw data you may be able to avoid this entire problem.
However, if that's not an option there's an easy enough solution for you.
What's happening when you're writing out your regex or replace is that you're not escaping the backslash, so it's essentially becoming nothing. If you use a double slash, that solve the problem as it escapes the escape character. "\\" => "\".
The other option is to use a raw string like r"\" which ignores the escape character.
Paste this into https://dartpad.dartlang.org:
String withapostraphe = "What\u0027s this?";
String withapostraphe1 = withapostraphe.replaceAll('\u0027', '');
String withapostraphe2 = withapostraphe.replaceAll(String.fromCharCode(0x27), '');
print("Original encoded properly: $withapostraphe");
print("Replaced with nothing: $withapostraphe1");
print("Using char code for ': $withapostraphe2");
String unicodeNotDecoded = "What\\u0027s this?";
String unicodeWithApostraphe = unicodeNotDecoded.replaceAll('\\u0027', '\'');
String unicodeNoApostraphe = unicodeNotDecoded.replaceAll('\\u0027', '');
String unicodeRaw = unicodeNotDecoded.replaceAll(r"\u0027", "'");
print("Data as read with escaped unicode: $unicodeNotDecoded");
print("Data replaced with apostraphe: $unicodeWithApostraphe");
print("Data replaced with nothing: $unicodeNoApostraphe");
print("Data replaced using raw string: $unicodeRaw");
To see the result:
Original encoded properly: What's this?
Replaced with nothing: Whats this?
Using char code for ': Whats this?
Data as read with escaped unicode: What\u0027s this?
Data replaced with apostraphe: What's this?
Data replaced with nothing: Whats this?
Data replaced using raw string: What's this?

Can't unescape escaped string with ABAP

I want to escape this string in SAPUI5 like this.
var escapedLongText = escape(unescapedLongText);
String (UTF-8 quote, space, Unicode quote)
" “
Escaped string
%22%20%u201C
I want to unescape it with this method, but it returns empty. Any ideas?
DATA: LV_STRING TYPE STRING.
LV_STRING = '%22%20%u201C'.
CALL METHOD CL_HTTP_UTILITY=>UNESCAPE_URL
EXPORTING
ESCAPED = LV_STRING
RECEIVING
UNESCAPED = LV_STRING.

I changed the code in SAPUI5 to the following:
var escapedLongText = encodeURI(unescapedLongText);
This results in: (like andreas mentioned)
%22%20%e2%80%9c
If I want to decode it later in SAPUI5, it can be done like this:
var unescapedLongText = unescape(decodeURI(escapedLongText));
The unescape needs to be done, because commas (for example) don't seem to be decoded automatically.

Encoding URLs containing unicode characters

Is there an Android class that (correctly) encodes URLs containing unicode characters? For example:
Blue Öyster Cult
Is converted to the following using java.net.URI:
uri.toString()
(java.lang.String) Blue%20Öyster%20Cult
The Ö character is not encoded. Using URLEncoder:
URLEncoder.encode("Blue Öyster Cult", "UTF-8").toString()
(java.lang.String) Blue+%C3%96yster+Cult
It encodes too much (i.e. spaces become "+" and path separators "/" become %2F). If I click on a link containing unicode characters with the Dolphin web browser it works correctly, so obviously this can be done. But if I try to open an HttpURLConnection using any of the above strings, I get an HTTP 404 Not Found exception.

I ended up hacking together a solution that seems to work for this, but is probably not the most robust:
url = new URL(userSuppliedPath);
String context = url.getProtocol();
String hostname = url.getHost();
String thePath = url.getPath();
int port = url.getPort();
thePath = thePath.replaceAll("(^/|/$)", ""); // removes beginning/end slash
String encodedPath = URLEncoder.encode(thePath, "UTF-8"); // encodes unicode characters
encodedPath = encodedPath.replace("+", "%20"); // change + to %20 (space)
encodedPath = encodedPath.replace("%2F", "/"); // change %2F back to slash
urlString = context + "://" + hostname + ":" + port + "/" + encodedPath;

URLEncoder is designed to be used to encode form content, not whole URI's. Encoding / as %2F is intentional to prevent user input from being interpreted as a directory, and + is valid encoding for form data. (form data == part of the URI following the ?)
Ideally, you would encode "Blue Öyster Cult" before appending it to your base URI, instead of encoding the whole string. And if "Blue Öyster Cult" is part of the path instead of part of the query string, you have to replace + with %20 yourself. With these restrictions, URLEncoder works fine.

Storing Special Characters in Windows Azure Blob Metadata

I have an app that is storing images in a Windows Azure Block Blob. I'm adding meta data to each blob that gets uploaded. The metadata may include some special characters. For instance, the registered trademark symbol (®). How do I add this value to meta data in Windows Azure?
Currently, when I try, I get a 400 (Bad Request) error anytime I try to upload a file that uses a special character like this.
Thank you!

You might use HttpUtility to encode/decode the string:
blob.Metadata["Description"] = HttpUtility.HtmlEncode(model.Description);
Description = HttpUtility.HtmlDecode(blob.Metadata["Description"]);
http://lvbernal.blogspot.com/2013/02/metadatos-de-azure-vs-caracteres.html

The supported characters in the blob metadata must be ASCII characters. To work around this you can either escape the string ( percent encode), base64 encode etc.
joe

HttpUtility.HtmlEncode may not work; if Unicode characters are in your string (i.e. &#8217), it will fail. So far, I have found Uri.EscapeDataString does handle this edge case and others. However, there are a number of characters that get encoded unnecessarily, such as space (' '=chr(32)=%20).
I mapped the illegal ascii characters metadata will not accept and built this to restore the characters:
static List<string> illegals = new List<string> { "%1", "%2", "%3", "%4", "%5", "%6", "%7", "%8", "%A", "%B", "%C", "%D", "%E", "%F", "%10", "%11", "%12", "%13", "%14", "%15", "%16", "%17", "%18", "%19", "%1A", "%1B", "%1C", "%1D", "%1E", "%1F", "%7F", "%80", "%81", "%82", "%83", "%84", "%85", "%86", "%87", "%88", "%89", "%8A", "%8B", "%8C", "%8D", "%8E", "%8F", "%90", "%91", "%92", "%93", "%94", "%95", "%96", "%97", "%98", "%99", "%9A", "%9B", "%9C", "%9D", "%9E", "%9F", "%A0", "%A1", "%A2", "%A3", "%A4", "%A5", "%A6", "%A7", "%A8", "%A9", "%AA", "%AB", "%AC", "%AD", "%AE", "%AF", "%B0", "%B1", "%B2", "%B3", "%B4", "%B5", "%B6", "%B7", "%B8", "%B9", "%BA", "%BB", "%BC", "%BD", "%BE", "%BF", "%C0", "%C1", "%C2", "%C3", "%C4", "%C5", "%C6", "%C7", "%C8", "%C9", "%CA", "%CB", "%CC", "%CD", "%CE", "%CF", "%D0", "%D1", "%D2", "%D3", "%D4", "%D5", "%D6", "%D7", "%D8", "%D9", "%DA", "%DB", "%DC", "%DD", "%DE", "%DF", "%E0", "%E1", "%E2", "%E3", "%E4", "%E5", "%E6", "%E7", "%E8", "%E9", "%EA", "%EB", "%EC", "%ED", "%EE", "%EF", "%F0", "%F1", "%F2", "%F3", "%F4", "%F5", "%F6", "%F7", "%F8", "%F9", "%FA", "%FB", "%FC", "%FD", "%FE" };
private static string MetaDataEscape(string value)
{
//CDC%20Guideline%20for%20Prescribing%20Opioids%20Module%206%3A%20%0Ahttps%3A%2F%2Fwww.cdc.gov%2Fdrugoverdose%2Ftraining%2Fdosing%2F
var x = HttpUtility.HtmlEncode(value);
var sz = value.Trim();
sz = Uri.EscapeDataString(sz);
for (int i = 1; i < 255; i++)
{
var hex = "%" + i.ToString("X");
if (!illegals.Contains(hex))
{
sz = sz.Replace(hex, Uri.UnescapeDataString(hex));
}
}
return sz;
}
The result is:
Before ==> "1080x1080 Facebook Images"
Uri.EscapeDataString =>
"1080x1080%20Facebook%20Images"
After => "1080x1080 Facebook
Images"
I am sure there is a more efficient way, but the hit seems negligible for my needs.

What is the character encoding?

I have several characters that aren't recognized properly.
Characters like:
º
á
ó
(etc..)
This means that the characters encoding is not utf-8 right?
So, can you tell me what character encoding could it be please.

We don't have nearly enough information to really answer this, but the gist of it is: you shouldn't just guess. You need to work out where the data is coming from, and find out what the encoding is. You haven't told us anything about the data source, so we're completely in the dark. You might want to try Encoding.Default if these are files saved with something like Notepad.
If you know what the characters are meant to be and how they're represented in binary, that should suggest an encoding... but again, we'd need to know more information.

read this first http://www.joelonsoftware.com/articles/Unicode.html
There are two encodings: the one that was used to encode string and one that is used to decode string. They must be the same to get expected result. If they are different then some characters will be displayed incorrectly. we can try to guess if you post actual and expected results.

I wrote a couple of methods to narrow down the possibilities a while back for situations just like this.
static void Main(string[] args)
{
Encoding[] matches = FindEncodingTable('Ÿ');
Encoding[] enc2 = FindEncodingTable(159, 'Ÿ');
}
// Locates all Encodings with the specified Character and position
// "CharacterPosition": Decimal position of the character on the unknown encoding table. E.G. 159 on the extended ASCII table
//"character": The character to locate in the encoding table. E.G. 'Ÿ' on the extended ASCII table
static Encoding[] FindEncodingTable(int CharacterPosition, char character)
{
List matches = new List();
byte myByte = (byte)CharacterPosition;
byte[] bytes = { myByte };
foreach (EncodingInfo encInfo in Encoding.GetEncodings())
{
Encoding thisEnc = Encoding.GetEncoding(encInfo.CodePage);
char[] chars = thisEnc.GetChars(bytes);
if (chars[0] == character)
{
matches.Add(thisEnc);
break;
}
}
return matches.ToArray();
}
// Locates all Encodings that contain the specified character
static Encoding[] FindEncodingTable(char character)
{
List matches = new List();
foreach (EncodingInfo encInfo in Encoding.GetEncodings())
{
Encoding thisEnc = Encoding.GetEncoding(encInfo.CodePage);
char[] chars = { character };
byte[] temp = thisEnc.GetBytes(chars);
if (temp != null)
matches.Add(thisEnc);
}
return matches.ToArray();
}

Encoding is the form of modifying some existing content; thus allowing it to be parsed by the required destination protocols.
An example of encoding can be seen when browsing the internet:
The URL you visit: www.example.com, may have the search facility to run custom searches via the URL address:
www.example.com?search=...
The following variables on the URL require URL encoding. If you was to write:
www.example.com?search=cat food cheap
The browser wouldn't understand your request as you have used an invalid character of ' ' (a white space)
To correct this encoding error you should exchange the ' ' with '%20' to form this URL:
www.example.com?search=cat%20food%20cheap
Different systems use different forms of encoding, in this example I have used standard Hex encoding for a URL. In other applications and instances you may find the need to use other types of encoding.
Good Luck!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

SQL db from ASCII to Unicode - unicode

Well if you have accented characters in your database, it's definitely not ASCII. Find out first what codepage you/they were using in the old DB, and convert that to UTF-8, and save to a new database.

Related

Flutter Unicode Apostrophe In String

Can't unescape escaped string with ABAP

Encoding URLs containing unicode characters

Storing Special Characters in Windows Azure Blob Metadata

What is the character encoding?

Categories

Resources