Decode a string with both Unicode and Utf-8 codes in Python 2.x - unicode

Say we have a string:
s = '\xe5\xaf\x92\xe5\x81\x87\\u2014\\u2014\xe5\x8e\xa6\xe9\x97\xa8'
Somehow two symbols, '—', whose Unicode is \u2014 was not correctly encoded as '\xe2\x80\x94' in UTF-8. Is there an easy way to decode this string? It should be decoded as 寒假——厦门
Manually using the replace function is OK:
t = u'\u2014'
s.replace('\u2014', t.encode('utf-8')
print s
However, it is not automatic. If we extract the Unicode,
index = s.find('\u')
t = s[index : index+6]
then t = '\\u2014'. How to convert it to UTF-8 code?

You're missing extra slashes in your replace()
It should be:
s.replace("\\u2014", u'\u2014'.encode("utf-8") )
Check my warning in the comments of the question. You should not end up in this situation.

Related

How to convert utf8 to text in dart

I want to convert special character utf8 value to it's text.
For example, if input is %20, the output will be whitespace
if input is %23, the output will be #
void main() {
var raw = 'Hello%20Bebop%23yahoo';
var parsed = Uri.decodeComponent(raw);
print(parsed);
}
Result:
Hello Bebop#yahoo
Looks like you converting to an ASCCI like encoding. What function you are using for that result?
Try finding the encoding table to your %20 and %23 output, so you will see where you heading at the moment.

Replace emdash with double dash

I want to replace ― back into --
I tried with the utf8 encodings but that doesn't work
string = "blablabla -- blablabla ―"
I want to replace the long dash (if there is one) with double hyphens. I tried it the simple way but that didn't work:
string= string.replace ("―", "--")
I also tried to encode it with utf8 and use the codes of the special characters
stringutf8= string.encode("utf-8")
emdash= u"\u2014"
hyphen= u"\u002D"
if emdash in stringutf8:
stringutf8.replace(emdash, 2*hyphen)
Any suggestions?
I am working with text files in which sometimes apparently the two hyphens are replaced automatically with a long dash...
thanks a lot!
You are dealing with strings here. Strings are lists of characters. Replace the character, leave the encoding out of the equation.
string = 'blablabla -- blablabla \u2014'
emdash = '\u2014'
hyphen = '\u002D'
string2 = string.replace(emdash, 2*hyphen)

How to convert string in UTF-8 to ASCII ignoring errors and removing non ASCII characters

I am new to Scala.
Please advise how to convert strings in UTF-8 to ASCII ignoring errors and removing non ASCII characters in output.
For example, how to remove non ASCII character \uc382 from result string: "hello���", so that "hello" is printed in output.
scala.io.Source.fromBytes("hello\uc382".getBytes ("UTF-8"), "US-ASCII").mkString
val str = "hello\uc382"
str.filter(_ <= 0x7f) // keep only valid ASCII characters
If you had text in UTF-8 as bytes that is now in a String then it was converted.
If you have text in a String and you want it in ASCII as bytes, you can convert it later.
It seems that you just want to filter for only the UTF-16 code units for the C0 Controls and Basic Latin codepoints. Fortunately, such codepoints take only one code unit so we can filter them directly without converting them to codepoints.
"hello\uC382"
.filter(Character.UnicodeBlock.of(_) == Character.UnicodeBlock.BASIC_LATIN)
.getBytes(StandardCharsets.US_ASCII)
.foreach {
println }
With the question generalized to an arbitrary, known character encoding, filtering doesn't do the job. Instead, the feature of the encoder to ignore characters that are not present in the target Charset can be used. An Encoder requires a bit more wrapping and unwrapping. (The API design is based on streaming and reusing the buffer within the same stream and even other streams.) So, with ISO_8859_1 as an example:
val encoder = StandardCharsets.ISO_8859_1
.newEncoder()
.onMalformedInput(CodingErrorAction.IGNORE)
.onUnmappableCharacter(CodingErrorAction.IGNORE)
val string = "ñhello\uc382"
println(string)
val chars = CharBuffer.allocate(string.length())
.put(string)
chars.rewind()
val buffer = encoder.encode(chars)
val bytes = Array.ofDim[Byte](buffer.remaining())
buffer.get(bytes)
println(bytes)
bytes
.foreach {
println }

ColdFusion Hash

I'm trying to create a password digest with this formula to get the following variables and my code is just not matching. Not sure what I'm doing wrong, but I'll admit when I need help. Hopefully someone is out there who can help.
Formula from documentation: Base64(SHA1(NONCE + TIMESTAMP + SHA1(PASSWORD)))
Correct Password Digest Answer: +LzcaRc+ndGAcZIXmq/N7xGes+k=
ColdFusion Code:
<cfSet PW = "AMADEUS">
<cfSet TS = "2015-09-30T14:12:15Z">
<cfSet NONCE = "secretnonce10111">
<cfDump var="#ToBase64(Hash(NONCE & TS & Hash(PW,'SHA-1'),'SHA-1'))#">
My code outputs:
Njk0MEY3MDc0NUYyOEE1MDMwRURGRkNGNTVGOTcyMUI4OUMxM0U0Qg==
I'm clearly doing something wrong, but for the life of me cannot figure out what. Anyone? Bueller?
The fun thing about hashing is that even if you start with the right string(s), the result can still be completely wrong, if those strings are combined/encoded/decoded incorrectly.
The biggest gotcha is that most of these functions actually work with the binary representation of the input strings. So how those strings are decoded makes a big difference. Notice the same string produces totally different binary when decoded as UTF-8 versus Hex? That means the results of Hash, ToBase64, etcetera will be totally different as well.
// Result: UTF-8: 65-65-68-69
writeOutput("<br>UTF-8: "& arrayToList(charsetDecode("AADE", "UTF-8"), "-"));
// Result: HEX: -86--34
writeOutput("<br>HEX: "& arrayToList(binaryDecode("AADE", "HEX"), "-"));
Possible Solution:
The problem with the current code is that ToBase64 assumes the input string is encoded as UTF-8. Whereas Hash() actually returns a hexadecimal string. So ToBase64() decodes it incorrectly. Instead, use binaryDecode and binaryEncode to convert the hash from hex to base64:
resultAsHex = Hash( NONCE & TS & Hash(PW,"SHA-1"), "SHA-1");
resultAsBase64 = binaryEncode(binaryDecode(resultAsHex, "HEX"), "base64");
writeDump(resultAsBase64);
More Robust Solution:
Having said that, be very careful with string concatenation and hashing. As it does not always yield the expected results. Without knowing more about this specific API, I cannot be completely certain what it expects. However, it is usually safer to only work with the binary values. Unfortunately, CF's ArrayAppend() function lacks support for binary arrays, but you can easily use Apache's ArrayUtils class, which is bundled with CF.
ArrayUtils = createObject("java", "org.apache.commons.lang.ArrayUtils");
// Combine binary of NONCE + TS
nonceBytes = charsetDecode(NONCE, "UTF-8");
timeBytes = charsetDecode(TS, "UTF-8");
combinedBytes = ArrayUtils.addAll(nonceBytes, timeBytes);
// Combine with binary of SECRET
secretBytes = binaryDecode( Hash(PW,"SHA-1"), "HEX");
combinedBytes = ArrayUtils.addAll(combinedBytes, secretBytes);
// Finally, HASH the binary and convert to base64
resultAsHex = hash(combinedBytes, "SHA-1");
resultAsBase64 = binaryEncode(binaryDecode(resultAsHex, "hex"), "base64");
writeDump(resultAsBase64);

How to encode the Numeric code in iPhone

I am having some numerical code and i want to encode the "Numerical Code". So how can i encode the string?. I have tried with NSASCIIStringEncoding and NSUTF8StringEncoding, but it doesn't encoded the string. So please help me out.
Eg :
İ -> İ
ı -> ı
Thanks!
What you have are Unicode code points, not strings. You don't need to specify a string encoding, because what you are dealing with aren't strings at all; they're just single characters. And an NSString does not have an "encoding" in this sense.
To get those characters into a string, you need to use:
[NSString stringWithCharacters: length];
For example: you don't want to be creating a string with the contents "304"; that's just a string of numbers. Instead, create a unichar with the value of 304:
unichar iWithDot = 304;
"Unichar" is just an unsigned short, so no pointer and no quotes; you are just assigning the code point to a numerical value. Bundle all of the characters you need into a C string and pass the pointer to stringWithCharacters.