I want to convert special character utf8 value to it's text.
For example, if input is %20, the output will be whitespace
if input is %23, the output will be #
void main() {
var raw = 'Hello%20Bebop%23yahoo';
var parsed = Uri.decodeComponent(raw);
print(parsed);
}
Result:
Hello Bebop#yahoo
Looks like you converting to an ASCCI like encoding. What function you are using for that result?
Try finding the encoding table to your %20 and %23 output, so you will see where you heading at the moment.
Related
I have the following string:
{\"Id\":\"135\",\"Type\":0}
The number in the Id field will vary, but will always be an integer with no comma separator. I'm not sure how to get just that value from that string given that it's string data type and not real "XML". I was toying with the replace() function, but the special characters are making it more complex than it seems it needs to be.
is there a way to convert that to XML or something that I can reference the Id value directly?
Maybe use a regular expression, e.g.
import re
txt = "{\"Id\":\"135\",\"Type\":0}"
x = re.search('"Id":"([0-9]+)"', txt)
if x:
print(x.group(1))
gives
135
It is assumed here that the ids are numeric and consist of at least one digit.
Non-regex answer as you asked
\" is an escape sequence in python.
So if {\"Id\":\"135\",\"Type\":0} is a raw string and if you put it into a python variable like
a = '{\"Id\":\"135\",\"Type\":0}'
gives
>>> a
'{"Id":"135","Type":0}'
OR
If the above string is python string which has \" which is already escaped, then do a.replace("\\","") which will give you the string without \.
Now just load this string into a dict and access element Id like below.
import json
d = json.loads(a)
d['Id']
Output :
135
I want to replace ― back into --
I tried with the utf8 encodings but that doesn't work
string = "blablabla -- blablabla ―"
I want to replace the long dash (if there is one) with double hyphens. I tried it the simple way but that didn't work:
string= string.replace ("―", "--")
I also tried to encode it with utf8 and use the codes of the special characters
stringutf8= string.encode("utf-8")
emdash= u"\u2014"
hyphen= u"\u002D"
if emdash in stringutf8:
stringutf8.replace(emdash, 2*hyphen)
Any suggestions?
I am working with text files in which sometimes apparently the two hyphens are replaced automatically with a long dash...
thanks a lot!
You are dealing with strings here. Strings are lists of characters. Replace the character, leave the encoding out of the equation.
string = 'blablabla -- blablabla \u2014'
emdash = '\u2014'
hyphen = '\u002D'
string2 = string.replace(emdash, 2*hyphen)
I am new to Scala.
Please advise how to convert strings in UTF-8 to ASCII ignoring errors and removing non ASCII characters in output.
For example, how to remove non ASCII character \uc382 from result string: "hello���", so that "hello" is printed in output.
scala.io.Source.fromBytes("hello\uc382".getBytes ("UTF-8"), "US-ASCII").mkString
val str = "hello\uc382"
str.filter(_ <= 0x7f) // keep only valid ASCII characters
If you had text in UTF-8 as bytes that is now in a String then it was converted.
If you have text in a String and you want it in ASCII as bytes, you can convert it later.
It seems that you just want to filter for only the UTF-16 code units for the C0 Controls and Basic Latin codepoints. Fortunately, such codepoints take only one code unit so we can filter them directly without converting them to codepoints.
"hello\uC382"
.filter(Character.UnicodeBlock.of(_) == Character.UnicodeBlock.BASIC_LATIN)
.getBytes(StandardCharsets.US_ASCII)
.foreach {
println }
With the question generalized to an arbitrary, known character encoding, filtering doesn't do the job. Instead, the feature of the encoder to ignore characters that are not present in the target Charset can be used. An Encoder requires a bit more wrapping and unwrapping. (The API design is based on streaming and reusing the buffer within the same stream and even other streams.) So, with ISO_8859_1 as an example:
val encoder = StandardCharsets.ISO_8859_1
.newEncoder()
.onMalformedInput(CodingErrorAction.IGNORE)
.onUnmappableCharacter(CodingErrorAction.IGNORE)
val string = "ñhello\uc382"
println(string)
val chars = CharBuffer.allocate(string.length())
.put(string)
chars.rewind()
val buffer = encoder.encode(chars)
val bytes = Array.ofDim[Byte](buffer.remaining())
buffer.get(bytes)
println(bytes)
bytes
.foreach {
println }
Say we have a string:
s = '\xe5\xaf\x92\xe5\x81\x87\\u2014\\u2014\xe5\x8e\xa6\xe9\x97\xa8'
Somehow two symbols, '—', whose Unicode is \u2014 was not correctly encoded as '\xe2\x80\x94' in UTF-8. Is there an easy way to decode this string? It should be decoded as 寒假——厦门
Manually using the replace function is OK:
t = u'\u2014'
s.replace('\u2014', t.encode('utf-8')
print s
However, it is not automatic. If we extract the Unicode,
index = s.find('\u')
t = s[index : index+6]
then t = '\\u2014'. How to convert it to UTF-8 code?
You're missing extra slashes in your replace()
It should be:
s.replace("\\u2014", u'\u2014'.encode("utf-8") )
Check my warning in the comments of the question. You should not end up in this situation.
I'm parsing longstrings in matlab and whenever I use str2num with an int it doesn't work, it outputs a weird Chinese or Greek symbol instead.
satrec.satnum = str2num(longstr1(3:7));
I checked by outputting it as a string, it works properly but I won't be able to use it in my calculations later on if I don't manage to change it to an int. The characters 3 to 7 of my string are ints (ex : 8188). As it appears to work if my strings are doubles, I tried this :
satrec.satnum = longstr1(3:7);
satrec.satnum = strcat(satrec.satnum,'.0');
satrec.satnum = str2num(satrec.satnum);
fprintf('satellite number : %s\n',satrec.satnum);
But it outputs the same weird symbol. Does anyone know what I can do ?
This looks like NORAD 2-line element data. In that case the file encoding is US-ASCII or effectively UTF-8 since no non-ASCII characters should be present.
Your problem appears to be in this line:
fprintf('satellite number : %s\n',satrec.satnum);
satrec.satnum is an integer, but you are printing it with a %s character in the format string, so Matlab is interpreting it as a string. Replace this with
fprintf('satellite number : %d\n',satrec.satnum);
and you get the correct result.
Edited to add
Matlab has in fact converted the string to an int correctly!
I tried running the the code you provided along with your example, and am unable to reproduce the problem you described:
longstr1='1 28895U 05043F 14195.24580016 .00000503 00000-0 10925-3 0 8188';
satrec.satnum = str2num(longstr1(3:7))
satrec =
satnum: 28895
In any case, I'd suggest using something like textscan or dlmread:
Data = textscan(longstr1,'%u8 %u16 %c %u16 %c %f %f %u16-%u8 %u16-%u8 %u8 %u16', 'delimiter', '')
Data =
Columns 1 through 9
[1] [28895] 'U' [5043] 'F' [1.4195e+04] [5.0300e-06] [0] [0]
Columns 10 through 13
[10925] [3] [0] [8188]
In the above example I guessed some of the data-types so you should update them for your use.
As you can see, this code works on a string. If however you provide it with fileID it will read all the lines in the file (see documentation for textscan) using this template.
On a side note: I noticed that char(28895) outputs a Chinese character.