Scala Random.nextString(int) returning question marks - scala

Whenever I use Random.nextString(int), I get a String of questions marks (??????). I've tried using creating an instance of Random and using a seed, but nothing works. I am using Scala 2.10.5. Anyone know what the issue is?

In most terminals, when a character is not displayable (there are a lot of existing characters, and you cannot remotely hope to have them all in the font used by your terminal), it will print a question mark instead.
Because the string is random, you are very likely to have the vast majority of them be non displayable (and thus rendered as a sequence of question marks).
So the strings are valid, they are indeed random (not just a series of question marks), and this is all just a rendering issue. You can easily check that their content really is different each time by displaying the character codes (something like println(myString.map(_.toInt)) will do).

Related

In Python (or any language) what does an "upper" function do to Hindi, Amharric and other non-Latin character sets?

Subject says it all. Been looking for an answer, but cannot seem to find it.
I am writing a web app that will store data in a database and also have language files translated into a wide variety of character sets. At various moments, the text will be presented. I want to control presentation such as spurious blank spaces at the beginning and end of strings. Also I want to ensure some letters are upper or lower case.
My question is: what happens in upper/lower case functions when the character set only has one case?
EDIT Sub question: Are there any unexpected side effects to be aware of?
My guess is that you simply get back the one and only character.
EDIT - Added Description
The main reason for asking this question is that I am writing a webapp that will be distributed and run on machines in remote areas with little or no chance to fix "on-the-spot" bugs. It's not a complicated webapp, but will run with many different language char sets. I want to be certain of my footing before releasing the server.
First of all the upper() and lower() method in python can be applied to Hindi, Amharric and non-letter character sets.
For instance will the upper() method converts the lowercase characters if an equivalent uppercase of this char exists. If not, then not.
Or better said, if there is nothing to convert, it stays the same.

CSV in bad Encoding

We have uploaded a file with bad encoding now when downloading it again all the "strange" French characters are mixed up.
Example of the bad text:
R�union
Now when opening the CSV with Openoffice we tried all of the encodings in the Dropdown none of them seem to work.
Anyone have a way to fix the encoding to the correct one that we can view the chars?
Links to file https://drive.google.com/file/d/0BwgeuQK3LAFRWkJuNHd2TlF2WjQ/view?usp=sharing
Kr.
Sadly there is no way to automatically fix the linked file. Consider the two words afectación and sécurité. In the file they have been converted incorrectly to afectaci?n and s?curit?. There is no way to convert the question marks back because sometimes they're ó and other times é.
(Actually instead of question marks the file uses the unicode replacement character, but that doesn't change the problem).
Hopefully you have an earlier version of the file that has not been converted incorrectly.
Next time try to use a consistent encoding. This question gives some suggestions for how to do this.
If the original data cannot be obtained, there is one thing that could be done outside of retyping the whole thing. It is possible to use dictionary lookups to guess the missing words. However this would be a difficult project, and there would be mistakes where incorrect guesses were made. It's probably not worth it.

dart, total available string characters?

I'm not familiar with character sets and whether languages pick them up from their environments or if they are baked into the language itself, I wanted to make a simple number system in dart that has the largest possible base it can have, like hex has 0-9a-f I would have every single character in some specified ascending order with lower case and upper case having different values to give me the largest possible base to my number system. I want to do this so I can send numbers as strings with as few characters as possible, so my question is, does dart have a standard baked in character set that I can be certain will exist in every environment it runs in?
You should be able to use every value even if no concrete character is assigned to a code.
This would only be a problem when you try to display the character.
Some codes are control characters with special meaning (like 0x0000) which you should avoid
more info here: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.
If you want to transport the result over the internet using text protocols you may be limited to ASCII. In this case I suggest Base64 encoding.

Getting first symbol from a glyph

Related (in fact, perhaps a duplicate of): how to extract characters from a Korean string in VBA
The linked question doesn't give me satisfactory answers and it's 2 years old so I'm making a new question.
I want to find the first symbol in a Korean glyph, ie. "한" -> "ㅎ" or "가" -> "ㄱ". I also want to recognize inputs that are already single symbols, such as "ㄱ".
I'm working with NSString, which I believe uses UTF-8. Do I have to convert the string to EUC-KR, then start reading bytes, or what?
As a disclaimer, I have no experience in working with iphone or NSString, except for what I've read in the documentation in order to answer this question. I'm addressing the question mainly as a unicode problem.
In order to find the first symbol (jamo) from a Korean glyph, you have to perform a decomposition as described in my answer to how to extract characters from a Korean string in VBA (it's a new answer so you didn't see it when you posted your question). To apply my answer (which is derived directly from the Unicode standard), you have to work with the Unicode code points (numerical values) of the Korean syllables. It looks like calling the method dataUsingEncoding passing NSUnicodeStringEncoding as a parameter should do the trick.
In order to identify single symbols, you have to check whether the Unicode code point of the character you are checking is in any of the following ranges:
1100-11FF (Hangul Jamo). I think this should cover most of the real life cases.
A960-A97F (Hangul Jamo Extended-A)
D7B0-D7FF (Hangul Jamo Extended-B)
3130-318F (Hangul Compatibility Jamo)
FFA0-FFDC (Halfwidth Jamo)
Check the Unicode Code Charts for a complete reference.

JMeter CSV Data Set is corrupting Japanese strings stored as proper UTF-8, I get Question Marks instead

I read in search terms from a simple text file to send to a search engine.
It works fine in English, but gives me ???? for any Japanese text.
Text with mixed English and Japanese does show the English text, so I know it's reading it.
What I'm seeing:
Input text:
Snow Leopard をインストールする場合、新しい
Turns into:
Snow Leopard ???????????????
This is in my POST field of an HTTP.
If I set JMeter to encode the data, it just puts in the percent sequence for question marks.
About the Data:
The CSV file is very simple in
structure.
There's only one field / one column,
which I name TERM, and later use as
${TERM}
I don't really need full CSV because it's only one string per line.
There's no commas or quotes.
It's UTF-8 and when I run the Unix "file" command on the file, it says UTF-8 text.
I've also verified UTF-8 in command line and graphical mode on two machines.
Interesting note:
An interesting coincidence that I noticed: if there are 15 Japanese characters then I get 15 question marks, so at some point it's being seen as full characters and not just bytes.
JMeter CSV Dataset Config:
Filename: japanese-searches.csv
File encoding: UTF-8 (also tried without)
Variable names: TERM
Delimiter: ,
Allow Quoted Data: False (I also tried True, different, but still wrong)
Recycle at EOF: True
Stop at EOF: False
Staring mode: All threads
A few things I've tried:
- Tried Allow quoted Data. It changed to other strange characters.
- Added -Dfile.encoding=UTF-8
- Tried encoding the POST stage, but it just turned into a bunch of %nn for question marks
And I'm not sure how "debug" just after the each line of the CSV is read in. I think it's corrupted right away, but I'm not sure.
If it's only mangled when I reference it, then instead of ${TERM} perhaps there's some other "to bytes" function call. I'll start checking into that. I haven't done anything with the JMeter functions yet.
Edited Dec 24:
Tweaks:
Changed formatting and added bullet
points for more clarity.
Clarified that the file is UTF-8, and have verified that.
A new theory:
Is it possible that the Japanese characters are making it through, and the issue is that EVERY SINGLE place that shows them maps them to a "?" at DISPLAY TIME only. So even though I've checked in a bunch of places, they all have a display issue just in the UI?
Is there a way in JMeter to see the numeric value of a character or string? Actually, to tell JMeter to display the list of Unicode code points?
I'll look at my last log files... although I suppose even the server logs could mis-mapped the characters.
Also, perhaps when doing variable expansion inside of the text field that I POST, where I reference the ${TERM}, maybe at that point it also maps to question marks, but that the corruption happens at that later point. If that happened, AND it was mis-displayed in the UI, then it might lead to a false conclusion.
What I'd really like to do is pause JMeter after the first CSV record, just after that line is loaded, and look at it with a "data scope" or byte editor or something. Not sure if this is possible.
Found the issue, there was another place the UTF-8 had to be specified.
In the HTTP Request, to the right of the Method, you have to also set Content Encoding to UTF-8
Yes, in hindsight, this seems obvious, but there were a number of reasons I didn't think this was needed. Some of my incorrect assumptions might be helpful for others who are debugging, so here goes - I would have thought that:
1: Once text has made it into Java as Unicode, it stays as Unicode, and goes in and out by UTF-8. Obviously not in this case.
2: I sort of thought HTTP defaulted to UTF-8 unless you say otherwise, but maybe I'm just used to XML, but probably not a good practice to assume that, and maybe HTTP defaults to ISO-Latin1 or something, or even if there's a spec, maybe folks don't follow it.
3: And if I don't specific it, I'd think the "do no harm" approach would be to pass the characters on, and let the receiver on the other end deal with it. Wrong again!
(OK, so points 1, 2 and 3 overlap a bit)
4: Even though my HTTP Request POST, I did still try the Encode checkbox. I certainly thought that would have encoded it, but all I got was the repeating % hex for question marks, so seemed to me that the data was already corrupted at that point. Wrong again. I suspect WITHIN the HTTP phase, there's TWO character transitions, first from Unicode to whatever encoding it thinks you have, and THEN a second encoding into the %signs, and my data was mis-encoded at the first step.
5: And I would have thought JMeter would say something or warn, but from my reading, apparently it's not helpful in that respect. You can do logging or whatever.
And the "?" is Java's way of reporting a problem BY default, this started in the Java 1.4x timeframe. In my Java code I prefer to set encoding errors to report as an exception, but again, not the default, and not what JMeter does.
So I learned my lesson.
The HINT that the Unicode was at least starting out OK was that the number of question marks equaled the number of Japanese characters, instead of having 2 or 3 times as many question marks. If the length of "???" matches your Japanese (or Chinese) string, then Java DID see actual Unicode characters at some point along the journey. Whereas if you see 3 times as many ?'s as input text, then Java always saw them as bytes or ints or whatever, and NEVER as valid codepoints.
Came across this topic when searching for solution to use parameters from csv file that contained some columns written in Hebrew.
I used Excel 2007 to create a 1000 lines data for user registrations. The first and the last names had to be in Hebrew.
I exported the file to "Unicode text" file. It became tab delimited.
"Unicode Text" saves in UTF-16 LE (Little Endian), not in UTF-8. That is important.
I opened the result in Notepad++. I could see the Hebrew letters properly. The Notepad++ has the "Encoding" menu item, where you can check the encoding or change it. So I changed the Little Endian to UTF-8.
Then I replaced tabs with commas (just selected the tab and pasted it into the Find box.
The parameters were substituted ok, but after running the script I saw the following:
In the "View Results Tree" listener I opened the "Result" tab of the "Http Request".
The parameters were substituted, but the HTTP view tab (on the bottom) of the Request showed me some gibberish.
But when I looked at the Raw view, I saw that the request parameters actually contained strings like %D7%A9%D7%A8%D7%9E%D7%95%D7%98%D7%94 that when taken in pairs (%D7 %A9) corersponded properly to Hebrew letters.
To my mind, the JMeter has a bug and can not properly display the unicode chars. But it sends (POSTs) them out ok.
Hope I am right and hope it will help someone.
You can try to use "SHIFT-JIS" in Content encoding (it's nearby Method selection). Then you should uncheck "Encode?" for parameter that included Japanese.
Hope it works you.