Can't convert unicode symbols to cyrillic - scala

I have a bunch of documents persisted in Apache Lucene with some names in russian, and when I'm trying to print them out it looks like this "\u0410\u0441\u043f\u0430\u0440", but not in cyrillic symbols. The project is in Scala. I've tried to fix this with Apache Commons unescapeJava method, but it didn't help. Are there any other options?
Updated:
Project is writen with Spray framework and returns json like this.
{
"id" : 0,
"name" : "\u0410\u0441\u043f\u0430\u0440"
}

I'm going to try to infer exactly what you are doing.
You are using Spray, so I gather that you are using its json library "spray-json"
So I suppose that you have some instance of spray.json.JsObject, and that what you posted in your question is what you get as the output when printing this instance.
Your json object is correct, the value of the name field has no embeded escaping, it is actually the conversion to string that escapes some unicode characters.
See the definition of printString here:
https://github.com/spray/spray-json/blob/master/src/main/scala/spray/json/JsonPrinter.scala
I will also assume that when you tried to use unescapeJava, you applied it on the value of the name field, creating a new spray.json.JsObject instance that you then printed as before. Given that your json object does not actually have any escaping, this did absolutly nothing, and then when printing it the printer does the escaping as before, and you're back to square one.
As a side note, it's worth mentioning that the json spec does not mandate how characters are encoded: they can either be stored as their literal value, or as a unicode escape. By example the string "abc" could be described as just "abc", or as "\u0061\u0062\u0063". Either form is correct. It just happens that the author of spray-json decided to use the latter form for all non-ascii characters.
So now you ask, what can I do to work around this? You could ask the spray-json author to add an option that let's you specify that you don't want any unicode escaping.
But I imagine that you want a solution right now.
The simplest thing to do is to just convert your object to a string (via JsValue.toString or JsValue.compactPrint or JsValue.prettyPrint), and then pass the result to unescapeJava. At least this will give you back your cyrillic original characters.
But this is a bit gross, and actually quite dangerous as some characters are not safe to unescape inside a string literal. By example: \n will be unescaped to an actual return, and \u0022 will be unescaped to ". You can easily see how it will break your json document.
But at the very least it will allow to confirm my theory (remember that I have been making assumptions about what exactly you are doing).
Now for a proper fix: you could simply extend JsonPrinter and override its printString method to remove the unicode escapting. Something like this (untested):
trait NoUnicodeEscJsonPrinter extends JsonPrinter {
override protected def printString(s: String, sb: StringBuilder) {
#tailrec
def printEscaped(s: String, ix: Int) {
if (ix < s.length) {
s.charAt(ix) match {
case '"' => sb.append("\\\"")
case '\\' => sb.append("\\\\")
case x if 0x20 <= x && x < 0x7F => sb.append(x)
case '\b' => sb.append("\\b")
case '\f' => sb.append("\\f")
case '\n' => sb.append("\\n")
case '\r' => sb.append("\\r")
case '\t' => sb.append("\\t")
case x => sb.append(x)
}
printEscaped(s, ix + 1)
}
}
sb.append('"')
printEscaped(s, 0)
sb.append('"')
}
}
trait NoUnicodeEscPrettyPrinter extends PrettyPrinter with NoUnicodeEscJsonPrinter
object NoUnicodeEscPrettyPrinter extends NoUnicodeEscPrettyPrinter
trait NoUnicodeEscCompactPrinter extends CompactPrinter with NoUnicodeEscJsonPrinter
object NoUnicodeEscCompactPrinter extends NoUnicodeEscCompactPrinter
Then you can do:
val json: JsValue = ...
val jsonString: String = NoUnicodeEscPrettyPrinter( json )
jsonString will contain your json document in pretty-print format and without any unicde escaping.

This problem appears to be corrected in spray-json 1.3.2: https://github.com/spray/spray-json/issues/46
I ran into a similar problem with Arabic characters using Akka HTTP 1.0, which depends on 1.3.1. By upgrading to 1.3.2, my problem was resolved.

Related

Idiomatic handling of JSON null in scala upickle / ujson

I am new to Scala and would like to learn the idiomatic way to solve common problems, as in pythonic for Python. My question regards reading JSON data with upickle, where the JSON value contains a string when present, and null when not present. I want to use a custom value to replace null. A simple example:
import upickle.default._
val jsonString = """[{"always": "foo", "sometimes": "bar"}, {"always": "baz", "sometimes": null}]"""
val jsonData = ujson.read(jsonString)
for (m <- jsonData.arr) {
println(m("always").str.length) // this will work
println(m("sometimes").str.length) // this will fail, Exception in thread "main" ujson.Value$InvalidData: Expected ujson.Str (data: null)
}
The issue is with the field "sometimes": when null, we cannot apply .str (or any other function mapping to a static type other than null). I am looking for something like m("sometimes").str("DEFAULT").length, where "DEFAULT" is the replacement for null.
Idea 1
Using pattern matching, the following works:
val sometimes = m("sometimes") match {
case s: ujson.Str => s.str
case _ => "DEFAULT"
}
println(sometimes.length)
Given Scala's concise syntax, this looks a bit complicated and will be repetitive when done for a number of values.
Idea 2
Answers to a related question mention creating a case class with default values. For my problem, the creation of a case class seems inflexible to me when different replacement values are needed depending depending on context.
Idea 3
Anwers to another question (not specific to upickle) discuss using Try().getOrElse(), i.e.:
import scala.util.Try
// ...
println(Try(m("sometimes").str).getOrElse("DEFAULT").length)
However, the discussion mentions that throwing an exception for a regular program path is expensive.
What are idiomatic, yet concise ways to solve this?
Idiomatic or scala way to do this by using scala's Option.
Fortunately, upickle Values offers them. Refer strOpt method in this source code.
Your problem in code is str methods in m("always").str and m("sometimes").str
With this code, you are prematurely assuming that all the values are strings. That's where the strOpt method comes. It either outputs a string if its value is a string or a None type if it not. And we can use getOrElse method coupled with it to decide what to throw if the value is None.
Following would be the optimum way to handle this.
val jsonString = """[{"always": "foo", "sometimes": "bar"}, {"always": "baz", "sometimes": null}]"""
for (m <- jsonData.arr) {
println(m("always").strOpt.getOrElse("").length)
println(m("sometimes").strOpt.getOrElse("").length)
}
Output:
3
3
3
0
Here if we get any value other than a string (null, float, int), the code will output it as an empty string. And its length will be calculated as 0.
Basically, this is similar to your "Idea1" approach but this is the scala way. Instead of "DEFAULT", I am throwing an empty string because you wouldn't want to have null values' length to be 7 (Length of string "DEFAULT").

Scala: Swapping the case of each character in a string

Using Scala I want to take a sub string of an initial string and swap each characters case so that capital letters become lower case lower case become upper.
var swapCase = buffer.substring(lwr, upr).to?OTHER?Case
I have used the .toUpperCase and .toLowerCase commands in the past and was wondering if there is a similar command for just swapping case without having to iterate through each character within a loop and evaluating which operation needs to be performed on each character i.e:
if(char(x).isUpperCase){char(x).toLowerCase}
else if(char(x).isLowerCase){char(x).toUpperCase}
In short, is there a really quick way to do this with a "." command instead of writing multiple lines.
This is about as good as you are going to get:
def swapCase(s: String): String =
s.map(ch => if (ch.isLower) ch.toUpper else ch.toLower)
An alternative to the Tim's oneliner could be:
def swapCharCase(ch: Char) = if (ch.isLower) ch.toUpper else ch.toLower
def swapCase(s: String): String = s.map(swapCharCase)
I find it a tiny bit more readable - and perhaps swapCharCase may become handy anyway.
To use it as .swapCase as requested, use an implicit class instead to provide the extension method:
implicit class CaseStringOps(s: String) {
def swapCase: String = s.map(swapCharCase)
}

Scala, PlayFramework - How to avoid auto escaping character while converting to Json?

I have a String:
val x = """{foo:"value1", bar:"value2"}"""
and I want to convert it into JsString.
val converted = JsString(x)
Now, if i print converted, following result is printed:
"{foo:\"value1\", bar:\"value2\"}"
However, I don't want the \ added in the string. Is there any other way avoid this auto escaping without using string.replace?
Try
println(JsString("""{foo:"value1", bar:"value2"}""").value)
"{foo:"value1", bar:"value2"}"
is not a valid JSON, that's why quotes are escape by JsString. Indeed, how would a JSON parser interpret this, if the internal quotes are not escaped?
If you want a (JVM) String with the JSON object inside, you already have it. If you want a JSON string, representing a JSON object, you MUST have escaping characters.
If you want the JSON object, you can always use
val obj: JsValue = Json.parse("""{foo:"value1", bar:"value2"}""")

URL decoding with unfiltered

I'm working with Unfiltered 0.6.8 (using the Jetty connector) and have encountered an odd behaviour: path segments aren't URL decoded.
The following bit of code is my minimal test case:
import unfiltered.request._
import unfiltered.response._
object Test extends App with unfiltered.filter.Plan {
def intent = {
case reg # Path(Seg(test :: Nil)) =>
println(test)
ResponseString(test)
}
unfiltered.jetty.Http.local(8080).filter(Test).run()
}
Querying http://localhost:8080/some_string yields the expected result: some_string, both on the client and server side.
On the other hand, http://localhost:8080/some%20string yields some%20string on both client and server, rather than the some string I was expecting.
Working around the issue is trivial (java.net.URLDecoder#decode(String, String)), but I'd like to know if:
I'm forgetting something trivial and making a fool of myself.
unfiltered has a kit to deal with the hassle automatically.
if none of the above, is there a particular reason for this behaviour, or should I file a bug report?
As a side note, the unfiltered tag doesn't exist and I do not have enough reputation to create it, which is why I defaulted to scala.
Strange, I'm seeing this behavior too. There is nothing in the Seg object that does any sort of url decoding prior to splitting up the path segments and I don't see anything else in the framework for this either. I ran into a post that detailed a solution using a custom extractor as follows:
object Decode {
import java.net.URLDecoder
import java.nio.charset.Charset
trait Extract {
def charset: Charset
def unapply(raw: String) =
Try(URLDecoder.decode(raw, charset.name())).toOption
}
object utf8 extends Extract {
val charset = Charset.forName("utf8")
}
}
Then you could use it like this:
case reg # Path(Seg(Decode.utf8(test) :: Nil)) =>
println(test)
ResponseString(test)
Or like this if you wanted the entire path decoded:
case reg # Path(Decode.utf8(Seg(test :: Nil))) =>
println(test)
ResponseString(test)
Thankfully the framework is flexible and open to extension like this, so you certainly have options.

How to convert a scala.xml.Elem into a JsExp in Lift?

My latest problem is one that I already have a solution for, it just
feels like there should be a better way.
The problem:
I want to send a PartialUpdate to a comet service, and I need to XML
escape the string, so that when it is used on the client it gets the
correct results. I currently have:
override def lowPriority = {
case v: List[TaskOwner] => {
partialUpdate(
taskOwners.foldLeft(JsCrVar("table", Call("$", Str("table#userTable"))) &
Call("table.dataTable().fnClearTable"))((r, c) => {
r & Call("table.dataTable().fnAddData",
JsArray(Str(Text(c.name).toString),
Str(Text(c.initials).toString),
Str(makeDeleteButton(c).toString)),
Num(0))
}) & Call("table.dataTable().fnDraw"))
}
}
And this works fine, however the Str(Text(c.name).toString) feels
quite wordy to me. Now, I can, of course, create a pair of implicit
conversion functions for this, but it seems like this should have
already been done somewhere, I just don't know where. And so, in the
spirit of reducing the code that I have written, I ask if anyone knows
a better way to do this, or if the implicit conversion already exist
somewhere?
I have seen reference to a solution here. However the code is summarized as:
def xmlToJson(xml: Elem): JsExp = {
// code to map XML responses to JSON responses. Handles tricky things like always returning
// js arrays for some fields even if only 1 element appears in the XML
}
A possibly better way of escaping the names is, instead of:
JsArray(Str(Text(c.name).toString),
Str(Text(c.initials).toString),
Str(makeDeleteButton(c).toString))
to use
JsArray(Str(c.name.asHtml.toString),
Str(c.initials.asHtml.toString),
Str(makeDeleteButton(c).toString))
This can be further reduced by using an implicit within the class like:
implicit def elemToJsExp(elem: NodeSeq): JsExp = Str(elem.toString)
…
JsArray(c.name.asHtml,
c.initials.asHtml,
makeDeleteButton (c))
I don't know what Str does, but maybe you mean Str(xml.Utility.escape(c.name))?
Well, how about:
def JsStrArray(strings: String*) = JsArray(strings map xml.Utility.escape map Str : _*)
And then just use
JsStrArray(c.name, c.initials, makeDeleteButton(c).toString)
Mmmmm. It might incorrectly escape the result of makeDeleteButton. Anyway, you can play with it and see what looks good.