Scala / Special character handling / How to turn m�dchen to mädchen? - scala

I've got a Scala Akka App where I execute python scripts inside Futures with ProcessBuilder.
Unfortunately are special character not displayed correct, so do I get instead of mädchen-> m�dchen
(äöü -> �)
If I execute the python script via command line do I get the right output of "mädchen", so I assume it has nothing to do with the python script instead somehow related to my Scala input read.
Python Spider:
print("mädchen")
Scala:
val proc = Process("scrapy runspider spider.py")
var output : String = ""
val exitValue = proc ! ProcessLogger (
(out) => if( out.trim.length > 0 )
output += out.trim,
(err) =>
System.err.printf("e:%s\n",err)
)
println(exitValue) // 0 -> succ.
println(output) // m�dchen -> should be mädchen
I already tried many thinks and also read that Strings are by default UTF-8 so I am not sure why I get those question marks.
Also did I tried with no success:
var byteBuffer : ByteBuffer = StandardCharsets.UTF_8.encode(output.toString())
val str = new String(output.toString().getBytes(), "UTF-8")
Update:
It seems to be a windows related issue, following instruction will solve this problem: Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell (Windows 10)

Related

How do I work with a Scala process interactively?

I'm writing a bot in Scala for a game that uses text input and output. So I want to work with a process interactively - that is, my code receives output from the process, works with it, and only then sends its next input to the process. So I want to give a function access to the inputStreams and the outputStream simultaneously.
This doesn't seem to fit into any of the factories in scala.sys.process.BasicIO or the constructor for scala.sys.process.ProcessIO (three functions, each of which has access to only one stream).
Here's how I'm doing it at the moment.
private var rogue_input: OutputStream = _
private var rogue_output: InputStream = _
private var rogue_error: InputStream = _
Process("python3 /home/robin/IdeaProjects/Rogomatic/python/rogue.py --rogomatic").run(
new ProcessIO(rogue_input = _, rogue_output = _, rogue_error = _)
)
try {
private val rogue_scanner = new Scanner(rogue_output)
private val rogue_writer = new PrintWriter(rogue_input, true)
// Play the game
} finally {
rogue_input.close()
rogue_output.close()
rogue_error.close()
}
This works, but it doesn't feel very Scala-like. Is there a more idiomatic way to do this?
So I want to work with a process interactively - that is, my code receives output from the process, works with it, and only then sends its next input to the process.
In general, this is traditionally solved by expect. There exist libraries and tools inspired by expect for various languages, including for Scala: https://github.com/Lasering/scala-expect.
The README of the project gives various examples. While I don't know exactly what your rouge.py expects in terms of stdin/stdout interactions, here's a quick "hello world" example showing how you could interact with a Python interpreter (using the Ammonite REPL, which has conveniently library importing capabilities):
import $ivy.`work.martins.simon::scala-expect:6.0.0`
import work.martins.simon.expect.core._
import work.martins.simon.expect.core.actions._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration._
val timeout = 5 seconds
val e = new Expect("python3 -i -", defaultValue = "?")(
new ExpectBlock(
new StringWhen(">>> ")(
Sendln("""print("hello, world")""")
)
),
new ExpectBlock(
new RegexWhen("""(.*)\n>>> """.r)(
ReturningWithRegex(_.group(1).toString)
)
)
)
e.run(timeout).onComplete(println)
What the code above does is it "expects" >>> to be sent to stdout, and when it finds that, it will send print("hello, world"), followed by a newline. From then, it reads and returns everything until the next prompt (>>>) using a regex.
Amongst other debug information, the above should result in Success(hello, world) being printed to your console.
The library has various other styles, and there may also exist other similar libraries out there. My main point is that an expect-inspired library is likely what you're looking for.

Scala - how to use variables in a multi-line string literal

I want to call value of 'myActionID' variable. How do I do that?
If i pass static value like "actionId":1368201 to myActionID then it works, but If I use "actionId" : ${actionIdd} it gives error.
Here's the relevant code:
class LaunchWorkflow_Act extends Simulation {
val scenarioRepeatCount = 1
val userCount = 1
val myActionID = "13682002351"
val scn = scenario("LaunchMyFile")
.repeat (scenarioRepeatCount) {
exec(session => session.set("counter", (globalVar.getAndIncrement+" "+timeStamp.toString())))
.exec(http("LaunchRequest")
.post("""/api/test""")
.headers(headers_0)
.body(StringBody(
"""{ "actionId": ${myActionID} ,
"jConfig": "{\"wflow\":[{\"Wflow\":{\"id\": \"13500145349\"},\"inherit-variables\": true,\"workflow-context-variable\": [{\"variable-name\": \"externalFilePath\",\"variable-value\": \"/var/nem/nem/media/mount/assets/Test.mp4\"},{\"variable-name\": \"Name\",\"variable-value\": \"${counter}\"}]}]}"
}""")))
.pause(pause)
}
}
setUp(scn.inject(atOnceUsers(userCount))).protocols(httpProtocol)
Everything works fine If I put value 13682002351 instead of myActionID. While executing this script in Gatling I am Getting this error
ERROR i.g.http.action.HttpRequestAction - 'httpRequest-3' failed to
execute: No attribute named 'myActionID' is defined
Scala has various mechanisms for String Interpolation (see docs), which can be used to embed variables in strings. All of them can be used in conjunction with the triple quotes """ used to create multi-line strings.
In this case, you can use:
val counter = 12
val myActionID = "13682002351"
val str = s"""{
"actionId": $myActionID ,
"jConfig": "{\"wflow\":[{\"Wflow\":{\"id\": \"13500145349\"},\"inherit-variables\": true,\"workflow-context-variable\": [{\"variable-name\": \"externalFilePath\",\"variable-value\": \"/var/nem/nem/media/mount/assets/Test.mp4\"},{\"variable-name\": \"Name\",\"variable-value\": \"${counter}\"}]}]}"
}"""
Notice the s prepended to the string literal, and the dollar sign prepended to the variable names.
Using S interpolated String we can do this easily:
s"""Hello Word , Welcome Back!
How are you doing ${userName}"""

Spark: run an external process in parallel

Is it possible with Spark to "wrap" and run an external process managing its input and output?
The process is represented by a normal C/C++ application that usually runs from command line. It accepts a plain text file as input and generate another plain text file as output. As I need to integrate the flow of this application with something bigger (always in Spark), I was wondering if there is a way to do this.
The process can be easily run in parallel (at the moment I use GNU Parallel) just splitting its input in (for example) 10 part files, run 10 instances in memory of it, and re-join the final 10 part files output in one file.
The simplest thing you can do is to write a simple wrapper which takes data from standard input, writes to file, executes an external program, and outputs results to the standard output. After that all you have to do is to use pipe method:
rdd.pipe("your_wrapper")
The only serious considerations is IO performance. If it is possible it would be better to adjust program you want to call so it can read and write data directly without going through disk.
Alternativelly you can use mapPartitions combined with process and standard IO tools to write to the local file, call your program and read the output.
If you end up here based on the question title from a Google search, but you don't have the OP restriction that the external program needs to read from a file--i.e., if your external program can read from stdin--here is a solution. For my use case, I needed to call an external decryption program for each input file.
import org.apache.commons.io.IOUtils
import sys.process._
import scala.collection.mutable.ArrayBuffer
val showSampleRows = true
val bfRdd = sc.binaryFiles("/some/files/*,/more/files/*")
val rdd = bfRdd.flatMap{ case(file, pds) => { // pds is a PortableDataStream
val rows = new ArrayBuffer[Array[String]]()
var errors = List[String]()
val io = new ProcessIO (
in => { // "in" is an OutputStream; write the encrypted contents of the
// input file (pds) to this stream
IOUtils.copy(pds.open(), in) // open() returns a DataInputStream
in.close
},
out => { // "out" is an InputStream; read the decrypted data off this stream.
// Even though this runs in another thread, we can write to rows, since it
// is part of the closure for this function
for(line <- scala.io.Source.fromInputStream(out).getLines) {
// ...decode line here... for my data, it was pipe-delimited
rows += line.split('|')
}
out.close
},
err => { // "err" is an InputStream; read any errors off this stream
// errors is part of the closure for this function
errors = scala.io.Source.fromInputStream(err).getLines.toList
err.close
}
)
val cmd = List("/my/decryption/program", "--decrypt")
val exitValue = cmd.run(io).exitValue // blocks until subprocess finishes
println(s"-- Results for file $file:")
if (exitValue != 0) {
// TBD write to string accumulator instead, so driver can output errors
// string accumulator from #zero323: https://stackoverflow.com/a/31496694/215945
println(s"exit code: $exitValue")
errors.foreach(println)
} else {
// TBD, you'll probably want to move this code to the driver, otherwise
// unless you're using the shell, you won't see this output
// because it will be sent to stdout of the executor
println(s"row count: ${rows.size}")
if (showSampleRows) {
println("6 sample rows:")
rows.slice(0,6).foreach(row => println(" " + row.mkString("|")))
}
}
rows
}}
scala> :paste "test.scala"
Loading test.scala...
...
rdd: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[62] at flatMap at <console>:294
scala> rdd.count // action, causes Spark code to actually run
-- Results for file hdfs://path/to/encrypted/file1: // this file had errors
exit code: 255
ERROR: Error decrypting
my_decryption_program: Bad header data[0]
-- Results for file hdfs://path/to/encrypted/file2:
row count: 416638
sample rows:
<...first row shown here ...>
...
<...sixth row shown here ...>
...
res43: Long = 843039
References:
https://www.scala-lang.org/api/current/scala/sys/process/ProcessIO.html
https://alvinalexander.com/scala/how-to-use-closures-in-scala-fp-examples#using-closures-with-other-data-types

cgi.parse_multipart function throws TypeError in Python 3

I'm trying to make an exercise from Udacity's Full Stack Foundations course. I have the do_POST method inside my subclass from BaseHTTPRequestHandler, basically I want to get a post value named message submitted with a multipart form, this is the code for the method:
def do_POST(self):
try:
if self.path.endswith("/Hello"):
self.send_response(200)
self.send_header('Content-type', 'text/html')
self.end_headers
ctype, pdict = cgi.parse_header(self.headers['content-type'])
if ctype == 'multipart/form-data':
fields = cgi.parse_multipart(self.rfile, pdict)
messagecontent = fields.get('message')
output = ""
output += "<html><body>"
output += "<h2>Ok, how about this?</h2>"
output += "<h1>{}</h1>".format(messagecontent)
output += "<form method='POST' enctype='multipart/form-data' action='/Hello'>"
output += "<h2>What would you like to say?</h2>"
output += "<input name='message' type='text'/><br/><input type='submit' value='Submit'/>"
output += "</form></body></html>"
self.wfile.write(output.encode('utf-8'))
print(output)
return
except:
self.send_error(404, "{}".format(sys.exc_info()[0]))
print(sys.exc_info() )
The problem is that the cgi.parse_multipart(self.rfile, pdict) is throwing an exception: TypeError: can't concat bytes to str, the implementation was provided in the videos for the course, but they're using Python 2.7 and I'm using python 3, I've looked for a solution all afternoon but I could not find anything useful, what would be the correct way to read data passed from a multipart form in python 3?
I've came across here to solve the same problem like you have.
I found a silly solution for that.
I just convert 'boundary' item in the dictionary from string to bytes with an encoding option.
ctype, pdict = cgi.parse_header(self.headers['content-type'])
pdict['boundary'] = bytes(pdict['boundary'], "utf-8")
if ctype == 'multipart/form-data':
fields = cgi.parse_multipart(self.rfile, pdict)
In my case, It seems work properly.
To change the tutor's code to work for Python 3 there are three error messages you'll have to combat:
If you get these error messages
c_type, p_dict = cgi.parse_header(self.headers.getheader('Content-Type'))
AttributeError: 'HTTPMessage' object has no attribute 'getheader'
or
boundary = pdict['boundary'].decode('ascii')
AttributeError: 'str' object has no attribute 'decode'
or
headers['Content-Length'] = pdict['CONTENT-LENGTH']
KeyError: 'CONTENT-LENGTH'
when running
c_type, p_dict = cgi.parse_header(self.headers.getheader('Content-Type'))
if c_type == 'multipart/form-data':
fields = cgi.parse_multipart(self.rfile, p_dict)
message_content = fields.get('message')
this applies to you.
Solution
First of all change the first line to accommodate Python 3:
- c_type, p_dict = cgi.parse_header(self.headers.getheader('Content-Type'))
+ c_type, p_dict = cgi.parse_header(self.headers.get('Content-Type'))
Secondly, to fix the error of 'str' object not having any attribute 'decode', it's because of the change of strings being turned into unicode strings as of Python 3, instead of being equivalent to byte strings as in Python 3, so add this line just under the above one:
p_dict['boundary'] = bytes(p_dict['boundary'], "utf-8")
Thirdly, to fix the error of not having 'CONTENT-LENGTH' in pdict just add these lines before the if statement:
content_len = int(self.headers.get('Content-length'))
p_dict['CONTENT-LENGTH'] = content_len
Full solution on my Github:
https://github.com/rSkogeby/web-server
I am doing the same course and was running into the same problem. Instead of getting it to work with cgi I am now using the parse library. This was shown in the same course just a few lessons earlier.
from urllib.parse import parse_qs
length = int(self.headers.get('Content-length', 0))
body = self.rfile.read(length).decode()
params = parse_qs(body)
messagecontent = params["message"][0]
And you have to get rid of the enctype='multipart/form-data' in your form.
In my case I used cgi.FieldStorage to extract file and name instead of cgi.parse_multipart
form = cgi.FieldStorage(
fp=self.rfile,
headers=self.headers,
environ={'REQUEST_METHOD':'POST',
'CONTENT_TYPE':self.headers['Content-Type'],
})
print('File', form['file'].file.read())
print('Name', form['name'].value)
Another hack solution is to edit the source of the cgi module.
At the very beginning of the parse_multipart (around the 226th line):
Change the usage of the boundary to str(boundary)
...
boundary = b""
if 'boundary' in pdict:
boundary = pdict['boundary']
if not valid_boundary(boundary):
raise ValueError('Invalid boundary in multipart form: %r'
% (boundary,))
nextpart = b"--" + str(boundary)
lastpart = b"--" + str(boundary) + b"--"
...

Reading sql file using getResources in scala

I'm trying to read and execute a sql in SPARK SQL.
sqlContext.sql(scala.io.Source.fromInputStream(getClass.getResourceAsStream("/" + "dq.sql")).getLines.mkString(" ").stripMargin).take(1)
My sql is very long. When I run it straight way in spark shell , it runs fine. When I try to read this using getResourcesAsStream - I'm hitting
java.lang.RuntimeException: [1.10930] failure: end of input
A simple solution could be reading the sql at driver (using any file utility) and pass the variable like ssc.sql(sqlvar)
val stream : InputStream = getClass.getResourceAsStream("/filename.txt")
val readFile = scala.io.Source.fromInputStream( stream ).getLines
val spa = readFile.map(line => " " + line)
val spl = spa.mkString.split(";")
for (m1 <- spl) {
sqlContext.sql(m1)
}