Getting NPE on simple Regex Replacing (Scala on Spark) - scala

I wrote a simple code to parse a large XML file ( extract lines, clean text, and remove any html tags from it) using Apache Spark.
I'm seeing a NullPointerException when calling .replaceAllIn on a string, which is non-null.
The funny thing is that I have no errors when I run the code locally, using input from disk, but I get a NullPointerException when I run the same code on AWS EMR, loading the input file from S3.
Here is the relevant code:
val HTML_TAGS_PATTERN = """<[^>]+>""".r
// other code here...
spark
.sparkContext
.textFile(pathToInputFile, numPartitions)
.filter { str => str.startsWith(" <row ") }
.toDS()
.map { str =>
Locale.setDefault(new Locale("en", "US"))
val parts = str.split(""""""")
var title: String = ""
var body: String = ""
// some code ommitted here
title = StringEscapeUtils.unescapeXml(title).toLowerCase.trim
body = StringEscapeUtils.unescapeXml(body).toLowerCase // decode xml entities
println("before replacing, body is: "+body)
// NEXT LINE TRIGGERS NPE
body = HTML_TAGS_PATTERN.replaceAllIn(body, " ") // take out htmltags
}
Things I've tried:
printing the string just before calling replaceAllIn to make sure it's not null.
making sure the Locale is not null
printing out the exception message, and stacktrace: it just tells me that that line is where the NullPointerException occurs. Nothing more
Things that are different between my local setup and AWS EMR:
in my local setup, I load the input file from disk, on EMR I load it from s3.
in my local setup, I run Spark in standalone mode, on EMR it's run in cluster mode.
Everything else is the same on my machine and on AWS EMR: Scala version, Spark version, Java version, Cluster configs...
I have been trying to figure this out for some hours and I can't think of anything else to try.
EDIT
I've moved the call to r() to within the map{} body, like this:
val HTML_TAGS_PATTERN = """<[^>]+>"""
// code ommited
.map{
body = HTML_TAGS_PATTERN.r.replaceAllIn(body, " ")
}
This also produces a NPE, wit the following stracktrace:
java.lang.NullPointerException
at java.util.regex.Pattern.<init>(Pattern.java:1350)
at java.util.regex.Pattern.compile(Pattern.java:1028)
at scala.util.matching.Regex.<init>(Regex.scala:191)
at scala.collection.immutable.StringLike$class.r(StringLike.scala:255)
at scala.collection.immutable.StringOps.r(StringOps.scala:29)
at scala.collection.immutable.StringLike$class.r(StringLike.scala:244)
at scala.collection.immutable.StringOps.r(StringOps.scala:29)
at ReadSOStanfordTokenize$$anonfun$2.apply(ReadSOStanfordTokenize.scala:102)
at ReadSOStanfordTokenize$$anonfun$2.apply(ReadSOStanfordTokenize.scala:72)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spar

I think you should try putting the regex inline like bellow.
This is a bit of a lame solution, you should be able to define a constant, maybe put it in a global object or something. Im not sure where you are defining it that would be a problem. But remember spark serialises the code and runs it on distributed workers, so something could be going wrong with that.
rdd.map { _ =>
...
body = """<[^>]+>""".r.replaceAllIn(body, " ")
}
I get a very similar error when I run .r on a null String.
val x: String = null
x.r
java.lang.NullPointerException
java.util.regex.Pattern.<init>(Pattern.java:1350)
java.util.regex.Pattern.compile(Pattern.java:1028)
scala.util.matching.Regex.<init>(Regex.scala:223)
scala.collection.immutable.StringLike.r(StringLike.scala:281)
scala.collection.immutable.StringLike.r$(StringLike.scala:281)
scala.collection.immutable.StringOps.r(StringOps.scala:29)
scala.collection.immutable.StringLike.r(StringLike.scala:270)
scala.collection.immutable.StringLike.r$(StringLike.scala:270)
scala.collection.immutable.StringOps.r(StringOps.scala:29)
That error has slightly different line numbers, I think because of the scala version. Im on 2.12.2.

Thanks to Stephen's answer I found why I was getting a NPE on my UDF... I went this way (finding a match in my case):
def findMatch(word: String): String => Boolean = { s =>
Option(s) match {
case Some(validText) => if (word.toLowerCase.r.findAllIn(validText.toLowerCase).nonEmpty) true else false
case None => false
}
}

"<[^>]+>" was great, but I have one type of things in my HTML. it consists of a name of style and then parameters in between curly braces:
p { margin-top: 0px;margin-bottom: 0px;line-height: 1.15; }
body { font-family: 'Arial';font-style: Normal;font-weight: normal;font-size: 14.6666666666667px; }.Normal { telerik-style-type: paragraph;telerik-style-name: Normal;border-collapse: collapse; }.TableNormal { telerik-style-type: table;telerik-style-name: TableNormal;border-collapse: collapse; }.s_4C87DD5E { telerik-style-type: local;font-family: 'Arial';font-size: 14.6666666666667px;color: #000000; }.s_8D20FCAB { telerik-style-type: local;font-family: 'Arial';font-size: 14.6666666666667px;color: #000000;text-decoration: underline; }.p_53E06EE5 { telerik-style-type: local;margin-left: 0px; }
I tried to extract them using the following, but it didn't work:
"\{[^\}]+\}"

Related

file merge logic: scala

For scala experts this might be a silly question but me as a beginner facing hard time to identify the solution. Any pointers would help.
I've set of 3 files in HDFS location by the names:
fileFirst.dat
fileSecond.dat
fileThird.dat
Not necessarily they'll be stored in any order. fileFirst.dat could be created at very last so a ls every time would show different ordering of the files.
My task is to combine all files in a single file in the order:
fileFirst contents, then fileSecond contents & finally fileThird contents; with newline as the separator, no spaces.
I tried some ideas but couldn't come up with something working. Every time the order of combination messes up.
Below is my function to merge whatever is coming in:
def writeFile(): Unit = {
val in: InputStream = fs.open(files(i).getPath)
try {
IOUtils.copyBytes(in, out, conf, false)
if (addString != null) out.write(addString.getBytes("UTF-8"))
} finally in.close()
}
Files is defined like this:
val files: Array[FileStatus] = fs.listStatus(srcPath)
This is part of a bigger function where I'm passing all the arguments used in this method. After everything is done, I'll do the out.close() to close the output stream.
Any ideas welcome, even if it goes against the file write logic I'm trying to do; just understand that I'm not that good in scala; for now :)
If you can enumerate your Paths directly, you don't really need to use listStatus. You could try something like this (untested):
val relativePaths = Array("fileFirst.dat", "fileSecond.dat", "fileThird.dat")
val paths = relativePaths.map(new Path(srcDirectory, _))
try {
val output = fs.create(destinationFile)
for (path <- paths) {
try {
val input = fs.open(path)
IOUtils.copyBytes(input, output, conf, false)
} catch {
case ex => throw ex // Feel free to do some error handling here
} finally {
input.close()
}
}
} catch {
case ex => throw ex // Feel free to do some error handling here
} finally {
output.close()
}

How to stop the execution or throw an error in scala play framework

I'm developing a scala application with play framework, but i got something strange.
i can't stop the execution nor throwing an error in order to send a response for the client, it always continue the code and it always returning okay, however i made a dummy function that it should return a bad request but unfortunately it is returning OK here is what i wrote. any help will be appreciated
def foo(locale: String, orderId: Int) = Action { implicit request => {
val x=4+7;
if(x==11){
BadRequest(JsonHelper.convertToJson("Bad bad it is really bad "))
}
OK(JsonHelper.convertToJson("Well Done"))
}
}
the above code returning OK Well done.
To make your code return a BadRequest, add an else:
def foo(locale: String, orderId: Int) = Action { implicit request => {
val x = 4 + 7;
if (x == 11)
BadRequest(JsonHelper.convertToJson("Bad bad it is really bad "))
else // <---
OK(JsonHelper.convertToJson("Well Done"))
}}
Your problem is your if without curly braces.
refered to this link :https://docs.scala-lang.org/style/control-structures.html#curly-braces
if - Omit braces if you have an else clause. Otherwise, surround the
contents with curly braces even if the contents are only a single
line.
so you can just suuround the BadRequest with curly braces or add an else statement between your Bad and OK instruction. Be careful, don't forget the indentation !
edit 20/12/2017 >> un scala the last instruction is implicitly returned. Your last instruction is OK, so it returns OK.
Add explicit return statement in your if block or add an else statement.

Spark: run an external process in parallel

Is it possible with Spark to "wrap" and run an external process managing its input and output?
The process is represented by a normal C/C++ application that usually runs from command line. It accepts a plain text file as input and generate another plain text file as output. As I need to integrate the flow of this application with something bigger (always in Spark), I was wondering if there is a way to do this.
The process can be easily run in parallel (at the moment I use GNU Parallel) just splitting its input in (for example) 10 part files, run 10 instances in memory of it, and re-join the final 10 part files output in one file.
The simplest thing you can do is to write a simple wrapper which takes data from standard input, writes to file, executes an external program, and outputs results to the standard output. After that all you have to do is to use pipe method:
rdd.pipe("your_wrapper")
The only serious considerations is IO performance. If it is possible it would be better to adjust program you want to call so it can read and write data directly without going through disk.
Alternativelly you can use mapPartitions combined with process and standard IO tools to write to the local file, call your program and read the output.
If you end up here based on the question title from a Google search, but you don't have the OP restriction that the external program needs to read from a file--i.e., if your external program can read from stdin--here is a solution. For my use case, I needed to call an external decryption program for each input file.
import org.apache.commons.io.IOUtils
import sys.process._
import scala.collection.mutable.ArrayBuffer
val showSampleRows = true
val bfRdd = sc.binaryFiles("/some/files/*,/more/files/*")
val rdd = bfRdd.flatMap{ case(file, pds) => { // pds is a PortableDataStream
val rows = new ArrayBuffer[Array[String]]()
var errors = List[String]()
val io = new ProcessIO (
in => { // "in" is an OutputStream; write the encrypted contents of the
// input file (pds) to this stream
IOUtils.copy(pds.open(), in) // open() returns a DataInputStream
in.close
},
out => { // "out" is an InputStream; read the decrypted data off this stream.
// Even though this runs in another thread, we can write to rows, since it
// is part of the closure for this function
for(line <- scala.io.Source.fromInputStream(out).getLines) {
// ...decode line here... for my data, it was pipe-delimited
rows += line.split('|')
}
out.close
},
err => { // "err" is an InputStream; read any errors off this stream
// errors is part of the closure for this function
errors = scala.io.Source.fromInputStream(err).getLines.toList
err.close
}
)
val cmd = List("/my/decryption/program", "--decrypt")
val exitValue = cmd.run(io).exitValue // blocks until subprocess finishes
println(s"-- Results for file $file:")
if (exitValue != 0) {
// TBD write to string accumulator instead, so driver can output errors
// string accumulator from #zero323: https://stackoverflow.com/a/31496694/215945
println(s"exit code: $exitValue")
errors.foreach(println)
} else {
// TBD, you'll probably want to move this code to the driver, otherwise
// unless you're using the shell, you won't see this output
// because it will be sent to stdout of the executor
println(s"row count: ${rows.size}")
if (showSampleRows) {
println("6 sample rows:")
rows.slice(0,6).foreach(row => println(" " + row.mkString("|")))
}
}
rows
}}
scala> :paste "test.scala"
Loading test.scala...
...
rdd: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[62] at flatMap at <console>:294
scala> rdd.count // action, causes Spark code to actually run
-- Results for file hdfs://path/to/encrypted/file1: // this file had errors
exit code: 255
ERROR: Error decrypting
my_decryption_program: Bad header data[0]
-- Results for file hdfs://path/to/encrypted/file2:
row count: 416638
sample rows:
<...first row shown here ...>
...
<...sixth row shown here ...>
...
res43: Long = 843039
References:
https://www.scala-lang.org/api/current/scala/sys/process/ProcessIO.html
https://alvinalexander.com/scala/how-to-use-closures-in-scala-fp-examples#using-closures-with-other-data-types

Create a weighted feeder in Gatling

I have a few .csv files I want to use for the same data in Gatling. Each of these files has a certain number of ID's that I want to be accessed fairly equally. I don't want to put them all in the same file because the .csv files are generated from SQL queries and, while I may have a lot of IDs in one file, I only have a few in another. What's important to me is that I have a random sample from each of my files and a way to specify the distribution.
I found an example of how to do this but I'm having trouble applying it in my case. Here is the code I have so far. I try to both 1) print out the value from the feeder in the session and 2) try to use the value from the feeder in a get request. Both attempts fail with various errors which I detail below:
import scala.concurrent.duration._
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import io.gatling.jdbc.Predef._
import util.Random
class FeederTest extends Simulation {
//headers...
val userCreds = csv("user_creds.csv")
val sample1 = csv("sample1.csv")
val sample2 = csv("sample2.csv")
def randFeed(): String = {
val foo = Random.nextInt(2)
var retval = ""
if (foo == 0) retval = "file1"
if (foo == 1) retval = "file2"
return retval
}
val scn = scenario("feeder test")
.repeat(1) {
feed(userCreds)
.doSwitch(randFeed)(
"file1" -> feed(sample1),
"file2" -> feed(sample2)
)
.exec(http("request - login")
.post("<URL>")
.headers(headers_login)
.formParam("email", "${username}")
.formParam("password", "<not telling>"))
.exec(session => {
println(session)
println(session("first").as[String])
session})
.exec(http("goto_url")
.get("<my url>/${first}"))
}
setUp(scn.inject(atOnceUsers(1))).protocols(httpProtocol)
}
This is the error I get when I attempt to printout the feeder value in the session (as in the above code using session(<value>).as[String]):
[ERROR] [03/13/2015 10:22:38.221] [GatlingSystem-akka.actor.default-dispatcher-8] [akka://GatlingSystem/user/sessionHook-2] key not found: first
java.util.NoSuchElementException: key not found: first
at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at io.gatling.core.session.SessionAttribute.as(Session.scala:40)
at FeederTest$$anonfun$2.apply(feeder_test.scala:81)
at FeederTest$$anonfun$2.apply(feeder_test.scala:79)
at io.gatling.core.action.SessionHook.executeOrFail(SessionHook.scala:35)
at io.gatling.core.action.Failable$class.execute(Actions.scala:71)
at io.gatling.core.action.SessionHook.execute(SessionHook.scala:28)
at io.gatling.core.action.Action$$anonfun$receive$1.applyOrElse(Actions.scala:29)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at io.gatling.core.akka.BaseActor.aroundReceive(BaseActor.scala:22)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I have also tried just using the EL expression ${first} in the session. All this does is print out the string ${first}. Similarly in the last .get line I get an error saying "No attribute named 'first' is defined.
The CSV files I'm currently using just have two columns of first and last in them like so:
sample1.csv:
first, last
george, bush
bill, clinton
barak, obama
sample2.csv:
first, last
super, man
aqua, man
bat, man
I'm using Gatling 2.1.4.
doSwitch takes an Expression[Any], which is a type alias for Session => Validation[Any]. Gatling has an implicit conversion that let you pass a static value instead, see documentation.
Which is exactly what you do. Even if randFeed is a def, it still doesn't return a function, but a String.
As you want randFeed to be called every time a virtual user pass through this step, you have to wrap the randFeed inside a function, even if you don't use the Session input parameter.
doSwitch(_ => randFeed)
Then, your randFeed is both ugly (no offense) and inefficient (Random is synchronized):
import scala.concurrent.forkjoin.ThreadLocalRandom
def randFeed(): String =
ThreadLocalRandom.current().nextInt(2) match {
case 0 => "file1"
case 1 => "file2"
}
I've never used feeders, but from the documentation, I see two possible problems:
randFeed returns either "foo" or "bar" but you're mapping with "file1" and "file2", so is anything being loaded? (The documentation says "If no switch is selected, the switch is bypassed.")
The examples for feeders I see don't show accessing fed data with session(varname) but rather "${varname}".

Play! + Scala: Split string by commnas then Foreach loop

I have a long string similar to this:
"tag1, tag2, tag3, tag4"
Now in my play template I would like to create a foreach loop like this:
#posts.foreach { post =>
#for(tag <- #post.tags.split(",")) {
<span>#tag</span>
}
}
With this, I'm getting this error: ')' expected but '}' found.
I switched ) for a } & it just throws back more errors.
How would I do this in Play! using Scala?
Thx in advance
With the help of #Xyzk, here's the answer: stackoverflow.com/questions/13860227/split-string-assignment
Posting this because the answer marked correct isn't necessarily true, as pointed out in my comment. There are only two things wrong with the original code. One, the foreach returns Unit, so it has no output. The code should actually run, but nothing would get printed to the page. Two, you don't need the magic # symbol within #for(...).
This will work:
#for(post <- posts)
#for(tag <- post.tags.split(",")) {
<span>#tag</span>
}
}
There is in fact nothing wrong with using other functions in play templates.
This should be the problem
#for(tag <- post.tags.split(",")) {
<span>#tag</span>
}