Akka flow Input (`In`) as Output (`Out`) - scala

I am trying to write a piece of code which does following:-
Reads a large csv file from remote source like s3.
Process the file record by record.
Send notification to user
Write the output to a remote location
Sample record in input csv:
recordId,name,salary
1,Aiden,20000
2,Tom,18000
3,Jack,25000
My input case class which represents a record in input csv:
case class InputRecord(recordId: String, name: String, salary: Long)
Sample record in output csv (that needs to be written):
recordId,name,designation
1,Aiden,Programmer
2,Tom,Web Developer
3,Jack,Manager
My output case class which represents a record in input csv:
case class OutputRecord(recordId: String, name: String, designation: String)
Reading a record using akka stream csv (uses Alpakka reactive s3 https://doc.akka.io/docs/alpakka/current/s3.html):
def readAsCSV: Future[Source[Map[String, ByteString], NotUsed]] =
S3.download(s3Object.bucket, s3Object.path)
.runWith(Sink.head)
// This is then converted to csv
Now I have a function to process the records:
def process(input: InputRecord): OutputRecord =
//if salary > avg(salary) then Manager
//else Programmer
Function to write the OutputRecord as csv
def writeOutput:Sink[ByteString, Future[MultipartUploadResult]] =
S3.multipartUpload(s3Object.bucket,
s3Object.path,
metaHeaders = MetaHeaders(Map())
Function to send email notification:
def notify : Flow[OutputRecord, PushResult, NotUsed]
//if notification is sent successfully PushResult has some additional info
Stitching it all together
readAsCSV.flatMap { recordSource =>
recordSource.map { record
val outputRecord = process(record)
outputRecord
}
.via(notify) //Error: Line 15
.to(writeOutput) //Error: Line 16
.run()
}
On Line 15 & 16 I am getting an error, I am either able to add Line 15 or Line 16 but not both since both notify & writeOutput needs outputRecord. Once notify is called I loose my outputRecord.
Is there a way I can add both notify and writeOutput to same graph?
I am not looking for parallel execution as I want to first call notify and then only writeOutput. So this is not helpful: https://doc.akka.io/docs/akka/current/stream/stream-parallelism.html#parallel-processing
The use case seems very simple to me but some how I am not able to find a clean solution.

The output of notify is a PushResult, but the input of writeOutput is ByteString. Once you change that it will compile. In case you need ByteString, get the same from OutputRecord.
BTW, in the sample code that you have provided, a similar error exists in readCSV and process.

Related

Is there any way to use groupBy on a stream and send each substream to a different file?

For example, if I am parsing a log that starts with a server name and I want to split it in a file for each server, is there a way of doing that without knowing how many servers there are?
FileIO.fromPath(Paths.get("in.log"))
.via(Framing.delimiter(ByteString("\n".getBytes), maximumFrameLength = 4000)).map(_.utf8String)
.map(_.span(_ == ' '))
.groupBy(100, _._1)
This would result in a substream of (filename, logged), but I don't know if it's possible to connect each substream to a separate sink.
Do you need to further process the log lines? If not, you can just use a custom sink.
def writeLineByServerName(line: String): Unit = {
val name = getServerNameFromLine(line)
// either create a new output stream or get an existing one from
// a pool. you may need to manage your resources to limit open
// buffers
val outputStream = getOutputStream(name)
outputStream.write(line)
}
FileIO.fromPath(Paths.get("in.log"))
.via(Framing.delimiter(ByteString("\n".getBytes), maximumFrameLength = 4000)).map(_.utf8String)
.map(_.span(_ == ' '))
.to(Sink.foreach[String](writeLineByServerName))
Otherwise, you can read the file twice. The first read will figure out how many servers there are.

SparkSQL performance issue with collect method

We are currently facing a performance issue in sparksql written in scala language. Application flow is mentioned below.
Spark application reads a text file from input hdfs directory
Creates a data frame on top of the file using programmatically specifying schema. This dataframe will be an exact replication of the input file kept in memory. Will have around 18 columns in the dataframe
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema)
Creates a filtered dataframe from the first data frame constructed in step 2. This dataframe will contain unique account numbers with the help of distinct keyword.
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
Using the two dataframes constructed in step 2 & 3, we will get all the records which belong to one account number and do some Json parsing logic on top of the filtered data.
var filtrEqpDF =
eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
Finally the json parsed data will be put into Hbase table
Here we are facing performance issues while calling the collect method on top of the data frames. Because collect will fetch all the data into a single node and then do the processing, thus losing the parallel processing benefit.
Also in real scenario there will be 10 billion records of data which we can expect. Hence collecting all those records in to driver node will might crash the program itself due to memory or disk space limitations.
I don't think the take method can be used in our case which will fetch limited number of records at a time. We have to get all the unique account numbers from the whole data and hence I am not sure whether take method, which takes
limited records at a time, will suit our requirements
Appreciate any help to avoid calling collect methods and have some other best practises to follow. Code snippets/suggestions/git links will be very helpful if anyone have had faced similar issues
Code snippet
val eqpSchemaString = "acoountnumber ....."
val eqpSchema = StructType(eqpSchemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)));
val eqpRdd = sc.textFile(inputPath)
val eqpRowRdd = eqpRdd.map(_.split(",")).map(eqpRow => Row(eqpRow(0).trim, eqpRow(1).trim, ....)
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema);
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}
The approach usually take in this kind of situation is:
Instead of collect, invoke foreachPartition: foreachPartition applies a function to each partition (represented by an Iterator[Row]) of the underlying DataFrame separately (the partition being the atomic unit of parallelism of Spark)
the function will open a connection to HBase (thus making it one per partition) and send all the contained values through this connection
This means the every executor opens a connection (which is not serializable but lives within the boundaries of the function, thus not needing to be sent across the network) and independently sends its contents to HBase, without any need to collect all data on the driver (or any one node, for that matter).
It looks like you are reading a CSV file, so probably something like the following will do the trick:
spark.read.csv(inputPath). // Using DataFrameReader but your way works too
foreachPartition { rows =>
val conn = ??? // Create HBase connection
for (row <- rows) { // Loop over the iterator
val data = parseJson(row) // Your parsing logic
??? // Use 'conn' to save 'data'
}
}
You can ignore collect in your code if you have large set of data.
Collect Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Also this can cause the driver to run out of memory, though, because collect() fetches the entire RDD/DF to a single machine.
I have just edited your code, which should work for you.
var distAccNrsDF = eqpDF.select("accountnumber").distinct()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'")
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}

Using Gatling to loop through line by line in a file and send one message at a time to Kafka

I have file that contains text something like this
{"content_type":"Twitter","id":"77f985b0-a30a-11e5-8791-80000bc51f65","source_id":"676656486307639298","date":"2015-12-15T06:54:12.000Z","text":"RT #kokodeikku_bot: ?????: ??,}
{"content_type":"Twitter","id":"7837a020-a30a-11e5-8791-80000bc51f65","source_id":"676656494700568576",}
{"content_type":"Twitter","id":"7838d8a0-a30a-11e5-8791-80000bc51f65","source_id":"676656507266703360",}
I'm unable to read each line at a time as String to a Kafka topic within the scenario, since I can't iterate over a scenario in Gatling.
Here is my code
class KafkaSimulation extends Simulation {
val line = Source.fromFile(<passing locn of file>)("UTF-8").getLines.mkString("\n") // one way by reading source from file
val br = new BufferedReader(new FileReader("<passing locn of file>"))
var line:String = ""
while ({ line = br.readLine() ; line != null } ) {
//In this while loop i can print line by line but i cant use while loop within scenario below
println(listOfLines.mkString("\n"))
}
val kafkaConf = kafka
// Kafka topic name
.topic("test")
// Kafka producer configs
.properties(
Map(
ProducerConfig.ACKS_CONFIG -> "1",
// list of Kafka broker hostname and port pairs
ProducerConfig.BOOTSTRAP_SERVERS_CONFIG -> "localhost:9092",
// Required since Apache Kafka 0.8.2.0
ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG ->
"org.apache.kafka.common.serialization.ByteArraySerializer",
ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG ->
"org.apache.kafka.common.serialization.ByteArraySerializer"))
val scn = scenario("Kafka Test")
.exec(kafka("request")
// message to send
.send(line.toString())) //Here if i put line.toString(), it doesnt read line by line instead it will post entire 3 lines as one message
setUp(
scn.inject(constantUsersPerSec(10) during (1 seconds)))
.protocols(kafkaConf)
}
Any tips for how I can iterate over a file and read line by line in scenario?
Turn your file into a one column CSV Feeder and use the standard Gatling way: feed a record, send your request, and repeat as much as you want.
For this objective the only thing you really need is to open the file and iterate over it, line by line. #stephane comment was a bit raw, but what he meant is:
Source
.fromFile("files/yourtargetfile.txt")
.getLines
.map { line =>
//do your stuff
}.foreach(println)
Or, something simpler if you don't want to edit the contents of the file:
Source
.fromFile("files/ChargeNames")
.getLines
.foreach { line =>
//do your stuff
}
I hope this helps,
Cheers.

Spark: run an external process in parallel

Is it possible with Spark to "wrap" and run an external process managing its input and output?
The process is represented by a normal C/C++ application that usually runs from command line. It accepts a plain text file as input and generate another plain text file as output. As I need to integrate the flow of this application with something bigger (always in Spark), I was wondering if there is a way to do this.
The process can be easily run in parallel (at the moment I use GNU Parallel) just splitting its input in (for example) 10 part files, run 10 instances in memory of it, and re-join the final 10 part files output in one file.
The simplest thing you can do is to write a simple wrapper which takes data from standard input, writes to file, executes an external program, and outputs results to the standard output. After that all you have to do is to use pipe method:
rdd.pipe("your_wrapper")
The only serious considerations is IO performance. If it is possible it would be better to adjust program you want to call so it can read and write data directly without going through disk.
Alternativelly you can use mapPartitions combined with process and standard IO tools to write to the local file, call your program and read the output.
If you end up here based on the question title from a Google search, but you don't have the OP restriction that the external program needs to read from a file--i.e., if your external program can read from stdin--here is a solution. For my use case, I needed to call an external decryption program for each input file.
import org.apache.commons.io.IOUtils
import sys.process._
import scala.collection.mutable.ArrayBuffer
val showSampleRows = true
val bfRdd = sc.binaryFiles("/some/files/*,/more/files/*")
val rdd = bfRdd.flatMap{ case(file, pds) => { // pds is a PortableDataStream
val rows = new ArrayBuffer[Array[String]]()
var errors = List[String]()
val io = new ProcessIO (
in => { // "in" is an OutputStream; write the encrypted contents of the
// input file (pds) to this stream
IOUtils.copy(pds.open(), in) // open() returns a DataInputStream
in.close
},
out => { // "out" is an InputStream; read the decrypted data off this stream.
// Even though this runs in another thread, we can write to rows, since it
// is part of the closure for this function
for(line <- scala.io.Source.fromInputStream(out).getLines) {
// ...decode line here... for my data, it was pipe-delimited
rows += line.split('|')
}
out.close
},
err => { // "err" is an InputStream; read any errors off this stream
// errors is part of the closure for this function
errors = scala.io.Source.fromInputStream(err).getLines.toList
err.close
}
)
val cmd = List("/my/decryption/program", "--decrypt")
val exitValue = cmd.run(io).exitValue // blocks until subprocess finishes
println(s"-- Results for file $file:")
if (exitValue != 0) {
// TBD write to string accumulator instead, so driver can output errors
// string accumulator from #zero323: https://stackoverflow.com/a/31496694/215945
println(s"exit code: $exitValue")
errors.foreach(println)
} else {
// TBD, you'll probably want to move this code to the driver, otherwise
// unless you're using the shell, you won't see this output
// because it will be sent to stdout of the executor
println(s"row count: ${rows.size}")
if (showSampleRows) {
println("6 sample rows:")
rows.slice(0,6).foreach(row => println(" " + row.mkString("|")))
}
}
rows
}}
scala> :paste "test.scala"
Loading test.scala...
...
rdd: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[62] at flatMap at <console>:294
scala> rdd.count // action, causes Spark code to actually run
-- Results for file hdfs://path/to/encrypted/file1: // this file had errors
exit code: 255
ERROR: Error decrypting
my_decryption_program: Bad header data[0]
-- Results for file hdfs://path/to/encrypted/file2:
row count: 416638
sample rows:
<...first row shown here ...>
...
<...sixth row shown here ...>
...
res43: Long = 843039
References:
https://www.scala-lang.org/api/current/scala/sys/process/ProcessIO.html
https://alvinalexander.com/scala/how-to-use-closures-in-scala-fp-examples#using-closures-with-other-data-types

store (binary) file - play framework using scala in heroku

I'm trying to store user-uploaded images in my application which is written by scala and play framework 2.2.x
I've deployed my app in heroku.
Heroku does not allow me to save my file in file system.
So I've tried to store my file in data base.
here is the code that I use for storing image :
def updateImage(id: Long, image: Array[Byte]) = {
val selected = getById(id)
DB.withConnection {
implicit c =>
SQL("update subcategory set image={image} where id = {id}").on('id -> id, 'image -> image).executeUpdate()
}
selected }
and here is the code that I use to retreive my image :
def getImageById(id: Long): Array[Byte] = DB.withConnection {
implicit c =>
val all = SQL("select image from subcategory where id = {id}").on('id -> id)().map {
case Row(image: Array[Byte]) => image
case Row(Some(image: Array[Byte])) => image
case Row(image: java.sql.Blob )=> image.getBytes(0 , image.length().toInt)
}
all.head
}
The problem is: when I use H2 database and blob column, I get the "Match Error" exception.
When I use Postgresql and bytea column, I got no error but when I retrieve the image, It's in hex format and some of the bytes in the beginning of the array are missing.
According to the PostgreSQL documentation, bytea stores the length of the array in the four bytes at the beginning of the array. These are stripped when you read the row, so that's why they seem to be "missing" when you compare the data in Scala with the data in the DB.
You will have to set the response's content-type to the appropriate value if you want the web browser to display the image correctly, as otherwise it does not know it is receiving image data. The Ok.sendFile helper does it for you. Otherwise you will have to do it by hand:
def getPicture = Action {
SimpleResult(
header = ResponseHeader(200),
body = Enumerator(pictureByteArray))
.as(pictureContentType)
}
In the example above, pictureByteArray is the Array[Byte] containing the picture data from your database, and pictureContentType is a string with the appropriate content type (for example, image/jpeg).
This is all quite well explained in the Play documentation.