I'm coding a small Akka Streams sample where I want to write elements of a List to a local TXT file
implicit val ec = context.dispatcher
implicit val actorSystem = context.system
implicit val materializer = ActorMaterializer()
val source = Source(List("a", "b", "c"))
.map(char => ByteString(s"${char} \n"))
val runnableGraph = source.toMat(FileIO.toPath(Paths.get("~/Downloads/results.txt")))(Keep.right)
runnableGraph.run()
The file is already created by the location I set in the code.
I do not terminate the actor system, so definitely it has enough time to write all of List elements to the file.
But unfortunately, nothing happens
Use the expanded path to your home directory instead of the tilde (~). For example:
val runnableGraph =
source.toMat(
FileIO.toPath(Paths.get("/home/YourUserName/Downloads/results.txt")))(Keep.right)
runnableGraph.run()
Related
This question already has an answer here:
Custom partiotioning of JavaDStreamPairRDD
(1 answer)
Closed 4 years ago.
I am trying to save my Pair Rdd in spark streaming but getting error while saving at last step .
Here is my sample code
def main(args: Array[String]) {
val inputPath = args(0)
val output = args(1)
val noOfHashPartitioner = args(2).toInt
println("IN Streaming ")
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val sc = new SparkContext(conf)
val hadoopConf = sc.hadoopConfiguration;
//hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
val ssc = new org.apache.spark.streaming.StreamingContext(sc, Seconds(60))
val input = ssc.textFileStream(inputPath)
val pairedRDD = input.map(row => {
val split = row.split("\\|")
val fileName = split(0)
val fileContent = split(1)
(fileName, fileContent)
})
import org.apache.hadoop.io.NullWritable
import org.apache.spark.HashPartitioner
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RddMultiTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any = NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = key.asInstanceOf[String]
}
//print(pairedRDD)
pairedRDD.partitionBy(new HashPartitioner(noOfHashPartitioner)).saveAsHadoopFile(output, classOf[String], classOf[String], classOf[RddMultiTextOutputFormat], classOf[GzipCodec])
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
I am getting at last step while saving .I am new to spark streaming so must be missing something here .
Getting error like
value partitionBy is not a member of
org.apache.spark.streaming.dstream.DStream[(String, String)]
Please help
pairedRDD is of type DStream[(String, String)] not RDD[(String,String)]. The method partitionBy is not available on DStreams.
Maybe look into foreachRDD which should be available on DStreams.
EDIT: A bit more context explanation textFileStream will set up a directory watch on the specified path and whenever there are new files will stream the content. so that's where the stream aspect comes from. Is that what you want? or do you just want to read the content of the directory "as is" once? Then there's readTextFiles which will return a non-stream container.
I have the following snippet that reads a CSV file and just prints something to the console:
def readUsingAkkaStreams = {
import java.io.File
import akka.stream.scaladsl._
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import java.security.MessageDigest
implicit val system = ActorSystem("Sys")
implicit val materializer = ActorMaterializer()
val file = new File("/path/to/csv/file.csv")
val fileSource = FileIO.fromFile(file, 65536)
val flow = fileSource.map(chunk => chunk.utf8String)
flow.to(Sink.foreach(println(_))).run
}
I now have some questions around this:
The chunksize is the size in bytes. How is it handled internally? I mean would I end up in a situation that a chunk may contain only partial elements in a line?
How does this stream termintate? Right now it does not! I want it to know that it has read the file completely and should trigger a stop signal! Is there a mechanism to do this?
EDIT 1: After suggestions from the post below, I get an error message as shown in the screenshot!
EDIT 2:
Managed to get rid of the error by setting the maximumFrameLength to match the size of the maximum chunk size which is 65536.
val file = new File("/path/to/csf/file.csv")
val chunkSize = 65536
val fileSource = FileIO.fromFile(file, chunkSize).via(Framing.delimiter(
ByteString("\n"),
maximumFrameLength = chunkSize,
allowTruncation = true))
1.As per the docs:
Emitted elements are chunkSize sized ByteString elements, except the final element, which will be up to chunkSize in size.
The FileIO source treats new lines as any other character. So yes, you will be potentially seeing the first part of a CSV line in a chunk, and the second part in another chunk. If this is not what you want, you can restructure how your ByteString flow is chunked by using Framing.delimiter (see the docs for more info).
As a side note, FileIO.fromFile has been deprecated, better use FileIO.fromPath.
An example would be:
val fileSource = FileIO.fromPath(...)
.via(Framing.delimiter(
ByteString("\n"),
maximumFrameLength = 256,
allowTruncation = true))
2.the sink materializes to a Future you can map onto to do something when the stream terminates:
val result: Future[IOResult] = flow.runWith(Sink.foreach(println(_)))
result.onComplete(...)
Is this possible to write tweets in real-time into some relational database (like MySQL or PostgreSQL) without the usage of Spark in Scala? Instead, I would like to user streaming module of Akka.
I would really appreciate some resources and advice on this topic.
EDIT: I have tried to connect to Twitter API and I used Twitter4J to get some tweets and filter them. Now I am on the stage where I want to write extracted tweets into the database instead of writing to the console.
class Counter extends StatusAdapter{
implicit val system = ActorSystem("EmojiTrends")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
implicit val LoggingAdapter =
Logging(system, classOf[Counter])
val overflowStrategy = OverflowStrategy.backpressure
val bufferSize = 1000
val statusSource = Source.queue[Status](
bufferSize,
overflowStrategy
)
val sink: Sink[Any, Future[Done]] = Sink.foreach(println)
val flow: Flow[Status, String, NotUsed] = Flow[Status].map(status => status.getText)
val graph = statusSource via flow to sink
val queue = graph.run()
override def onStatus(status: Status) =
Await.result(queue.offer(status), Duration.Inf)
}
I have following code sample that works fine. I want to add some changes to preserve relation between request and responce. How can I ahieve that?
Rest api flow's materialized value is NotUsed. Is it possible to somehow use Keep.both for that?
// this flow is provided by some third party library that I can't change in place
val someRestApiFlow: Flow[Int, Int, NotUsed] = Flow[Int].mapAsync(10)(x => Future(x + 1))
val digits: Source[Int, NotUsed] = Source(List(1, 2, 3))
val r = digits.via(someRestApiFlow).runForeach(println)
Result is
2
3
4
I want result to be like
1 -> 2
2 -> 3
3 -> 4
You can use a broadcast element to create 2 separate flows. The first output of broadcast goes through someRestApiFlow, the second output of the broadcast goes unmodified. You then zip the output of the someRestApiFlow with the second output of the broadcast flow. Doing that, you have both the input element and the result of its transformation through someRestApiFlow.
digits ---> broadcast --> someRestApiFlow ---> zip --> result
\----------------------/
I have also encountered this kind of cases a couple of times. The only solution I have found is creating a graph using DSL and making use of broadcast and zip stages.
import akka.actor.ActorSystem
import akka.stream.scaladsl.{Broadcast, Flow, GraphDSL, RunnableGraph, Sink, Source, Zip}
import akka.stream.{ActorMaterializer, ClosedShape}
import akka.{Done, NotUsed}
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
object Main extends App {
implicit val system: ActorSystem = ActorSystem("my-system")
implicit val materializer: ActorMaterializer = ActorMaterializer()
val graph = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val src: Source[Int, NotUsed] = Source(List(1, 2, 3))
val someRestApiFlow: Flow[Int, Int, NotUsed] = Flow[Int].mapAsync(10)(x => Future(x + 1))
val out: Sink[(Int, Int), Future[Done]] = Sink.foreach[(Int, Int)](println)
val bcast = builder.add(Broadcast[Int](2))
val zip = builder.add(Zip[Int, Int])
src ~> bcast ~> zip.in0
bcast ~> someRestApiFlow ~> zip.in1
zip.out ~> out
ClosedShape
})
graph.run()
}
What is being done here is we are broadcasting the input to both a zip and a custom flow, and that zip also waits for the result of that custom flow and finally merges them to send sink.
I'm reading a csv file. I am using Akka Streams to do this so that I can create a graph of actions to perform on each line. I've got the following toy example up and running.
def main(args: Array[String]): Unit = {
implicit val system = ActorSystem("MyAkkaSystem")
implicit val materializer = ActorMaterializer()
val source = akka.stream.scaladsl.Source.fromIterator(Source.fromFile("a.csv").getLines)
val sink = Sink.foreach(println)
source.runWith(sink)
}
The two Source types don't sit easy with me. Is this idiomatic or is there is a better way to write this?
Actually, akka-streams provides a function to directly read from a file.
FileIO.fromPath(Paths.get("a.csv"))
.via(Framing.delimiter(ByteString("\n"), 256, true).map(_.utf8String))
.runForeach(println)
Here, runForeach method is to print the lines. If you have a proper Sink to process these lines, use it instead of this function. For example, if you want to split the lines by ' and print the total number of words in it:
val sink: Sink[String] = Sink.foreach(x => println(x.split(",").size))
FileIO.fromPath(Paths.get("a.csv"))
.via(Framing.delimiter(ByteString("\n"), 256, true).map(_.utf8String))
.to(sink)
.run()
The idiomatic way to read a CSV file with Akka Streams is to use the Alpakka CSV connector. The following example reads a CSV file, converts it to a map of column names (assumed to be the first line in the file) and ByteString values, transforms the ByteString values to String values, and prints each line:
FileIO.fromPath(Paths.get("a.csv"))
.via(CsvParsing.lineScanner())
.via(CsvToMap.toMap())
.map(_.mapValues(_.utf8String))
.runForeach(println)
Try this:
import java.nio.file.Paths
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl._
import akka.util.ByteString
import scala.concurrent.Await
import scala.concurrent.duration._
object ReadStreamApp extends App {
implicit val actorSystem = ActorSystem()
import actorSystem.dispatcher
implicit val flowMaterializer = ActorMaterializer()
val logFile = Paths.get("src/main/resources/a.csv")
val source = FileIO.fromPath(logFile)
val flow = Framing
.delimiter(ByteString(System.lineSeparator()), maximumFrameLength = 512, allowTruncation = true)
.map(_.utf8String)
val sink = Sink.foreach(println)
source
.via(flow)
.runWith(sink)
.andThen {
case _ =>
actorSystem.terminate()
Await.ready(actorSystem.whenTerminated, 1 minute)
}
}
Yeah, it's ok because these are different Sources. But if you don't like scala.io.Source you can read file yourself (which sometimes we have to do e.g. source csv file is zipped) and then parse it using given InputStream like this
StreamConverters.fromInputStream(() => input)
.via(Framing.delimiter(ByteString("\n"), 4096))
.map(_.utf8String)
.collect { line =>
line
}
Having said that consider using Apache Commons CSV with akka-stream. You may end up writing less code :)