How can I print the values of a source in the console.
val someSource = Source.single(ByteString("SomeValue"))
I want to print the String "SomeValue" from this source. I tried:
someSource.to(Sink.foreach(println)) //This one prints RunnableGraph object
someSource.map(each => {
val pqr = each.decodeString(ByteString.UTF_8)
print(pqr)
}) // THis one prints res3: soneSource.Repr[Unit] = Source(SourceShape(Map.out(169373838)))
How do I print the original string which was originally used to create Source of single object.
From what is written in the question, I think you are probably using Scala console or Scala worksheet.
In Scala console or workseet, it prints a representation of the things created in current statement. For example,
scala> val i = 5
val i: Int = 5
scala> val s = "ssfdf"
val s: String = ssfdf
But, what happens when you use something like a println here,
scala> val u = println("dfsd")
dfsd
val u: Unit = ()
It also execute the println and afterwards prints that the value u created by that println is actually an Unit.
And that is where your confusion is probably coming from, because your println in Sink.foreach is not working in this case.
That is because this case is more like following, where you are actually defining a function.
scala> val f1 = (s: String) => println(s)
val f1: String => Unit = $Lambda$1062/0x0000000800689840#1796b2d4
You are not using println here, you are just defining a function (an instance of String => Unit or Function1[String, Unit]) which will use println.
So, console just prints that the value f1 created here is of type String => Unit.
You will require to call this function to actually execute that println,
scala> f1.apply("dsfsd")
dsfsd
Similarly, someSource.to(Sink.foreach(println)) will create a value of type RunnableGraph, hence scala console will print something like val res0: RunnableGraph....
You will now require to run this graph to actually execute it.
But compared to the function example earlier, the execution of graph happens asynchronously on a thread pool, which means that it might not work in some versions of Scala console or workseet (depending on how the thread pool lifecycle is managed). So, if you just do,
scala> val someSource = Source.single(ByteString("SomeValue"))
val someSource: akka.stream.scaladsl.Source[akka.util.ByteString,akka.NotUsed] = Source(SourceShape(single.out(369296388)))
scala> val runnableGraph = someSource.to(Sink.foreach(println))
val runnableGraph: akka.stream.scaladsl.RunnableGraph[akka.NotUsed] = RunnableGraph
scala> runnableGraph.run()
If it works then you will see following,
scala> runnableGraph.run()
val res0: akka.NotUsed = NotUsed
ByteString(83, 111, 109, 101, 86, 97, 108, 117, 101)
But chances are that you will just see some errors related to the console failing to complete the graph run due to some reasion.
You will actually need to materialize the Sink which will reasult into a Future[Done] on running the graph. Then you will have to wait on that Future[Done] using Await.
You will have to put all this into a normal Scala file and execute as an Scala application.
import akka.{Done, actor}
import akka.actor.typed.ActorSystem
import akka.actor.typed.scaladsl.Behaviors
import akka.stream.scaladsl.{Keep, Sink, Source}
import akka.util.ByteString
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, Future}
object TestAkkaStream extends App {
val actorSystem = ActorSystem(Behaviors.empty, "test-stream-system")
implicit val classicActorSystem = actorSystem.classicSystem
val someSource = Source.single(ByteString("SomeValue"))
val runnableGraph = someSource.toMat(Sink.foreach(println))(Keep.right)
val graphRunDoneFuture: Future[Done] = runnableGraph.run()
Await.result(graphRunDoneFuture, Duration.Inf)
}
Related
I am new to Scala and I have some questions about how it works.
I want to do the next thing : given list of values, I want to construct some imitation of dictionary in parallel, something like that: (1,2,3,4) -> ((1,1), (2,2), (3,3), (4,4) ). I know that if we deal with parallelized collections we should use accumulators. So here is my attempt:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.util.AccumulatorV2
import scala.collection.mutable.ListBuffer
class DictAccumulatorV2 extends AccumulatorV2[Int, ListBuffer[(Int, Int)]] {
private var dict:ListBuffer[(Int, Int)]= new ListBuffer[(Int, Int)]
def reset(): Unit = {
dict.clear()
}
def add(v: Int): Unit = {
dict.append((v, v))
}
def value():ListBuffer[(Int, Int)] = {
return dict
}
def isZero(): Boolean = {
return dict.isEmpty
}
def copy() : AccumulatorV2[Int, ListBuffer[(Int, Int)]] = {
// I do not understand how to code it correctly
return new DictAccumulatorV2
}
def merge(other:AccumulatorV2[Int, ListBuffer[(Int, Int)]]): Unit = {
// I do not understand how to code it correctly without reinitializing dict from val to var
dict = dict ++ other.value
}
}
object FirstSparkApplication {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("MyFirstApp").setMaster("local")
val sc = new SparkContext(conf)
val accum = new DictAccumulatorV2()
sc.register(accum, "mydictacc")
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
var res = distData.map(x => accum.add(x))
res.count()
println(accum)
}
}
So I wonder if I do it right or there are any mistakes.
In general I also have questions about how sc.parallelize works. Does it actually parallelize job on my machine or it's just fictional string of code? What should I put instead of "local" in setMaster? How can I see on which nodes is the task been performing? Is the task performed on the all of the nodes at the same time or there is some sequence?
(1,2,3,4) -> ((1,1), (2,2), (3,3), (4,4) )
You can do this in Scala by doing
val list = List(1,2,3,4)
val dict = list.map(i => (i,i))
Spark Accumulators are used as a communication means from Spark executor to Driver.
If you want to do the above in Parallel, then you would construct an RDD out of this list and applying map transformation to it like shown above.
In spark shell it would look like
val list = List(1,2,3,4)
val listRDD = sc.parallelize(list)
val dictRDD = listRDD.map(i => (i,i))
how sc.parallelize works
It creates a distributed Dataset (RDD in spark terms) using the collection that you pass in to the function. More information.
It does parallelize your job.
If you are submitting your spark job to a cluster then you should be able to see a YARN application ID or URL after running spark-submit command.You can visit the YARN application URL and see how many executors are processing that distributed dataset and what sequence they are performed in.
What should I put instead of "local" in setMaster
From the Spark documentation -
The master URL to connect to, such as "local" to run locally with one thread, "local[4]" to run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
I'm trying to wrap some blocking calls in Future.The return type is Seq[User] where User is a case class. The following just wouldn't compile with complaints of various overloaded versions being present. Any suggestions? I tried almost all the variations is Source.apply without any luck.
// All I want is Seq[User] => Future[Seq[User]]
def findByFirstName(firstName: String) = {
val users: Seq[User] = userRepository.findByFirstName(firstName)
val sink = Sink.fold[User, User](null)((_, elem) => elem)
val src = Source(users) // doesn't compile
src.runWith(sink)
}
First of all, I assume that you are using version 1.0 of akka-http-experimental since the API may changed from previous release.
The reason why your code does not compile is that the akka.stream.scaladsl.Source$.apply() requires
scala.collection.immutable.Seq instead of scala.collection.mutable.Seq.
Therefore you have to convert from mutable sequence to immutable sequence using to[T] method.
Document: akka.stream.scaladsl.Source
Additionally, as you see the document, Source$.apply() accepts ()=>Iterator[T] so you can also pass ()=>users.iterator as argument.
Since Sink.fold(...) returns the last evaluated expression, you can give an empty Seq() as the first argument, iterate over the users with appending the element to the sequence, and finally get the result.
However, there might be a better solution that can create a Sink which puts each evaluated expression into Seq, but I could not find it.
The following code works.
import akka.actor._
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Source,Sink}
import scala.concurrent.ExecutionContext.Implicits.global
case class User(name:String)
object Main extends App{
implicit val system = ActorSystem("MyActorSystem")
implicit val materializer = ActorMaterializer()
val users = Seq(User("alice"),User("bob"),User("charlie"))
val sink = Sink.fold[Seq[User], User](Seq())(
(seq, elem) =>
{println(s"elem => ${elem} \t| seq => ${seq}");seq:+elem})
val src = Source(users.to[scala.collection.immutable.Seq])
// val src = Source(()=>users.iterator) // this also works
val fut = src.runWith(sink) // Future[Seq[User]]
fut.onSuccess({
case x=>{
println(s"result => ${x}")
}
})
}
The output of the code above is
elem => User(alice) | seq => List()
elem => User(bob) | seq => List(User(alice))
elem => User(charlie) | seq => List(User(alice), User(bob))
result => List(User(alice), User(bob), User(charlie))
If you need just Future[Seq[Users]] dont use akka streams but
futures
import scala.concurrent._
import ExecutionContext.Implicits.global
val session = socialNetwork.createSessionFor("user", credentials)
val f: Future[List[Friend]] = Future {
session.getFriends()
}
I would like to be able to send data from a scalaz stream into an external program and then get the result of that item back in about 100ms in the future. Although I was able to do this with the code below by zipping the output stream Sink with the input stream Process and then throwing away the Sink side effect, I feel like this solution may be very brittle.
If the external program has an error for one of the input items everything will be out of sync. I feel like the best bet would be to send some sort of incremental ID into the external program which it can echo back out in the future so that if an error occurs we can resync.
The main trouble I am having is joining together the result of sending data into the external program Process[Task, Unit] with the output of the program Process[Task, String]. I feel like I should be using something from wyn but not really sure.
import java.io.PrintStream
import scalaz._
import scalaz.concurrent.Task
import scalaz.stream.Process._
import scalaz.stream._
object Main extends App {
/*
# echo.sh just prints to stdout what it gets on stdin
while read line; do
sleep 0.1
echo $line
done
*/
val p: java.lang.Process = Runtime.getRuntime.exec("/path/to/echo.sh")
val source: Process[Task, String] = Process.repeatEval(Task{
Thread.sleep(1000)
System.currentTimeMillis().toString
})
val linesR: stream.Process[Task, String] = stream.io.linesR(p.getInputStream)
val printLines: Sink[Task, String] = stream.io.printLines(new PrintStream(p.getOutputStream))
val in: Process[Task, Unit] = source to printLines
val zip: Process[Task, (Unit, String)] = in.zip(linesR)
val out: Process[Task, String] = zip.map(_._2) observe stream.io.stdOutLines
out.run.run
}
After delving a little deeper into the more advanced types. It looks like Exchange does exactly what I want.
import java.io.PrintStream
import scalaz._
import scalaz.concurrent.Task
import scalaz.stream._
import scalaz.stream.io._
object Main extends App {
/*
# echo.sh just prints to stdout what it gets on stdin
while read line; do
sleep 0.1
echo $line
done
*/
val program: java.lang.Process = Runtime.getRuntime.exec("./echo.sh")
val source: Process[Task, String] = Process.repeatEval(Task{
Thread.sleep(100)
System.currentTimeMillis().toString
})
val read: stream.Process[Task, String] = linesR(program.getInputStream)
val write: Sink[Task, String] = printLines(new PrintStream(program.getOutputStream))
val exchange: Exchange[String, String] = Exchange(read, write)
println(exchange.run(source).take(10).runLog.run)
}
I think I understand the rules of implicit returns but I can't figure out why splithead is not being set. This code is run via
val m = new TaxiModel(sc, file)
and then I expect
m.splithead
to give me an array strings. Note head is an array of strings.
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
class TaxiModel(sc: SparkContext, dat: String) {
val rawData = sc.textFile(dat)
val head = rawData.take(10)
val splithead = head.slice(1,11).foreach(splitData)
def splitData(dat: String): Array[String] = {
val splits = dat.split("\",\"")
val split0 = splits(0).substring(1, splits(0).length)
val split8 = splits(8).substring(0, splits(8).length - 1)
Array(split0).union(splits.slice(1, 8)).union(Array(split8))
}
}
foreach just evaluates expression, and do not collect any data while iterating. You probably need map or flatMap (see docs here)
head.slice(1,11).map(splitData) // gives you Array[Array[String]]
head.slice(1,11).flatMap(splitData) // gives you Array[String]
Consider also a for comprehension (which desugars in this case into flatMap),
for (s <- head.slice(1,11)) yield splitData(s)
Note also that Scala strings are equipped with ordered collections methods, thus
splits(0).substring(1, splits(0).length)
proves equivalent to any of the following
splits(0).drop(1)
splits(0).tail
I am attempting to asynchronously write messages to Amazon Kinesis using Scala Futures so I can load test an application.
This code works, and I can see data moving down my pipeline as well as the output printing to the console.
import com.amazonaws.services.kinesis.AmazonKinesisClient
import java.nio.CharBuffer
import java.nio.charset.Charset
import java.text.SimpleDateFormat
import java.util.{Date, TimeZone}
object KinesisDummyDataProducer extends App {
val kinesis = new AmazonKinesisClient(PipelineConfig.awsCredentials)
println("Connected")
lazy val encoder = Charset.forName("UTF-8").newEncoder()
lazy val tz = TimeZone.getTimeZone("UTC")
lazy val df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'Z")
df.setTimeZone(tz)
(1 to args(0).toInt).map(int => send(int)).map(msg => println(msg))
private def send(int: Int) = {
val msg = "{\"event_name\":\"test\",\"timestamp\":\"%s\",\"int\":%s}".format(df.format(new Date()), int.toString)
val bytes = encoder.encode(CharBuffer.wrap(msg))
encoder.flush(bytes)
kinesis.putRecord("PrimaryEventStream", bytes, "123")
msg
}
}
This code works with Scala Futures.
import scala.concurrent.future
import scala.concurrent.ExecutionContext.Implicits.global
def doIt(x: Int) = {Thread.sleep(1000); x + 1}
(1 to 10).map(x => future{doIt(x)}).map(y => y.onSuccess({case x => println(x)}))
You'll note that the syntax is nearly identical on the mapping of sequences. However, the follwoing does not work (i.e., it neither prints to the console nor sends data down my pipeline).
import com.amazonaws.services.kinesis.AmazonKinesisClient
import java.nio.CharBuffer
import java.nio.charset.Charset
import java.text.SimpleDateFormat
import java.util.{Date, TimeZone}
import scala.concurrent.future
import scala.concurrent.ExecutionContext.Implicits.global
object KinesisDummyDataProducer extends App {
val kinesis = new AmazonKinesisClient(PipelineConfig.awsCredentials)
println("Connected")
lazy val encoder = Charset.forName("UTF-8").newEncoder()
lazy val tz = TimeZone.getTimeZone("UTC")
lazy val df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'Z")
df.setTimeZone(tz)
(1 to args(0).toInt).map(int => future {send(int)}).map(f => f.onSuccess({case msg => println(msg)}))
private def send(int: Int) = {
val msg = "{\"event_name\":\"test\",\"timestamp\":\"%s\",\"int\":%s}".format(df.format(new Date()), int.toString)
val bytes = encoder.encode(CharBuffer.wrap(msg))
encoder.flush(bytes)
kinesis.putRecord("PrimaryEventStream", bytes, "123")
msg
}
}
Some more notes about this project. I am using Maven to do the build (from the command line), and running all of the above code (also from the command line) works just dandy.
My question is: Why with using the same syntax does my function 'send' appear to not be executing?