With code below I read and print the content of file using Akka streams :
package playground
import java.nio.file.Paths
import akka.actor.ActorSystem
import akka.stream.scaladsl.{FileIO, Framing, Sink, Source}
import akka.util.ByteString
import akka.stream.ActorMaterializer
object Greeter extends App {
implicit val system = ActorSystem("map-management-service")
implicit val materializer = ActorMaterializer()
FileIO.fromPath(Paths.get("a.csv"))
.via(Framing.delimiter(ByteString("\n"), 256, true).map(_.utf8String)).runForeach(println)
}
My understanding of using Akka streams is that if the file changes/updates the processing code, in this case println is fired so each time the file is updated the entire file is re-read. But this is not occurring - the file is read once.
How should this be modified such that each time the file a.csv is updated the file is re-read and the println code is re-executed
Alpakka's DirectoryChangesSource could fit your use case. For example:
import akka.stream.alpakka.file.DirectoryChange
import akka.stream.alpakka.file.scaladsl.DirectoryChangesSource
implicit val system = ActorSystem("map-management-service")
implicit val materializer = ActorMaterializer()
val myFile = Paths.get("a.csv")
val changes = DirectoryChangesSource(Paths.get("."), pollInterval = 3.seconds, maxBufferSize = 1000)
changes
.filter {
case (path, dirChange) =>
path.endsWith(myFile) && (dirChange == DirectoryChange.Creation || dirChange == DirectoryChange.Modification)
}
.flatMapConcat(_ => FileIO.fromPath(myFile).via(Framing.delimiter(ByteString("\n"), 256, true)))
.map(_.utf8String)
.runForeach(println)
The above snippet prints the file contents when the file is created and whenever the file is modified, polling in three-second intervals.
I'd like to expand on Jeffrey's answer with a fully runnable Ammonite script:
import $ivy.`com.lightbend.akka::akka-stream-alpakka-file:1.1.1`
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{ FileIO, Framing }
import akka.stream.alpakka.file.DirectoryChange
import akka.stream.alpakka.file.scaladsl.DirectoryChangesSource
import akka.util.ByteString
import java.nio.file.Paths
import scala.concurrent.duration._
implicit val system = ActorSystem("map-management-service")
implicit val materializer = ActorMaterializer()
val myFile = Paths.get("a.csv")
val changes = DirectoryChangesSource(Paths.get("."), pollInterval = 3.seconds, maxBufferSize = 1000)
changes
.filter {
case (path, dirChange) =>
path.endsWith(myFile) && (dirChange == DirectoryChange.Creation || dirChange == DirectoryChange.Modification)
}
.flatMapConcat {
case (path, _) => FileIO.fromPath(path).via(Framing.delimiter(ByteString("\n"), 256, true))
}
.map(_.utf8String)
.runForeach(println)
Please direct upvotes to his answer for the original idea.
Related
The path given to the text file is correct still I am getting error " Input path does not exist: file:/C:/Users/cmpil/Downloads/hunger_games.txt". Why is it happening
import org.apache.spark.sql._
import org.apache.log4j._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object WordCountDataSet {
case class Book(value:String)
def main(args:Array[String]): Unit ={
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.appName("WordCount")
.master("local[*]")
.getOrCreate()
import spark.implicits._
//Another way of doing it
val bookRDD = spark.sparkContext.textFile("C:/Users/cmpil/Downloads/hunger_games.txt")
val wordsRDD = bookRDD.flatMap(x => x.split("\\W+"))
val wordsDS = wordsRDD.toDS()
val lowercaseWordsDS = wordsDS.select(lower($"value").alias("word"))
val wordCountsDS = lowercaseWordsDS.groupBy("word").count()
val wordCountsSortedDS = wordCountsDS.sort("count")
wordCountsSortedDS.show(wordCountsSortedDS.count().toInt)
}
}
on windows you have to use '\\' in place of '/'
try using "C:\\Users\\cmpil\\Downloads\\hunger_games.txt"
I have a scala project that uses akka. I want the execution context to be available throughout the project. So I've created a package object like this:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import com.datastax.driver.core.Cluster
package object connector {
implicit val system = ActorSystem()
implicit val mat = ActorMaterializer()
implicit val executionContext = executionContext
implicit val session = Cluster
.builder
.addContactPoints("localhost")
.withPort(9042)
.build()
.connect()
}
In the same package I have this file:
import akka.stream.alpakka.cassandra.scaladsl.CassandraSource
import akka.stream.scaladsl.Sink
import com.datastax.driver.core.{Row, Session, SimpleStatement}
import scala.collection.immutable
import scala.concurrent.Future
object CassandraService {
def selectFromCassandra()() = {
val statement = new SimpleStatement(s"SELECT * FROM animals.alpakka").setFetchSize(20)
val rows: Future[immutable.Seq[Row]] = CassandraSource(statement).runWith(Sink.seq)
rows.map{item =>
print(item)
}
}
}
However I am getting the compiler error that no execution context or session can be found. My understanding of the package keyword was that everything in that object will be available throughout the package. But that does not seem work. Grateful if this could be explained to me!
Your implementation must be something like this, and hope it helps.
package.scala
package com.app.akka
package object connector {
// Do some codes here..
}
CassandraService.scala
package com.app.akka
import com.app.akka.connector._
object CassandraService {
def selectFromCassandra() = {
// Do some codes here..
}
}
You have two issue with your current code.
When you compile your package object connector it is throwing below error
Error:(14, 35) recursive value executionContext needs type
implicit val executionContext = executionContext
Issue is with implicit val executionContext = executionContext line
Solution for this issue would be as below.
implicit val executionContext = ExecutionContext
When we compile CassandraService then it is throwing error as below
Error:(17, 13) Cannot find an implicit ExecutionContext. You might pass
an (implicit ec: ExecutionContext) parameter to your method
or import scala.concurrent.ExecutionContext.Implicits.global.
rows.map{item =>
Error clearly say that either we need to pass ExecutionContext as implicit parameter or import scala.concurrent.ExecutionContext.Implicits.global. In my system both issues are resolved and its compiled successfully. I have attached code for your reference.
package com.apache.scala
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import com.datastax.driver.core.Cluster
import scala.concurrent.ExecutionContext
package object connector {
implicit val system = ActorSystem()
implicit val mat = ActorMaterializer()
implicit val executionContext = ExecutionContext
implicit val session = Cluster
.builder
.addContactPoints("localhost")
.withPort(9042)
.build()
.connect()
}
package com.apache.scala.connector
import akka.stream.alpakka.cassandra.scaladsl.CassandraSource
import akka.stream.scaladsl.Sink
import com.datastax.driver.core.{Row, SimpleStatement}
import scala.collection.immutable
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
object CassandraService {
def selectFromCassandra() = {
val statement = new SimpleStatement(s"SELECT * FROM animals.alpakka").setFetchSize(20)
val rows: Future[immutable.Seq[Row]] = CassandraSource(statement).runWith(Sink.seq)
rows.map{item =>
print(item)
}
}
}
I'm using Alpakkas UDP.bindFlow to forward incoming UDP datagrams to a Kafka broker. The legacy application that is sending these datagrams requires a UDP response from the same port the message was sent to. I am struggling to model this behaviour as it would require me to connect the output of the flow to its input.
I tried this solution, but it does not work because the response datagram is sent from a different source port:
import java.net.InetSocketAddress
import akka.actor.ActorSystem
import akka.kafka.ProducerSettings
import akka.kafka.scaladsl.Producer
import akka.stream.ActorMaterializer
import akka.stream.alpakka.udp.Datagram
import akka.stream.alpakka.udp.scaladsl.Udp
import akka.stream.scaladsl.{Flow, Source}
import akka.util.ByteString
import org.apache.kafka.clients.producer.ProducerRecord
import org.apache.kafka.common.serialization.StringSerializer
object UdpInput extends App {
implicit val system: ActorSystem = ActorSystem()
implicit val materializer: ActorMaterializer = ActorMaterializer()
val socket = new InetSocketAddress("0.0.0.0", 40000)
val udpBindFlow = Udp.bindFlow(socket)
val producerSettings = ProducerSettings(system, new StringSerializer, new StringSerializer)
val kafkaSink = Flow[Datagram].map(toProducerRecord).to(Producer.plainSink(producerSettings))
def toProducerRecord(datagram: Datagram) = new ProducerRecord[String, String]("udp", datagram.data.utf8String)
def toResponseDatagram(datagram: Datagram) = Datagram(ByteString("OK"), datagram.remote)
// Does not model the behaviour I'm looking for because
// the response datagram is sent from a different source port
Source.asSubscriber
.via(udpBindFlow)
.alsoTo(kafkaSink)
.map(toResponseDatagram)
.to(Udp.sendSink)
.run
}
I ended up using GraphDSL to implement a cyclic graph. Thanks to dvim for pointing me in the right direction!
import java.net.InetSocketAddress
import akka.actor.ActorSystem
import akka.kafka.ProducerSettings
import akka.kafka.scaladsl.Producer
import akka.stream.alpakka.udp.Datagram
import akka.stream.alpakka.udp.scaladsl.Udp
import akka.stream.scaladsl.GraphDSL.Implicits._
import akka.stream.scaladsl.{Broadcast, Flow, GraphDSL, MergePreferred, RunnableGraph, Source}
import akka.stream.{ActorMaterializer, ClosedShape}
import akka.util.ByteString
import org.apache.kafka.clients.producer.ProducerRecord
import org.apache.kafka.common.serialization.StringSerializer
object UdpInput extends App {
implicit val system: ActorSystem = ActorSystem()
implicit val materializer: ActorMaterializer = ActorMaterializer()
val producerSettings = ProducerSettings(system, new StringSerializer, new StringSerializer)
val socket = new InetSocketAddress("0.0.0.0", 40000)
val udpBindFlow = Udp.bindFlow(socket)
val udpResponseFlow = Flow[Datagram].map(toResponseDatagram)
val kafkaSink = Flow[Datagram].map(toProducerRecord).to(Producer.plainSink(producerSettings))
def toProducerRecord(datagram: Datagram) = new ProducerRecord[String, String]("udp", datagram.data.utf8String)
def toResponseDatagram(datagram: Datagram) = Datagram(ByteString("OK"), datagram.remote)
RunnableGraph.fromGraph(GraphDSL.create() { implicit b =>
val merge = b.add(MergePreferred[Datagram](1))
val bcast = b.add(Broadcast[Datagram](2))
Source.asSubscriber ~> merge ~> udpBindFlow ~> bcast ~> kafkaSink
merge.preferred <~ udpResponseFlow <~ bcast
ClosedShape
}).run
}
I'm working with Spark Streaming using Scala. I need to read a .csv file dinamically from HDFS directory with this line:
val lines = ssc.textFileStream("/user/root/")
I use the following command line to put the file into HDFS:
hdfs dfs -put ./head40k.csv
It works fine with a relatively small file.
When I try with a larger one, I get this error:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /user/root/head800k.csv._COPYING
I can understand why, but I don't know how to fix it. I've tried this solution too:
hdfs dfs -put ./head800k.csv /user
hdfs dfs -mv /usr/head800k.csv /user/root
but my program doesn't read the file.
Any ideas?
Thanks in advance
PROGRAM:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.rdd.RDDFunctions._
import scala.sys.process._
import org.apache.spark.mllib.linalg.Vectors
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import java.util.HashMap
import org.apache.hadoop.io.{LongWritable, NullWritable, Text}
import org.apache.hadoop.fs.Path
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
import StreamingContext._
object Traccia2014{
def main(args: Array[String]){
if (args.length < 2) {
System.err.println(s"""
|Usage: DirectKafkaWordCount <brokers> <test><topicRisultato>
| <brokers> is a list of one or more Kafka brokers
| <topics> is a list of one or more kafka topics to consume from
|
""".stripMargin)
System.exit(1)
}
val Array(brokers,risultato) = args
val sparkConf = new SparkConf().setAppName("Traccia2014")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val lines = ssc.textFileStream("/user/root/")
//val lines= ssc.fileStream[LongWritable, Text, TextInputFormat](directory="/user/root/",
// filter = (path: org.apache.hadoop.fs.Path) => //(!path.getName.endsWith("._COPYING")),newFilesOnly = true)
//********** Definizioni Producer***********
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val slice=30
lines.foreachRDD( rdd => {
if(!rdd.isEmpty){
val min=rdd.map(x => x.split(",")(0)).reduce((a, b) => if (a < b) a else b)
if(!min.isEmpty){
val ipDst= rdd.map(x => (((x.split(",")(0).toInt - min.toInt).toLong/slice).round*slice+" "+(x.split(",")(2)),1)).reduceByKey(_ + _)
if(!ipDst.isEmpty){
val ipSrc=rdd.map(x => (((x.split(",")(0).toInt - min.toInt).toLong/slice).round*slice+" "+(x.split(",")(1)),1)).reduceByKey(_ + _)
if(!ipSrc.isEmpty){
val Rapporto=ipSrc.leftOuterJoin(ipDst).mapValues{case (x,y) => x.asInstanceOf[Int] / y.getOrElse(1) }
val RapportoFiltrato=Rapporto.filter{case (key, value) => value > 100 }
println("###(ConsumerScala) CalcoloRapporti: ###")
Rapporto.collect().foreach(println)
val str = Rapporto.collect().mkString("\n")
println(s"###(ConsumerScala) Produco Risultato : ${str}")
val message = new ProducerRecord[String, String](risultato, null, str)
producer.send(message)
Thread.sleep(1000)
}else{
println("src vuoto")
}
}else{
println("dst vuoto")
}
}else{
println("min vuoto")
}
}else
{
println("rdd vuoto")
}
})//foreach
ssc.start()
ssc.awaitTermination()
} }
/user/root/head800k.csv._COPYING is a transient file that is created while the copy process is on going. Wait for the copy process to complete and you will have a fail without the _COPYING suffix ie /user/root/head800k.csv.
to filter these transient in your spark-streaming job you can use the fileStream method documented here
as shown below for example
ssc.fileStream[LongWritable, Text, TextInputFormat](
directory="/user/root/",
filter = (path: org.apache.hadoop.fs.Path) => (!path.getName.endsWith("_COPYING")), // add other filters like files starting with dot etc
newFilesOnly = true)
EDIT
since you are moving your file from local filesystem to HDFS, the best solution is to move your file to a temporary staging location in the HDFS and then move them to your target directory. copying or moving within the HDFS filesystem should avoid the transient files
I am new to Scala. How can I read a file from HDFS using Scala (not using Spark)?
When I googled it I only found writing option to HDFS.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.PrintWriter;
/**
* #author ${user.name}
*/
object App {
//def foo(x : Array[String]) = x.foldLeft("")((a,b) => a + b)
def main(args : Array[String]) {
println( "Trying to write to HDFS..." )
val conf = new Configuration()
//conf.set("fs.defaultFS", "hdfs://quickstart.cloudera:8020")
conf.set("fs.defaultFS", "hdfs://192.168.30.147:8020")
val fs= FileSystem.get(conf)
val output = fs.create(new Path("/tmp/mySample.txt"))
val writer = new PrintWriter(output)
try {
writer.write("this is a test")
writer.write("\n")
}
finally {
writer.close()
println("Closed!")
}
println("Done!")
}
}
Please help me.How can read the file or load file from HDFS using scala.
One of the ways (kinda in functional style) could be like this:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import java.net.URI
import scala.collection.immutable.Stream
val hdfs = FileSystem.get(new URI("hdfs://yourUrl:port/"), new Configuration())
val path = new Path("/path/to/file/")
val stream = hdfs.open(path)
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
//This example checks line for null and prints every existing line consequentally
readLines.takeWhile(_ != null).foreach(line => println(line))
Also you could take a look this article or here and here, these questions look related to yours and contain working (but more Java-like) code examples if you're interested.