I have the following snippet that reads a CSV file and just prints something to the console:
def readUsingAkkaStreams = {
import java.io.File
import akka.stream.scaladsl._
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import java.security.MessageDigest
implicit val system = ActorSystem("Sys")
implicit val materializer = ActorMaterializer()
val file = new File("/path/to/csv/file.csv")
val fileSource = FileIO.fromFile(file, 65536)
val flow = fileSource.map(chunk => chunk.utf8String)
flow.to(Sink.foreach(println(_))).run
}
I now have some questions around this:
The chunksize is the size in bytes. How is it handled internally? I mean would I end up in a situation that a chunk may contain only partial elements in a line?
How does this stream termintate? Right now it does not! I want it to know that it has read the file completely and should trigger a stop signal! Is there a mechanism to do this?
EDIT 1: After suggestions from the post below, I get an error message as shown in the screenshot!
EDIT 2:
Managed to get rid of the error by setting the maximumFrameLength to match the size of the maximum chunk size which is 65536.
val file = new File("/path/to/csf/file.csv")
val chunkSize = 65536
val fileSource = FileIO.fromFile(file, chunkSize).via(Framing.delimiter(
ByteString("\n"),
maximumFrameLength = chunkSize,
allowTruncation = true))
1.As per the docs:
Emitted elements are chunkSize sized ByteString elements, except the final element, which will be up to chunkSize in size.
The FileIO source treats new lines as any other character. So yes, you will be potentially seeing the first part of a CSV line in a chunk, and the second part in another chunk. If this is not what you want, you can restructure how your ByteString flow is chunked by using Framing.delimiter (see the docs for more info).
As a side note, FileIO.fromFile has been deprecated, better use FileIO.fromPath.
An example would be:
val fileSource = FileIO.fromPath(...)
.via(Framing.delimiter(
ByteString("\n"),
maximumFrameLength = 256,
allowTruncation = true))
2.the sink materializes to a Future you can map onto to do something when the stream terminates:
val result: Future[IOResult] = flow.runWith(Sink.foreach(println(_)))
result.onComplete(...)
Related
I'm coding a small Akka Streams sample where I want to write elements of a List to a local TXT file
implicit val ec = context.dispatcher
implicit val actorSystem = context.system
implicit val materializer = ActorMaterializer()
val source = Source(List("a", "b", "c"))
.map(char => ByteString(s"${char} \n"))
val runnableGraph = source.toMat(FileIO.toPath(Paths.get("~/Downloads/results.txt")))(Keep.right)
runnableGraph.run()
The file is already created by the location I set in the code.
I do not terminate the actor system, so definitely it has enough time to write all of List elements to the file.
But unfortunately, nothing happens
Use the expanded path to your home directory instead of the tilde (~). For example:
val runnableGraph =
source.toMat(
FileIO.toPath(Paths.get("/home/YourUserName/Downloads/results.txt")))(Keep.right)
runnableGraph.run()
My problem is that, as I change my code into streaming mode and put my data frame into the foreach loop, the data frame shows empty table! I does't fill! I also can not put it into assembler.transform(). The error is:
Error:(38, 40) not enough arguments for method map: (mapFunc: String => U)(implicit evidence$2: scala.reflect.ClassTag[U])org.apache.spark.streaming.dstream.DStream[U].
Unspecified value parameter mapFunc.
val dataFrame = Train_DStream.map()
My train.csv file is like below:
Please help me.
Here is my code:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.{LabeledPoint, StreamingLinearRegressionWithSGD}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import scala.util.Try
/**
* Created by saeedtkh on 5/22/17.
*/
object ML_Test {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setMaster("local").setAppName("HdfsWordCount")
val sc = new SparkContext(sparkConf)
// Create the context
val ssc = new StreamingContext(sc, Seconds(10))
val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
StructField("column0", StringType, true),
StructField("column1", StringType, true),
StructField("column2", StringType, true)))
//val Test_DStream = ssc.textFileStream("/Users/saeedtkh/Desktop/sharedsaeed/train.csv").map(LabeledPoint.parse)
val Train_DStream = ssc.textFileStream("/Users/saeedtkh/Desktop/sharedsaeed/train.csv")
val DStream =Train_DStream.map(line => line.split(">")).map(array => {
val first = Try(array(0).trim.split(" ")(0)) getOrElse ""
val second = Try(array(1).trim.split(" ")(6)) getOrElse ""
val third = Try(array(2).trim.split(" ")(0).replace(":", "")) getOrElse ""
Row.fromSeq(Seq(first, second, third))
})
DStream.foreachRDD { Test_DStream =>
val dataFrame = sqlContext.createDataFrame(Test_DStream, customSchema)
dataFrame.groupBy("column1", "column2").count().show()
val numFeatures = 3
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.zeros(numFeatures))
val featureCol = Array("column1", "column2")
val assembler=new VectorAssembler().setInputCols(featureCol).setOutputCol("features")
dataFrame.show()
val df_new=assembler.transform(dataFrame)
}
ssc.start()
ssc.awaitTermination()
}
}
My guess is that all the files under /Users/saeedtkh/Desktop/sharedsaeed/train.csv directory have already been processed and so there are no files left and hence the DataFrame is empty.
Please note that the sole input parameter for StreamingContext.textFileStream is a directory not a file.
textFileStream(directory: String): DStream[String] Create an input stream that monitors a Hadoop-compatible filesystem for new files and reads them as text files
Please also note that once a file has ever been processed in a Spark Streaming application, this file should not be changed (or appended to) since the file has already been marked as processed and Spark Streaming will ignore any modifications.
Quoting the official documentation of Spark Streaming in Basic Sources:
Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported).
Note that
The files must have the same data format.
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
For simple text files, there is an easier method streamingContext.textFileStream(dataDirectory). And file streams do not require running a receiver, hence does not require allocating cores.
Please also replace setMaster("local") with setMaster("local[*]") to make sure your Spark Streaming application will have enough threads to process incoming data (you have to have at least 2 threads).
I have an RDD that has the signature
org.apache.spark.rdd.RDD[java.io.ByteArrayOutputStream]
In this RDD, each row has its own partition.
This ByteArrayOutputStream is zip output. I am applying some processing on the data in each partition and i want to export the processed data from each partition as a single zip file. What is the best way to export each Row in the final RDD as one file per row on hdfs?
If you are interested in knowing how I ended up with such an Rdd.
val npyData = transformedTopData.select("tokenIDF", "topLevelId").rdd.repartition(2).mapPartitions(x => {
val vectors = for {
row <- x
} yield {
row.getAs[Vector](0)
}
Seq(ml2npyCSR(vectors.toSeq).zipOut)
}.iterator)
EDIT: Count works perfectly fine
scala> npyData.count()
res9: Long = 2
Spark has very little support for file system operations. You'll need to Hadoop FileSystem API to create individual files
// This method is needed as Hadoop conf object is not serializable
def createFileStream(pathStr:String) = {
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
val hadoopconf = new Configuration();
val fs = FileSystem.get(hadoopconf);
val outFileStream = fs.create(new Path(pathStr));
outFileStream
}
// Method writes to individual files.
// Needs a unique id along with object for output file naming
def writeToFile( x:(Char, Long) ) : Unit = {
val (dataStream, id) = x
val output_dir = "/tmp/del_a/"
val outFileStream = createFileStream(output_dir+id)
dataStream.writeTo(outFileStream)
outFileStream.close()
}
// zipWithIndex used for creating unique id for each item in rdd
npyData.zipWithIndex().foreach(writeToFile)
Reference:
Hadoop FileSystem example
ByteArrayOutputStream.writeTo(java.io.OutputStream)
I figured out that I should represent my data as PairRDD and implement a custom FileOutputFormat. I looked in to the implementation of SequenceFileOutputFormat for inspiration and managed to write my own version based on that.
My custom FileOutputFormat is available here
I'm reading a csv file. I am using Akka Streams to do this so that I can create a graph of actions to perform on each line. I've got the following toy example up and running.
def main(args: Array[String]): Unit = {
implicit val system = ActorSystem("MyAkkaSystem")
implicit val materializer = ActorMaterializer()
val source = akka.stream.scaladsl.Source.fromIterator(Source.fromFile("a.csv").getLines)
val sink = Sink.foreach(println)
source.runWith(sink)
}
The two Source types don't sit easy with me. Is this idiomatic or is there is a better way to write this?
Actually, akka-streams provides a function to directly read from a file.
FileIO.fromPath(Paths.get("a.csv"))
.via(Framing.delimiter(ByteString("\n"), 256, true).map(_.utf8String))
.runForeach(println)
Here, runForeach method is to print the lines. If you have a proper Sink to process these lines, use it instead of this function. For example, if you want to split the lines by ' and print the total number of words in it:
val sink: Sink[String] = Sink.foreach(x => println(x.split(",").size))
FileIO.fromPath(Paths.get("a.csv"))
.via(Framing.delimiter(ByteString("\n"), 256, true).map(_.utf8String))
.to(sink)
.run()
The idiomatic way to read a CSV file with Akka Streams is to use the Alpakka CSV connector. The following example reads a CSV file, converts it to a map of column names (assumed to be the first line in the file) and ByteString values, transforms the ByteString values to String values, and prints each line:
FileIO.fromPath(Paths.get("a.csv"))
.via(CsvParsing.lineScanner())
.via(CsvToMap.toMap())
.map(_.mapValues(_.utf8String))
.runForeach(println)
Try this:
import java.nio.file.Paths
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl._
import akka.util.ByteString
import scala.concurrent.Await
import scala.concurrent.duration._
object ReadStreamApp extends App {
implicit val actorSystem = ActorSystem()
import actorSystem.dispatcher
implicit val flowMaterializer = ActorMaterializer()
val logFile = Paths.get("src/main/resources/a.csv")
val source = FileIO.fromPath(logFile)
val flow = Framing
.delimiter(ByteString(System.lineSeparator()), maximumFrameLength = 512, allowTruncation = true)
.map(_.utf8String)
val sink = Sink.foreach(println)
source
.via(flow)
.runWith(sink)
.andThen {
case _ =>
actorSystem.terminate()
Await.ready(actorSystem.whenTerminated, 1 minute)
}
}
Yeah, it's ok because these are different Sources. But if you don't like scala.io.Source you can read file yourself (which sometimes we have to do e.g. source csv file is zipped) and then parse it using given InputStream like this
StreamConverters.fromInputStream(() => input)
.via(Framing.delimiter(ByteString("\n"), 4096))
.map(_.utf8String)
.collect { line =>
line
}
Having said that consider using Apache Commons CSV with akka-stream. You may end up writing less code :)
How do you write RDD[Array[Byte]] to a file using Apache Spark and read it back again?
Common problems seem to be getting a weird cannot cast exception from BytesWritable to NullWritable. Other common problem is BytesWritable getBytes is a totally pointless pile of nonsense which doesn't get bytes at all. What getBytes does is get your bytes than adds a ton of zeros on the end! You have to use copyBytes
val rdd: RDD[Array[Byte]] = ???
// To write
rdd.map(bytesArray => (NullWritable.get(), new BytesWritable(bytesArray)))
.saveAsSequenceFile("/output/path", codecOpt)
// To read
val rdd: RDD[Array[Byte]] = sc.sequenceFile[NullWritable, BytesWritable]("/input/path")
.map(_._2.copyBytes())
Here is a snippet with all required imports that you can run from spark-shell, as requested by #Choix
import org.apache.hadoop.io.BytesWritable
import org.apache.hadoop.io.NullWritable
val path = "/tmp/path"
val rdd = sc.parallelize(List("foo"))
val bytesRdd = rdd.map{str => (NullWritable.get, new BytesWritable(str.getBytes) ) }
bytesRdd.saveAsSequenceFile(path)
val recovered = sc.sequenceFile[NullWritable, BytesWritable]("/tmp/path").map(_._2.copyBytes())
val recoveredAsString = recovered.map( new String(_) )
recoveredAsString.collect()
// result is: Array[String] = Array(foo)