Find size of data stored in rdd from a text file in apache spark - scala

I am new to Apache Spark (version 1.4.1). I wrote a small code to read a text file and stored its data in Rdd .
Is there a way by which I can get the size of data in rdd .
This is my code :
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.util.SizeEstimator
import org.apache.spark.sql.Row
object RddSize {
def main(args: Array[String]) {
val sc = new SparkContext("local", "data size")
val FILE_LOCATION = "src/main/resources/employees.csv"
val peopleRdd = sc.textFile(FILE_LOCATION)
val newRdd = peopleRdd.filter(str => str.contains(",M,"))
//Here I want to find whats the size remaining data
}
}
I want to get size of data before filter Transformation (peopleRdd) and after it (newRdd).

There are multiple way to get the RDD size
1.Add the spark listener in your spark context
SparkDriver.getContext.addSparkListener(new SparkListener() {
override def onStageCompleted(stageCompleted: SparkListenerStageCompleted) {
val map = stageCompleted.stageInfo.rddInfos
map.foreach(row => {
println("rdd memSize " + row.memSize)
println("rdd diskSize " + row.diskSize)
})
}})
2. Save you rdd as text file.
myRDD.saveAsTextFile("person.txt")
and call Apache Spark REST API.
/applications/[app-id]/stages
3. You can also try SizeEstimater
val rddSize = SizeEstimator.estimate(myRDD)

I'm not sure you need to do this. You could cache the rdd and check the size in the Spark UI. But lets say that you do want to do this programmatically, here is a solution.
def calcRDDSize(rdd: RDD[String]): Long = {
//map to the size of each string, UTF-8 is the default
rdd.map(_.getBytes("UTF-8").length.toLong)
.reduce(_+_) //add the sizes together
}
You can then call this function for your two RDDs:
println(s"peopleRdd is [${calcRDDSize(peopleRdd)}] bytes in size")
println(s"newRdd is [${calcRDDSize(newRdd)}] bytes in size")
This solution should work even if the file size is larger than the memory available in the cluster.

The Spark API doc says that:
You can get info about your RDDs from the Spark context: sc.getRDDStorageInfo
The RDD info includes memory and disk size: RDDInfo doc

Related

Why does foreachRDD not populate DataFrame with new content using StreamingContext.textFileStream?

My problem is that, as I change my code into streaming mode and put my data frame into the foreach loop, the data frame shows empty table! I does't fill! I also can not put it into assembler.transform(). The error is:
Error:(38, 40) not enough arguments for method map: (mapFunc: String => U)(implicit evidence$2: scala.reflect.ClassTag[U])org.apache.spark.streaming.dstream.DStream[U].
Unspecified value parameter mapFunc.
val dataFrame = Train_DStream.map()
My train.csv file is like below:
Please help me.
Here is my code:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.{LabeledPoint, StreamingLinearRegressionWithSGD}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import scala.util.Try
/**
* Created by saeedtkh on 5/22/17.
*/
object ML_Test {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setMaster("local").setAppName("HdfsWordCount")
val sc = new SparkContext(sparkConf)
// Create the context
val ssc = new StreamingContext(sc, Seconds(10))
val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
StructField("column0", StringType, true),
StructField("column1", StringType, true),
StructField("column2", StringType, true)))
//val Test_DStream = ssc.textFileStream("/Users/saeedtkh/Desktop/sharedsaeed/train.csv").map(LabeledPoint.parse)
val Train_DStream = ssc.textFileStream("/Users/saeedtkh/Desktop/sharedsaeed/train.csv")
val DStream =Train_DStream.map(line => line.split(">")).map(array => {
val first = Try(array(0).trim.split(" ")(0)) getOrElse ""
val second = Try(array(1).trim.split(" ")(6)) getOrElse ""
val third = Try(array(2).trim.split(" ")(0).replace(":", "")) getOrElse ""
Row.fromSeq(Seq(first, second, third))
})
DStream.foreachRDD { Test_DStream =>
val dataFrame = sqlContext.createDataFrame(Test_DStream, customSchema)
dataFrame.groupBy("column1", "column2").count().show()
val numFeatures = 3
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.zeros(numFeatures))
val featureCol = Array("column1", "column2")
val assembler=new VectorAssembler().setInputCols(featureCol).setOutputCol("features")
dataFrame.show()
val df_new=assembler.transform(dataFrame)
}
ssc.start()
ssc.awaitTermination()
}
}
My guess is that all the files under /Users/saeedtkh/Desktop/sharedsaeed/train.csv directory have already been processed and so there are no files left and hence the DataFrame is empty.
Please note that the sole input parameter for StreamingContext.textFileStream is a directory not a file.
textFileStream(directory: String): DStream[String] Create an input stream that monitors a Hadoop-compatible filesystem for new files and reads them as text files
Please also note that once a file has ever been processed in a Spark Streaming application, this file should not be changed (or appended to) since the file has already been marked as processed and Spark Streaming will ignore any modifications.
Quoting the official documentation of Spark Streaming in Basic Sources:
Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported).
Note that
The files must have the same data format.
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
For simple text files, there is an easier method streamingContext.textFileStream(dataDirectory). And file streams do not require running a receiver, hence does not require allocating cores.
Please also replace setMaster("local") with setMaster("local[*]") to make sure your Spark Streaming application will have enough threads to process incoming data (you have to have at least 2 threads).

Spark: How to write org.apache.spark.rdd.RDD[java.io.ByteArrayOutputStream]

I have an RDD that has the signature
org.apache.spark.rdd.RDD[java.io.ByteArrayOutputStream]
In this RDD, each row has its own partition.
This ByteArrayOutputStream is zip output. I am applying some processing on the data in each partition and i want to export the processed data from each partition as a single zip file. What is the best way to export each Row in the final RDD as one file per row on hdfs?
If you are interested in knowing how I ended up with such an Rdd.
val npyData = transformedTopData.select("tokenIDF", "topLevelId").rdd.repartition(2).mapPartitions(x => {
val vectors = for {
row <- x
} yield {
row.getAs[Vector](0)
}
Seq(ml2npyCSR(vectors.toSeq).zipOut)
}.iterator)
EDIT: Count works perfectly fine
scala> npyData.count()
res9: Long = 2
Spark has very little support for file system operations. You'll need to Hadoop FileSystem API to create individual files
// This method is needed as Hadoop conf object is not serializable
def createFileStream(pathStr:String) = {
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
val hadoopconf = new Configuration();
val fs = FileSystem.get(hadoopconf);
val outFileStream = fs.create(new Path(pathStr));
outFileStream
}
// Method writes to individual files.
// Needs a unique id along with object for output file naming
def writeToFile( x:(Char, Long) ) : Unit = {
val (dataStream, id) = x
val output_dir = "/tmp/del_a/"
val outFileStream = createFileStream(output_dir+id)
dataStream.writeTo(outFileStream)
outFileStream.close()
}
// zipWithIndex used for creating unique id for each item in rdd
npyData.zipWithIndex().foreach(writeToFile)
Reference:
Hadoop FileSystem example
ByteArrayOutputStream.writeTo(java.io.OutputStream)
I figured out that I should represent my data as PairRDD and implement a custom FileOutputFormat. I looked in to the implementation of SequenceFileOutputFormat for inspiration and managed to write my own version based on that.
My custom FileOutputFormat is available here

spark streaming wordcount is not printing the results

I am trying to run this app in spark streaming the code is from a book I am reading but unfortunately I am not getting the expected results. There is a java class in which I open a socket and wait for an input. I run the socket code and connect it properly with the spark job. Then I submit the following job and I get a message that I connected successfully. When I type something in the socket I want to get a wordcount result printed in the terminal instead I am getting this message:
INFO BlockManagerInfo: Added input-0-1480077969600 in memory on 192.168.1.4:38818 (size: 7.0 B, free: 265.1 MB)
where is the problem? See the code bellow, thanks in advance
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming._
import org.apache.spark.storage.StorageLevel._
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.dstream.ForEachDStream
object ScalaFirstStreamingExample {
def main(args: Array[String]){
println("Creating Spark Configuration") //Create an Object of Spark Configuration
val conf = new SparkConf() //Set the logical and user defined Name of this Application
conf.setAppName("My First Spark Streaming Application")
println("Retreiving Streaming Context from Spark Conf") //Retrieving Streaming Context from SparkConf Object.
//Second parameter is the time interval at which
//streaming data will be divided into batches
val streamCtx = new StreamingContext(conf, Seconds(2)) //Define the type of Stream. Here we are using TCP
//Socket as textstream,
//It will keep watching for the incoming data from a
//specific machine (localhost) and port (9087)
//Once the data is retrieved it will be saved in the
//memory and in case memory
//is not sufficient, then it will store it on the Disk
//It will further read the Data and convert it into DStream
val lines = streamCtx.socketTextStream("localhost", 9087, MEMORY_AND_DISK_SER_2) //Apply the Split() function to all elements of DStream
//which will further generate multiple new records from
//each record in Source Stream
//And then use flatmap to consolidate all records and
//create a new DStream.
val words = lines.flatMap(x => x.split(" ")) //Now, we will count these words by applying a using map()
//map() helps in applying a given function to each
//element in an RDD.
val pairs = words.map(word => (word, 1)) //Further we will aggregate the value of each key by
//using/applying the given function.
val wordCounts = pairs.reduceByKey(_ + _) //Lastly we will print all Values
//wordCounts.print(20)
myPrint(wordCounts,streamCtx)
//Most important statement which will initiate the
//Streaming Context
streamCtx.start();
//Wait till the execution is completed.
streamCtx.awaitTermination();
}
def myPrint(stream:DStream[(String,Int)],streamCtx: StreamingContext){
stream.foreachRDD(foreachFunc)
def foreachFunc = (rdd: RDD[(String,Int)]) => {
val array = rdd.collect()
println("---------Start Printing Results----------")
for(res<-array){
println(res)
}
println("---------Finished Printing Results----------")
}
}
}

How do i pass Spark context to a function from foreach

I need to pass SparkContext to my function and please suggest me how to do that for below scenario.
I have a Sequence, each element refers to specific data source from which we gets RDD and process them. I have defined a function which takes spark context and the data source and does the necessary things. I am curretly using while loop. But, i would like to do it with foreach or map, so that i can imply parallel processing. I need to spark context for the function, but how can i pass it from the foreach.?
Just a SAMPLE code, as i cannot present the actual code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object RoughWork {
def main(args: Array[String]) {
val str = "Hello,hw:How,sr:are,ws:You,re";
val conf = new SparkConf
conf.setMaster("local");
conf.setAppName("app1");
val sc = new SparkContext(conf);
val sqlContext = new SQLContext(sc);
val rdd = sc.parallelize(str.split(":"))
rdd.map(x => {println("==>"+x);passTest(sc, x)}).collect();
}
def passTest(context: SparkContext, input: String) {
val rdd1 = context.parallelize(input.split(","));
rdd1.foreach(println)
}
}
You cannot pass the SparkContext around like that. passTest will be run on an/the executor(s), while the SparkContext runs on the driver.
If I would have to do a double split like that, one approach would be to use flatMap:
rdd
.zipWithIndex
.flatMap(l => {
val parts = l._1.split(",");
List.fill(parts.length)(l._2) zip parts})
.countByKey
There may be prettier ways, but basically the idea is that you can use zipWithIndex to keep track which line an item came from and then use key-value pair RDD methods to work on your data.
If you have more than one key, or just more structured data in general, you can look into using Spark SQL with DataFrames (or DataSets in latest version) and explode instead of flatMap.

Can I convert an incoming stream of data into an array?

I'm trying to learn streaming data and manipulating it with the telecom churn dataset provided here. I've written a method to calculate this in batch:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD, LogisticRegressionWithLBFGS, LogisticRegressionModel, NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
object batchChurn{
def main(args: Array[String]): Unit = {
//setting spark context
val conf = new SparkConf().setAppName("churn")
val sc = new SparkContext(conf)
//loading and mapping data into RDD
val csv = sc.textFile("file://filename.csv")
val data = csv.map {line =>
val parts = line.split(",").map(_.trim)
val stringvec = Array(parts(1)) ++ parts.slice(4,20)
val label = parts(20).toDouble
val vec = stringvec.map(_.toDouble)
LabeledPoint(label, Vectors.dense(vec))
}
val splits = data.randomSplit(Array(0.7,0.3))
val (training, testing) = (splits(0),splits(1))
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 6
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 7
val maxBins = 32
val model = RandomForest.trainClassifier(training, numClasses, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
val labelAndPreds = testing.map {point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
}
}
I've had no problems with this. Now, I looked at the NetworkWordCount example provided on the spark website, and changed the code slightly to see how it would behave.
val ssc = new StreamingContext(sc, Seconds(5))
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val data = lines.flatMap(_.split(","))
My question is: is it possible to convert this DStream to an array which I can input into my analysis code? Currently when I try to convert to Array using val data = lines.flatMap(_.split(",")), it clearly says that:error: value toArray is not a member of org.apache.spark.streaming.dstream.DStream[String]
Your DStream contains many RDDs you can get access to the RDDs using foreachRDD function.
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/streaming/dstream/DStream.html#foreachRDD(scala.Function1)
then each RDD can be converted to array using collect function.
this has already been shown here
For each RDD in a DStream how do I convert this to an array or some other typical Java data type?
DStream.foreachRDD gives you an RDD[String] for each interval of
course, you could collect in an array
val arr = new ArrayBuffer[String]();
data.foreachRDD {
arr ++= _.collect()
}
Also keep in mind you could end up having way more data than you want in your driver since a DStream can be huge.
To limit the data for your analysis , I would do this way
data.slice(new Time(fromMillis), new Time(toMillis)).flatMap(_.collect()).toSet
You cannot put all the elements of a DStream in an array because those elements will keep being read over the wire, and your array would have to be indefinitely extensible.
The adaptation of this decision tree model to a streaming mode, where training and testing data arrives continuously, is not trivial for algorithmical reasons — while the answers mentioning collect are technically correct, they're not the appropriate solution to what you're trying to do.
If you want to run decision trees on a Stream in Spark, you may want to look at Hoeffding trees.