Why does foreachRDD not populate DataFrame with new content using StreamingContext.textFileStream? - scala

My problem is that, as I change my code into streaming mode and put my data frame into the foreach loop, the data frame shows empty table! I does't fill! I also can not put it into assembler.transform(). The error is:
Error:(38, 40) not enough arguments for method map: (mapFunc: String => U)(implicit evidence$2: scala.reflect.ClassTag[U])org.apache.spark.streaming.dstream.DStream[U].
Unspecified value parameter mapFunc.
val dataFrame = Train_DStream.map()
My train.csv file is like below:
Please help me.
Here is my code:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.{LabeledPoint, StreamingLinearRegressionWithSGD}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import scala.util.Try
/**
* Created by saeedtkh on 5/22/17.
*/
object ML_Test {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setMaster("local").setAppName("HdfsWordCount")
val sc = new SparkContext(sparkConf)
// Create the context
val ssc = new StreamingContext(sc, Seconds(10))
val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
StructField("column0", StringType, true),
StructField("column1", StringType, true),
StructField("column2", StringType, true)))
//val Test_DStream = ssc.textFileStream("/Users/saeedtkh/Desktop/sharedsaeed/train.csv").map(LabeledPoint.parse)
val Train_DStream = ssc.textFileStream("/Users/saeedtkh/Desktop/sharedsaeed/train.csv")
val DStream =Train_DStream.map(line => line.split(">")).map(array => {
val first = Try(array(0).trim.split(" ")(0)) getOrElse ""
val second = Try(array(1).trim.split(" ")(6)) getOrElse ""
val third = Try(array(2).trim.split(" ")(0).replace(":", "")) getOrElse ""
Row.fromSeq(Seq(first, second, third))
})
DStream.foreachRDD { Test_DStream =>
val dataFrame = sqlContext.createDataFrame(Test_DStream, customSchema)
dataFrame.groupBy("column1", "column2").count().show()
val numFeatures = 3
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.zeros(numFeatures))
val featureCol = Array("column1", "column2")
val assembler=new VectorAssembler().setInputCols(featureCol).setOutputCol("features")
dataFrame.show()
val df_new=assembler.transform(dataFrame)
}
ssc.start()
ssc.awaitTermination()
}
}

My guess is that all the files under /Users/saeedtkh/Desktop/sharedsaeed/train.csv directory have already been processed and so there are no files left and hence the DataFrame is empty.
Please note that the sole input parameter for StreamingContext.textFileStream is a directory not a file.
textFileStream(directory: String): DStream[String] Create an input stream that monitors a Hadoop-compatible filesystem for new files and reads them as text files
Please also note that once a file has ever been processed in a Spark Streaming application, this file should not be changed (or appended to) since the file has already been marked as processed and Spark Streaming will ignore any modifications.
Quoting the official documentation of Spark Streaming in Basic Sources:
Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported).
Note that
The files must have the same data format.
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
For simple text files, there is an easier method streamingContext.textFileStream(dataDirectory). And file streams do not require running a receiver, hence does not require allocating cores.
Please also replace setMaster("local") with setMaster("local[*]") to make sure your Spark Streaming application will have enough threads to process incoming data (you have to have at least 2 threads).

Related

TextFileStreaming in spark scala

I have many text file in local directory. Spark Program to read all the files and store it into database. For the moment, trying to read the files using text file stream not working.
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.DStream
/**
* Main Program
*/
object SparkMain extends App {
// Create a SparkContext to initialize Spark
val sparkConf: SparkConf =
new SparkConf()
.setMaster("local")
.setAppName("TestProgram")
// Create a spark streaming context with windows period 2 sec
val ssc: StreamingContext =
new StreamingContext(sparkConf, Seconds(2))
// Create text file stream
val sourceDir: String = "D:\\tmpDir"
val stream: DStream[String] = ssc.textFileStream(sourceDir)
case class TextLine(line: String)
val lineRdd: DStream[TextLine] = stream.map(TextLine)
lineRdd.foreachRDD(rdd => {
rdd.foreach(println)
})
// Start the computation
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()
}
Input:
//1.txt
Hello World
Nothing print when stream the streaming. What is wrong in it?
TextFileStreaming does not read the file that is already present in the directory. Start the program and create a new file or move the file from any other directory. The following program is simple word count for text file streaming
val sourceDir: String = "path to streaming directory"
val stream: DStream[String] = streamingContext.textFileStream(sourceDir)
case class TextLine(line: String)
val lineRdd: DStream[TextLine] = stream.map(TextLine)
lineRdd.foreachRDD(rdd => {
val words = rdd.flatMap(rdd => rdd.line.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
println("=====================")
wordCounts.foreach(println)
println("=====================" + rdd.count())
})
The ouput should be something like this
+++++++++++++++++++++++
=====================0
+++++++++++++++++++++++
(are,1)
(you,1)
(how,1)
(hello,1)
(doing,1)
=====================5
+++++++++++++++++++++++
=====================0
I hope this helps!

Using iterated writing in HDFS file by using Spark/Scala

I am learning how to read and write from files in HDFS by using Spark/Scala.
I am unable to write in HDFS file, the file is created, but it's empty.
I don't know how to create a loop for writing in a file.
The code is:
import scala.collection.immutable.Map
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
// Read the adult CSV file
val logFile = "hdfs://zobbi01:9000/input/adult.csv"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
//val logFile = sc.textFile("hdfs://zobbi01:9000/input/adult.csv")
val headerAndRows = logData.map(line => line.split(",").map(_.trim))
val header = headerAndRows.first
val data = headerAndRows.filter(_(0) != header(0))
val maps = data.map(splits => header.zip(splits).toMap)
val result = maps.filter(map => map("AGE") != "23")
result.foreach{
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
}
If I replace:
result.foreach{println}
Then it works!
but when using the method of (saveAsTextFile), then an error message is thrown as
<console>:76: error: type mismatch;
found : Unit
required: scala.collection.immutable.Map[String,String] => Unit
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
Any help please.
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
This is all what you need to do. You don't need to loop through all the rows.
Hope this helps!
What this does!!!
result.foreach{
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
}
RDD action cannot be triggered from RDD transformations unless special conf set.
Just use result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt") to save to HDFS.
I f you need other formats in the file to be written, change in rdd itself before writing.

Saving data in to a different host in spark scala application

I'm trying to persist a data frame created out of a kafka topic data in to a different host.
The code i've used:
val topicMaps = Map("topic" -> 2)
val conf = new Configuration()
conf.set("fs.defaultFS","maprfs://host-2:7222")
val fs =FileSystem.get(conf)
val messages = KafkaUtils.createStream[String, String,StringDecoder,StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)
messages.foreachRDD(rdd=>
{
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val dataframe =sqlContext.read.json(rdd.map(_._2))
val myDF =dataframe.toDF()
import org.apache.spark.sql.SaveMode
myDF.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save("maprfs://host-2:7222/hdfs/path")
})
The above code has created a path in the host directory, but the data is not being written whatsoever.
Any help is appreciated.

Does spark supports multiple output file with parquet format

The business case is that we'd like to split a big parquet file into small ones by a column as partition. We've tested using dataframe.partition("xxx").write(...). It took about 1hr with 100K entries of records. So, we are going to use map reduce to generate different parquet file in different folder. Sample code:
import org.apache.hadoop.io.NullWritable
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String =
key.asInstanceOf[String]+"/aa"
}
object Split {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SplitTest")
val sc = new SparkContext(conf)
sc.parallelize(List(("w", "www"), ("b", "blog"), ("c", "com"), ("w", "bt")))
.map(value => (value._1, value._2 + "Test"))
.partitionBy(new HashPartitioner(3))//.saveAsNewAPIHadoopFile(path, keyClass, valueClass, outputFormatClass, conf)
.saveAsHadoopFile(args(0), classOf[String], classOf[String],
classOf[RDDMultipleTextOutputFormat])
sc.stop()
}
}
The sample above just generates a text file, how to generate a parquet file with multipleoutputformat?
Spark supports Parquet partitioning since 1.4.0 (1.5+ syntax):
df.write.partitionBy("some")
and bucketing since (2.0.0):
df.write.bucketBy("some")
with optional sortBy clause.

Save MongoDB data to parquet file format using Apache Spark

I am a newbie with Apache spark as well with Scala programming language.
What I am trying to achieve is to extract the data from my local mongoDB database for then to save it in a parquet format using Apache Spark with the hadoop-connector
This is my code so far:
package com.examples
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD
import org.apache.hadoop.conf.Configuration
import org.bson.BSONObject
import com.mongodb.hadoop.{MongoInputFormat, BSONFileInputFormat}
import org.apache.spark.sql
import org.apache.spark.sql.SQLContext
object DataMigrator {
def main(args: Array[String])
{
val conf = new SparkConf().setAppName("Migration App").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// Import statement to implicitly convert an RDD to a DataFrame
import sqlContext.implicits._
val mongoConfig = new Configuration()
mongoConfig.set("mongo.input.uri", "mongodb://localhost:27017/mongosails4.case")
val mongoRDD = sc.newAPIHadoopRDD(mongoConfig, classOf[MongoInputFormat], classOf[Object], classOf[BSONObject]);
val count = countsRDD.count()
// the count value is aprox 100,000
println("================ PRINTING =====================")
println(s"ROW COUNT IS $count")
println("================ PRINTING =====================")
}
}
The thing is that in order to save data to a parquet file format first its necessary to convert the mongoRDD variable to Spark DataFrame. I have tried something like this:
// convert RDD to DataFrame
val myDf = mongoRDD.toDF() // this lines throws an error
myDF.write.save("my/path/myData.parquet")
and the error I get is this:
Exception in thread "main" scala.MatchError: java.lang.Object (of class scala.reflect.internal.Types.$TypeRef$$anon$6)
do you guys have any other idea how could I convert the RDD to a DataFrame so that I can save data in parquet format?
Here's the structure of one Document in the mongoDB collection : https://gist.github.com/kingtrocko/83a94238304c2d654fe4
Create a Case class representing the data stored in your DBObject.
case class Data(x: Int, s: String)
Then, map the values of your rdd to instances of your case class.
val dataRDD = mongoRDD.values.map { obj => Data(obj.get("x"), obj.get("s")) }
Now with your RDD[Data], you can create a DataFrame with the sqlContext
val myDF = sqlContext.createDataFrame(dataRDD)
That should get you going. I can explain more later if needed.