Saving DataStream data into MongoDB / converting DS to DF - mongodb

I am able to save a Data Frame to mongoDB but my program in spark streaming gives a datastream ( kafkaStream ) and I am not able to save it in mongodb neither i am able to convert this datastream to a dataframe. Is there any library or method to do this? Any inputs are highly appreciated.
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka.KafkaUtils
object KafkaSparkStream {
def main(args: Array[String]){
val conf = new SparkConf().setMaster("local[*]").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(10))
val kafkaStream = KafkaUtils.createStream(ssc,
"localhost:2181","spark-streaming-consumer-group", Map("topic" -> 25))
kafkaStream.print()
ssc.start()
ssc.awaitTermination()
}
}
Save a DF to mongodb - SUCCESS
val mongoDbFormat = "com.stratio.datasource.mongodb"
val mongoDbDatabase = "mongodatabase"
val mongoDbCollection = "mongodf"
val mongoDbOptions = Map(
MongodbConfig.Host -> "localhost:27017",
MongodbConfig.Database -> mongoDbDatabase,
MongodbConfig.Collection -> mongoDbCollection
)
//with DataFrame methods
dataFrame.write
.format(mongoDbFormat)
.mode(SaveMode.Append)
.options(mongoDbOptions)
.save()

Access the underlying RDD from the DStream using foreachRDD, transform it to a DataFrame and use your DF function on it.
The easiest way to transform an RDD to a DataFrame is by first transforming the data into a schema, represented in Scala by a case class
case class Element(...)
val elementDStream = kafkaDStream.map(entry => Element(entry, ...))
elementDStream.foreachRDD{rdd =>
val df = rdd.toDF
df.write(...)
}
Also, watch out for Spark 2.0 where this process will completely change with the introduction of Structured Streaming, where a MongoDB connection will become a sink.

Related

How to perform Unit testing on Spark Structured Streaming?

I would like to know about the unit testing side of Spark Structured Streaming. My scenario is, I am getting data from Kafka and I am consuming it using Spark Structured Streaming and applying some transformations on top of the data.
I am not sure about how can I test this using Scala and Spark. Can someone tell me how to do unit testing in Structured Streaming using Scala. I am new to streaming.
tl;dr Use MemoryStream to add events and memory sink for the output.
The following code should help to get started:
import org.apache.spark.sql.execution.streaming.MemoryStream
implicit val sqlCtx = spark.sqlContext
import spark.implicits._
val events = MemoryStream[Event]
val sessions = events.toDS
assert(sessions.isStreaming, "sessions must be a streaming Dataset")
// use sessions event stream to apply required transformations
val transformedSessions = ...
val streamingQuery = transformedSessions
.writeStream
.format("memory")
.queryName(queryName)
.option("checkpointLocation", checkpointLocation)
.outputMode(queryOutputMode)
.start
// Add events to MemoryStream as if they came from Kafka
val batch = Seq(
eventGen.generate(userId = 1, offset = 1.second),
eventGen.generate(userId = 2, offset = 2.seconds))
val currentOffset = events.addData(batch)
streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])
// check the output
// The output is in queryName table
// The following code simply shows the result
spark
.table(queryName)
.show(truncate = false)
So, I tried to implement the answer from #Jacek and I couldn't find how to create the eventGen object and also test a small streaming application for write data on the console. I am also using MemoryStream and here I show a small example working.
The class that I testing is:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.{DataFrame, SparkSession, functions}
object StreamingDataFrames {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.appName(StreamingDataFrames.getClass.getSimpleName)
.master("local[2]")
.getOrCreate()
val lines = readData(spark, "socket")
val streamingQuery = writeData(lines)
streamingQuery.awaitTermination()
}
def readData(spark: SparkSession, source: String = "socket"): DataFrame = {
val lines: DataFrame = spark.readStream
.format(source)
.option("host", "localhost")
.option("port", 12345)
.load()
lines
}
def writeData(df: DataFrame, sink: String = "console", queryName: String = "calleventaggs", outputMode: String = "append"): StreamingQuery = {
println(s"Is this a streaming data frame: ${df.isStreaming}")
val shortLines: DataFrame = df.filter(functions.length(col("value")) >= 3)
val query = shortLines.writeStream
.format(sink)
.queryName(queryName)
.outputMode(outputMode)
.start()
query
}
}
I test only the writeData method. This is way I split the query into 2 methods.
Then here is the Spec to test the class. I use a SharedSparkSession class to facilitate the open and close of spark context. Like it is shown here.
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.execution.streaming.{LongOffset, MemoryStream}
import org.github.explore.spark.SharedSparkSession
import org.scalatest.funsuite.AnyFunSuite
class StreamingDataFramesSpec extends AnyFunSuite with SharedSparkSession {
test("spark structured streaming can read from memory socket") {
// We can import sql implicits
implicit val sqlCtx = sparkSession.sqlContext
import sqlImplicits._
val events = MemoryStream[String]
val queryName: String = "calleventaggs"
// Add events to MemoryStream as if they came from Kafka
val batch = Seq(
"this is a value to read",
"and this is another value"
)
val currentOffset = events.addData(batch)
val streamingQuery = StreamingDataFrames.writeData(events.toDF(), "memory", queryName)
streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])
val result: DataFrame = sparkSession.table(queryName)
result.show
streamingQuery.awaitTermination(1000L)
assertResult(batch.size)(result.count)
val values = result.take(2)
assertResult(batch(0))(values(0).getString(0))
assertResult(batch(1))(values(1).getString(0))
}
}

How to convert RDD of Avro's GenericData.Record to DataFrame?

Perhaps this question may seem a bit abstract, here it is:
val originalAvroSchema : Schema = // read from a file
val rdd : RDD[GenericData.Record] = // From some streaming source
// Looking for a handy:
val df: DataFrame = rdd.toDF(schema)
I explore spark-avro but it has support only to read from a file, not from existing RDD.
import com.databricks.spark.avro._
val sqlContext = new SQLContext(sc)
val rdd : RDD[MyAvroRecord] = ...
val df = rdd.toAvroDF(sqlContext)

Can I convert an incoming stream of data into an array?

I'm trying to learn streaming data and manipulating it with the telecom churn dataset provided here. I've written a method to calculate this in batch:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD, LogisticRegressionWithLBFGS, LogisticRegressionModel, NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
object batchChurn{
def main(args: Array[String]): Unit = {
//setting spark context
val conf = new SparkConf().setAppName("churn")
val sc = new SparkContext(conf)
//loading and mapping data into RDD
val csv = sc.textFile("file://filename.csv")
val data = csv.map {line =>
val parts = line.split(",").map(_.trim)
val stringvec = Array(parts(1)) ++ parts.slice(4,20)
val label = parts(20).toDouble
val vec = stringvec.map(_.toDouble)
LabeledPoint(label, Vectors.dense(vec))
}
val splits = data.randomSplit(Array(0.7,0.3))
val (training, testing) = (splits(0),splits(1))
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 6
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 7
val maxBins = 32
val model = RandomForest.trainClassifier(training, numClasses, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
val labelAndPreds = testing.map {point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
}
}
I've had no problems with this. Now, I looked at the NetworkWordCount example provided on the spark website, and changed the code slightly to see how it would behave.
val ssc = new StreamingContext(sc, Seconds(5))
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val data = lines.flatMap(_.split(","))
My question is: is it possible to convert this DStream to an array which I can input into my analysis code? Currently when I try to convert to Array using val data = lines.flatMap(_.split(",")), it clearly says that:error: value toArray is not a member of org.apache.spark.streaming.dstream.DStream[String]
Your DStream contains many RDDs you can get access to the RDDs using foreachRDD function.
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/streaming/dstream/DStream.html#foreachRDD(scala.Function1)
then each RDD can be converted to array using collect function.
this has already been shown here
For each RDD in a DStream how do I convert this to an array or some other typical Java data type?
DStream.foreachRDD gives you an RDD[String] for each interval of
course, you could collect in an array
val arr = new ArrayBuffer[String]();
data.foreachRDD {
arr ++= _.collect()
}
Also keep in mind you could end up having way more data than you want in your driver since a DStream can be huge.
To limit the data for your analysis , I would do this way
data.slice(new Time(fromMillis), new Time(toMillis)).flatMap(_.collect()).toSet
You cannot put all the elements of a DStream in an array because those elements will keep being read over the wire, and your array would have to be indefinitely extensible.
The adaptation of this decision tree model to a streaming mode, where training and testing data arrives continuously, is not trivial for algorithmical reasons — while the answers mentioning collect are technically correct, they're not the appropriate solution to what you're trying to do.
If you want to run decision trees on a Stream in Spark, you may want to look at Hoeffding trees.

Save MongoDB data to parquet file format using Apache Spark

I am a newbie with Apache spark as well with Scala programming language.
What I am trying to achieve is to extract the data from my local mongoDB database for then to save it in a parquet format using Apache Spark with the hadoop-connector
This is my code so far:
package com.examples
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD
import org.apache.hadoop.conf.Configuration
import org.bson.BSONObject
import com.mongodb.hadoop.{MongoInputFormat, BSONFileInputFormat}
import org.apache.spark.sql
import org.apache.spark.sql.SQLContext
object DataMigrator {
def main(args: Array[String])
{
val conf = new SparkConf().setAppName("Migration App").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// Import statement to implicitly convert an RDD to a DataFrame
import sqlContext.implicits._
val mongoConfig = new Configuration()
mongoConfig.set("mongo.input.uri", "mongodb://localhost:27017/mongosails4.case")
val mongoRDD = sc.newAPIHadoopRDD(mongoConfig, classOf[MongoInputFormat], classOf[Object], classOf[BSONObject]);
val count = countsRDD.count()
// the count value is aprox 100,000
println("================ PRINTING =====================")
println(s"ROW COUNT IS $count")
println("================ PRINTING =====================")
}
}
The thing is that in order to save data to a parquet file format first its necessary to convert the mongoRDD variable to Spark DataFrame. I have tried something like this:
// convert RDD to DataFrame
val myDf = mongoRDD.toDF() // this lines throws an error
myDF.write.save("my/path/myData.parquet")
and the error I get is this:
Exception in thread "main" scala.MatchError: java.lang.Object (of class scala.reflect.internal.Types.$TypeRef$$anon$6)
do you guys have any other idea how could I convert the RDD to a DataFrame so that I can save data in parquet format?
Here's the structure of one Document in the mongoDB collection : https://gist.github.com/kingtrocko/83a94238304c2d654fe4
Create a Case class representing the data stored in your DBObject.
case class Data(x: Int, s: String)
Then, map the values of your rdd to instances of your case class.
val dataRDD = mongoRDD.values.map { obj => Data(obj.get("x"), obj.get("s")) }
Now with your RDD[Data], you can create a DataFrame with the sqlContext
val myDF = sqlContext.createDataFrame(dataRDD)
That should get you going. I can explain more later if needed.

Saving twitter streams into a single file with spark streaming, scala

So after help from this answer Spark Streaming : Join Dstream batches into single output Folder I was able to create a single file for my twitter streams. However,now I don't see any tweets being saved in this file. Please find below my code snippet for this. What am I doing wrong?
val ssc = new StreamingContext(sparkConf, Seconds(5))
val stream = TwitterUtils.createStream(ssc, None, filters)
val tweets = stream.map(r => r.getText)
tweets.foreachRDD{rdd =>
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._
val df = rdd.map(t => Record(t)).toDF()
df.save("com.databricks.spark.csv",SaveMode.Append,Map("path"->"tweetstream.csv")
}
ssc.start()
ssc.awaitTermination()
}