Save MongoDB data to parquet file format using Apache Spark - scala

I am a newbie with Apache spark as well with Scala programming language.
What I am trying to achieve is to extract the data from my local mongoDB database for then to save it in a parquet format using Apache Spark with the hadoop-connector
This is my code so far:
package com.examples
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD
import org.apache.hadoop.conf.Configuration
import org.bson.BSONObject
import com.mongodb.hadoop.{MongoInputFormat, BSONFileInputFormat}
import org.apache.spark.sql
import org.apache.spark.sql.SQLContext
object DataMigrator {
def main(args: Array[String])
{
val conf = new SparkConf().setAppName("Migration App").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// Import statement to implicitly convert an RDD to a DataFrame
import sqlContext.implicits._
val mongoConfig = new Configuration()
mongoConfig.set("mongo.input.uri", "mongodb://localhost:27017/mongosails4.case")
val mongoRDD = sc.newAPIHadoopRDD(mongoConfig, classOf[MongoInputFormat], classOf[Object], classOf[BSONObject]);
val count = countsRDD.count()
// the count value is aprox 100,000
println("================ PRINTING =====================")
println(s"ROW COUNT IS $count")
println("================ PRINTING =====================")
}
}
The thing is that in order to save data to a parquet file format first its necessary to convert the mongoRDD variable to Spark DataFrame. I have tried something like this:
// convert RDD to DataFrame
val myDf = mongoRDD.toDF() // this lines throws an error
myDF.write.save("my/path/myData.parquet")
and the error I get is this:
Exception in thread "main" scala.MatchError: java.lang.Object (of class scala.reflect.internal.Types.$TypeRef$$anon$6)
do you guys have any other idea how could I convert the RDD to a DataFrame so that I can save data in parquet format?
Here's the structure of one Document in the mongoDB collection : https://gist.github.com/kingtrocko/83a94238304c2d654fe4

Create a Case class representing the data stored in your DBObject.
case class Data(x: Int, s: String)
Then, map the values of your rdd to instances of your case class.
val dataRDD = mongoRDD.values.map { obj => Data(obj.get("x"), obj.get("s")) }
Now with your RDD[Data], you can create a DataFrame with the sqlContext
val myDF = sqlContext.createDataFrame(dataRDD)
That should get you going. I can explain more later if needed.

Related

How to declare an empty dataset in Spark?

I am new in Spark and Spark dataset. I was trying to declare an empty dataset using emptyDataset but it was asking for org.apache.spark.sql.Encoder. The data type I am using for the dataset is an object of case class Tp(s1: String, s2: String, s3: String).
All you need is to import implicit encoders from SparkSession instance before you create empty Dataset: import spark.implicits._
See full example here
EmptyDataFrame
package com.examples.sparksql
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object EmptyDataFrame {
def main(args: Array[String]){
//Create Spark Conf
val sparkConf = new SparkConf().setAppName("Empty-Data-Frame").setMaster("local")
//Create Spark Context - sc
val sc = new SparkContext(sparkConf)
//Create Sql Context
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Import Sql Implicit conversions
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType}
//Create Schema RDD
val schema_string = "name,id,dept"
val schema_rdd = StructType(schema_string.split(",").map(fieldName => StructField(fieldName, StringType, true)) )
//Create Empty DataFrame
val empty_df = sqlContext.createDataFrame(sc.emptyRDD[Row], schema_rdd)
//Some Operations on Empty Data Frame
empty_df.show()
println(empty_df.count())
//You can register a Table on Empty DataFrame, it's empty table though
empty_df.registerTempTable("empty_table")
//let's check it ;)
val res = sqlContext.sql("select * from empty_table")
res.show
}
}
Alternatively you can convert an empty list into a Dataset:
import sparkSession.implicits._
case class Employee(name: String, id: Int)
val ds: Dataset[Employee] = List.empty[Employee].toDS()

Converting error with RDD operation in Scala

I am new to Scala and I ran into the error while doing some practice.
I tried to convert RDD into DataFrame and following is my code.
package com.sclee.examples
import com.sun.org.apache.xalan.internal.xsltc.compiler.util.IntType
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType};
object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("examples").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Person(name: String, age: Long)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
val df = personRDD.map({
case Row(val1: String, val2: Long) => Person(val1,val2)
}).toDS()
// val ds = personRDD.toDS()
}
}
I followed the instructions in Spark documentation and also referenced some blogs showing me how to convert rdd into dataframe but the I got the error below.
Error:(20, 27) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
val df = personRDD.map({
Although I tried to fix the problem by myself but failed. Any help will be appreciated.
The following code works:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
case class Person(name: String, age: Long)
object SparkTest {
def main(args: Array[String]): Unit = {
// use the SparkSession of Spark 2
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
// this your RDD - just a sample how to create an RDD
val personRDD: RDD[Person] = spark.sparkContext.parallelize(Seq(Person("A",10),Person("B",20)))
// the sparksession has a method to convert to an Dataset
val ds = spark.createDataset(personRDD)
println(ds.count())
}
}
I made the following changes:
use SparkSession instead of SparkContext and SqlContext
move Person class out of the App (I'm not sure why I had to do
this)
use createDataset for conversion
However, I guess it's pretty uncommon to do this conversion and you probably want to read your input directly into an Dataset using the read method

Spark dataframe join is failing if key column contains a period(".") in the end

I am getting below exception if I do join in between two dataframes in spark (ver 1.5, scala 2.10).
Exception in thread "main" org.apache.spark.sql.AnalysisException: syntax error in attribute name: col1.;
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:118)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:182)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:653)
at com.nielsen.buy.integration.commons.Demo$.main(Demo.scala:62)
at com.nielsen.buy.integration.commons.Demo.main(Demo.scala)
Code works fine if column in dataframe does not contain any period . Please do help me out.
You can find the code that I am using.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import com.google.gson.Gson
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.Row
object Demo
{
lazy val sc: SparkContext = {
val conf = new SparkConf().setMaster("local")
.setAppName("demooo")
.set("spark.driver.allowMultipleContexts", "true")
new SparkContext(conf)
}
sc.setLogLevel("ERROR")
lazy val sqlcontext=new SQLContext(sc)
val data=List(Row("a","b"),Row("v","b"))
val dataRdd=sc.parallelize(data)
val schema = new StructType(Array(StructField("col.1",StringType,true),StructField("col2",StringType,true)))
val df1=sqlcontext.createDataFrame(dataRdd, schema)
val data2=List(Row("a","b"),Row("v","b"))
val dataRdd2=sc.parallelize(data2)
val schema2 = new StructType(Array(StructField("col3",StringType,true),StructField("col4",StringType,true)))
val df2=sqlcontext.createDataFrame(dataRdd2, schema2)
val val1="col.1"
val df3= df1.join(df2,df1.col(val1).equalTo(df2.col("col3")),"outer").show
}
In general, period is used to access members of a struct field.
The spark version you are using (1.5) is relatively old. Several such issues were fixed in later versions so if you upgrade it might just solve the issue.
That said, you can simply use withColumnRenamed to rename the column to something which does not have a period before the join.
So you basically do something like this:
val dfTmp = df1.withColumnRenamed(val1, "JOIN_COL")
val df3= dfTmp.join(df2,dfTmp.col("JOIN_COL").equalTo(df2.col("col3")),"outer").withColumnRenamed("JOIN_COL", val1)
df3.show
btw show returns a Unit so you probably meant df3 to be equal to the expression without it and do df3.show separately.

Does spark supports multiple output file with parquet format

The business case is that we'd like to split a big parquet file into small ones by a column as partition. We've tested using dataframe.partition("xxx").write(...). It took about 1hr with 100K entries of records. So, we are going to use map reduce to generate different parquet file in different folder. Sample code:
import org.apache.hadoop.io.NullWritable
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String =
key.asInstanceOf[String]+"/aa"
}
object Split {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SplitTest")
val sc = new SparkContext(conf)
sc.parallelize(List(("w", "www"), ("b", "blog"), ("c", "com"), ("w", "bt")))
.map(value => (value._1, value._2 + "Test"))
.partitionBy(new HashPartitioner(3))//.saveAsNewAPIHadoopFile(path, keyClass, valueClass, outputFormatClass, conf)
.saveAsHadoopFile(args(0), classOf[String], classOf[String],
classOf[RDDMultipleTextOutputFormat])
sc.stop()
}
}
The sample above just generates a text file, how to generate a parquet file with multipleoutputformat?
Spark supports Parquet partitioning since 1.4.0 (1.5+ syntax):
df.write.partitionBy("some")
and bucketing since (2.0.0):
df.write.bucketBy("some")
with optional sortBy clause.

Saving DataStream data into MongoDB / converting DS to DF

I am able to save a Data Frame to mongoDB but my program in spark streaming gives a datastream ( kafkaStream ) and I am not able to save it in mongodb neither i am able to convert this datastream to a dataframe. Is there any library or method to do this? Any inputs are highly appreciated.
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka.KafkaUtils
object KafkaSparkStream {
def main(args: Array[String]){
val conf = new SparkConf().setMaster("local[*]").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(10))
val kafkaStream = KafkaUtils.createStream(ssc,
"localhost:2181","spark-streaming-consumer-group", Map("topic" -> 25))
kafkaStream.print()
ssc.start()
ssc.awaitTermination()
}
}
Save a DF to mongodb - SUCCESS
val mongoDbFormat = "com.stratio.datasource.mongodb"
val mongoDbDatabase = "mongodatabase"
val mongoDbCollection = "mongodf"
val mongoDbOptions = Map(
MongodbConfig.Host -> "localhost:27017",
MongodbConfig.Database -> mongoDbDatabase,
MongodbConfig.Collection -> mongoDbCollection
)
//with DataFrame methods
dataFrame.write
.format(mongoDbFormat)
.mode(SaveMode.Append)
.options(mongoDbOptions)
.save()
Access the underlying RDD from the DStream using foreachRDD, transform it to a DataFrame and use your DF function on it.
The easiest way to transform an RDD to a DataFrame is by first transforming the data into a schema, represented in Scala by a case class
case class Element(...)
val elementDStream = kafkaDStream.map(entry => Element(entry, ...))
elementDStream.foreachRDD{rdd =>
val df = rdd.toDF
df.write(...)
}
Also, watch out for Spark 2.0 where this process will completely change with the introduction of Structured Streaming, where a MongoDB connection will become a sink.