How to use SQLContext and SparkContext inside foreachPartition - scala

I want to use SparkContext and SQLContext inside foreachPartition, but unable to do it due to serialization error. I know that both objects are not serializable, but I thought that foreachPartition is executed on the master, where both Spark Context and SQLContext are available..
Notation:
`msg -> Map[String,String]`
`result -> Iterable[Seq[Row]]`
This is my current code (UtilsDM is an object that extends Serializable). The part of code that fails starts from val schema =..., where I want to write result to the DataFrame and then save it to Parquet. Maybe the way I organized the code is inefficient, then I'd like to here your recommendations. Thanks.
// Here I am creating df from parquet file on S3
val exists = FileSystem.get(new URI("s3n://" + bucketNameCode), sc.hadoopConfiguration).exists(new Path("s3n://" + bucketNameCode + "/" + pathToSentMessages))
var df: DataFrame = null
if (exists) {
df = sqlContext
.read.parquet("s3n://bucket/pathToParquetFile")
}
UtilsDM.setDF(df)
// Here I process myDStream
myDStream.foreachRDD(rdd => {
rdd.foreachPartition{iter =>
val r = new RedisClient(UtilsDM.getHost, UtilsDM.getPort)
val producer = UtilsDM.createProducer
var df = UtilsDM.getDF
val result = iter.map{ msg =>
// ...
Seq(msg("key"),msg("value"))
}
// HERE I WANT TO WRITE result TO S3, BUT IT FAILS
val schema = StructType(
StructField("key", StringType, true) ::
StructField("value", StringType, true)
result.foreach { row =>
val rdd = sc.makeRDD(row)
val df2 = sqlContext.createDataFrame(rdd, schema)
// If the parquet file is not created, then create it
var df_final: DataFrame = null
if (df != null) {
df_final = df.unionAll(df2)
} else {
df_final = df2
}
df_final.write.parquet("s3n://bucket/pathToSentMessages)
}
}
})
EDIT:
I am using Spark 1.6.2 and Scala 2.10.6.

It is not possible. SparkContext, SQLContext and SparkSession can be used only on the driver. You can use sqlContext in the top level of foreachRDD:
myDStream.foreachRDD(rdd => {
val df = sqlContext.createDataFrame(rdd, schema)
...
})
You cannot use it in transformation / action:
myDStream.foreachRDD(rdd => {
rdd.foreach {
val df = sqlContext.createDataFrame(...)
...
}
})
You probably want equivalent of:
myDStream.foreachRDD(rdd => {
val foo = rdd.mapPartitions(iter => doSomethingWithRedisClient(iter))
val df = sqlContext.createDataFrame(foo, schema)
df.write.parquet("s3n://bucket/pathToSentMessages)
})

I found out that using an existing SparkContext (assume I have created a sparkContext sc beforehand) inside a loop works i.e.
// this works
stream.foreachRDD( _ => {
// update rdd
.... = SparkContext.getOrCreate().parallelize(...)
})
// this doesn't work - throws a SparkContext not serializable error
stream.foreachRDD( _ => {
// update rdd
.... = sc.parallelize(...)
})

Related

Type Mismatch Spark Scala

I am trying to create an empty dataframe an using it on a function but I am having the following error all time:
Required: DataFrame
Found: Dataset[DataFrame]
This is how I am doing it:
//Create empty DataFrame
val schema = StructType(
StructField("g", StringType, true) ::
StructField("tg", StringType, true) :: Nil)
var df1 = spark.createDataFrame(spark.sparkContext
.emptyRDD[Row], schema)
//or
var df1 = spark.emptyDataFrame
Then I try to use it calling a functions as you can see here:
df1 = kvrdd1_toDF.map(x => function1(x, df1))
And this is the function:
def function1(input: org.apache.spark.sql.Row, df: DataFrame): DataFrame = {
val v1 = spark.sparkContext.parallelize(Seq("g","tg"))
var df3 = v1.toDF("g","tg")
if (df.take(1).isEmpty){
df3 = Seq((input.get(2), "nn")).toDF("g", "tg")
} else {
df3 = df3.union(df)
}
df3
}
What am I doing wrong?
You have a DataFrame which is an alias for Dataset[Row]. You map that Row to a DataFrame so that's how you end up with a Dataset[DataFrame]. I don't know what you are trying to do but it will never work. The functions (and all its dependencies) you use to map the contents of a Dataset are serialized and distributed over your spark cluster. You can't use another DataFrame or a SparkSession or SparkContext in such a function.

Load External RDD Periodically during Spark Streaming [duplicate]

I want to use SparkContext and SQLContext inside foreachPartition, but unable to do it due to serialization error. I know that both objects are not serializable, but I thought that foreachPartition is executed on the master, where both Spark Context and SQLContext are available..
Notation:
`msg -> Map[String,String]`
`result -> Iterable[Seq[Row]]`
This is my current code (UtilsDM is an object that extends Serializable). The part of code that fails starts from val schema =..., where I want to write result to the DataFrame and then save it to Parquet. Maybe the way I organized the code is inefficient, then I'd like to here your recommendations. Thanks.
// Here I am creating df from parquet file on S3
val exists = FileSystem.get(new URI("s3n://" + bucketNameCode), sc.hadoopConfiguration).exists(new Path("s3n://" + bucketNameCode + "/" + pathToSentMessages))
var df: DataFrame = null
if (exists) {
df = sqlContext
.read.parquet("s3n://bucket/pathToParquetFile")
}
UtilsDM.setDF(df)
// Here I process myDStream
myDStream.foreachRDD(rdd => {
rdd.foreachPartition{iter =>
val r = new RedisClient(UtilsDM.getHost, UtilsDM.getPort)
val producer = UtilsDM.createProducer
var df = UtilsDM.getDF
val result = iter.map{ msg =>
// ...
Seq(msg("key"),msg("value"))
}
// HERE I WANT TO WRITE result TO S3, BUT IT FAILS
val schema = StructType(
StructField("key", StringType, true) ::
StructField("value", StringType, true)
result.foreach { row =>
val rdd = sc.makeRDD(row)
val df2 = sqlContext.createDataFrame(rdd, schema)
// If the parquet file is not created, then create it
var df_final: DataFrame = null
if (df != null) {
df_final = df.unionAll(df2)
} else {
df_final = df2
}
df_final.write.parquet("s3n://bucket/pathToSentMessages)
}
}
})
EDIT:
I am using Spark 1.6.2 and Scala 2.10.6.
It is not possible. SparkContext, SQLContext and SparkSession can be used only on the driver. You can use sqlContext in the top level of foreachRDD:
myDStream.foreachRDD(rdd => {
val df = sqlContext.createDataFrame(rdd, schema)
...
})
You cannot use it in transformation / action:
myDStream.foreachRDD(rdd => {
rdd.foreach {
val df = sqlContext.createDataFrame(...)
...
}
})
You probably want equivalent of:
myDStream.foreachRDD(rdd => {
val foo = rdd.mapPartitions(iter => doSomethingWithRedisClient(iter))
val df = sqlContext.createDataFrame(foo, schema)
df.write.parquet("s3n://bucket/pathToSentMessages)
})
I found out that using an existing SparkContext (assume I have created a sparkContext sc beforehand) inside a loop works i.e.
// this works
stream.foreachRDD( _ => {
// update rdd
.... = SparkContext.getOrCreate().parallelize(...)
})
// this doesn't work - throws a SparkContext not serializable error
stream.foreachRDD( _ => {
// update rdd
.... = sc.parallelize(...)
})

how to get Dataframe from list spark scala

I have a list of RDD. I have iterated the rdd and for each elemet of rdd I am doing some parsing logic. Finally I getting
val mRdd = nRdd.map {
ele => //parsing logic, I have the below field
colum = Array[String] // example ['id','name','dept']<br>
c_type = Array[String] // example ['Int','String','String']<br>
value = ArrayBuffer[String] // [1,lucy,it][2,denis,cs]<br>
}
How I can get the list of dataframe in mRdd
I tried a logic to create dataframe, in this case I have to rdd first. But I can't create rdd inside rdd.
I am new in spark. I am using spark 1.6.3
Please help me
In order to convert an RDD into a Dataframe, you would need to do the following:
Approach 1 - Use createDaframe function:
val mRdd: Seq[DataFrame] = nRdd.map {ele =>
val parsedRDD = ele //apply parse logic here
val schema = StructType(Seq(
StructField("id", IntegerType),
StructField("name", StringType),
StructField("dept", StringType)
))
createDataframe(parsedRDD, schema)
}
Read more about this approach here: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
Approach 2 - Use toDF implicit function:
import sqlContext.implicits._
val mRdd: Seq[DataFrame] = nRdd.map {ele =>
val parsedRDD = ele //apply parse logic here
val columns = Seq("id", "name", "dept")
parsedRDD.toDF(columns: _*)
}

converting textFile to dataFrame dynamically

I am trying to convert input from a text file to dataframe using a schema file which is read at run time.
My input text file looks like this:
John,23
Charles,34
The schema file looks like this:
name:string
age:integer
This is what I tried:
object DynamicSchema {
def main(args: Array[String]) {
val inputFile = args(0)
val schemaFile = args(1)
val schemaLines = Source.fromFile(schemaFile, "UTF-8").getLines().map(_.split(":")).map(l => l(0) -> l(1)).toMap
val spark = SparkSession.builder()
.master("local[*]")
.appName("Dynamic Schema")
.getOrCreate()
import spark.implicits._
val input = spark.sparkContext.textFile(args(0))
val schema = spark.sparkContext.broadcast(schemaLines)
val nameToType = {
Seq(IntegerType,StringType)
.map(t => t.typeName -> t).toMap
}
println(nameToType)
val fields = schema.value
.map(field => StructField(field._1, nameToType(field._2), nullable = true)).toSeq
val schemaStruct = StructType(fields)
val rowRDD = input
.map(_.split(","))
.map(attributes => Row.fromSeq(attributes))
val peopleDF = spark.createDataFrame(rowRDD, schemaStruct)
peopleDF.printSchema()
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")
// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")
results.show()
}
}
Though the printSchema gives the desired result, result.show errors out. I think the age field actually needs to be converted using toInt. Is there a way to achieve the same when the schema is only available at runtime?
Replace
val input = spark.sparkContext.textFile(args(0))
with
val input = spark.read.schema(schemaStruct).csv(args(0))
and move it after schema definition.

Spark: Save Dataframe in ORC format

In the previous version, we used to have a 'saveAsOrcFile()' method on RDD. This is now gone! How do I save data in DataFrame in ORC File format?
def main(args: Array[String]) {
println("Creating Orc File!")
val sparkConf = new SparkConf().setAppName("orcfile")
val sc = new SparkContext(sparkConf)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val people = sc.textFile("/apps/testdata/people.txt")
val schemaString = "name age"
val schema = StructType(schemaString.split(" ").map(fieldName => {if(fieldName == "name") StructField(fieldName, StringType, true) else StructField(fieldName, IntegerType, true)}))
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), new Integer(p(1).trim)))
//# Infer table schema from RDD**
val peopleSchemaRDD = hiveContext.createDataFrame(rowRDD, schema)
//# Create a table from schema**
peopleSchemaRDD.registerTempTable("people")
val results = hiveContext.sql("SELECT * FROM people")
results.map(t => "Name: " + t.toString).collect().foreach(println)
// Now I want to save this Dataframe(peopleSchemaRDD) in ORC Format. How do I do that?
}
Since Spark 1.4 you can simply use DataFrameWriter and set format to orc:
peopleSchemaRDD.write.format("orc").save("people")
or
peopleSchemaRDD.write.orc("people")