I have avro message and .avsc file. I have generated the java class from .avsc file. Now I want to convert the avro(json) message into data frame. I read the message. Successfully decoded the message and I got RDD[Product] but I am unable to convert RDD[Product] into dataframe. I need to save the message in .avro format.
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("test").setMaster("local[*]")
val spark = SparkSession.builder().config(conf).getOrCreate()
import spark.implicits._
val rdd = spark.read.textFile("/Users/lucy/product_avro.json").rdd
val rdd1 = rdd.map(string => toProduct(string))
spark.createDataFrame(rdd1, classOf[Product]) // not working
}
def toProduct(input: String): Product = {
return new SpecificDatumReader[Product](Product.SCHEMA$)
.read(null, DecoderFactory.get().jsonDecoder(Product.SCHEMA$, input))
}
Error: java.lang.UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class org.apache.avro.Schema
Related
This question already has an answer here:
Custom partiotioning of JavaDStreamPairRDD
(1 answer)
Closed 4 years ago.
I am trying to save my Pair Rdd in spark streaming but getting error while saving at last step .
Here is my sample code
def main(args: Array[String]) {
val inputPath = args(0)
val output = args(1)
val noOfHashPartitioner = args(2).toInt
println("IN Streaming ")
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val sc = new SparkContext(conf)
val hadoopConf = sc.hadoopConfiguration;
//hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
val ssc = new org.apache.spark.streaming.StreamingContext(sc, Seconds(60))
val input = ssc.textFileStream(inputPath)
val pairedRDD = input.map(row => {
val split = row.split("\\|")
val fileName = split(0)
val fileContent = split(1)
(fileName, fileContent)
})
import org.apache.hadoop.io.NullWritable
import org.apache.spark.HashPartitioner
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RddMultiTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any = NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = key.asInstanceOf[String]
}
//print(pairedRDD)
pairedRDD.partitionBy(new HashPartitioner(noOfHashPartitioner)).saveAsHadoopFile(output, classOf[String], classOf[String], classOf[RddMultiTextOutputFormat], classOf[GzipCodec])
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
I am getting at last step while saving .I am new to spark streaming so must be missing something here .
Getting error like
value partitionBy is not a member of
org.apache.spark.streaming.dstream.DStream[(String, String)]
Please help
pairedRDD is of type DStream[(String, String)] not RDD[(String,String)]. The method partitionBy is not available on DStreams.
Maybe look into foreachRDD which should be available on DStreams.
EDIT: A bit more context explanation textFileStream will set up a directory watch on the specified path and whenever there are new files will stream the content. so that's where the stream aspect comes from. Is that what you want? or do you just want to read the content of the directory "as is" once? Then there's readTextFiles which will return a non-stream container.
I want to tail Mongo oplog and stream it through Kafka. So I found debezium Kafka CDC connector which tails the Mongo oplog with their in-build serialisation technique.
Schema registry uses below convertor for the serialization,
'key.converter=io.confluent.connect.avro.AvroConverter' and
'value.converter=io.confluent.connect.avro.AvroConverter'
Below are the library dependencies I'm using in the project
libraryDependencies += "io.confluent" % "kafka-avro-serializer" % "3.1.2"
libraryDependencies += "org.apache.kafka" % "kafka-streams" % "0.10.2.0
Below is the streaming code which deserialize Avro data
import org.apache.spark.sql.{Dataset, SparkSession}
import io.confluent.kafka.schemaregistry.client.rest.RestService
import io.confluent.kafka.serializers.KafkaAvroDeserializer
import org.apache.avro.Schema
import scala.collection.JavaConverters._
object KafkaStream{
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder
.master("local")
.appName("kafka")
.getOrCreate()
sparkSession.sparkContext.setLogLevel("ERROR")
import sparkSession.implicits._
case class DeserializedFromKafkaRecord(key: String, value: String)
val schemaRegistryURL = "http://127.0.0.1:8081"
val topicName = "productCollection.inventory.Product"
val subjectValueName = topicName + "-value"
//create RestService object
val restService = new RestService(schemaRegistryURL)
//.getLatestVersion returns io.confluent.kafka.schemaregistry.client.rest.entities.Schema object.
val valueRestResponseSchema = restService.getLatestVersion(subjectValueName)
//Use Avro parsing classes to get Avro Schema
val parser = new Schema.Parser
val topicValueAvroSchema: Schema = parser.parse(valueRestResponseSchema.getSchema)
//key schema is typically just string but you can do the same process for the key as the value
val keySchemaString = "\"string\""
val keySchema = parser.parse(keySchemaString)
//Create a map with the Schema registry url.
//This is the only Required configuration for Confluent's KafkaAvroDeserializer.
val props = Map("schema.registry.url" -> schemaRegistryURL)
//Declare SerDe vars before using Spark structured streaming map. Avoids non serializable class exception.
var keyDeserializer: KafkaAvroDeserializer = null
var valueDeserializer: KafkaAvroDeserializer = null
//Create structured streaming DF to read from the topic.
val rawTopicMessageDF = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topicName)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 20) //remove for prod
.load()
rawTopicMessageDF.printSchema()
//instantiate the SerDe classes if not already, then deserialize!
val deserializedTopicMessageDS = rawTopicMessageDF.map{
row =>
if (keyDeserializer == null) {
keyDeserializer = new KafkaAvroDeserializer
keyDeserializer.configure(props.asJava, true) //isKey = true
}
if (valueDeserializer == null) {
valueDeserializer = new KafkaAvroDeserializer
valueDeserializer.configure(props.asJava, false) //isKey = false
}
//Pass the Avro schema.
val deserializedKeyString = keyDeserializer.deserialize(topicName, row.getAs[Array[Byte]]("key"), keySchema).toString //topic name is actually unused in the source code, just required by the signature. Weird right?
val deserializedValueJsonString = valueDeserializer.deserialize(topicName, row.getAs[Array[Byte]]("value"), topicValueAvroSchema).toString
DeserializedFromKafkaRecord(deserializedKeyString, deserializedValueJsonString)
}
val deserializedDSOutputStream = deserializedTopicMessageDS.writeStream
.outputMode("append")
.format("console")
.option("truncate", false)
.start()
Kafka consumer running fine I can see the data tailing from the oplog however when I run above code I'm getting below errors,
Error:(70, 59) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val deserializedTopicMessageDS = rawTopicMessageDF.map{
and
Error:(70, 59) not enough arguments for method map: (implicit evidence$7: org.apache.spark.sql.Encoder[DeserializedFromKafkaRecord])org.apache.spark.sql.Dataset[DeserializedFromKafkaRecord].
Unspecified value parameter evidence$7.
val deserializedTopicMessageDS = rawTopicMessageDF.map{
Please suggest what I'm missing here.
Thanks in advance.
Just declare your case class DeserializedFromKafkaRecord outside of the main method.
I imagine that when the case class is defined inside main, Spark magic with implicit encoders does not work properly, since the case class does not exist before the execution of main method.
The problem can be reproduced with a simpler example (without Kafka) :
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
object SimpleTest {
// declare CaseClass outside of main method
case class CaseClass(value: Int)
def main(args: Array[String]): Unit = {
// when case class is declared here instead
// of outside main, the program does not compile
// case class CaseClass(value: Int)
val sparkSession = SparkSession
.builder
.master("local")
.appName("simpletest")
.getOrCreate()
import sparkSession.implicits._
val df: DataFrame = sparkSession.sparkContext.parallelize(1 to 10).toDF()
val ds: Dataset[CaseClass] = df.map { row =>
CaseClass(row.getInt(0))
}
ds.show()
}
}
I am trying to convert a Spark RDD to a Spark SQL dataframe with toDF(). I have used this function successfully many times, but in this case I'm getting a compiler error:
error: value toDF is not a member of org.apache.spark.rdd.RDD[com.example.protobuf.SensorData]
Here is my code below:
// SensorData is an auto-generated class
import com.example.protobuf.SensorData
def loadSensorDataToRdd : RDD[SensorData] = ???
object MyApplication {
def main(argv: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName("My application")
conf.set("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val sensorDataRdd = loadSensorDataToRdd()
val sensorDataDf = sensorDataRdd.toDF() // <-- CAUSES COMPILER ERROR
}
}
I am guessing that the problem is with the SensorData class, which is a Java class that was auto-generated from a Protocol Buffer. What can I do in order to convert the RDD to a dataframe?
The reason for the compilation error is that there's no Encoder in scope to convert a RDD with com.example.protobuf.SensorData to a Dataset of com.example.protobuf.SensorData.
Encoders (ExpressionEncoders to be exact) are used to convert InternalRow objects into JVM objects according to the schema (usually a case class or a Java bean).
There's a hope you can create an Encoder for the custom Java class using org.apache.spark.sql.Encoders object's bean method.
Creates an encoder for Java Bean of type T.
Something like the following:
import org.apache.spark.sql.Encoders
implicit val SensorDataEncoder = Encoders.bean(classOf[com.example.protobuf.SensorData])
If SensorData uses unsupported types you'll have to map the RDD[SensorData] to an RDD of some simpler type(s), e.g. a tuple of the fields, and only then expect toDF work.
I am a newbie with Apache spark as well with Scala programming language.
What I am trying to achieve is to extract the data from my local mongoDB database for then to save it in a parquet format using Apache Spark with the hadoop-connector
This is my code so far:
package com.examples
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD
import org.apache.hadoop.conf.Configuration
import org.bson.BSONObject
import com.mongodb.hadoop.{MongoInputFormat, BSONFileInputFormat}
import org.apache.spark.sql
import org.apache.spark.sql.SQLContext
object DataMigrator {
def main(args: Array[String])
{
val conf = new SparkConf().setAppName("Migration App").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// Import statement to implicitly convert an RDD to a DataFrame
import sqlContext.implicits._
val mongoConfig = new Configuration()
mongoConfig.set("mongo.input.uri", "mongodb://localhost:27017/mongosails4.case")
val mongoRDD = sc.newAPIHadoopRDD(mongoConfig, classOf[MongoInputFormat], classOf[Object], classOf[BSONObject]);
val count = countsRDD.count()
// the count value is aprox 100,000
println("================ PRINTING =====================")
println(s"ROW COUNT IS $count")
println("================ PRINTING =====================")
}
}
The thing is that in order to save data to a parquet file format first its necessary to convert the mongoRDD variable to Spark DataFrame. I have tried something like this:
// convert RDD to DataFrame
val myDf = mongoRDD.toDF() // this lines throws an error
myDF.write.save("my/path/myData.parquet")
and the error I get is this:
Exception in thread "main" scala.MatchError: java.lang.Object (of class scala.reflect.internal.Types.$TypeRef$$anon$6)
do you guys have any other idea how could I convert the RDD to a DataFrame so that I can save data in parquet format?
Here's the structure of one Document in the mongoDB collection : https://gist.github.com/kingtrocko/83a94238304c2d654fe4
Create a Case class representing the data stored in your DBObject.
case class Data(x: Int, s: String)
Then, map the values of your rdd to instances of your case class.
val dataRDD = mongoRDD.values.map { obj => Data(obj.get("x"), obj.get("s")) }
Now with your RDD[Data], you can create a DataFrame with the sqlContext
val myDF = sqlContext.createDataFrame(dataRDD)
That should get you going. I can explain more later if needed.
I get the following error
Exception in thread "main" java.io.IOException: Not a file: hdfs://quickstart.cloudera:8020/user/cloudera/linkage/out1
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:320)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:180)
when launching the following command
spark-submit --class spark00.DataAnalysis1 --master local sproject1.jar linkage linkage/out1
The two last arguments (linkage and linkage/out1) are HDFS directories, the first contains several CSV files, the second doesn't exist, I assume that it will be automatically created.
The following code has been tested successfully with REPL (Spark 1.1, Scala 2.10.4), except of course the saveAsTextFile() part. I've followed the step-by-step method explained in O'Reilly's "Advanced Analytics with Spark" book.
Since it worked on REPL, I wanted to transpose this into a JAR file using Eclipse Juno, with the following code.
package spark00
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object DataAnalysis1 {
case class MatchData(id1: Int, id2: Int, scores: Array[Double], matched: Boolean)
def isHeader(line:String) = line.contains("id_1")
def toDouble(s:String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
def parse(line:String) = {
val pieces = line.split(",")
val id1 = pieces(0).toInt
val id2 = pieces(1).toInt
val scores = pieces.slice(2, 11).map(toDouble)
val matched = pieces(11).toBoolean
MatchData(id1, id2, scores, matched)
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("DataAnalysis1")
val sc = new SparkContext(conf)
// Load our input data.
val rawblocks = sc.textFile(args(0))
// CLEAN-UP
// a. calling !isHeader(): suppress header
val noheader = rawblocks.filter(!isHeader(_))
// b. calling parse(): setting feature types and renaming headers
val parsed = noheader.map(line => parse(line))
// EXPORT CLEAN FILE
parsed.coalesce(1,true).saveAsTextFile(args(1))
}
}
As you can see args(0) should be "linkage" directory, and args(1) is actually the output HDFS directory linkage/out1 based on my spark-submit command above.
I've also tried the last line without coalesce(1,true)
Here's the official RDD type for parsed
parsed: org.apache.spark.rdd.RDD[(Int, Int, Array[Double], Boolean)] = MappedRDD[3] at map at <console>:34
Thank you in advance for your support
Nov 20th: I'm adding this simple Wordcount code that works well when running the spark-submit command the same way as for the code above. Thus, my question will be: why the saveAsTextFile() worked for this one and not the for other code ?
object SpWordCount {
def main(args: Array[String]) {
// Create a Scala Spark Context.
val conf = new SparkConf().setMaster("local").setAppName("wordCount")
val sc = new SparkContext(conf)
// Load our input data.
val input = sc.textFile(args(0))
// Split it up into words.
val words = input.flatMap(line => line.split(" "))
// Transform into word and count.
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
// Save the word count back out to a text file, causing evaluation.
counts.saveAsTextFile(args(1))
}
}