How to convert RDD of Avro's GenericData.Record to DataFrame? - scala

Perhaps this question may seem a bit abstract, here it is:
val originalAvroSchema : Schema = // read from a file
val rdd : RDD[GenericData.Record] = // From some streaming source
// Looking for a handy:
val df: DataFrame = rdd.toDF(schema)
I explore spark-avro but it has support only to read from a file, not from existing RDD.

import com.databricks.spark.avro._
val sqlContext = new SQLContext(sc)
val rdd : RDD[MyAvroRecord] = ...
val df = rdd.toAvroDF(sqlContext)

Related

how to create DataFrame if I can not use the SparkContext?

def predict(model: LRModel,query: Query): PredictedResult = {
val categorical_val = Array[String]("Type","Month","Dept","Size","IsHoliday")
val ordinary_val = Array[String]()
val sc = new SparkContext()
val sqlContext = new SQLContext(sc)
val query_seq = sc.parallelize(Seq(query))
val df = sqlContext.createDataFrame(query_seq).toDF("Type","Month","Dept","Size","IsHoliday")
val features = process_Data(df = df,categorical_val = categorical_val,ordinary_val = ordinary_val)
val label = model.linear.predict(Vectors.dense(features))
new PredictedResult(label) }
Im trying to convert Seq to DataFrame, but I find there are many methods using SparkContext to create online. The problem is that I do not have the para SparkContext,So I want to ask if there are some other ways to create DataFrame. Im new to Scala and Spark!
The SparkContext the main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. DataFrame is a distributed collection of data organized into named columns. You can checkout the documentation here: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
You can create a DataFrame from Seq as given below:
import sqlContext.implicits._
val df = Seq(("A1", "B1", "C1", "D1", "E1"), ("A2", "B2", "C2", "D2", "E2")).toDF("Type","Month","Dept","Size","IsHoliday")

How to convert RDD[Row] to RDD[String]

I have a DataFrame called source, a table from mysql
val source = sqlContext.read.jdbc(jdbcUrl, "source", connectionProperties)
I have converted it to rdd by
val sourceRdd = source.rdd
but its RDD[Row] I need RDD[String]
to do transformations like
source.map(rec => (rec.split(",")(0).toInt, rec)), .subtractByKey(), etc..
Thank you
You can use Row. mkString(sep: String): String method in a map call like this :
val sourceRdd = source.rdd.map(_.mkString(","))
You can change the "," parameter by whatever you want.
Hope this help you, Best Regards.
What is your schema?
If it's just a String, you can use:
import spark.implicits._
val sourceDS = source.as[String]
val sourceRdd = sourceDS.rdd // will give RDD[String]
Note: use sqlContext instead of spark in Spark 1.6 - spark is a SparkSession, which is a new class in Spark 2.0 and is a new entry point to SQL functionality. It should be used instead of SQLContext in Spark 2.x
You can also create own case classes.
Also you can map rows - here source is of type DataFrame, we use partial function in map function:
val sourceRdd = source.rdd.map { case x : Row => x(0).asInstanceOf[String] }.map(s => s.split(","))

OneHotEncoder in Spark Dataframe in Pipeline

I've been trying to get an example running in Spark and Scala with the adult dataset .
Using Scala 2.11.8 and Spark 1.6.1.
The problem (for now) lies in the amount of categorical features in that dataset that all need to be encoded to numbers before a Spark ML algorithm can do its job..
So far I have this:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.OneHotEncoder
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
object Adult {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Adult example").setMaster("local[*]")
val sparkContext = new SparkContext(conf)
val sqlContext = new SQLContext(sparkContext)
val data = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("src/main/resources/adult.data")
val categoricals = data.dtypes filter (_._2 == "StringType")
val encoders = categoricals map (cat => new OneHotEncoder().setInputCol(cat._1).setOutputCol(cat._1 + "_encoded"))
val features = data.dtypes filterNot (_._1 == "label") map (tuple => if(tuple._2 == "StringType") tuple._1 + "_encoded" else tuple._1)
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(encoders ++ Array(lr))
val model = pipeline.fit(training)
}
}
However, this doesn't work. Calling pipeline.fit still contains the original string features and thus throws an exception.
How can I remove these "StringType" columns in a pipeline?
Or maybe I'm doing it completely wrong, so if someone has a different suggestion I'm happy to all input :).
The reason why I choose to follow this flow is because I have an extensive background in Python and Pandas, but am trying to learn both Scala and Spark.
There is one thing that can be rather confusing here if you're used to higher level frameworks. You have to index the features before you can use encoder. As it is explained in the API docs:
one-hot encoder (...) maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder}
val df = Seq((1L, "foo"), (2L, "bar")).toDF("id", "x")
val categoricals = df.dtypes.filter (_._2 == "StringType") map (_._1)
val indexers = categoricals.map (
c => new StringIndexer().setInputCol(c).setOutputCol(s"${c}_idx")
)
val encoders = categoricals.map (
c => new OneHotEncoder().setInputCol(s"${c}_idx").setOutputCol(s"${c}_enc")
)
val pipeline = new Pipeline().setStages(indexers ++ encoders)
val transformed = pipeline.fit(df).transform(df)
transformed.show
// +---+---+-----+-------------+
// | id| x|x_idx| x_enc|
// +---+---+-----+-------------+
// | 1|foo| 1.0| (1,[],[])|
// | 2|bar| 0.0|(1,[0],[1.0])|
// +---+---+-----+-------------+
As you can see there is no need to drop string columns from the pipeline. In practice OneHotEncoder will accept numeric column with NominalAttribute, BinaryAttribute or missing type attribute.

Saving DataStream data into MongoDB / converting DS to DF

I am able to save a Data Frame to mongoDB but my program in spark streaming gives a datastream ( kafkaStream ) and I am not able to save it in mongodb neither i am able to convert this datastream to a dataframe. Is there any library or method to do this? Any inputs are highly appreciated.
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka.KafkaUtils
object KafkaSparkStream {
def main(args: Array[String]){
val conf = new SparkConf().setMaster("local[*]").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(10))
val kafkaStream = KafkaUtils.createStream(ssc,
"localhost:2181","spark-streaming-consumer-group", Map("topic" -> 25))
kafkaStream.print()
ssc.start()
ssc.awaitTermination()
}
}
Save a DF to mongodb - SUCCESS
val mongoDbFormat = "com.stratio.datasource.mongodb"
val mongoDbDatabase = "mongodatabase"
val mongoDbCollection = "mongodf"
val mongoDbOptions = Map(
MongodbConfig.Host -> "localhost:27017",
MongodbConfig.Database -> mongoDbDatabase,
MongodbConfig.Collection -> mongoDbCollection
)
//with DataFrame methods
dataFrame.write
.format(mongoDbFormat)
.mode(SaveMode.Append)
.options(mongoDbOptions)
.save()
Access the underlying RDD from the DStream using foreachRDD, transform it to a DataFrame and use your DF function on it.
The easiest way to transform an RDD to a DataFrame is by first transforming the data into a schema, represented in Scala by a case class
case class Element(...)
val elementDStream = kafkaDStream.map(entry => Element(entry, ...))
elementDStream.foreachRDD{rdd =>
val df = rdd.toDF
df.write(...)
}
Also, watch out for Spark 2.0 where this process will completely change with the introduction of Structured Streaming, where a MongoDB connection will become a sink.

How to create a DataFrame from a text file in Spark

I have a text file on HDFS and I want to convert it to a Data Frame in Spark.
I am using the Spark Context to load the file and then try to generate individual columns from that file.
val myFile = sc.textFile("file.txt")
val myFile1 = myFile.map(x=>x.split(";"))
After doing this, I am trying the following operation.
myFile1.toDF()
I am getting an issues since the elements in myFile1 RDD are now array type.
How can I solve this issue?
Update - as of Spark 1.6, you can simply use the built-in csv data source:
spark: SparkSession = // create the Spark Session
val df = spark.read.csv("file.txt")
You can also use various options to control the CSV parsing, e.g.:
val df = spark.read.option("header", "false").csv("file.txt")
For Spark version < 1.6:
The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (;), can read CSV headers (if you have them), and it can infer the schema types (with the cost of an extra scan of the data).
Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:
case class Record(id: Int, name: String)
val myFile1 = myFile.map(x=>x.split(";")).map {
case Array(id, name) => Record(id.toInt, name)
}
myFile1.toDF() // DataFrame will have columns "id" and "name"
I have given different ways to create DataFrame from text file
val conf = new SparkConf().setAppName(appName).setMaster("local")
val sc = SparkContext(conf)
raw text file
val file = sc.textFile("C:\\vikas\\spark\\Interview\\text.txt")
val fileToDf = file.map(_.split(",")).map{case Array(a,b,c) =>
(a,b.toInt,c)}.toDF("name","age","city")
fileToDf.foreach(println(_))
spark session without schema
import org.apache.spark.sql.SparkSession
val sparkSess =
SparkSession.builder().appName("SparkSessionZipsExample")
.config(conf).getOrCreate()
val df = sparkSess.read.option("header",
"false").csv("C:\\vikas\\spark\\Interview\\text.txt")
df.show()
spark session with schema
import org.apache.spark.sql.types._
val schemaString = "name age city"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header",
"false").schema(schema).csv("C:\\vikas\\spark\\Interview\\text.txt")
dfWithSchema.show()
using sql context
import org.apache.spark.sql.SQLContext
val fileRdd =
sc.textFile("C:\\vikas\\spark\\Interview\\text.txt").map(_.split(",")).map{x
=> org.apache.spark.sql.Row(x:_*)}
val sqlDf = sqlCtx.createDataFrame(fileRdd,schema)
sqlDf.show()
If you want to use the toDF method, you have to convert your RDD of Array[String] into a RDD of a case class. For example, you have to do:
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
You will not able to convert it into data frame until you use implicit conversion.
val sqlContext = new SqlContext(new SparkContext())
import sqlContext.implicits._
After this only you can convert this to data frame
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
val df = spark.read.textFile("abc.txt")
case class Abc (amount:Int, types: String, id:Int) //columns and data types
val df2 = df.map(rec=>Amount(rec(0).toInt, rec(1), rec(2).toInt))
rdd2.printSchema
root
|-- amount: integer (nullable = true)
|-- types: string (nullable = true)
|-- id: integer (nullable = true)
A txt File with PIPE (|) delimited file can be read as :
df = spark.read.option("sep", "|").option("header", "true").csv("s3://bucket_name/folder_path/file_name.txt")
I know I am quite late to answer this but I have come up with a different answer:
val rdd = sc.textFile("/home/training/mydata/file.txt")
val text = rdd.map(lines=lines.split(",")).map(arrays=>(ararys(0),arrays(1))).toDF("id","name").show
You can read a file to have an RDD and then assign schema to it. Two common ways to creating schema are either using a case class or a Schema object [my preferred one]. Follows the quick snippets of code that you may use.
Case Class approach
case class Test(id:String,name:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
Schema Approach
import org.apache.spark.sql.types._
val schemaString = "id name"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header","false").schema(schema).csv("file.txt")
dfWithSchema.show()
The second one is my preferred approach since case class has a limitation of max 22 fields and this will be a problem if your file has more than 22 fields!