Spark Structured Streaming MemoryStream + Row + Encoders issue - scala

I am trying to run some tests on my local machine with spark structured streaming.
In batch mode here is the Row that i am dealing with:
val recordSchema = StructType(List(StructField("Record", MapType(StringType, StringType), false)))
val rows = List(
Row(
Map("ID" -> "1",
"STRUCTUREID" -> "MFCD00869853",
"MOLFILE" -> "The MOL Data",
"MOLWEIGHT" -> "803.482",
"FORMULA" -> "C44H69NO12",
"NAME" -> "Tacrolimus",
"HASH" -> "52b966c551cfe0fa7d526bac16abcb7be8b8867d",
"SMILES" -> """[H][C#]12O[C#](O)([C#H](C)C[C##H]1OC)""",
"METABOLISM" -> "The metabolism 500"
)),
Row(
Map("ID" -> "2",
"STRUCTUREID" -> "MFCD00869854",
"MOLFILE" -> "The MOL Data",
"MOLWEIGHT" -> "603.482",
"FORMULA" -> "",
"NAME" -> "Tacrolimus2",
"HASH" -> "52b966c551cfe0fa7d526bac16abcb7be8b8867d",
"SMILES" -> """[H][C#]12O[C#](O)([C#H](C)C[C##H]1OC)""",
"METABOLISM" -> "The metabolism 500"
))
)
val df = spark.createDataFrame(spark.sparkContext.parallelize(rows), recordSchema)
Working with that in Batch more works as a charm, no issue.
Now I'm try to move in streaming mode using MemoryStream for testing. I added the following:
implicit val ctx = spark.sqlContext
val intsInput = MemoryStream[Row]
But the compiler complain with the as follows:
No implicits found for parameter evidence$1: Encoder[Row]
Hence, my question: What should I do here to get that working
Also i saw that if I add the following import the error goes away:
import spark.implicits._
Actually, I now get the following warning instead of an error
Ambiguous implicits for parameter evidence$1: Encoder[Row]
I do not understand the encoder mechanism well and would appreciate if someone could explain to me how not to use those implicits. The reason being that I red the following in a book when it comes to the creation of DataFrame from Rows.
Recommended appraoch:
val myManualSchema = new StructType(Array(
new StructField("some", StringType, true),
new StructField("col", StringType, true),
new StructField("names", LongType, false)))
val myRows = Seq(Row("Hello", null, 1L))
val myRDD = spark.sparkContext.parallelize(myRows)
val myDf = spark.createDataFrame(myRDD, myManualSchema)
myDf.show()
And then the author goes on with this:
In Scala, we can also take advantage of Spark’s implicits in the
console (and if you import them in your JAR code) by running toDF on a
Seq type. This does not play well with null types, so it’s not
necessarily recommended for production use cases.
val myDF = Seq(("Hello", 2, 1L)).toDF("col1", "col2", "col3")
If someone could take the time to explain what is happening in my scenario when i use the implicit, and if it is rather safe to do so, or else is there a way to do it more explicitly without importing the implicit.
Finally, if someone could point me to a good doc around Encoder and Spark Type mapping that would be great.
EDIT1
I finally got it to work with
implicit val ctx = spark.sqlContext
import spark.implicits._
val rows = MemoryStream[Map[String,String]]
val df = rows.toDF()
Although my problem here is that i am not confident about what I am doing. It seems to me that it is like in some situation I need to create a DataSet to be able to convert it in an DF[ROW] with toDF conversion. I understood that working with DS is typeSafe but slower than with DF. So why this intermediary with DataSet? This is not the first time that i see that in Spark Structured Streaming. Again if someone could help me with those, that would be great.

I encourage you to use Scala's case classes for data modeling.
final case class Product(name: String, catalogNumber: String, cas: String, formula: String, weight: Double, mld: String)
Now you can have a List of Product in memory:
val inMemoryRecords: List[Product] = List(
Product("Cyclohexanecarboxylic acid", " D19706", "1148027-03-5", "C(11)H(13)Cl(2)NO(5)", 310.131, "MFCD11226417"),
Product("Tacrolimus", "G51159", "104987-11-3", "C(44)H(69)NO(12)", 804.018, "MFCD00869853"),
Product("Methanol", "T57494", "173310-45-7", "C(8)H(8)Cl(2)O", 191.055, "MFCD27756662")
)
The structured streaming API makes it easy to reason about stream processing by using the widely known Dataset[T] abstraction. Roughly speaking, you just have to worry about three things:
Source: a source can generate an input data stream which we can represent as a Dataset[Input]. Every new data item Input that arrives is going to be appended into this unbounded dataset. You can manipulate the data as you wish (e.g. Dataset[Input] => Dataset[Output]).
StreamingQueries and Sink: a query generates a result table that's updated from the Source every trigger interval. Changes are written into external storage called a Sink.
Output modes: there are different modes on which you can write data into the Sink: complete mode, append mode, and update mode.
Let's assume that you want to know the products that contain a molecular weight bigger than 200 units.
As you said, using the batch API is fairly simple and straight-forward:
// Create an static dataset using the in-memory data
val staticData: Dataset[Product] = spark.createDataset(inMemoryRecords)
// Processing...
val result: Dataset[Product] = staticData.filter(_.weight > 200)
// Print results!
result.show()
When using the Streaming API you just need to define a source and a sink as an extra step. In this example, we can use the MemoryStream and the console sink to print out the results.
// Create an streaming dataset using the in-memory data (memory source)
val productSource = MemoryStream[Product]
productSource.addData(inMemoryRecords)
val streamingData: Dataset[Product] = productSource.toDS()
// Processing...
val result: Dataset[Product] = streamingData.filter(_.weight > 200)
// Print results by using the console sink.
val query: StreamingQuery = result.writeStream.format("console").start()
// Stop streaming
query.awaitTermination(timeoutMs=5000)
query.stop()
Note that the staticData and the streamingData have the exact type signature (i.e., Dataset[Product]). This allows us to apply the same processing steps regardless of using the Batch or Streaming API. You can also think of implementing a generic method def processing[In, Out](inputData: Dataset[In]): Dataset[Out] = ??? to avoid repeating yourself in both approaches.
Complete code example:
object ExMemoryStream extends App {
// Boilerplate code...
val spark: SparkSession = SparkSession.builder
.appName("ExMemoryStreaming")
.master("local[*]")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
implicit val sqlContext: SQLContext = spark.sqlContext
// Define your data models
final case class Product(name: String, catalogNumber: String, cas: String, formula: String, weight: Double, mld: String)
// Create some in-memory instances
val inMemoryRecords: List[Product] = List(
Product("Cyclohexanecarboxylic acid", " D19706", "1148027-03-5", "C(11)H(13)Cl(2)NO(5)", 310.131, "MFCD11226417"),
Product("Tacrolimus", "G51159", "104987-11-3", "C(44)H(69)NO(12)", 804.018, "MFCD00869853"),
Product("Methanol", "T57494", "173310-45-7", "C(8)H(8)Cl(2)O", 191.055, "MFCD27756662")
)
// Defining processing step
def processing(inputData: Dataset[Product]): Dataset[Product] =
inputData.filter(_.weight > 200)
// STATIC DATASET
val datasetStatic: Dataset[Product] = spark.createDataset(inMemoryRecords)
println("This is the static dataset:")
processing(datasetStatic).show()
// STREAMING DATASET
val productSource = MemoryStream[Product]
productSource.addData(inMemoryRecords)
val datasetStreaming: Dataset[Product] = productSource.toDS()
println("This is the streaming dataset:")
val query: StreamingQuery = processing(datasetStreaming).writeStream.format("console").start()
query.awaitTermination(timeoutMs=5000)
// Stop query and close Spark
query.stop()
spark.close()
}

Related

How can I introspect and pre-load all collections from MongoDB into the Spark SQL catalog?

When learning Spark SQL, I've been using the following approach to register a collection into the Spark SQL catalog and query it.
val persons: Seq[MongoPerson] = Seq(MongoPerson("John", "Doe"))
sqlContext.createDataset(persons)
.write
.format("com.mongodb.spark.sql.DefaultSource")
.option("collection", "peeps")
.mode("append")
.save()
sqlContext.read
.format("com.mongodb.spark.sql.DefaultSource")
.option("collection", "peeps")
.load()
.as[Peeps]
.show()
However, when querying it, it seems that I need to register it as a temporary view in order to access it using SparkSQL.
val readConfig = ReadConfig(Map("uri" -> "mongodb://localhost:37017/test", "collection" -> "morepeeps"), Some(ReadConfig(spark)))
val people: DataFrame = MongoSpark.load[Peeps](spark, readConfig)
people.show()
people.createOrReplaceTempView("peeps")
spark.catalog.listDatabases().show()
spark.catalog.listTables().show()
sqlContext.sql("SELECT * FROM peeps")
.as[Peeps]
.show()
For a database with quite a few collections, is there a way to hydrate the Spark SQL schema catalog so that this op isn't so verbose?
So there's a couple things going on. First of all, simply loading the Dataset using sqlContext.read will not register it with SparkSQL catalog. The end of the function chain you have in your first code sample returns a Dataset at .as[Peeps]. You need to tell Spark that you want to use it as a view.
Depending on what you're doing with it, I might recommend leaning on the Scala Dataset API rather than SparkSQL. However, if SparkSQL is absolutely essential, you can likely speed things up programmatically.
In my experience, you'll need to run that boilerplate on each table you want to import. Fortunately, Scala is a proper programming language, so we can cut down on code duplication substantially by using a function, and calling it as such:
val MongoDbUri: String = "mongodb://localhost:37017/test" // store this as a constant somewhere
// T must be passed in as some case class
// Note, you can also add a second parameter to change the view name if so desired
def loadTableAsView[T <: Product : TypeTag](table: String)(implicit spark: SparkSession): Dataset[T] {
val configMap = Map(
"uri" -> MongoDbUri,
"collection" -> table
)
val readConfig = ReadConfig(configMap, Some(ReadConfig(spark)))
val df: DataFrame = MongoSpark.load[T](spark, readConfig)
df.createOrReplaceTempView(table)
df.as[T]
}
And to call it:
// Note: if spark is defined implicitly, e.g. implicit val spark: SparkSession = spark, you won't need to pass it explicitly
val peepsDS: Dataset[Peeps] = loadTableAsView[Peeps]("peeps")(spark)
val chocolatesDS: Dataset[Chocolates] = loadTableAsView[Chocolates]("chocolates")(spark)
val candiesDS: Dataset[Candies] = loadTableAsView[Candies]("candies")(spark)
spark.catalog.listDatabases().show()
spark.catalog.listTables().show()
peepsDS.show()
chocolatesDS.show()
candiesDS.show()
This will substantially cut down your boilerplate, and also allow you to more easily write some tests for that repeated bit of code. There's also probably a way to create a map of table names to case classes that you can then iterate over, but I don't have an IDE handy to test it out.

Using map() and filter() in Spark instead of spark.sql

I have two datasets that I want to INNER JOIN to give me a whole new table with the desired data. I used SQL and manage to get it. But now I want to try it with map() and filter(), is it possible?
This is my code using the SPARK SQL:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object hello {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local")
.setAppName("quest9")
val sc = new SparkContext(conf)
val spark = SparkSession.builder().appName("quest9").master("local").getOrCreate()
val zip_codes = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/zip.csv")
val census = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/census.csv")
census.createOrReplaceTempView("census")
zip_codes.createOrReplaceTempView("zip")
//val query = spark.sql("SELECT * FROM census")
val query = spark.sql("SELECT DISTINCT census.Total_Males AS male, census.Total_Females AS female FROM census INNER JOIN zip ON census.Zip_Code=zip.Zip_Code WHERE zip.City = 'Inglewood' AND zip.County = 'Los Angeles'")
query.show()
query.write.parquet("/home/hdfs/Documents/population/census/IDE/census.parquet")
sc.stop()
}
}
The only sensible way, in general to do this would be to use the join() method of `Dataset̀. I would urge you to question the need to use only map/filter to do this, as this is not intuitive, and will probably confuse any experienced spark developer (or simply put, make him roll his eyes). It may also lead to scalability issues should the dataset grow.
That said, in your use case, it is pretty simple to avoid using join. Another possibility would be to issue two separate jobs to spark :
fetch the zip code(s) that interests you
filter on the census data on that (those) zip code(s)
Step 1 collect the zip codes of interest (not sure of the exact syntax as I do not have a spark shell at hand, but it should be trivial to find the right one).
var codes: Seq[String] = zip_codes
// filter on the city
.filter(row => row.getAs[String]("City").equals("Inglewood"))
// filter on the county
.filter(row => row.getAs[String]("County").equals("Los Angeles"))
// map to zip code as a String
.map(row => row.getAs[String]("Zip_Code"))
.as[String]
// Collect on the driver side
.collect()
Then again, writing it this way instead of using select/where is pretty strange to anyone being used to spark.
Yet, the reason this will work is because we can be sure that zip codes matching a given town and county will be really small. So it is safe to perform driver side collcetion of the result.
Now on to step 2 :
census.filter(row => codes.contains(row.getAs[String]("Zip_Code")))
.map( /* whatever to get your data out */ )
What you need is a join, your query roughly translates to :
census.as("census")
.join(
broadcast(zip_codes
.where($"City"==="Inglewood")
.where($"County"==="Los Angeles")
.as("zip"))
,Seq("Zip_Code"),
"inner" // "leftsemi" would also be sufficient
)
.select(
$"census.Total_Males".as("male"),
$"census.Total_Females".as("female")
).distinct()

Partitionning the rdf datasets by subject in spark scala

I am a newbie to functional programming language and I am trying to learn spark scala
The goal is to partition the rdf datset by subject
the code is below:
object SimpleApp {
def main(args: Array[String]): Unit = {
val sparkConf =
new SparkConf().
setAppName("SimpleApp").
setMaster("local[2]").
set("spark.executor.memory", "1g")
val sc = new SparkContext(sparkConf)
val data = sc.textFile("/home/hduser/Bureau/11.txt")
val subject = data.map(_.split("\\s+")(0)).distinct.collect
}
}
So I get to recover the subjects but it returns an array of string also mapPartitions(func) and mapPartitionsWithIndex(func) : the func need to be iterator
So how do I proceed?
Partitioning your RDD by subject would probably best be done by using a HashPartitioner. The HashPartitioner works by taking an RDD of N-tuples and sorting the data by key eg
myPairRDD:
("sub1", "desc1")
("sub2", "desc2")
("sub1", "desc3")
("sub2", "desc4")
myPairRDD.partitionBy(new HashPartitioner(2))
becomes:
partition 1:
("sub1", "desc1")
("sub1", "desc3")
partition 2:
("sub2", "desc2")
("sub2", "desc4")
Therefore, your subjects RDD should probably be created more like this (note the extra brackets which create a tuple/pair RDD):
val subjectTuples = data.map((_.split("\\s+")(0), _.split("\\s+")(1)))
See the diagrams here for more info: https://blog.knoldus.com/2015/06/19/shufflling-and-repartitioning-of-rdds-in-apache-spark/

Encoder error while trying to map dataframe row to updated row

When I m trying to do the same thing in my code as mentioned below
dataframe.map(row => {
val row1 = row.getAs[String](1)
val make = if (row1.toLowerCase == "tesla") "S" else row1
Row(row(0),make,row(2))
})
I have taken the above reference from here:
Scala: How can I replace value in Dataframs using scala
But I am getting encoder error as
Unable to find encoder for type stored in a Dataset. Primitive types
(Int, S tring, etc) and Product types (case classes) are supported by
importing spark.im plicits._ Support for serializing other types will
be added in future releases.
Note: I am using spark 2.0!
There is nothing unexpected here. You're trying to use code which has been written with Spark 1.x and is no longer supported in Spark 2.0:
in 1.x DataFrame.map is ((Row) ⇒ T)(ClassTag[T]) ⇒ RDD[T]
in 2.x Dataset[Row].map is ((Row) ⇒ T)(Encoder[T]) ⇒ Dataset[T]
To be honest it didn't make much sense in 1.x either. Independent of version you can simply use DataFrame API:
import org.apache.spark.sql.functions.{when, lower}
val df = Seq(
(2012, "Tesla", "S"), (1997, "Ford", "E350"),
(2015, "Chevy", "Volt")
).toDF("year", "make", "model")
df.withColumn("make", when(lower($"make") === "tesla", "S").otherwise($"make"))
If you really want to use map you should use statically typed Dataset:
import spark.implicits._
case class Record(year: Int, make: String, model: String)
df.as[Record].map {
case tesla if tesla.make.toLowerCase == "tesla" => tesla.copy(make = "S")
case rec => rec
}
or at least return an object which will have implicit encoder:
df.map {
case Row(year: Int, make: String, model: String) =>
(year, if(make.toLowerCase == "tesla") "S" else make, model)
}
Finally if for some completely crazy reason you really want to map over Dataset[Row] you have to provide required encoder:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
// Yup, it would be possible to reuse df.schema here
val schema = StructType(Seq(
StructField("year", IntegerType),
StructField("make", StringType),
StructField("model", StringType)
))
val encoder = RowEncoder(schema)
df.map {
case Row(year, make: String, model) if make.toLowerCase == "tesla" =>
Row(year, "S", model)
case row => row
} (encoder)
For scenario where dataframe schema is known in advance answer given by #zero323 is the solution
but for scenario with dynamic schema / or passing multiple dataframe to a generic function:
Following code has worked for us while migrating from 1.6.1 from 2.2.0
import org.apache.spark.sql.Row
val df = Seq(
(2012, "Tesla", "S"), (1997, "Ford", "E350"),
(2015, "Chevy", "Volt")
).toDF("year", "make", "model")
val data = df.rdd.map(row => {
val row1 = row.getAs[String](1)
val make = if (row1.toLowerCase == "tesla") "S" else row1
Row(row(0),make,row(2))
})
this code executes on both the versions of spark.
disadvantage : optimization provided
by spark on dataframe/datasets api wont be applied.
Just to add a few other important-to-know points in order to well understand the other answers (especially the final point of #zero323's answer about map over Dataset[Row]):
First of all, Dataframe.map gives you a Dataset (more specifically, Dataset[T], rather than Dataset[Row])!
And Dataset[T] always requires an encoder, that's what this sentence "Dataset[Row].map is ((Row) ⇒ T)(Encoder[T]) ⇒ Dataset[T]" means.
There are indeed lots of encoders predefined already by Spark (which can be imported by doing import spark.implicits._), but still the list would not be able to cover many domain specific types that developers may create, in which case you need to create encoders yourself.
In the specific example on this page, df.map returns a Row type for Dataset, and hang on a minute, Row type is not within the list of types that have encoders predefined by Spark, hence you are going to create one on your own.
And I admit that creating an encoder for Row type is a bit different than the approach described in the above link, and you have to use RowEncoder which takes StructType as param describing type of a row, like what #zero323 provides above:
// this describes the internal type of a row
val schema = StructType(Seq(StructField("year", IntegerType), StructField("make", StringType), StructField("model", StringType)))
// and this completes the creation of encoder
// for the type `Row` with internal schema described above
val encoder = RowEncoder(schema)
In my case of spark 2.4.4 version, I had to import implicits. This is a general answer
val spark2 = spark
import spark2.implicits._
val data = df.rdd.map(row => my_func(row))
where my_func did some operation.

How to convert a case-class-based RDD into a DataFrame?

The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass), but my DataFrame ends up empty. Here's my Scala code:
// sc is the SparkContext, while sqlContext is the SQLContext.
// Define the case class and raw data
case class Dog(name: String)
val data = Array(
Dog("Rex"),
Dog("Fido")
)
// Create an RDD from the raw data
val dogRDD = sc.parallelize(data)
// Print the RDD for debugging (this works, shows 2 dogs)
dogRDD.collect().foreach(println)
// Create a DataFrame from the RDD
val dogDF = sqlContext.createDataFrame(dogRDD, classOf[Dog])
// Print the DataFrame for debugging (this fails, shows 0 dogs)
dogDF.show()
The output I'm seeing is:
Dog(Rex)
Dog(Fido)
++
||
++
||
||
++
What am I missing?
Thanks!
All you need is just
val dogDF = sqlContext.createDataFrame(dogRDD)
Second parameter is part of Java API and expects you class follows java beans convention (getters/setters). Your case class doesn't follow this convention, so no property is detected, that leads to empty DataFrame with no columns.
You can create a DataFrame directly from a Seq of case class instances using toDF as follows:
val dogDf = Seq(Dog("Rex"), Dog("Fido")).toDF
Case Class Approach won't Work in cluster mode. It'll give ClassNotFoundException to the case class you defined.
Convert it a RDD[Row] and define the schema of your RDD with StructField and then createDataFrame like
val rdd = data.map { attrs => Row(attrs(0),attrs(1)) }
val rddStruct = new StructType(Array(StructField("id", StringType, nullable = true),StructField("pos", StringType, nullable = true)))
sqlContext.createDataFrame(rdd,rddStruct)
toDF() wont work either