unable to store row elements of a dataset, via mapPartitions(), in variables - scala

I am trying to create a Spark Dataset, and then using mapPartitions, trying to access each of its elements and store those in variables. Using below piece of code for the same:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sql("select col1,col2,col3 from table limit 10")
val schema = StructType(Seq(
StructField("col1", StringType),
StructField("col2", StringType),
StructField("col3", StringType)))
val encoder = RowEncoder(schema)
df.mapPartitions{iterator => { val myList = iterator.toList
myList.map(x=> { val value1 = x.getString(0)
val value2 = x.getString(1)
val value3 = x.getString(2)}).iterator}} (encoder)
The error I am getting against this code is:
<console>:39: error: type mismatch;
found : org.apache.spark.sql.catalyst.encoders.ExpressionEncoder[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Encoder[Unit]
val value3 = x.getString(2)}).iterator}} (encoder)
Eventually, I am targeting to store the row elements in variables, and perform some operation with these. Not sure what am I missing here. Any help towards this would be highly appreciated!

Actually, there are several problems with your code:
Your map-statement has no return value, therefore Unit
If you return a tuple of String from mapPartitions, you don't need a RowEncoder (because you don't return a Row, but a Tuple3 which does not need a encoder because its a Product)
You can write your code like this:
df
.mapPartitions{itr => itr.map(x=> (x.getString(0),x.getString(1),x.getString(2)))}
.toDF("col1","col2","col3") // Convert Dataset to Dataframe, get desired field names
But you could just use a simple select statement in DataFrame API, no need for mapPartitions here
df
.select($"col1",$"col2",$"col3")

Related

How to extract data from MapType Scala Spark Column as Scala Map?

Well, the question is pretty much that. Let me provide sample:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Column, Dataset}
val data = List(
Row("miley",
Map("good_songs" -> "wrecking ball",
"bad_songs" -> "younger now"
)
),
Row("kesha",
Map(
"good_songs" -> "tik tok",
"bad_songs" -> "rainbow"
)
)
)
val schema = List(
StructField("singer", StringType, true),
StructField("songs", MapType(StringType, StringType, true))
)
val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
// This returns scala.collection.Map[Nothing,Nothing]
someDF.select($"songs").head().getMap(0)
// Therefore, this won't work:
val myHappyMap : Map[String, String] = someDF.select($"songs").head().getMap(0)
I don't understand why I'm getting a Map[Nothing, Nothing] if I properly described my desired schema for the MapType column - not only that: when I do someDF.schema, what I get is
org.apache.spark.sql.types.StructType = StructType(StructField(singer,StringType,true), StructField(songs,MapType(StringType,StringType,true),true)), showing that the DataFrame schema is properly set.
I've read extract or filter MapType of Spark DataFrame
, and also How to get keys and values from MapType column in SparkSQL DataFrame
. I thought the latter would solve my problem by at least being able to extract the keys and the values separately, but, still, I get the values as WrappedArray(Nothing), which means it just adds extra complication for no real gain.
What am I missing here?
.getMap is a typed method and it's incapable of infering the types on your map, so you have to actually tell it:
val myHappyMap: Map[String, String] = someDF.select($"songs").head().getMap[String, String](0).toMap
the toMap in the end is just to convert it from scala.collection.Map to scala.collection.immutable.Map (they are different stuff and when you declare the type usually you are refering to the second one) (edited)

Cleaning CSV/Dataframe of size ~40GB using Spark and Scala

I am kind of newbie to big data world. I have a initial CSV which has a data size of ~40GB but in some kind of shifted order. I mean if you see initial CSV, for Jenny there is no age, so sex column value is shifted to age and remaining column value keeps shifting till the last element in the row.
I want clean/process this CVS using dataframe with Spark in Scala. I tried quite a few solution with withColumn() API and all, but nothing worked for me.
If anyone can suggest me some sort of logic or API available which is out there to solve this in a cleaner way. I might not need proper solution but pointers will also do. Help much appreciated!!
Initial CSV/Dataframe
Required CSV/Dataframe
EDIT:
This is how I'm reading the data:
val spark = SparkSession .builder .appName("SparkSQL")
.master("local[*]") .config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
import spark.implicits._
val df = spark.read.option("header", true").csv("path/to/csv.csv")
This pretty much looks like the data is flawed. To handle this, I would suggest reading each line of the csv file as a single string and the applying a map() function to handle the data
case class myClass(name: String, age: Integer, sex: String, siblings: Integer)
val myNewDf = myDf.map(row => {
val myRow: String = row.getAs[String]("MY_SINGLE_COLUMN")
val myRowValues = myRow.split(",")
if (4 == myRowValues.size()) {
//everything as expected
return myClass(myRowValues[0], myRowValues[1], myRowValues[2], myRowValues[3])
} else {
//do foo to guess missing values
}
}
As in your case Data is not properly formatted. To handle this first data has to be cleansed, i.e all rows of CSV should have same Schema or same no of delimiter/columns.
Basic approach to do this in spark could be:
Load data as Text
Apply map operation on loaded DF/DS to clean it
Create Schema manually
Apply Schema on the cleansed DF/DS
Sample Code
//Sample CSV
John,28,M,3
Jenny,M,3
//Sample Code
val schema = StructType(
List(
StructField("name", StringType, nullable = true),
StructField("age", IntegerType, nullable = true),
StructField("sex", StringType, nullable = true),
StructField("sib", IntegerType, nullable = true)
)
)
import spark.implicits._
val rawdf = spark.read.text("test.csv")
rawdf.show(10)
val rdd = rawdf.map(row => {
val raw = row.getAs[String]("value")
//TODO: Data cleansing has to be done.
val values = raw.split(",")
if (values.length != 4) {
s"${values(0)},,${values(1)},${values(2)}"
} else {
raw
}
})
val df = spark.read.schema(schema).csv(rdd)
df.show(10)
You can try to define a case class with Optional field for age and load your csv with schema directly into a Dataset.
Something like that :
import org.apache.spark.sql.{Encoders}
import sparkSession.implicits._
case class Person(name: String, age: Option[Int], sex: String, siblings: Int)
val schema = Encoders.product[Person].schema
val dfInput = sparkSession.read
.format("csv")
.schema(schema)
.option("header", "true")
.load("path/to/csv.csv")
.as[Person]

calling a scala method passing each row of a dataframe as input

I have a dataframe which has two columns in it, has been created importing a .txt file.
sample file content::
Sankar Biswas, Played{"94"}
Puja "Kumari" Jha, Didnot
Man Women, null
null,Gay Gentleman
null,null
Created a dataframe importing the above file ::
val a = sc.textFile("file:////Users/sankar.biswas/Desktop/hello.txt")
case class Table(contentName: String, VersionDetails: String)
val b = a.map(_.split(",")).map(p => Table(p(0).trim,p(1).trim)).toDF
Now I have a function defined lets say like this ::
def getFormattedName(contentName : String, VersionDetails:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
Now what I need to do is I have to take each row of the dataframe and call the method getFormattedName passing the 2 arguments of the dataframe's each row.
I tried like this and many others but did not work out ::
val a = b.map((m,n) => getFormattedContentName(m,n))
Looking forward to any suggestion you have for me.
Thanks in advance.
I think you have a structured schema and it can be represented by a dataframe.
Dataframe has support for reading the csv input.
import org.apache.spark.sql.types._
val customSchema = StructType(Array(StructField("contentName", StringType, true),StructField("titleVersionDesc", StringType, true)))
val df = spark.read.schema(customSchema).csv("input.csv")
To call a custom method on dataset, you can create a UDF(User Defined Function).
def getFormattedName(contentName : String, titleVersionDesc:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
val get_formatted_name = udf(getFormattedName _)
df.select(get_formatted_name($"contentName", $"titleVersionDesc"))
Try
val a = b.map(row => getFormattedContentName(row(0),row(1)))
Remember that the rows of a dataframe are their own type, not a tuple or something, and you need to use the correct methodology for referring to their elements.

How to convert a case-class-based RDD into a DataFrame?

The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass), but my DataFrame ends up empty. Here's my Scala code:
// sc is the SparkContext, while sqlContext is the SQLContext.
// Define the case class and raw data
case class Dog(name: String)
val data = Array(
Dog("Rex"),
Dog("Fido")
)
// Create an RDD from the raw data
val dogRDD = sc.parallelize(data)
// Print the RDD for debugging (this works, shows 2 dogs)
dogRDD.collect().foreach(println)
// Create a DataFrame from the RDD
val dogDF = sqlContext.createDataFrame(dogRDD, classOf[Dog])
// Print the DataFrame for debugging (this fails, shows 0 dogs)
dogDF.show()
The output I'm seeing is:
Dog(Rex)
Dog(Fido)
++
||
++
||
||
++
What am I missing?
Thanks!
All you need is just
val dogDF = sqlContext.createDataFrame(dogRDD)
Second parameter is part of Java API and expects you class follows java beans convention (getters/setters). Your case class doesn't follow this convention, so no property is detected, that leads to empty DataFrame with no columns.
You can create a DataFrame directly from a Seq of case class instances using toDF as follows:
val dogDf = Seq(Dog("Rex"), Dog("Fido")).toDF
Case Class Approach won't Work in cluster mode. It'll give ClassNotFoundException to the case class you defined.
Convert it a RDD[Row] and define the schema of your RDD with StructField and then createDataFrame like
val rdd = data.map { attrs => Row(attrs(0),attrs(1)) }
val rddStruct = new StructType(Array(StructField("id", StringType, nullable = true),StructField("pos", StringType, nullable = true)))
sqlContext.createDataFrame(rdd,rddStruct)
toDF() wont work either

toDF() not handling RDD

I have an RDD of Rows called RowRDD. I am simply trying to convert into DataFrame. From the examples I have seen on the internet from various places, I am seeing that I shoudl be trying RowRDD.toDF() I am getting the error :
value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
It doesn't work because Row is not a Product type and createDataFrame with as single RDD argument is defined only for RDD[A] where A <: Product.
If you want to use RDD[Row] you have to provide a schema as the second argument. If you think about it is should be obvious. Row is just just a container of Any and as such it doesn't provide enough information for schema inference.
Assuming this is the same RDD as defined in your previous question then schema is easy to generate:
import org.apache.spark.sql.types._
import org.apache.spark.rdd.RD
val rowRdd: RDD[Row] = ???
val schema = StructType(
(1 to rowRdd.first.size).map(i => StructField(s"_$i", StringType, false))
)
val df = sqlContext.createDataFrame(rowRdd, schema)