When I retrieve a dataset in Spark 2, using a select statement the underlying columns inherit the data types of the queried columns.
val ds1 = spark.sql("select 1 as a, 2 as b, 'abd' as c")
ds1.printSchema()
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
|-- c: string (nullable = false)
Now if I convert this into a case class, it will correctly convert the values, but the underlying schema is still wrong.
case class abc(a: Double, b: Double, c: String)
val ds2 = ds1.as[abc]
ds2.printSchema()
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
|-- c: string (nullable = false)
ds2.collect
res18: Array[abc] = Array(abc(1.0,2.0,abd))
I "SHOULD" be able to specify the encoder to use when I create the second dataset, but scala seems to ignore this parameter (Is this a BUG?):
val abc_enc = org.apache.spark.sql.Encoders.product[abc]
val ds2 = ds1.as[abc](abc_enc)
ds2.printSchema
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
|-- c: string (nullable = false)
So the only way I can see to do this simply, without very complex mapping is to use createDataset, but this requires a collect on the underlying object, so it's not ideal.
val ds2 = spark.createDataset(ds1.as[abc].collect)
This is an open issue in Spark API (check this ticket SPARK-17694)
So what you need to do is doing an extra explicit cast. Something like this should work:
ds1.as[abc].map(x => x : abc)
You can simply use cast method on columns as
import sqlContext.implicits._
val ds2 = ds1.select($"a".cast(DoubleType), $"a".cast(DoubleType), $"c")
ds2.printSchema()
you should have
root
|-- a: double (nullable = false)
|-- a: double (nullable = false)
|-- c: string (nullable = false)
You could also cast the column while selecting with sql query as below
import spark.implicits._
val ds = Seq((1,2,"abc"),(1,2,"abc")).toDF("a", "b","c").createOrReplaceTempView("temp")
val ds1 = spark.sql("select cast(a as Double) , cast (b as Double), c from temp")
ds1.printSchema()
This have the schema as
root
|-- a: double (nullable = false)
|-- b: double (nullable = false)
|-- c: string (nullable = true)
Now you can convert to Dataset with case class
case class abc(a: Double, b: Double, c: String)
val ds2 = ds1.as[abc]
ds2.printSchema()
Which now has the required schema
root
|-- a: double (nullable = false)
|-- b: double (nullable = false)
|-- c: string (nullable = true)
Hope this helps!
OK, I think I've resolved this in a better way.
Instead of using a collect when we create a new dataset, we can just reference the rdd of the dataset.
So instead of
val ds2 = spark.createDataset(ds1.as[abc].collect)
We use:
val ds2 = spark.createDataset(ds1.as[abc].rdd)
ds2.printSchema
root
|-- a: double (nullable = false)
|-- b: double (nullable = false)
|-- c: string (nullable = true)
This keeps the lazy evaluation intact, but allows the new dataset to use the Encoder for the abc case class, and the subsequent schema will reflect this when we use it to create a new table.
Related
I am trying to move data from GP to HDFS using Scala & Spark.
val execQuery = "select * from schema.tablename"
val yearDF = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2016").option("user", devUserName).option("password", devPassword).option("partitionColumn","header_id").option("lowerBound", 19919927).option("upperBound", 28684058).option("numPartitions",30).load()
val yearDFSchema = yearDF.schema
The schema for yearDF is:
root
|-- source_system_name: string (nullable = true)
|-- table_refresh_delay_min: decimal(38,30) (nullable = true)
|-- release_number: decimal(38,30) (nullable = true)
|-- change_number: decimal(38,30) (nullable = true)
|-- interface_queue_enabled_flag: string (nullable = true)
|-- rework_enabled_flag: string (nullable = true)
|-- fdm_application_id: decimal(15,0) (nullable = true)
|-- history_enabled_flag: string (nullable = true)
The schema of same table on hive which is given by our project:
val hiveColumns = source_system_name:String|description:String|creation_date:Timestamp|status:String|status_date:Timestamp|table_refresh_delay_min:Timestamp|release_number:Double|change_number:Double|interface_queue_enabled_flag:String|rework_enabled_flag:String|fdm_application_id:Bigint|history_enabled_flag:String
So I took hiveColumns and created a new StructType as given below:
def convertDatatype(datatype: String): DataType = {
val convert = datatype match {
case "string" => StringType
case "bigint" => LongType
case "int" => IntegerType
case "double" => DoubleType
case "date" => TimestampType
case "boolean" => BooleanType
case "timestamp" => TimestampType
}
convert
}
val schemaList = hiveColumns.split("\\|")
val newSchema = new StructType(schemaList.map(col => col.split(":")).map(e => StructField(e(0), convertDatatype(e(1)), true)))
newSchema.printTreeString()
root
|-- source_system_name: string (nullable = true)
|-- table_refresh_delay_min: double (nullable = true)
|-- release_number: double (nullable = true)
|-- change_number: double (nullable = true)
|-- interface_queue_enabled_flag: string (nullable = true)
|-- rework_enabled_flag: string (nullable = true)
|-- fdm_application_id: long (nullable = true)
|-- history_enabled_flag: string (nullable = true)
When I try to apply my new schema: schemaStructType on yearDF as below, I get the exception:
Caused by: java.lang.RuntimeException: java.math.BigDecimal is not a valid external type for schema of double
The exception occurs due to conversion of decimal to double.
What I don't understand is how can I convert the datatype of columns: table_refresh_delay_min, release_number, change_number, fdm_application_id in the StructType: newSchema from DoubleType to their corresponding datatypes present in yearDF's Schema. i.e.
If the column in yearDFSchema has a decimal datatype with precision more than zero, in this case decimal(38,30), I need to convert the same column's datatype in newSchema to DecimalType(38,30)
Could anyone let me know how can I achieve it ?
Errors like this occur when you try to apply schema on RDD[Row], using Developer's API functions:
def createDataFrame(rows: List[Row], schema: StructType): DataFrame
def createDataFrame(rowRDD: JavaRDD[Row], schema: StructType): DataFrame
def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame
In such cases stored data types have to match external (i.e. Value type in Scala) data types as listed in the official SQL and no type casting or coercion is applied.
Therefore it is your responsibility as an user to ensure that the date and schema are compatible.
The description of the problem you've provided indicates rather different scenario, which asks for CAST. Let's create dataset with exact the same schema as in your example:
val yearDF = spark.createDataFrame(
sc.parallelize(Seq[Row]()),
StructType(Seq(
StructField("source_system_name", StringType),
StructField("table_refresh_delay_min", DecimalType(38, 30)),
StructField("release_number", DecimalType(38, 30)),
StructField("change_number", DecimalType(38, 30)),
StructField("interface_queue_enabled_flag", StringType),
StructField("rework_enabled_flag", StringType),
StructField("fdm_application_id", DecimalType(15, 0)),
StructField("history_enabled_flag", StringType)
)))
yearDF.printSchema
root
|-- source_system_name: string (nullable = true)
|-- table_refresh_delay_min: decimal(38,30) (nullable = true)
|-- release_number: decimal(38,30) (nullable = true)
|-- change_number: decimal(38,30) (nullable = true)
|-- interface_queue_enabled_flag: string (nullable = true)
|-- rework_enabled_flag: string (nullable = true)
|-- fdm_application_id: decimal(15,0) (nullable = true)
|-- history_enabled_flag: string (nullable = true)
and desired types like
val dtypes = Seq(
"source_system_name" -> "string",
"table_refresh_delay_min" -> "double",
"release_number" -> "double",
"change_number" -> "double",
"interface_queue_enabled_flag" -> "string",
"rework_enabled_flag" -> "string",
"fdm_application_id" -> "long",
"history_enabled_flag" -> "string"
)
then you can just map:
val mapping = dtypes.toMap
yearDF.select(yearDF.columns.map { c => col(c).cast(mapping(c)) }: _*).printSchema
root
|-- source_system_name: string (nullable = true)
|-- table_refresh_delay_min: double (nullable = true)
|-- release_number: double (nullable = true)
|-- change_number: double (nullable = true)
|-- interface_queue_enabled_flag: string (nullable = true)
|-- rework_enabled_flag: string (nullable = true)
|-- fdm_application_id: long (nullable = true)
|-- history_enabled_flag: string (nullable = true)
This of course assumes that actual and desired types are compatible, and CAST is allowed.
If you still experience problems due you to peculiarities of specific JDBC driver, you should consider placing cast directly in the query, either manually (In Apache Spark 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)?)
val externalDtypes = Seq(
"source_system_name" -> "text",
"table_refresh_delay_min" -> "double precision",
"release_number" -> "float8",
"change_number" -> "float8",
"interface_queue_enabled_flag" -> "string",
"rework_enabled_flag" -> "string",
"fdm_application_id" -> "bigint",
"history_enabled_flag" -> "string"
)
val externalDtypes = dtypes.map {
case (c, t) => s"CAST(`$c` AS $t)"
} .mkString(", ")
val dbTable = s"""(select $fields from schema.tablename) as tmp"""
or through custom schema:
spark.read
.format("jdbc")
.option(
"customSchema",
dtypes.map { case (c, t) => s"`$c` $t" } .mkString(", "))
...
.load()
var exprs = dfx.columns.map(max(_))
var df2 = df1.groupBy("x","y","z").agg(exprs.head, exprs.tail: _*)
df2.printSchema()
The output of this creates a dataframe
root
|-- x: string (nullable = true)
|-- y: string (nullable = true)
|-- z: double (nullable = true)
|-- max(a): double (nullable = true)
|-- max(b): double (nullable = true)
|-- max(c): double (nullable = true)
How does one programtically remove the max() and rename the columns with as "a" instead of max(a)
Replace
var exprs = dfx.columns.map(max(_))
with (and yeah, don't use var when val works fine):
val exprs = dfx.columns.map(c => max(c).alias(c))
This has me confused. I'm using "spark-testing-base_2.11" % "2.0.0_0.5.0" for the test. Can anyone explain why the map function changes the schema if using a Dataset, but works if I use the RDD? Any insights greatly appreciated.
import com.holdenkarau.spark.testing.SharedSparkContext
import org.apache.spark.sql.{ Encoders, SparkSession }
import org.scalatest.{ FunSpec, Matchers }
class TransformSpec extends FunSpec with Matchers with SharedSparkContext {
describe("data transformation") {
it("the rdd maintains the schema") {
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val personEncoder = Encoders.product[TestPerson]
val personDS = Seq(TestPerson("JoeBob", 29)).toDS
personDS.schema shouldEqual personEncoder.schema
val mappedSet = personDS.rdd.map { p: TestPerson => p.copy(age = p.age + 1) }.toDS
personEncoder.schema shouldEqual mappedSet.schema
}
it("datasets choke on explicit schema") {
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val personEncoder = Encoders.product[TestPerson]
val personDS = Seq(TestPerson("JoeBob", 29)).toDS
personDS.schema shouldEqual personEncoder.schema
val mappedSet = personDS.map[TestPerson] { p: TestPerson => p.copy(age = p.age + 1) }
personEncoder.schema shouldEqual mappedSet.schema
}
}
}
case class TestPerson(name: String, age: Int)
A couple of things are conspiring against you here. Spark appears to have special casing for what types it considers nullable.
case class TestTypes(
scalaString: String,
javaString: java.lang.String,
myString: MyString,
scalaInt: Int,
javaInt: java.lang.Integer,
myInt: MyInt
)
Encoders.product[TestTypes].schema.printTreeString results in:
root
|-- scalaString: string (nullable = true)
|-- javaString: string (nullable = true)
|-- myString: struct (nullable = true)
| |-- value: string (nullable = true)
|-- scalaInt: integer (nullable = false)
|-- javaInt: integer (nullable = true)
|-- myInt: struct (nullable = true)
| |-- value: integer (nullable = false)
but if you map the types you will end up with everything nullable
val testTypes: Seq[TestTypes] = Nil
val testDS = testTypes.toDS
testDS.map(foo => foo).mapped.schema.printTreeString results in everything being nullable:
root
|-- scalaString: string (nullable = true)
|-- javaString: string (nullable = true)
|-- myString: struct (nullable = true)
| |-- value: string (nullable = true)
|-- scalaInt: integer (nullable = true)
|-- javaInt: integer (nullable = true)
|-- myInt: struct (nullable = true)
| |-- value: integer (nullable = true)
Even if you force the schema to be correct, Spark explicitely ignores nullability comparisons when applying a schema, which is why when you convert back to the typed representation you lose the few nullability guarantees you had.
You could enrich your types to be able to force a nonNull schema:
implicit class StructImprovements(s: StructType) {
def nonNull: StructType = StructType(s.map(_.copy(nullable = false)))
}
implicit class DsImprovements[T: Encoder](ds: Dataset[T]) {
def nonNull: Dataset[T] = {
val nnSchema = ds.schema.nonNull
applySchema(ds.toDF, nnSchema).as[T]
}
}
val mappedSet = personDS.map { p =>
p.copy(age = p.age + 1)
}.nonNull
But you will find it evaporates when applying any interesting operation then when comparing schemas again if the shape is the same except nullability Spark will pass it through as the same.
This appears to be by design https://github.com/apache/spark/pull/11785
A map is a transformation operation on the data. It takes the input, and a function and applies that function to all elements of the input data. The output is the set of return values of this function. So the schmea of the output data is dependant on the return type of the function.
The map operation is a fairly standard and heavily used operation in functional programming. Look at https://en.m.wikipedia.org/wiki/Map_(higher-order_function) if you want to read more.
I have a DataFrame like this:
root
|-- midx: double (nullable = true)
|-- future: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: long (nullable = false)
| | |-- _2: long (nullable = false)
Using this code I am trying to transfer it into something like this:
val T = withFfutures.where($"midx" === 47.0).select("midx","future").collect().map((row: Row) =>
Row {
row.getAs[Seq[Row]]("future").map { case Row(e: Long, f: Long) =>
(row.getAs[Double]("midx"), e, f)
}
}
).toList
root
|-- id: double (nullable = true)
|-- event: long (nullable = true)
|-- future: long (nullable = true)
So the plan is to transfer the array of (event, future) into a dataframe that has those two fields as a column. I am trying to transfer T into a DataFrame like this:
val schema = StructType(Seq(
StructField("id", DoubleType, nullable = true)
, StructField("event", LongType, nullable = true)
, StructField("future", LongType, nullable = true)
))
val df = sqlContext.createDataFrame(context.parallelize(T), schema)
But when I a, trying to look into the df I get this error:
java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be cast to java.lang.Double
After a while I found what was the problem: First and foremost that Array of structs in the column should be casted to Row. So the final code to build the final data frame should look like this:
val T = withFfutures.select("midx","future").collect().flatMap( (row: Row) =>
row.getAs[Seq[Row]]("future").map { case Row(e: Long, f: Long) =>
(row.getAs[Double]("midx") , e, f)
}.toList
).toList
val all = context.parallelize(T).toDF("id","event","future")
How to convert dataframe with multiple columns
I can get RDD[org.apache.spark.sql.Row], but I'd need something that I could use for org.apache.spark.mllib.fpm.FPGrowth, ei RDD[Array[String]]
How to convert?
df.head
org.apache.spark.sql.Row = [blabla,128323,23843,11.23,blabla,null,null,..]
df.printSchema
|-- source: string (nullable = true)
|-- b1: string (nullable = true)
|-- b2: string (nullable = true)
|-- b3: long (nullable = true)
|-- amount: decimal(30,2) (nullable = true)
and so on
Thanks
Question is vague, but in general, you can change the RDD from Row to Array passing through Sequence. The following code will take all columns from an RDD, convert them to string, and returning them as an array.
df.first
res1: org.apache.spark.sql.Row = [blah1,blah2]
df.map { _.toSeq.map {_.toString}.toArray }.first
res2: Array[String] = Array(blah1, blah2)
This however may not be enough to get it to work with MLib the way you want since you didn't give enough detail, but it's a start.