I'm currently trying to extract a database from MongoDB and use Spark to ingest into ElasticSearch with geo_points.
The Mongo database has latitude and longitude values, but ElasticSearch requires them to be casted into the geo_point type.
Is there a way in Spark to copy the lat and lon columns to a new column that is an array or struct?
Any help is appreciated!
I assume you start with some kind of flat schema like this:
root
|-- lat: double (nullable = false)
|-- long: double (nullable = false)
|-- key: string (nullable = false)
First lets create example data:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(
Row(52.23, 21.01, "Warsaw") :: Row(42.30, 9.15, "Corte") :: Nil)
val schema = StructType(
StructField("lat", DoubleType, false) ::
StructField("long", DoubleType, false) ::
StructField("key", StringType, false) ::Nil)
val df = sqlContext.createDataFrame(rdd, schema)
An easy way is to use an udf and case class:
case class Location(lat: Double, long: Double)
val makeLocation = udf((lat: Double, long: Double) => Location(lat, long))
val dfRes = df.
withColumn("location", makeLocation(col("lat"), col("long"))).
drop("lat").
drop("long")
dfRes.printSchema
and we get
root
|-- key: string (nullable = false)
|-- location: struct (nullable = true)
| |-- lat: double (nullable = false)
| |-- long: double (nullable = false)
A hard way is to transform your data and apply schema afterwards:
val rddRes = df.
map{case Row(lat, long, key) => Row(key, Row(lat, long))}
val schemaRes = StructType(
StructField("key", StringType, false) ::
StructField("location", StructType(
StructField("lat", DoubleType, false) ::
StructField("long", DoubleType, false) :: Nil
), true) :: Nil
)
sqlContext.createDataFrame(rddRes, schemaRes).show
and we get an expected output
+------+-------------+
| key| location|
+------+-------------+
|Warsaw|[52.23,21.01]|
| Corte| [42.3,9.15]|
+------+-------------+
Creating nested schema from scratch can be tedious so if you can I would recommend the first approach. It can be easily extended if you need more sophisticated structure:
case class Pin(location: Location)
val makePin = udf((lat: Double, long: Double) => Pin(Location(lat, long))
df.
withColumn("pin", makePin(col("lat"), col("long"))).
drop("lat").
drop("long").
printSchema
and we get expected output:
root
|-- key: string (nullable = false)
|-- pin: struct (nullable = true)
| |-- location: struct (nullable = true)
| | |-- lat: double (nullable = false)
| | |-- long: double (nullable = false)
Unfortunately you have no control over nullable field so if is important for your project you'll have to specify schema.
Finally you can use struct function introduced in 1.4:
import org.apache.spark.sql.functions.struct
df.select($"key", struct($"lat", $"long").alias("location"))
Try this:
import org.apache.spark.sql.functions._
df.registerTempTable("dt")
dfres = sql("select struct(lat,lon) as colName from dt")
Related
I am trying to change the datatype of a column present in a dataframe I that I am reading from an RDBMS database.
To do that, I got the schema of the dataframe in the below way:
val dataSchema = dataDF.schema
To see the schema of the dataframe, I used the below statement:
println(dataSchema.schema)
Output: StructType(StructField(je_header_id,LongType,true), StructField(je_line_num,LongType,true), StructField(last_update_date,TimestampType,true), StructField(last_updated_by,DecimalType(15,0),true), StructField(creation_date,TimestampType,true), StructField(created_by,DecimalType(15,0),true), StructField(created_by_name,StringType,true), StructField(entered_dr,DecimalType(38,30),true), StructField(entered_cr,DecimalType(38,30),true))
My requirement is find the DecimalType and change it to DoubleType from the above schema.
I can get the column name and the datatypes using: dataSchema.dtype but it gives me the datatypes in the format of ((columnName1, column datatype),(columnName2, column datatype)....(columnNameN, column datatype))
I am trying to find a way to parse the StructType and change the schema in dataSchema in vain.
Could anyone let me know if there is a way to parse the StructType so that I can change the datatype to my requirement and get in the below format
StructType(StructField(je_header_id,LongType,true), StructField(je_line_num,LongType,true), StructField(last_update_date,TimestampType,true), StructField(last_updated_by,DoubleType,true), StructField(creation_date,TimestampType,true), StructField(created_by,DoubleType,true), StructField(created_by_name,StringType,true), StructField(entered_dr,DoubleType,true), StructField(entered_cr,DoubleType,true))
To modify a DataFrame Schema specific to a given data type, you can pattern-match against StructField's dataType, as shown below:
import org.apache.spark.sql.types._
val df = Seq(
(1L, BigDecimal(12.34), "a", BigDecimal(10.001)),
(2L, BigDecimal(56.78), "b", BigDecimal(20.002))
).toDF("c1", "c2", "c3", "c4")
val newSchema = df.schema.fields.map{
case StructField(name, _: DecimalType, nullable, _)
=> StructField(name, DoubleType, nullable)
case field => field
}
// newSchema: Array[org.apache.spark.sql.types.StructField] = Array(
// StructField(c1,LongType,false), StructField(c2,DoubleType,true),
// StructField(c3,StringType,true), StructField(c4,DoubleType,true)
// )
However, assuming your end-goal is to transform the dataset with the column type change, it would be easier to just traverse the columns for the targeted data type to iteratively cast them, like below:
import org.apache.spark.sql.functions._
val df2 = df.dtypes.
collect{ case (dn, dt) if dt.startsWith("DecimalType") => dn }.
foldLeft(df)((accDF, c) => accDF.withColumn(c, col(c).cast("Double")))
df2.printSchema
// root
// |-- c1: long (nullable = false)
// |-- c2: double (nullable = true)
// |-- c3: string (nullable = true)
// |-- c4: double (nullable = true)
[UPDATE]
Per additional requirement from comments, if you want to change schema only for DecimalType with positive scale, just apply a Regex pattern-match as the guard condition in method collect:
val pattern = """DecimalType\(\d+,(\d+)\)""".r
val df2 = df.dtypes.
collect{ case (dn, dt) if pattern.findFirstMatchIn(dt).map(_.group(1)).getOrElse("0") != "0" => dn }.
foldLeft(df)((accDF, c) => accDF.withColumn(c, col(c).cast("Double")))
Here is another way:
data.show(false)
data.printSchema
+----+------------------------+----+----------------------+
|col1|col2 |col3|col4 |
+----+------------------------+----+----------------------+
|1 |0.003200000000000000 |a |23.320000000000000000 |
|2 |78787.990030000000000000|c |343.320000000000000000|
+----+------------------------+----+----------------------+
root
|-- col1: integer (nullable = false)
|-- col2: decimal(38,18) (nullable = true)
|-- col3: string (nullable = true)
|-- col4: decimal(38,18) (nullable = true)
Create a schema that you want:
Exampe:
val newSchema = StructType(
Seq(
StructField("col1", StringType, true),
StructField("col2", DoubleType, true),
StructField("col3", StringType, true),
StructField("col4", DoubleType, true)
)
)
Cast the columns to the required datatype.
val newDF = data.selectExpr(newSchema.map(
col => s"CAST ( ${col.name} As ${col.dataType.sql}) ${col.name}"
): _*)
newDF.printSchema
root
|-- col1: string (nullable = false)
|-- col2: double (nullable = true)
|-- col3: string (nullable = true)
|-- col4: double (nullable = true)
newDF.show(false)
+----+-----------+----+------+
|col1|col2 |col3|col4 |
+----+-----------+----+------+
|1 |0.0032 |a |23.32 |
|2 |78787.99003|c |343.32|
+----+-----------+----+------+
The accepted solution works great, but its very costly because of huge cost of withColumn, and analyzer has to analyze whole DF for each withColumn, and with a large number of columns it is very costly. I would rather suggest doing this -
val transformedColumns = inputDataDF.dtypes
.collect {
case (dn, dt)
if (dt.startsWith("DecimalType")) =>
(dn, DoubleType)
}
val transformedDF = inputDataDF.select(transformedColumns.map { fieldType =>
inputDataDF(fieldType._1).cast(fieldType._2)
}: _*)
For a very small dataset it took 1+ minute with withColumn approach for me in my machine and 100 ms with approach with select.
you can read more about cost of withColumn here - https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015
I need to create a schema using existing df field.
Consider this example dataframe
scala> case class prd (a:Int, b:Int)
defined class prd
scala> val df = Seq((Array(prd(10,20),prd(15,30),prd(20,25)))).toDF("items")
df: org.apache.spark.sql.DataFrame = [items: array<struct<a:int,b:int>>]
scala> df.printSchema
root
|-- items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = false)
| | |-- b: integer (nullable = false)
I need one more field "items_day1" similar to "items" for df2. Right now, I'm doing it like below which is a workaround
scala> val df2=df.select('items,'items.as("item_day1"))
df2: org.apache.spark.sql.DataFrame = [items: array<struct<a:int,b:int>>, item_day1: array<struct<a:int,b:int>>]
scala> df2.printSchema
root
|-- items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = false)
| | |-- b: integer (nullable = false)
|-- item_day1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = false)
| | |-- b: integer (nullable = false)
scala>
But how to get that using the df.schema.add() or df.schema.copy() functions?.
EDIT1:
I'm trying like below
val (a,b) = (df.schema,df.schema) // works
a("items") //works
b.add(a("items").as("items_day1")) //Error..
To add a new field to your DataFrame schema (which is of StructType) with the same structure but a different top-level name of the existing field, you can copy the StructField with a modified StructField member name, as shown below:
import org.apache.spark.sql.types._
case class prd (a:Int, b:Int)
val df = Seq((Array(prd(10,20), prd(15,30), prd(20,25)))).toDF("items")
val schema = df.schema
// schema: org.apache.spark.sql.types.StructType = StructType(
// StructField(items, ArrayType(
// StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,false)
// ), true), true)
// )
val newSchema = schema.find(_.name == "items") match {
case Some(field) => schema.add(field.copy(name = "items_day1"))
case None => schema
}
// newSchema: org.apache.spark.sql.types.StructType = StructType(
// StructField(items, ArrayType(
// StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,false)
// ), true), true),
// StructField(items_day1, ArrayType(
// StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,false)
// ), true), true)
// )
I am doing some calculations on row level in Scala/Spark. I have a dataframe created using JSON below-
{"available":false,"createTime":"2016-01-08","dataValue":{"names_source":{"first_names":["abc", "def"],"last_names_id":[123,456]},"another_source_array":[{"first":"1.1","last":"ONE"}],"another_source":"TableSources","location":"GMP", "timestamp":"2018-02-11"},"deleteTime":"2016-01-08"}
You can create dataframe using this JSON directly. My schema looks like below-
root
|-- available: boolean (nullable = true)
|-- createTime: string (nullable = true)
|-- dataValue: struct (nullable = true)
| |-- another_source: string (nullable = true)
| |-- another_source_array: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- first: string (nullable = true)
| | | |-- last: string (nullable = true)
| |-- location: string (nullable = true)
| |-- names_source: struct (nullable = true)
| | |-- first_names: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- last_names_id: array (nullable = true)
| | | |-- element: long (containsNull = true)
| |-- timestamp: string (nullable = true)
|-- deleteTime: string (nullable = true)
I am reading all columns separately with readSchema and writing with writeSchema. Out of two complex columns, I am able to process one but not other.
Below is a part of my read schema-
.add("names_source", StructType(
StructField("first_names", ArrayType.apply(StringType)) ::
StructField("last_names_id", ArrayType.apply(DoubleType)) ::
Nil
))
.add("another_source_array", ArrayType(StructType(
StructField("first", StringType) ::
StructField("last", StringType) ::
Nil
)))
Here is a part of my write schema-
.add("names_source", StructType.apply(Seq(
StructField("first_names", StringType),
StructField("last_names_id", DoubleType))
))
.add("another_source_array", ArrayType(StructType.apply(Seq(
StructField("first", StringType),
StructField("last", StringType))
)))
In processing, I am using a method to index all columns. below is my piece of code for the function-
def myMapRedFunction(df: DataFrame, spark: SparkSession): DataFrame = {
val columnIndex = dataIndexingSchema.fieldNames.zipWithIndex.toMap
val myRDD = df.rdd
.map(row => {
Row(
row.getAs[Boolean](columnIndex("available")),
parseDate(row.getAs[String](columnIndex("create_time"))),
??I Need help here??
row.getAs[String](columnIndex("another_source")),
anotherSourceArrayFunction(row.getSeq[Row](columnIndex("another_source_array"))),
row.getAs[String](columnIndex("location")),
row.getAs[String](columnIndex("timestamp")),
parseDate(row.getAs[String](columnIndex("delete_time")))
)
}).distinct
spark.createDataFrame(myRDD, dataWriteSchema)
}
another_source_array column is being processed by anotherSourceArrayFunction method to make sure we get schema as per the requirements. I need a similar function to get names_source column. Below is the function that I am using for another_source_array column.
def anotherSourceArrayFunction(data: Seq[Row]): Seq[Row] = {
if (data == null) {
data
} else {
data.map(r => {
val first = r.getAs[String]("first").ToUpperCase()
val last = r.getAs[String]("last")
new GenericRowWithSchema(Array(first,last), StructType(
StructField("first", StringType) ::
StructField("last", StringType) ::
Nil
))
})
}
}
Probably in short, I need something like this, where I can get my names_source column structure as a struct.
names_source:struct<first_names:array<string>,last_names_id:array<bigint>>
another_source_array:array<struct<first:string,last:string>>
Above are the column schema required finally. I am able to get another_source_array properly and need help in names_source. I think my write schema for this column is correct but I am not sure. But I need finally names_source:struct<first_names:array<string>,last_names_id:array<bigint>> as column schema.
Note: I am able to get another_source_array column carefully without any problem. I kept here that function to make it better understanding.
From what I see in all the codes you've tried is that you are trying to flatten the struct dataValue column to separate columns.
If my assumption is correct then you don't have to go through such complexity. You can simply do the follwoing
val myRDD = df.rdd
.map(row => {
Row(
row.getAs[Boolean]("available"),
parseDate(row.getAs[String]("createTime")),
row.getAs[Row]("dataValue").getAs[Row]("names_source"),
row.getAs[Row]("dataValue").getAs[String]("another_source"),
row.getAs[Row]("dataValue").getAs[Seq[Row]]("another_source_array"),
row.getAs[Row]("dataValue").getAs[String]("location"),
row.getAs[Row]("dataValue").getAs[String]("timestamp"),
parseDate(row.getAs[String]("deleteTime"))
)
}).distinct
import org.apache.spark.sql.types._
val dataWriteSchema = StructType(Seq(
StructField("createTime", DateType, true),
StructField("createTime", StringType, true),
StructField("names_source", StructType(Seq(StructField("first_names", ArrayType(StringType), true), StructField("last_names_id", ArrayType(LongType), true))), true),
StructField("another_source", StringType, true),
StructField("another_source_array", ArrayType(StructType.apply(Seq(StructField("first", StringType),StructField("last", StringType)))), true),
StructField("location", StringType, true),
StructField("timestamp", StringType, true),
StructField("deleteTime", DateType, true)
))
spark.createDataFrame(myRDD, dataWriteSchema).show(false)
using * to flatten the struct column
You can simply use .* on struct column for the elements of struct column to be on separate columns
import org.apache.spark.sql.functions._
df.select(col("available"), col("createTime"), col("dataValue.*"), col("deleteTime")).show(false)
You will have to change the string dates column to dateType in this method
In both the cases you shall get output as
+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+
|available|createTime|names_source |another_source|another_source_array|location|timestamp |deleteTime|
+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+
|false |2016-01-08|[WrappedArray(abc, def),WrappedArray(123, 456)]|TableSources |[[1.1,ONE]] |GMP |2018-02-11|2016-01-08|
+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+
I hope the answer is helpful
I want to add a struct column to a dataframe, but the struct has more than 100 fields.
I learned that case class can be changed to struct column, but case class has the limit of no more than 22 fields(online spark is 1.6.3 with scala of 2.10.4).
Can normal class do this? What functions or interface I have to implement?
There is also a "org.apache.spark.sql.functions.struct", but seems that it can't set the name of the fields of the struct.
Thanks ahead.
but seems that it can't set the name of the fields of the struct.
You can. For example:
import org.apache.spark.sql.functions._
spark.range(1).withColumn("foo",
struct($"id".alias("x"), lit("foo").alias("y"), struct($"id".alias("bar")))
).printSchema
root
|-- id: long (nullable = false)
|-- foo: struct (nullable = false)
| |-- x: long (nullable = false)
| |-- y: string (nullable = false)
| |-- col3: struct (nullable = false)
| | |-- bar: long (nullable = false)
There is no need to define case class for this struct, you can create struct type this way:
val struct =
StructType(
StructField("a", IntegerType, true) ::
StructField("b", LongType, false) ::
StructField("c", BooleanType, false) :: Nil)
This struct can have arbitrary length.
then you can read dataframe this way
val df = sparkSession.read.schema(struct).//your read method
I have a DataFrame like this:
root
|-- midx: double (nullable = true)
|-- future: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: long (nullable = false)
| | |-- _2: long (nullable = false)
Using this code I am trying to transfer it into something like this:
val T = withFfutures.where($"midx" === 47.0).select("midx","future").collect().map((row: Row) =>
Row {
row.getAs[Seq[Row]]("future").map { case Row(e: Long, f: Long) =>
(row.getAs[Double]("midx"), e, f)
}
}
).toList
root
|-- id: double (nullable = true)
|-- event: long (nullable = true)
|-- future: long (nullable = true)
So the plan is to transfer the array of (event, future) into a dataframe that has those two fields as a column. I am trying to transfer T into a DataFrame like this:
val schema = StructType(Seq(
StructField("id", DoubleType, nullable = true)
, StructField("event", LongType, nullable = true)
, StructField("future", LongType, nullable = true)
))
val df = sqlContext.createDataFrame(context.parallelize(T), schema)
But when I a, trying to look into the df I get this error:
java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be cast to java.lang.Double
After a while I found what was the problem: First and foremost that Array of structs in the column should be casted to Row. So the final code to build the final data frame should look like this:
val T = withFfutures.select("midx","future").collect().flatMap( (row: Row) =>
row.getAs[Seq[Row]]("future").map { case Row(e: Long, f: Long) =>
(row.getAs[Double]("midx") , e, f)
}.toList
).toList
val all = context.parallelize(T).toDF("id","event","future")