How to add large struct column to dataframe - scala

I want to add a struct column to a dataframe, but the struct has more than 100 fields.
I learned that case class can be changed to struct column, but case class has the limit of no more than 22 fields(online spark is 1.6.3 with scala of 2.10.4).
Can normal class do this? What functions or interface I have to implement?
There is also a "org.apache.spark.sql.functions.struct", but seems that it can't set the name of the fields of the struct.
Thanks ahead.

but seems that it can't set the name of the fields of the struct.
You can. For example:
import org.apache.spark.sql.functions._
spark.range(1).withColumn("foo",
struct($"id".alias("x"), lit("foo").alias("y"), struct($"id".alias("bar")))
).printSchema
root
|-- id: long (nullable = false)
|-- foo: struct (nullable = false)
| |-- x: long (nullable = false)
| |-- y: string (nullable = false)
| |-- col3: struct (nullable = false)
| | |-- bar: long (nullable = false)

There is no need to define case class for this struct, you can create struct type this way:
val struct =
StructType(
StructField("a", IntegerType, true) ::
StructField("b", LongType, false) ::
StructField("c", BooleanType, false) :: Nil)
This struct can have arbitrary length.
then you can read dataframe this way
val df = sparkSession.read.schema(struct).//your read method

Related

Access names within pyspark columns

I need some help to access names within columns. I have for example the following Schema:
root
|-- id_1: string (nullable = true)
|-- array_1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id_2: string (nullable = true)
| | |-- post: struct (nullable = true)
| | | |-- value: double (nullable = true)
By using
cols = df.columns
I will get a list of all names at root level,
cols = [id_1, array_1,...]
However, I would like to access the names within e.g. 'array_1'. Using
df.id_1.columns
simply returns
Column<b'array_1[columns]'>
and no names. Any way to access names within arrays? Same issue arise with structs. This would help me loop/make functions easier. If it is possible to avoid various modules it would be beneficial.
Thanks
You can use schema of dataframe to look column names. Use StructType and StructField apis. In example scala-spark code(optimize this code for your needs):
import org.apache.spark.sql.types._
case class A(a: Int, b: String)
val df = Seq(("a", Array(A(1, "asd"))), ("b", Array(A(2, "dsa")))).toDF("str_col", "arr_col")
println(df.schema)
> res19: org.apache.spark.sql.types.StructType = StructType(StructField(str_col,StringType,true), StructField(arr_col,ArrayType(StructType(StructField(a,IntegerType,false), StructField(b,StringType,true)),true),true))
val fields = df.schema.fields
println(fields(0).name)
> res22: String = str_col
println(fields(1).dataType.asInstanceOf[ArrayType].elementType)
> res23: org.apache.spark.sql.types.DataType = StructType(StructField(a,IntegerType,false), StructField(b,StringType,true))
.....

Rename nested struct columns in a Spark DataFrame [duplicate]

This question already has answers here:
Rename nested field in spark dataframe
(5 answers)
Closed 3 years ago.
I am trying to change the names of a DataFrame columns in scala. I am easily able to change the column names for direct fields but I'm facing difficulty while converting array struct columns.
Below is my DataFrame schema.
|-- _VkjLmnVop: string (nullable = true)
|-- _KaTasLop: string (nullable = true)
|-- AbcDef: struct (nullable = true)
| |-- UvwXyz: struct (nullable = true)
| | |-- _MnoPqrstUv: string (nullable = true)
| | |-- _ManDevyIxyz: string (nullable = true)
But I need the schema like below
|-- vkj_lmn_vop: string (nullable = true)
|-- ka_tas_lop: string (nullable = true)
|-- abc_def: struct (nullable = true)
| |-- uvw_xyz: struct (nullable = true)
| | |-- mno_pqrst_uv: string (nullable = true)
| | |-- man_devy_ixyz: string (nullable = true)
For Non Struct columns I'm changing column names by below
def aliasAllColumns(df: DataFrame): DataFrame = {
df.select(df.columns.map { c =>
df.col(c)
.as(
c.replaceAll("_", "")
.replaceAll("([A-Z])", "_$1")
.toLowerCase
.replaceFirst("_", ""))
}: _*)
}
aliasAllColumns(file_data_df).show(1)
How I can change Struct column names dynamically?
You can create a recursive method to traverse the DataFrame schema for renaming the columns:
import org.apache.spark.sql.types._
def renameAllCols(schema: StructType, rename: String => String): StructType = {
def recurRename(schema: StructType): Seq[StructField] = schema.fields.map{
case StructField(name, dtype: StructType, nullable, meta) =>
StructField(rename(name), StructType(recurRename(dtype)), nullable, meta)
case StructField(name, dtype: ArrayType, nullable, meta) if dtype.elementType.isInstanceOf[StructType] =>
StructField(rename(name), ArrayType(StructType(recurRename(dtype.elementType.asInstanceOf[StructType])), true), nullable, meta)
case StructField(name, dtype, nullable, meta) =>
StructField(rename(name), dtype, nullable, meta)
}
StructType(recurRename(schema))
}
Testing it with the following example:
import org.apache.spark.sql.functions._
import spark.implicits._
val renameFcn = (s: String) =>
s.replace("_", "").replaceAll("([A-Z])", "_$1").toLowerCase.dropWhile(_ == '_')
case class C(A_Bc: Int, D_Ef: Int)
val df = Seq(
(10, "a", C(1, 2), Seq(C(11, 12), C(13, 14)), Seq(101, 102)),
(20, "b", C(3, 4), Seq(C(15, 16)), Seq(103))
).toDF("_VkjLmnVop", "_KaTasLop", "AbcDef", "ArrStruct", "ArrInt")
val newDF = spark.createDataFrame(df.rdd, renameAllCols(df.schema, renameFcn))
newDF.printSchema
// root
// |-- vkj_lmn_vop: integer (nullable = false)
// |-- ka_tas_lop: string (nullable = true)
// |-- abc_def: struct (nullable = true)
// | |-- a_bc: integer (nullable = false)
// | |-- d_ef: integer (nullable = false)
// |-- arr_struct: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- a_bc: integer (nullable = false)
// | | |-- d_ef: integer (nullable = false)
// |-- arr_int: array (nullable = true)
// | |-- element: integer (containsNull = false)
as far as I know, it's not possible to rename nested fields directly.
From one side, you could try moving to a flat object.
However, if you need to keep the structure, you can play with spark.sql.functions.struct(*cols).
Creates a new struct column.
Parameters: cols – list of column names (string) or list of Column expressions
You will need to decompose all the schema, generate the aliases that you need and then compose it again using the struct function.
It's not the best solution. But it's something :)
Pd: I'm attaching the PySpark doc since it contains a better explanation than the Scala one.

Get element of Struct Type for a row in scala using any function

I am doing some calculations on row level in Scala/Spark. I have a dataframe created using JSON below-
{"available":false,"createTime":"2016-01-08","dataValue":{"names_source":{"first_names":["abc", "def"],"last_names_id":[123,456]},"another_source_array":[{"first":"1.1","last":"ONE"}],"another_source":"TableSources","location":"GMP", "timestamp":"2018-02-11"},"deleteTime":"2016-01-08"}
You can create dataframe using this JSON directly. My schema looks like below-
root
|-- available: boolean (nullable = true)
|-- createTime: string (nullable = true)
|-- dataValue: struct (nullable = true)
| |-- another_source: string (nullable = true)
| |-- another_source_array: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- first: string (nullable = true)
| | | |-- last: string (nullable = true)
| |-- location: string (nullable = true)
| |-- names_source: struct (nullable = true)
| | |-- first_names: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- last_names_id: array (nullable = true)
| | | |-- element: long (containsNull = true)
| |-- timestamp: string (nullable = true)
|-- deleteTime: string (nullable = true)
I am reading all columns separately with readSchema and writing with writeSchema. Out of two complex columns, I am able to process one but not other.
Below is a part of my read schema-
.add("names_source", StructType(
StructField("first_names", ArrayType.apply(StringType)) ::
StructField("last_names_id", ArrayType.apply(DoubleType)) ::
Nil
))
.add("another_source_array", ArrayType(StructType(
StructField("first", StringType) ::
StructField("last", StringType) ::
Nil
)))
Here is a part of my write schema-
.add("names_source", StructType.apply(Seq(
StructField("first_names", StringType),
StructField("last_names_id", DoubleType))
))
.add("another_source_array", ArrayType(StructType.apply(Seq(
StructField("first", StringType),
StructField("last", StringType))
)))
In processing, I am using a method to index all columns. below is my piece of code for the function-
def myMapRedFunction(df: DataFrame, spark: SparkSession): DataFrame = {
val columnIndex = dataIndexingSchema.fieldNames.zipWithIndex.toMap
val myRDD = df.rdd
.map(row => {
Row(
row.getAs[Boolean](columnIndex("available")),
parseDate(row.getAs[String](columnIndex("create_time"))),
??I Need help here??
row.getAs[String](columnIndex("another_source")),
anotherSourceArrayFunction(row.getSeq[Row](columnIndex("another_source_array"))),
row.getAs[String](columnIndex("location")),
row.getAs[String](columnIndex("timestamp")),
parseDate(row.getAs[String](columnIndex("delete_time")))
)
}).distinct
spark.createDataFrame(myRDD, dataWriteSchema)
}
another_source_array column is being processed by anotherSourceArrayFunction method to make sure we get schema as per the requirements. I need a similar function to get names_source column. Below is the function that I am using for another_source_array column.
def anotherSourceArrayFunction(data: Seq[Row]): Seq[Row] = {
if (data == null) {
data
} else {
data.map(r => {
val first = r.getAs[String]("first").ToUpperCase()
val last = r.getAs[String]("last")
new GenericRowWithSchema(Array(first,last), StructType(
StructField("first", StringType) ::
StructField("last", StringType) ::
Nil
))
})
}
}
Probably in short, I need something like this, where I can get my names_source column structure as a struct.
names_source:struct<first_names:array<string>,last_names_id:array<bigint>>
another_source_array:array<struct<first:string,last:string>>
Above are the column schema required finally. I am able to get another_source_array properly and need help in names_source. I think my write schema for this column is correct but I am not sure. But I need finally names_source:struct<first_names:array<string>,last_names_id:array<bigint>> as column schema.
Note: I am able to get another_source_array column carefully without any problem. I kept here that function to make it better understanding.
From what I see in all the codes you've tried is that you are trying to flatten the struct dataValue column to separate columns.
If my assumption is correct then you don't have to go through such complexity. You can simply do the follwoing
val myRDD = df.rdd
.map(row => {
Row(
row.getAs[Boolean]("available"),
parseDate(row.getAs[String]("createTime")),
row.getAs[Row]("dataValue").getAs[Row]("names_source"),
row.getAs[Row]("dataValue").getAs[String]("another_source"),
row.getAs[Row]("dataValue").getAs[Seq[Row]]("another_source_array"),
row.getAs[Row]("dataValue").getAs[String]("location"),
row.getAs[Row]("dataValue").getAs[String]("timestamp"),
parseDate(row.getAs[String]("deleteTime"))
)
}).distinct
import org.apache.spark.sql.types._
val dataWriteSchema = StructType(Seq(
StructField("createTime", DateType, true),
StructField("createTime", StringType, true),
StructField("names_source", StructType(Seq(StructField("first_names", ArrayType(StringType), true), StructField("last_names_id", ArrayType(LongType), true))), true),
StructField("another_source", StringType, true),
StructField("another_source_array", ArrayType(StructType.apply(Seq(StructField("first", StringType),StructField("last", StringType)))), true),
StructField("location", StringType, true),
StructField("timestamp", StringType, true),
StructField("deleteTime", DateType, true)
))
spark.createDataFrame(myRDD, dataWriteSchema).show(false)
using * to flatten the struct column
You can simply use .* on struct column for the elements of struct column to be on separate columns
import org.apache.spark.sql.functions._
df.select(col("available"), col("createTime"), col("dataValue.*"), col("deleteTime")).show(false)
You will have to change the string dates column to dateType in this method
In both the cases you shall get output as
+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+
|available|createTime|names_source |another_source|another_source_array|location|timestamp |deleteTime|
+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+
|false |2016-01-08|[WrappedArray(abc, def),WrappedArray(123, 456)]|TableSources |[[1.1,ONE]] |GMP |2018-02-11|2016-01-08|
+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+
I hope the answer is helpful

How to rename elements of an array of structs in Spark DataFrame API

I have an UDF which returns an array of tuples:
val df = spark.range(1).toDF("i")
val myUDF = udf((l:Long) => {
Seq((1,2))
})
df.withColumn("udf_result",myUDF($"i"))
.printSchema
gives
root
|-- i: long (nullable = false)
|-- test: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: integer (nullable = false)
I want to rename the elements of the struct to something meaningful instead of _1 and _2, how can this be achieved? Note that I'm aware that returning a Seq of case-classes would let me allow to give proper field names, but using Spark-Notebook (REPL) with Yarn we have many issues using case classes, so I'm looking for a solution without case-classes.
I'm using Spark 2 but with untyped DataFrames, the solution should also be applicable for Spark 1.6
It is possible to cast the output of the udf. E.g. to rename the structfields to x and y, you can do:
type-safe:
val schema = ArrayType(
StructType(
Array(
StructField("x",IntegerType),
StructField("y",IntegerType)
)
)
)
df.withColumn("udf_result",myUDF($"i").cast(schema))
or unsafe, but shorter using string-argument to cast
df.withColumn("udf_result",myUDF($"i").cast("array<struct<x:int,y:int>>"))
both will give the schema
root
|-- i: long (nullable = false)
|-- udf_result: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: integer (nullable = true)
| | |-- y: integer (nullable = true)

How to add a new Struct column to a DataFrame

I'm currently trying to extract a database from MongoDB and use Spark to ingest into ElasticSearch with geo_points.
The Mongo database has latitude and longitude values, but ElasticSearch requires them to be casted into the geo_point type.
Is there a way in Spark to copy the lat and lon columns to a new column that is an array or struct?
Any help is appreciated!
I assume you start with some kind of flat schema like this:
root
|-- lat: double (nullable = false)
|-- long: double (nullable = false)
|-- key: string (nullable = false)
First lets create example data:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(
Row(52.23, 21.01, "Warsaw") :: Row(42.30, 9.15, "Corte") :: Nil)
val schema = StructType(
StructField("lat", DoubleType, false) ::
StructField("long", DoubleType, false) ::
StructField("key", StringType, false) ::Nil)
val df = sqlContext.createDataFrame(rdd, schema)
An easy way is to use an udf and case class:
case class Location(lat: Double, long: Double)
val makeLocation = udf((lat: Double, long: Double) => Location(lat, long))
val dfRes = df.
withColumn("location", makeLocation(col("lat"), col("long"))).
drop("lat").
drop("long")
dfRes.printSchema
and we get
root
|-- key: string (nullable = false)
|-- location: struct (nullable = true)
| |-- lat: double (nullable = false)
| |-- long: double (nullable = false)
A hard way is to transform your data and apply schema afterwards:
val rddRes = df.
map{case Row(lat, long, key) => Row(key, Row(lat, long))}
val schemaRes = StructType(
StructField("key", StringType, false) ::
StructField("location", StructType(
StructField("lat", DoubleType, false) ::
StructField("long", DoubleType, false) :: Nil
), true) :: Nil
)
sqlContext.createDataFrame(rddRes, schemaRes).show
and we get an expected output
+------+-------------+
| key| location|
+------+-------------+
|Warsaw|[52.23,21.01]|
| Corte| [42.3,9.15]|
+------+-------------+
Creating nested schema from scratch can be tedious so if you can I would recommend the first approach. It can be easily extended if you need more sophisticated structure:
case class Pin(location: Location)
val makePin = udf((lat: Double, long: Double) => Pin(Location(lat, long))
df.
withColumn("pin", makePin(col("lat"), col("long"))).
drop("lat").
drop("long").
printSchema
and we get expected output:
root
|-- key: string (nullable = false)
|-- pin: struct (nullable = true)
| |-- location: struct (nullable = true)
| | |-- lat: double (nullable = false)
| | |-- long: double (nullable = false)
Unfortunately you have no control over nullable field so if is important for your project you'll have to specify schema.
Finally you can use struct function introduced in 1.4:
import org.apache.spark.sql.functions.struct
df.select($"key", struct($"lat", $"long").alias("location"))
Try this:
import org.apache.spark.sql.functions._
df.registerTempTable("dt")
dfres = sql("select struct(lat,lon) as colName from dt")