Structure of the Schema to be created:
|-- col1: boolean (nullable = true)
|-- col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col2_1: boolean (nullable = true)
| | |-- col2_2: string (nullable = true)
Code to create schema:
val prodSchema = StructType(Array(StructField("col1", StringType), StructField("col2",ArrayType(Array(StructField("element",StructType(Array(StructField("col2_1",StringType)))))))))
Error:
found : Array[org.apache.spark.sql.types.StructField]
required: org.apache.spark.sql.types.DataType
StructField("col2",ArrayType(Array(StructField("element",StructType(Array(StructField("col2_1",StringType)))))))
Any suggestions on how to correct this schema error.
I think you can write it like this:
val prodSchema =
StructType(
List(
StructField("col1", BooleanType),
StructField("col2", ArrayType(
StructType(
List(
StructField("col2_1", BooleanType),
StructField("col2_2",StringType)
)
)
))
)
)
prodSchema.printTreeString()
root
|-- col1: boolean (nullable = true)
|-- col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col2_1: boolean (nullable = true)
| | |-- col2_2: string (nullable = true)
Try this:
val schema = StructType(Seq(
StructField("col1",BooleanType,false),
StructField("col2",ArrayType(StructType(Seq(
StructField("col2_1",BooleanType,true),
StructField("col2_2",StringType,true)
)))
)))
You could use the Schema DSL to create the schema:
val col2 = new StructType().add($"col2_1".boolean).add($"col2_2".string)
val schema = new StructType()
.add($"col1".boolean)
.add($"col2".array(col2))
schema.printTreeString()
root
|-- col1: boolean (nullable = true)
|-- col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col2_1: boolean (nullable = true)
| | |-- col2_2: string (nullable = true)
Hope it helps.
Related
We have a dataFrame that looks like:
root
|-- id: string (nullable = true)
|-- key1_suffix1: string (nullable = true)
|-- key2_suffix1: string (nullable = true)
|-- suffix1: string (nullable = true)
|-- key1_suffix2: string (nullable = true)
|-- key2_suffix2: string (nullable = true)
|-- suffix2: string (nullable = true)
How can we convert this into another dataframe like this:
root
|-- id: string (nullable = true)
|-- tags: struct (nullable = true)
| |-- suffix1: struct (nullable = true)
| | |-- key1_suffix1: string (nullable = true)
| | |-- key2_suffix1: string (nullable = true)
| | |-- suffix1: string (nullable = true)
| |-- suffix2: struct (nullable = true)
| | |-- key1_suffix2: string (nullable = true)
| | |-- key2_suffix2: string (nullable = true)
| | |-- suffix2: string (nullable = true)
Input array with suffixes will be already given.
example inputSuffix=["suffix1","suffix2"]
This is needed in spark scala code. Spark=3.1 and scala = 2.12
You can use struct() function to group columns into 1 nested columns:
// test data
import spark.implicits._
val df = Seq(
("1", "a", "b", "c", "d", "e", "f"),
("2", "aa", "bb", "cc", "dd", "ee", "ff")
).toDF("id", "key1_suffix1", "key2_suffix1", "suffix1", "key1_suffix2", "key2_suffix2", "suffix2")
// Processing
val res = df.withColumn("tags", struct(struct("key1_suffix1", "key2_suffix1", "suffix1").as("suffix1"),
struct("key1_suffix2", "key2_suffix2", "suffix2").as("suffix2")))
.drop("key1_suffix1", "key2_suffix1", "suffix1", "key1_suffix2", "key2_suffix2", "suffix2")
res.printSchema()
root
|-- id: string (nullable = true)
|-- tags: struct (nullable = false)
| |-- suffix1: struct (nullable = false)
| | |-- key1_suffix1: string (nullable = true)
| | |-- key2_suffix1: string (nullable = true)
| | |-- suffix1: string (nullable = true)
| |-- suffix2: struct (nullable = false)
| | |-- key1_suffix2: string (nullable = true)
| | |-- key2_suffix2: string (nullable = true)
| | |-- suffix2: string (nullable = true)
UPDATE
This can be done dynamically using a list of columns, if you value in the list that doesn't exist in the dataframe you can remove them to make sure you will not get some errors:
val inputSuffix = Array("suffix1", "suffix2", "suffix3")
val inputSuffixFiltred = inputSuffix.filter(c => df.columns.contains(s"key1_$c") && df.columns.contains(s"key2_$c") && df.columns.contains(c))
val tagsCol = inputSuffixFiltred.map(c => struct(s"key1_$c", s"key2_$c", c).as(c))
val colsToDelete = inputSuffixFiltred.flatMap(c => Seq(s"key1_$c", s"key2_$c", c))
val res = df.withColumn("tags", struct(tagsCol: _*)).drop(colsToDelete: _*)
res.printSchema()
I wanted to create a predefined Schema in spark/scala so that I can read the json files accordingly.
Structure of the Schema is as below :
root
|-- arrayCol: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- email: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- id2: string (nullable = true)
| | |-- price: string (nullable = true)
| | |-- qty: long (nullable = true)
| | |-- window: struct (nullable = true)
| | | |-- end: string (nullable = true)
| | | |-- start: string (nullable = true)
|-- primaryKeys: string (nullable = true)
|-- state: string (nullable = true)
I was able to create the schema but I am stuck at one place where the elements have two sub elements. This is what I have tried
import org.apache.spark.sql.types._
val testSchema = StructType(
List(
StructField("primaryKeys", StringType, true),
StructField("state", IntegerType, true),
StructField("email",ArrayType(StringType,true),true),
StructField("id",StringType,true),
StructField("name",StringType,true),
StructField("id2",StringType,true),
StructField("price",StringType,true),
StructField("qty",StringType,true),
StructField("window",ArrayType(StringType,true),true)
))
I am not able to figure out how start and end can be included inside that window element.
it is a nested struct, so it should be
StructType(StructField("end",StringType,true), StructField("start",StringType,true))
btw, you can get schema from case classes as follow:
import org.apache.spark.sql.catalyst.ScalaReflection
case class ArrayColWindow(end:String,start:String)
case class ArrayCol(id:String,email:Seq[String], qty:Long,rqty:Long,pids:Seq[String],
sqty:Long,id1:String,id2:String,window:ArrayColWindow, otherId:String)
case class FullArrayCols(arrayCol:Seq[ArrayCol],primarykey:String,runtime:String)
val schema =ScalaReflection.schemaFor[FullArrayCols].dataType.asInstanceOf[StructType]
The way I figured out to make it work is as under :
val arrayStructureSchema = new StructType()
.add("primaryKeys",StringType,true)
.add("runtime", StringType, true)
.add("email", ArrayType(StringType))
.add("id", StringType)
.add("id1", StringType)
.add("id2", StringType)
.add("otherId", StringType)
.add("qty", StringType)
.add("rqty", StringType)
.add("sqty", StringType)
.add("window", new StructType()
.add("end",StringType)
.add("start",StringType))
Given a dynamic structType . here structType name is not known . It is dynamic and hence its name is changing.
The name is variable . So don't pre assume "MAIN_COL" in the schema.
root
|-- MAIN_COL: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: string (nullable = true)
| |-- c: string (nullable = true)
| |-- d: string (nullable = true)
| |-- f: long (nullable = true)
| |-- g: long (nullable = true)
| |-- h: long (nullable = true)
| |-- j: long (nullable = true)
how can we write a dynamic code to rename the fields of a structType with its name as its prefix.
root
|-- MAIN_COL: struct (nullable = true)
| |-- MAIN_COL_a: string (nullable = true)
| |-- MAIN_COL_b: string (nullable = true)
| |-- MAIN_COL_c: string (nullable = true)
| |-- MAIN_COL_d: string (nullable = true)
| |-- MAIN_COL_f: long (nullable = true)
| |-- MAIN_COL_g: long (nullable = true)
| |-- MAIN_COL_h: long (nullable = true)
| |-- MAIN_COL_j: long (nullable = true)
You can use DSL to update the schema of nested columns.
import org.apache.spark.sql.types._
val schema: StructType = df.schema.fields.head.dataType.asInstanceOf[StructType]
val updatedSchema = StructType.apply(
schema.fields.map(sf => StructField.apply("MAIN_COL_" + sf.name, sf.dataType))
)
val resultDF = df.withColumn("MAIN_COL", $"MAIN_COL".cast(updatedSchema))
Updated Schema:
root
|-- MAIN_COL: struct (nullable = false)
| |-- MAIN_COL_a: string (nullable = true)
| |-- MAIN_COL_b: string (nullable = true)
| |-- MAIN_COL_c: string (nullable = true)
I have a DF:
-- str1: struct (nullable = true)
| |-- a1: string (nullable = true)
| |-- a2: string (nullable = true)
| |-- a3: string (nullable = true)
|-- str2: string (nullable = true)
|-- str3: string (nullable = true)
|-- str4: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- b1: string (nullable = true)
| | |-- b2: string (nullable = true)
| | |-- b3: boolean (nullable = true)
| | |-- b4: struct (nullable = true)
| | | |-- c1: integer (nullable = true)
| | | |-- c2: string (nullable = true)
| | | |-- c3: integer (nullable = true)
I am trying to flatten it, to do that I have used code below:
def flattenSchema(schema: StructType, prefix: String = null):Array[Column]=
{
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case at: ArrayType =>
val st = at.elementType.asInstanceOf[StructType]
flattenSchema(st, colName)
case _ => Array(new Column(colName).as(colName))
}
})
}
val d1 = df.select(flattenSchema(df.schema):_*)
Its giving me below Output:
|-- str1.a1: string (nullable = true)
|-- str1.a2: string (nullable = true)
|-- str1.a3: string (nullable = true)
|-- str2: string (nullable = true)
|-- str3: string (nullable = true)
|-- str4.b1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str4.b2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str4.b3: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str4.b4.c1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- str4.b4.c2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str4.b4.c3: array (nullable = true)
| |-- element: integer (containsNull = true)
Problem is arising when I am trying to query it:
d1.select("str2").show -- Its giving me no issue
but when I do query on any flattened nested column
d1.select("str1.a1")
Error:
org.apache.spark.sql.AnalysisException: cannot resolve '`str1.a1`' given input columns: ....
What I am doing wrong here? or any other way to achieve the desired result?
Spark does not support string type column name with dot(.). Dot is use to access child column of any struct type column. If you will try to access same column from dataframe df then it should work since in df it struct type.
I have a dataframe which look like this
root
|-- A1: string (nullable = true)
|-- A2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- A3 : string (nullable = true)
|-- A4 : array (nullable = true)
| |-- element: string (containsNull = true)
I have a schema which looks like this-
StructType(StructField(A1,ArrayType(StringType,true),true), StructField(A2,StringType,true), StructField(A3,IntegerType,true),StructField(A4,ArrayType(StringType,true),true)
I want to convert this dataframe to schema defined above.
Can someone help me how can i do this ?
Note:- The schema and dataframe are loaded at runtime and they are not fix
you can use the org.apache.spark.sql.expressions.UserDefinedFunction to transform a string to an array and an arry to string, like this.
val string_to_array_udf = udf((s:String) => Array(s))
val array_to_string_udf = udf((a: Seq[String]) => a.head)
val string_to_int_udf = udf((s:String) => s.toInt)
val newDf = df.withColumn("a12", string_to_array_udf(col("a1"))).drop("a1").withColumnRenamed("a12", "a1")
.withColumn("a32", string_to_int_udf(col("a3"))).drop("a3").withColumnRenamed("a32", "a3")
.withColumn("a22", array_to_string_udf(col("a2"))).drop("a2").withColumnRenamed("a22", "a2")
newDf.printSchema
root
|-- a4: array (nullable = true)
| |-- element: string (containsNull = true)
|-- a1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- a3: integer (nullable = true)
|-- a2: string (nullable = true)