Ambiguous schema in Spark Scala - scala

Schema:
|-- c0: string (nullable = true)
|-- c1: struct (nullable = true)
| |-- c2: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- orangeID: string (nullable = true)
| | | |-- orangeId: string (nullable = true)
I am trying to flatten the schema above in spark.
Code:
var df = data.select($"c0",$"c1.*").select($"c0",explode($"c2")).select($"c0",$"col.orangeID", $"col.orangeId")
The flattening code is working fine. The problem is in the last part where the 2 columns differ only by 1 letter (orangeID and orangeId). Hence I am getting this error:
Error:
org.apache.spark.sql.AnalysisException: Ambiguous reference to fields StructField(orangeID,StringType,true), StructField(orangeId,StringType,true);
Any suggestions to avoid this ambiguity will be great.

turn on the spark sql case sensitivity configuration and try
spark.sql("set spark.sql.caseSensitive=true")

Related

PySpark coalesce struct fields inside a struct

I have a struct coming from a data source where the struct fields have multiple possible data types like the following:
|-- priority: struct (nullable = true)
| |-- priority_a: struct (nullable = true)
| | |-- union: boolean (nullable = true)
| | |-- int32: integer (nullable = true)
| | |-- double: double (nullable = true)
| |-- priority_b: integer (nullable = true)
| |-- priority_c: struct (nullable = true)
| | |-- union: boolean (nullable = true)
| | |-- double: double (nullable = true)
| | |-- int32: integer (nullable = true)
| |-- priority_d: struct (nullable = true)
| | |-- union: boolean (nullable = true)
| | |-- double: double (nullable = true)
| | |-- int32: integer (nullable = true)
| |-- priority_e: double (nullable = true)
I want to coalesce the struct fields and cast them to a data type which makes the most sense, for instance:
|-- priority: struct (nullable = true)
| |-- priority_a: integer (nullable = true)
| |-- priority_b: integer (nullable = true)
| |-- priority_c: double (nullable = true)
| |-- priority_d: double (nullable = true)
| |-- priority_e: double (nullable = true)
If a column is not a struct field inside a struct, the following code works perfectly for what I need:
try:
cols = [f'{c}.{col}' for col in source.select(f'{c}.*').columns]
if f'{struct_path}.union' in cols:
cols.remove(f'{struct_path}.union')
source = source.withColumn(pc, f.coalesce(*cols).cast(t)) # t is the type I want to cast to
except:
source = source.withColumn(c, f.col(c).cast(t))
I would like to the do the same recursively for a struct where the nested struct fields can have multiple data types. Is it possible to do so?
StructField's fields are accessible by fields property, so what you can do is you can make a loop go through the schema and check every field to see if it's StructType
from pyspark.sql import types as T
for field in schema.fields:
if isinstance(field.dataType, T.StructType):
print(field.dataType.fields)
Or if you want to read it recursively
def flatten(schema, prefix=None):
fields = []
for field in schema.fields:
name = prefix + '.' + field.name if prefix else field.name
dtype = field.dataType
if isinstance(dtype, T.ArrayType):
dtype = dtype.elementType
if isinstance(dtype, T.StructType):
print(dtype)
fields += flatten(dtype, prefix=name)
else:
fields.append((dtype, name))
return fields

Adding new column for DataFrame with complex column (Array<Map<String,String>>

I am loading a Dataframe from an external source with the following schema:
|-- A: string (nullable = true)
|-- B: timestamp (nullable = true)
|-- C: long (nullable = true)
|-- METADATA: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- M_1: integer (nullable = true)
| | |-- M_2: string (nullable = true)
| | |-- M_3: string (nullable = true)
| | |-- M_4: string (nullable = true)
| | |-- M_5: double (nullable = true)
| | |-- M_6: string (nullable = true)
| | |-- M_7: double (nullable = true)
| | |-- M_8: boolean (nullable = true)
| | |-- M_9: boolean (nullable = true)
|-- E: string (nullable = true)
Now, I need to add new column, METADATA_PARSED, with column type Array and the following case class:
case class META_DATA_COL(M_1: String, M_2: String, M_3, M_10:String)
My approach here, based on examples is to create a UDF and pass in the METADATA column. But since it is of a complex type I am having a lot of trouble parsing it.
On top of that in the UDF, for the "new" variable M_10, I need to do some string manipulation on the method as well. So I need to access each of the elements in the metadata column.
What would be the best way to approach this issue? I attempted to convert the source dataframe (+METADATA) to a case class; but that did not work as it was translated back to spark WrappedArray types upon entering the UDF.
you can Use something like this.
import org.apache.spark.sql.functions._
val tempdf = df.select(
explode( col("METADATA")).as("flat")
)
val processedDf = tempdf.select( col("flat.M_1"),col("flat.M_2"),col("flat.M_3"))
now write a udf
def processudf = udf((col1:Int,col2:String,col3:String) => /* do the processing*/)
this should help, i can provide some more help if you can provide more details on the processing.

How can I perform ETL on a Spark Row and return it to a dataframe?

I'm currently using Scala Spark for some ETL and have a base dataframe that contains has the following schema
|-- round: string (nullable = true)
|-- Id : string (nullable = true)
|-- questions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- bonusQuestions: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- difficulty : string (nullable = true)
| | |-- answerOptions: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- followUpAnswers: array (nullable = true)
| | | |-- element: string (containsNull = true)
|-- school: string (nullable = true)
I only need to perform ETL on rows where the round type is primary (there are 2 types primary and secondary). However, I need both type of rows in my final table.
I'm stuck doing the ETL which should be according to -
If tag is non-bonus, the bonusQuestions should be set to null and difficulty should be null.
I'm currently able to access most fields of the DF like
val round = tr.getAs[String]("round")
Next, I'm able to get the questions array using
val questionsArray = tr.getAs[Seq[StructType]]("questions")
and can iterate using for (question <- questionsArray) {...}; However I cannot access struct fields like question.bonusQuestions or question.tagwhich returns an error
error: value tag is not a member of org.apache.spark.sql.types.StructType
Spark treats StructType as GenericRowWithSchema, more specific as Row. So instead of Seq[StructType] you have to use Seq[Row] as
val questionsArray = tr.getAs[Seq[Row]]("questions")
and in the loop for (question <- questionsArray) {...} you can get the data of Row as
for (question <- questionsArray) {
val tag = question.getAs[String]("tag")
val bonusQuestions = question.getAs[Seq[String]]("bonusQuestions")
val difficulty = question.getAs[String]("difficulty")
val answerOptions = question.getAs[Seq[String]]("answerOptions")
val followUpAnswers = question.getAs[Seq[String]]("followUpAnswers")
}
I hope the answer is helpful

Spark: pruning nested columns/fields

I have a question about the possibility to prune nested fields.
I'm developing a source for High Energy Physics Data format (ROOT).
below is the schema for some file using a DataSource that I'm developing.
root
|-- EventAuxiliary: struct (nullable = true)
| |-- processHistoryID_: struct (nullable = true)
| | |-- hash_: string (nullable = true)
| |-- id_: struct (nullable = true)
| | |-- run_: integer (nullable = true)
| | |-- luminosityBlock_: integer (nullable = true)
| | |-- event_: long (nullable = true)
| |-- processGUID_: string (nullable = true)
| |-- time_: struct (nullable = true)
| | |-- timeLow_: integer (nullable = true)
| | |-- timeHigh_: integer (nullable = true)
| |-- luminosityBlock_: integer (nullable = true)
| |-- isRealData_: boolean (nullable = true)
| |-- experimentType_: integer (nullable = true)
| |-- bunchCrossing_: integer (nullable = true)
| |-- orbitNumber_: integer (nullable = true)
| |-- storeNumber_: integer (nullable = true)
The DataSource is here https://github.com/diana-hep/spark-root/blob/master/src/main/scala/org/dianahep/sparkroot/experimental/package.scala#L62
When building a reader using the buildReader method of the FileFormat:
override def buildReaderWithPartitionValues(
sparkSession: SparkSession,
dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
I see that requiredSchema will always contain all of the fields/members of the top column that is being looked at. Meaning that when I want to select a particular nested field with :
df.select("EventAuxiliary.id_.run_"), requiredSchema will be again the full struct for that top column ("EventAuxiliary"). I would expect that schema would be something like this:
root
|-- EventAuxiliary: struct...
| |-- id_: struct ...
| | |-- run_: integer
since this is the only schema that has been required by the select statement.
Basically, I want to know how on the data source level I can prune nested fields. I thought that requiredSchema will be only the fields that are coming from the df.select.
I'm trying to see what avro/parquet are doing and found this: https://github.com/apache/spark/pull/14957/files
If there are suggestions/comments - would be appreciated!
Thanks!
VK

Spark/Scala: join dataframes when id is nested in an array of structs

I'm using Spark's MlLib DataFrame ALS functionality on Spark 2.2.0. I had to run my userId and itemId fields through an StringIndexer to get things going
The method 'recommendForAllUsers' returns the following schema
root
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: long (nullable = true)
| | |-- rating: double (nullable = true)
|-- userIdIndex: string (nullable = true)
This is perfect for my needs (would love not to flatten it) but I need to replace userIdIndex and itemIdIndex with their actual value
for the userIdIndex was ok (I couldn't simply reverse it with IndexToString as the ALS FITTING seems to erase the link between index and value):
df.join(df2, df2("userIdIndex")===df("userIdIndex"), "left")
.select(df2("userId"), df("recommendations"))
where df2 looks like this:
+------------------+--------------------+----------+-----------+-----------+
| userId| itemId| rating|userIdIndex|itemIdIndex|
+------------------+--------------------+----------+-----------+-----------+
|glorified-consumer| item-22302| 3.0| 15.0| 4.0|
the result is this schema:
root
|-- userId: string (nullable = true)
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: integer (nullable = true)
| | |-- rating: float (nullable = true)
QUESTION: for itemIdIndex, being inside an array of structures.
You can explode the array so that struct is only remained as
val tempdf2 = df2.withColumn("recommendations", explode('recommendations))
which should leave you with schema as
root
|-- userdId: string (nullable = true)
|-- recommendations: struct (nullable = true)
| |-- itemIdIndex: string (nullable = true)
| |-- rating: string (nullable = true)
Do the same for df (the first dataframe)
Then after that you can join them as
tempdf1.join(tempdf2, tempdf1("recommendations.itemIndex") === tempdf2("recommendations.itemIndex"))