How to add missing fields to dataframe based on pre-defined schema? - scala

When working with spark-streaming application, if all records in a batch from kafka misses a common field, then the dataframe schema changes everytime. I need a fixed dataframe schema for further processing and transformation operations.
I have pre-defined schema like this :
root
|-- PokeId: string (nullable = true)
|-- PokemonName: string (nullable = true)
|-- PokemonWeight: integer (nullable = false)
|-- PokemonType: string (nullable = true)
|-- PokemonEndurance: float (nullable = false)
|-- Attacks: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- AttackName: string (nullable = true)
| | |-- AttackImpact: long (nullable = true)
But for some streaming sessions, I don't get all columns, and input schema is like :
root
|-- PokeId: string (nullable = true)
|-- PokemonName: string (nullable = true)
|-- PokemonWeight: integer (nullable = false)
|-- PokemonEndurance: float (nullable = false)
I am defining my schema like this :
val schema = StructType(Array(
StructField("PokeId", StringType, true),
StructField("PokemonName", StringType, true),
StructField("PokemonWeight", IntegerType, false),
StructField("PokemonType", StringType, true),
StructField("PokemonEndurance", FloatType, false),
StructField("Attacks", ArrayType(StructType(Array(
StructField("AttackName", StringType),
StructField("AttackImpact", LongType)
))))
))
Now, I don't know how to add missing columns( with null values ) in input dataframe based on this schema?
I have tried with spark-daria for Dataframe Validation, but it returns missing columns as a descriptive error. How to get missing columns from it.

Related

Reading ORC file in Spark with schema returns null values

I am trying to read ORC files from spark job. I have defined the below schema based on output of printSchema,
df.printSchema():
root
|-- application: struct (nullable = true)
| |-- appserver: string (nullable = true)
| |-- buildversion: string (nullable = true)
| |-- frameworkversion: string (nullable = true)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- schemaversion: string (nullable = true)
| |-- sdkversion: string (nullable = true)
| |-- servertimestamp: string (nullable = true)
| |-- trackingversion: string (nullable = true)
| |-- version: string (nullable = true)
Schema defined
StructField("application", new StructType()
.add(StructField("appserver", StringType, false))
.add(StructField("buildversion", StringType, true))
.add(StructField("frameworkversion", StringType, true))
.add(StructField("id", StringType, true))
.add(StructField("name", StringType, true))
.add(StructField("schemaversion", StringType, true))
.add(StructField("sdkversion", StringType, true))
.add(StructField("servertimestamp", StringType, true))
.add(StructField("trackingversion", StringType, true))
.add(StructField("version", StringType, true)),
false)
)
The Dataframe output returns null for application value.
val data = sparkSession
.read
.schema(schema)
.orc("Data/Input/")
.select($"*")
data
.show(100, false)
When the same data is read without schema defined, it returns valid values. There are few other fields in the file and I am not defining them in schema as I am not interested in them. Could this be causing issue? Can somebody help me understand the problem here?

How to select columns in PySpark which do not contain strings

I had the problem on how to remove the columns with strings in Pyspark, keeping only, numerical ones and timestamps.
This is how I did it.
I had this:
full_log.printSchema()
root
|-- ProgramClassID: integer (nullable = true)
|-- CategoryID: integer (nullable = true)
|-- LogServiceID: integer (nullable = true)
|-- LogDate: timestamp (nullable = true)
|-- AudienceTargetAgeID: integer (nullable = true)
|-- AudienceTargetEthnicID: integer (nullable = true)
|-- ClosedCaptionID: integer (nullable = true)
|-- CountryOfOriginID: integer (nullable = true)
|-- DubDramaCreditID: integer (nullable = true)
|-- EthnicProgramID: integer (nullable = true)
|-- ProductionSourceID: integer (nullable = true)
|-- FilmClassificationID: integer (nullable = true)
|-- ExhibitionID: integer (nullable = true)
|-- Duration: string (nullable = true)
|-- EndTime: string (nullable = true)
|-- LogEntryDate: timestamp (nullable = true)
|-- ProductionNO: string (nullable = true)
|-- ProgramTitle: string (nullable = true)
|-- StartTime: string (nullable = true)
This will get the list of column names to filter
no_string_columns = [types[0] for types in full_log.dtypes if types[1] != 'string']
Perform the final selection
full_log_no_strings = full_log.select([*no_string_columns])
You can also use the schema object of the dataframe:
from pyspark.sql.types import *
string_columns = [column.name for column in full_log.schema if column.dataType != StringType()]
You can use below logic for this use case.
Step:1 Finding the columns which are string as datatype .
Step:2 remove those columns list
Step:3 apply your logic
Code snippet :
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)
columnList = [item[0] for item in df.dtypes if item[1].startswith('string')]
print(columnList)
df1 = df.drop(*columnList)
display(df1)
Screen shot:

Change column names of nested data in bigquery using spark

I'm trying to write some data into BigQuery using Spark Scala, My spark df looks like,
root
|-- id: string (nullable = true)
|-- cost: double (nullable = false)
|-- nodes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- settled: string (nullable = true)
| | |-- constant: string (nullable = true)
|-- status: string (nullable = true)
I tried to change the struct of the data frame.
val schema = StructType(Array(
StructField("id", StringType, true),
StructField("cost", DoubleType, true),
StructField("nodes", StructType(Array(StructField("settled", StringType), StructField("constant", StringType)))),
StructField("status", StringType, true)))
val actualDf = spark.createDataFrame(results, schema)
But it didn't work. When this writes into the BigQuery, Column names look like as follows,
id, cost, nodes.list.element.settled, nodes.list.element.constant, status
Is there a possible way to change these column names as,
id, cost, settled, constant, status
You can explode nodes array to get flatten structure of columns, then write dataframe to bigquery.
Example:
val jsn_ds=Seq("""{"id":1, "cost": "2.0","nodes":[{"settled":"u","constant":"p"}],"status":"s"}""").toDS
spark.read.json(jsn_ds).printSchema
// root
// |-- cost: string (nullable = true)
// |-- id: long (nullable = true)
// |-- nodes: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- constant: string (nullable = true)
// | | |-- settled: string (nullable = true)
// |-- status: string (nullable = true)
spark.read.json(jsn_ds).
withColumn("expld",explode('nodes)).
select("*","expld.*").
drop("expld","nodes").
show()
//+----+---+------+--------+-------+
//|cost| id|status|constant|settled|
//+----+---+------+--------+-------+
//| 2.0| 1| s| p| u|
//+----+---+------+--------+-------+

apache spark - Inserting a dataframe as nested struct into other dataframe

I have the two dataframes created in spark
xml_df:
root
|-- _defaultedgetype: string (nullable = true)
|-- _mode: string (nullable = true)
and nodes_df:
root
|-- nodes: struct (nullable = false)
| |-- _id: string (nullable = true)
| |-- _label: string (nullable = true)
the xml_df will have always just one rows, as described as follows:
+----------------+------+
|_defaultedgetype| _mode|
+----------------+------+
| undirected|static|
+----------------+------+
and the nodes_df data:
+-----+
|nodes|
+-----+
|[1,1]|
|[2,2]|
|[3,3]|
|[4,4]|
|[5,5]|
+-----+
I used the struct function in the nodes_df to put _id and _label inside the struct. Based on that, i would like to add a third column in the xml_df dataframe that contains the struct created in the nodes_df dataframe.
I tried to use a join function creating a literal id for each entry in nodes_df, but the column result as null.
Any light please?
I found why my join was not working.
I needed to use aggregation for the nodes column, so then i was able to proper join the dataframes.
i created an id for the xml_df:
StructType(List(StructField("id",IntegerType, true),
StructField("_defaultedgetype",StringType, true),
StructField("_mode",StringType, true)))
and the same for the nodes_df
val nodes_schema = StructType(List(
StructField("id",IntegerType, true),
StructField("_id",StringType, true),
StructField("_label",StringType, true)
))
i used the id 666 for both of then and used aggregation in the nodes_df
nodes_df = nodes_df.groupBy("id").agg(collect_set("nodes").as("node"))
and joined with xml_df:
xml_df = xml_df.join(nodes_df, Seq("id"),"right").drop("id")
the result is:
+----------------+------+--------------------+
|_defaultedgetype| _mode| node|
+----------------+------+--------------------+
| undirected|static|[[2,2], [3,3], [5...|
+----------------+------+--------------------+
root
|-- _defaultedgetype: string (nullable = true)
|-- _mode: string (nullable = true)
|-- node: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _id: string (nullable = true)
| | |-- _label: string (nullable = true)

Scala Spark setting schema duplicates columns

I have an issue when specifying the schema of my dataframe. Without setting the schema, printschema() produces:
root
|-- Store: string (nullable = true)
|-- Date: string (nullable = true)
|-- IsHoliday: string (nullable = true)
|-- Dept: string (nullable = true)
|-- Weekly_Sales: string (nullable = true)
|-- Temperature: string (nullable = true)
|-- Fuel_Price: string (nullable = true)
|-- MarkDown1: string (nullable = true)
|-- MarkDown2: string (nullable = true)
|-- MarkDown3: string (nullable = true)
|-- MarkDown4: string (nullable = true)
|-- MarkDown5: string (nullable = true)
|-- CPI: string (nullable = true)
|-- Unemployment: string (nullable = true)
However, when i specify the schema with .schema(schema)
val dfr = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(schema)
My printschema() produces:
root
|-- Store: integer (nullable = true)
|-- Date: date (nullable = true)
|-- IsHoliday: boolean (nullable = true)
|-- Dept: integer (nullable = true)
|-- Weekly_Sales: integer (nullable = true)
|-- Temperature: double (nullable = true)
|-- Fuel_Price: double (nullable = true)
|-- MarkDown1: double (nullable = true)
|-- MarkDown2: double (nullable = true)
|-- MarkDown3: double (nullable = true)
|-- MarkDown4: double (nullable = true)
|-- MarkDown5: double (nullable = true)
|-- CPI: double (nullable = true)
|-- Unemployment: double (nullable = true)
|-- Dept: integer (nullable = true)
|-- Weekly_Sales: integer (nullable = true)
|-- Temperature: double (nullable = true)
|-- Fuel_Price: double (nullable = true)
|-- MarkDown1: double (nullable = true)
|-- MarkDown2: double (nullable = true)
|-- MarkDown3: double (nullable = true)
|-- MarkDown4: double (nullable = true)
|-- MarkDown5: double (nullable = true)
|-- CPI: double (nullable = true)
|-- Unemployment: double (nullable = true)
The dataframe itself has all these duplicate columns, and i'm not sure why.
My code:
// Make cutom schema
val schema = StructType(Array(
StructField("Store", IntegerType, true),
StructField("Date", DateType, true),
StructField("IsHoliday", BooleanType, true),
StructField("Dept", IntegerType, true),
StructField("Weekly_Sales", IntegerType, true),
StructField("Temperature", DoubleType, true),
StructField("Fuel_Price", DoubleType, true),
StructField("MarkDown1", DoubleType, true),
StructField("MarkDown2", DoubleType, true),
StructField("MarkDown3", DoubleType, true),
StructField("MarkDown4", DoubleType, true),
StructField("MarkDown5", DoubleType, true),
StructField("CPI", DoubleType, true),
StructField("Unemployment", DoubleType, true)))
val dfr = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(schema)
val train_df = dfr.load("/FileStore/tables/train.csv")
val features_df = dfr.load("/FileStore/tables/features.csv")
// Combine the train and features
val data = train_df.join(features_df, Seq("Store", "Date", "IsHoliday"), "left")
data.show(5)
data.printSchema()
It's working as expected. Your train_df, features_df have the same columns as schema (14 columns) after your load().
After your join condition , Seq("Store", "Date", "IsHoliday") takes thes 3 columns from the both DFs(total 3+3 =6 columns) and join it and gives one set of columns names(3 columns). But rest of columns will be from both train_df(rest 11 columns), features_df(rest 11 columns).
Hence you printSchema showing 25 columns(3 + 11 + 11).