I have a DataFrame with the following schema :
|-- user_loans_arr: array (nullable = true)
|-- element: struct (containsNull = true)
|-- loan_date: string (nullable = true)
|-- loan_amount: string (nullable = true)
so I want to create udf function which takes user_loans_arr as input and return some string after some processing
Related
I have a dataframe as below
root
|-- tasin: string (nullable = true)
|-- advertiser_id: decimal(38,10) (nullable = true)
|-- predicted_sp_sold_units: decimal(38,10) (nullable = true)
|-- predicted_sp_impressions: decimal(38,10) (nullable = true)
|-- predicted_sp_clicks: decimal(38,10) (nullable = true)
|-- predicted_sdc_sold_units: decimal(38,10) (nullable = true)
|-- predicted_sdc_impressions: decimal(38,10) (nullable = true)
|-- predicted_sdc_clicks: decimal(38,10) (nullable = true)
|-- predicted_sda_sold_units: decimal(38,10) (nullable = true)
|-- predicted_sda_impressions: decimal(38,10) (nullable = true)
|-- predicted_sda_clicks: decimal(38,10) (nullable = true)
|-- region_id: integer (nullable = true)
|-- marketplace_id: integer (nullable = true)
|-- dataset_date: date (nullable = true)
Now I am using the below select statement. I am looking for presence of a column name and if present select the value or else fill with Null. The dataframe is stored in df variable.
scores_df1 = df.select(
col('marketplace_id'),
col('region_id'),
col('tasin'),
col('advertiser_id'),
col('predicted_sp_sold_units'),
col('predicted_sp_impressions'),
col('predicted_sp_clicks'),
col('predicted_sdc_sold_units'),
col('predicted_sdc_impressions'),
col('predicted_sdc_clicks'),
col('predicted_sda_sold_units'),
col('predicted_sda_impressions'),
col('predicted_sda_clicks'),
when('sdcr_score' in df.columns is True, col('sdcr_score')).otherwise(lit(None)).alias('sdcr_score'),
when('sdar_score' in df.columns is True, col('sdar_score')).otherwise(lit(None)).alias('sdar_score')
)
I am receiving error <class 'TypeError'>: condition should be a Column
Please advice what is wrong
the phrase 'sdcr_score' in df.columns is True is evaluated in Python before moving to spark and return True/False.
So what you are passing to spark is: when(True, ..., ...).
When is expecting the first argument to be a Column that is evaluated to a True/False statement and not a Pythonic Bool.
You can wrap the argument with lit() function which will basically pass a True/False column to all arguments of the when clause.
I am trying to read thorugh spark.sql a huge csv.
I created a dataframe from a CSV, the dataframe seems created correctly.
I read the schema and I can perform select and filter.
I would like to create a temp view to execute same research using sql, I am more comfortable with it but the temp view seems created on the csv header only.
Where am I making the mistake?
Thanks
>>> df = spark.read.options(header=True,inferSchema=True,delimiter=";").csv("./elenco_dm_tutti_csv_formato_opendata_UltimaVersione.csv")
>>> df.printSchema()
root
|-- TIPO: integer (nullable = true)
|-- PROGRESSIVO_DM_ASS: integer (nullable = true)
|-- DATA_PRIMA_PUBBLICAZIONE: string (nullable = true)
|-- DM_RIFERIMENTO: integer (nullable = true)
|-- GRUPPO_DM_SIMILI: integer (nullable = true)
|-- ISCRIZIONE_REPERTORIO: string (nullable = true)
|-- INIZIO_VALIDITA: string (nullable = true)
|-- FINE_VALIDITA: string (nullable = true)
|-- FABBRICANTE_ASSEMBLATORE: string (nullable = true)
|-- CODICE_FISCALE: string (nullable = true)
|-- PARTITA_IVA_VATNUMBER: string (nullable = true)
|-- CODICE_CATALOGO_FABBR_ASS: string (nullable = true)
|-- DENOMINAZIONE_COMMERCIALE: string (nullable = true)
|-- CLASSIFICAZIONE_CND: string (nullable = true)
|-- DESCRIZIONE_CND: string (nullable = true)
|-- DATAFINE_COMMERCIO: string (nullable = true)
>>> df.count()
1653697
>>> df.createOrReplaceTempView("mask")
>>> spark.sql("select count(*) from mask")
DataFrame[count(1): bigint]
Spark operations like sql() do not process anything by default. You need to add .show() or .collect() to get results.
I've Databricks notebook that reads the delta data in JSON format every hour. So lets says at 11AM the schema of the file is as follows,
root
|-- number: string (nullable = true)
|-- company: string (nullable = true)
|-- assignment: struct (nullable = true)
| |-- link: string (nullable = true)
| |-- value: string (nullable = true)
The next hour at 12PM the schema changes to,
root
|-- number: string (nullable = true)
|-- company: struct (nullable = true)
| |-- link: string (nullable = true)
| |-- value: string (nullable = true)
|-- assignment: struct (nullable = true)
| |-- link: string (nullable = true)
| |-- value: string (nullable = true)
Some of the columns change from string to struct and vice-versa. So if I select the col(company.link) and the incoming schema is of type string the code fails.
How do I handle schema changes in PySpark when reading the file as my end goal is to flatten the JSON to a CSV format.
def get_dtype(df,colname):
return [dtype for name, dtype in df.dtypes if name == colname][0]
#df has the exploded JSON data
df2 = df.select("result.number",
"result.company",
"result.assignment_group")
df23 = df2
for name, cols in df2.dtypes:
if 'struct' in get_dtype(df2, name):
try:
df23 = df23.withColumn(name+"_link", col(name+".link")).withColumn(name+"_value", col(name+".value")).drop(name)
except:
print("error")
df23.printSchema()
root
|-- number: string (nullable = true)
|-- company: string (nullable = true)
|-- assignment_group_link: string (nullable = true)
|-- assignment_group_value: string (nullable = true)
So this is what I did,
Created a function that identifies if the column is of type struct
read all the columns from the base dataframe that has the exploded result from JSON
then loop through the column and if it of type struct then add new columns with the nested values.
Below is my data structure:
root
|-- platform_build_id: string (nullable = true)
|-- pro: struct (nullable = true)
| |-- av: string (nullable = true)
| |-- avc: string (nullable = true)
i tried using explode function
val flattened = Data_df.withColumn("pro", explode(array($"pro")))
this will work if there is an element inside pro column, but in my case what should i use to get this data into flattened format.
Use .select() with struct column (pro.*) will result flatten format.
Data_df.select("platform_build_id","pro.*").show()
Example:
val s_ds=Seq("""{"platform_build_id":"1234","pro":{"av":"pro_av","avc":"pro_avc"}}""").toDS
spark.read.json(s_ds).printSchema
//root
// |-- platform_build_id: string(nullable = true)
// |-- pro: struct (nullable = true)
// | |-- av: string (nullable = true)
// | |-- avc: string (nullable = true)
spark.read.json(s_ds).select("platform_build_id","pro.*").show()
Result:
+-----------------+------+-------+
|platform_build_id| av| avc|
+-----------------+------+-------+
| 1234|pro_av|pro_avc|
+-----------------+------+-------+
My question is if there are any approaches to update the schema of a DataFrame without explicitly calling SparkSession.createDataFrame(dataframe.rdd, newSchema).
Details are as follows.
I have an original Spark DataFrame with schema below:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
I applied Dataset.mapPartitions on the original DataFrame and got a new DataFrame (returned by Dataset.mapPartitions).
The reason for using Dataset.mapPartitions but not Dataset.map is better transformation speed.
In this new DataFrame, every row should have a schema like below:
root
|-- column21: string (nullable = true)
|-- column22: long (nullable = true)
|-- column23: string (nullable = true)
|-- column24: long (nullable = true)
|-- column25: struct (nullable = true)
| |-- column251: string (nullable = true)
| |-- column252: string (nullable = true)
| |-- column253: string (nullable = true)
| |-- column254: string (nullable = true)
| |-- column255: string (nullable = true)
| |-- column256: string (nullable = true)
So the schema of the new DataFrame should be the same as the above.
However, the schema of the new DataFrame won't be updated automatically. The output of applying Dataset.printSchema method on the new DataFrame is still original:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
So, in order to get the correct (updated) schema, what I'm doing is using SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
My concern here is that falling back to RDD (newDataFrame.rdd) will hurt the transformation speed because Spark Catalyst doesn't handle RDD as well as Dataset/DataFrame.
My question is if there are any approaches to update the schema of the new DataFrame without explicitly calling SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
Thanks a lot.
You can use RowEncoder to define schema for newDataFrame.
See following example.
val originalDF = spark.sparkContext.parallelize(List(("Tonny", "city1"), ("Rogger", "city2"), ("Michal", "city3"))).toDF("name", "city")
val r = scala.util.Random
val encoderForNewDF = RowEncoder(StructType(Array(
StructField("name", StringType),
StructField("num", IntegerType),
StructField("city", StringType)
)))
val newDF = originalDF.mapPartitions { partition =>
partition.map{ row =>
val name = row.getAs[String]("name")
val city = row.getAs[String]("city")
val num = r.nextInt
Row.fromSeq(Array[Any](name, num, city))
}
} (encoderForNewDF)
newDF.printSchema()
|-- name: string (nullable = true)
|-- num: integer (nullable = true)
|-- city: string (nullable = true)
Row Encoder for spark: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-RowEncoder.html