Spark 1.6: drop column in DataFrame with escaped column names - scala

Trying to drop a column in a DataFrame, but i have column names with dots in them, which I escaped.
Before I escape, my schema looks like this:
root
|-- user_id: long (nullable = true)
|-- hourOfWeek: string (nullable = true)
|-- observed: string (nullable = true)
|-- raw.hourOfDay: long (nullable = true)
|-- raw.minOfDay: long (nullable = true)
|-- raw.dayOfWeek: long (nullable = true)
|-- raw.sensor2: long (nullable = true)
If I try to drop a column, I get:
df = df.drop("hourOfWeek")
org.apache.spark.sql.AnalysisException: cannot resolve 'raw.hourOfDay' given input columns raw.dayOfWeek, raw.sensor2, observed, raw.hourOfDay, hourOfWeek, raw.minOfDay, user_id;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
Note that I'm not even trying to drop on the columns with dots in name.
Since I couldn't seem to do much without escaping the column names, I converted the schema to:
root
|-- user_id: long (nullable = true)
|-- hourOfWeek: string (nullable = true)
|-- observed: string (nullable = true)
|-- `raw.hourOfDay`: long (nullable = true)
|-- `raw.minOfDay`: long (nullable = true)
|-- `raw.dayOfWeek`: long (nullable = true)
|-- `raw.sensor2`: long (nullable = true)
but that doesn't seem to help. I still get the same error.
I tried escaping all column names, and drop using the escaped name, but that doesn't work either.
root
|-- `user_id`: long (nullable = true)
|-- `hourOfWeek`: string (nullable = true)
|-- `observed`: string (nullable = true)
|-- `raw.hourOfDay`: long (nullable = true)
|-- `raw.minOfDay`: long (nullable = true)
|-- `raw.dayOfWeek`: long (nullable = true)
|-- `raw.sensor2`: long (nullable = true)
df.drop("`hourOfWeek`")
org.apache.spark.sql.AnalysisException: cannot resolve 'user_id' given input columns `user_id`, `raw.dayOfWeek`, `observed`, `raw.minOfDay`, `raw.hourOfDay`, `raw.sensor2`, `hourOfWeek`;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
Is there another way to drop a column that would not fail on this type of data?

Alright, I seem to have found the solution after all:
df.drop(df.col("raw.hourOfWeek")) seems to work

val data = df.drop("Customers");
will work fine for normal columns
val new = df.drop(df.col("old.column"));

Related

Not able to add data through with column using case statement pyspark

Code like as below:
#To get deal keys
schema of lt_online:
root
|-- FT/RT: string (nullable = true)
|-- Country: string (nullable = true)
|-- Charge_Type: string (nullable = true)
|-- Tariff_Loc: string (nullable = true)
|-- Charge_No: string (nullable = true)
|-- Status: string (nullable = true)
|-- Validity_from: string (nullable = true)
|-- Validity_to: string (nullable = true)
|-- Range_Basis: string (nullable = true)
|-- Limited_Parties: string (nullable = true)
|-- Charge_Detail: string (nullable = true)
|-- Freetime_Unit: string (nullable = true)
|-- Freetime: string (nullable = true)
|-- Count_Holidays: string (nullable = true)
|-- Majeure: string (nullable = true)
|-- Start_Event: string (nullable = true)
|-- Same/Next_Day: string (nullable = true)
|-- Next_Day_if_AFTER: string (nullable = true)
|-- Availability_Date: string (nullable = true)
|-- Route_Group: string (nullable = true)
|-- Route_Code: string (nullable = true)
|-- Origin: string (nullable = true)
|-- LoadZone: string (nullable = true)
|-- FDischZone: string (nullable = true)
|-- PODZone: string (nullable = true)
|-- FDestZone: string (nullable = true)
|-- Equipment_Group: string (nullable = true)
|-- Equipment_Type: string (nullable = true)
|-- Range_From: string (nullable = true)
|-- Range_To: void (nullable = true)
|-- Cargo_Type: string (nullable = true)
|-- Commodity: string (nullable = true)
|-- SC_Group: string (nullable = true)
|-- SC_Number: string (nullable = true)
|-- IMO: string (nullable = true)
|-- Shipper_Group: string (nullable = true)
|-- Cnee_Group: string (nullable = true)
|-- Direction: string (nullable = true)
|-- Service: string (nullable = true)
|-- Haulage: string (nullable = true)
|-- Transport_Type: string (nullable = true)
|-- Option1: string (nullable = true)
|-- Option2: string (nullable = true)
|-- 1st_of_Route_Group: string (nullable = true)
|-- 1st_of_LoadZone: string (nullable = true)
|-- 1st_of_FDischZone: string (nullable = true)
|-- 1st_of_PODZone: string (nullable = true)
|-- 1st_of_FDestZone: string (nullable = true)
|-- 1st_of_Equipment_Group: string (nullable = true)
|-- 1st_of_SC_Group: string (nullable = true)
|-- 1st_of_Shipper_Group: string (nullable = true)
|-- 1st_of_Cnee_Group: string (nullable = true)
pyspark code as below df=lt_online.withColumn("dealkeys",lit('')).withColumn("dealAttributes",lit(''))
start=[]
start_dict={}
dealatt=["Charge_No","Status","Validity_from","Validity_to"]
dealkeys=["Charge_Type","Direction"]
for index,row in lt_online.toPandas().iterrows():
start=[]
start_dict={}
key = row['Charge_No']
for i in dealatt:
#final = row[i]
start_dict[i]=row[i]
df_deal_att = df.withColumn('dealkeys', when(col('Charge_No') == key , str(start_dict)).otherwise(col('dealkeys')))
for i in dealkeys:
#key = row['Charge_No']
final = {"keyname" : i,"value" : row[i],"description":".."}
start.append(final)
#final_val= {"value" : row['Charge_Type']}
#start.append(final_val)
#df3=lt_online.withColumn("new_column",str(start))
print(key,start_dict)
df3 = df_deal_att.withColumn('dealAttributes', when(col('Charge_No') == key , str(start)).otherwise(col('dealAttributes')))
when i run DF3 dataframe dealAttributes and dealkeys old data got blank and latest record only inserted.
Please see the screenshot
Since the lt_online dataframe is large, I have selected only the required columns from it. The following is the schema of the lt_online dataframe that I have selected.
The problem arrises because you are not changing df in place, but assigning it to df_deal_att. This will update df_deal_att (also df3) only for the current row in loop (because df is not changing in the entire process). Using df_deal_att.show() inside the loop will help in understanding this.
Use the following code instead to get the desired output:
for index,row in lt_online.toPandas().iterrows():
start=[]
start_dict={}
key = row['Charge_No']
for i in dealatt:
start_dict[i]=row[i]
#ASSIGN TO df INSTEAD OF df_deal_att
df = df.withColumn('dealkeys', when(col('Charge_No') == key , str(start_dict)).otherwise(col('dealkeys')))
for i in dealkeys:
final = {"keyname" : i,"value" : row[i],"description":".."}
start.append(final)
#USE df and ASSIGN TO df INSTEAD OF USING df_deal_att AND ASSIGNING TO df3
df = df.withColumn('dealAttributes', when(col('Charge_No') == key , str(start)).otherwise(col('dealAttributes')))
Assigning the df dataframe after adding the column value based on condition to df itself (instead of using df_deal_att or df3) helps in solving the issue. The following image reflects the output achieved after using the above code.

Pyspark create temp view from dataframe

I am trying to read thorugh spark.sql a huge csv.
I created a dataframe from a CSV, the dataframe seems created correctly.
I read the schema and I can perform select and filter.
I would like to create a temp view to execute same research using sql, I am more comfortable with it but the temp view seems created on the csv header only.
Where am I making the mistake?
Thanks
>>> df = spark.read.options(header=True,inferSchema=True,delimiter=";").csv("./elenco_dm_tutti_csv_formato_opendata_UltimaVersione.csv")
>>> df.printSchema()
root
|-- TIPO: integer (nullable = true)
|-- PROGRESSIVO_DM_ASS: integer (nullable = true)
|-- DATA_PRIMA_PUBBLICAZIONE: string (nullable = true)
|-- DM_RIFERIMENTO: integer (nullable = true)
|-- GRUPPO_DM_SIMILI: integer (nullable = true)
|-- ISCRIZIONE_REPERTORIO: string (nullable = true)
|-- INIZIO_VALIDITA: string (nullable = true)
|-- FINE_VALIDITA: string (nullable = true)
|-- FABBRICANTE_ASSEMBLATORE: string (nullable = true)
|-- CODICE_FISCALE: string (nullable = true)
|-- PARTITA_IVA_VATNUMBER: string (nullable = true)
|-- CODICE_CATALOGO_FABBR_ASS: string (nullable = true)
|-- DENOMINAZIONE_COMMERCIALE: string (nullable = true)
|-- CLASSIFICAZIONE_CND: string (nullable = true)
|-- DESCRIZIONE_CND: string (nullable = true)
|-- DATAFINE_COMMERCIO: string (nullable = true)
>>> df.count()
1653697
>>> df.createOrReplaceTempView("mask")
>>> spark.sql("select count(*) from mask")
DataFrame[count(1): bigint]
Spark operations like sql() do not process anything by default. You need to add .show() or .collect() to get results.

Handling varying JSON schema when creating a dataframe in PySpark

I've Databricks notebook that reads the delta data in JSON format every hour. So lets says at 11AM the schema of the file is as follows,
root
|-- number: string (nullable = true)
|-- company: string (nullable = true)
|-- assignment: struct (nullable = true)
| |-- link: string (nullable = true)
| |-- value: string (nullable = true)
The next hour at 12PM the schema changes to,
root
|-- number: string (nullable = true)
|-- company: struct (nullable = true)
| |-- link: string (nullable = true)
| |-- value: string (nullable = true)
|-- assignment: struct (nullable = true)
| |-- link: string (nullable = true)
| |-- value: string (nullable = true)
Some of the columns change from string to struct and vice-versa. So if I select the col(company.link) and the incoming schema is of type string the code fails.
How do I handle schema changes in PySpark when reading the file as my end goal is to flatten the JSON to a CSV format.
def get_dtype(df,colname):
return [dtype for name, dtype in df.dtypes if name == colname][0]
#df has the exploded JSON data
df2 = df.select("result.number",
"result.company",
"result.assignment_group")
df23 = df2
for name, cols in df2.dtypes:
if 'struct' in get_dtype(df2, name):
try:
df23 = df23.withColumn(name+"_link", col(name+".link")).withColumn(name+"_value", col(name+".value")).drop(name)
except:
print("error")
df23.printSchema()
root
|-- number: string (nullable = true)
|-- company: string (nullable = true)
|-- assignment_group_link: string (nullable = true)
|-- assignment_group_value: string (nullable = true)
So this is what I did,
Created a function that identifies if the column is of type struct
read all the columns from the base dataframe that has the exploded result from JSON
then loop through the column and if it of type struct then add new columns with the nested values.

How to dynamically infer a schema using SparkSession

I have just started learning Spark. I am aware of the fact that if we set inferSchema option to true, the schema is automatically inferred. I am reading a simple csv file. How do i dynamically infer a schema without specifying any custom schema in my code. The code should be able to build schema for any incoming dataset.
Is it possible to do so?
I tried using readStream and specified my format as csv skipping the inferschema option altogether but it seems i need to provide that option in any case.
val ds1: DataFrame = spark
.readStream
.format("csv")
.load("/home/vaibha/Downloads/C2ImportCalEventSample.csv")
println(ds1.show(2))
You can dynamically infer schema but might get bit tedious in some cases of csv format. More read here. Referring to CSV file in your code sample and assuming it is same as the one here, something like below will give what you need:
scala> val df = spark.read.
| option("header", "true").
| option("inferSchema", "true").
| option("timestampFormat","MM/dd/yyyy").
| csv("D:\\texts\\C2ImportCalEventSample.csv")
df: org.apache.spark.sql.DataFrame = [Start Date : timestamp, Start Time: string ... 15 more fields]
scala> df.printSchema
root
|-- Start Date : timestamp (nullable = true)
|-- Start Time: string (nullable = true)
|-- End Date: timestamp (nullable = true)
|-- End Time: string (nullable = true)
|-- Event Title : string (nullable = true)
|-- All Day Event: string (nullable = true)
|-- No End Time: string (nullable = true)
|-- Event Description: string (nullable = true)
|-- Contact : string (nullable = true)
|-- Contact Email: string (nullable = true)
|-- Contact Phone: string (nullable = true)
|-- Location: string (nullable = true)
|-- Category: integer (nullable = true)
|-- Mandatory: string (nullable = true)
|-- Registration: string (nullable = true)
|-- Maximum: integer (nullable = true)
|-- Last Date To Register: timestamp (nullable = true)

How to update the schema of a Spark DataFrame (methods like Dataset.withColumn and Datset.select don't work in my case)

My question is if there are any approaches to update the schema of a DataFrame without explicitly calling SparkSession.createDataFrame(dataframe.rdd, newSchema).
Details are as follows.
I have an original Spark DataFrame with schema below:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
I applied Dataset.mapPartitions on the original DataFrame and got a new DataFrame (returned by Dataset.mapPartitions).
The reason for using Dataset.mapPartitions but not Dataset.map is better transformation speed.
In this new DataFrame, every row should have a schema like below:
root
|-- column21: string (nullable = true)
|-- column22: long (nullable = true)
|-- column23: string (nullable = true)
|-- column24: long (nullable = true)
|-- column25: struct (nullable = true)
| |-- column251: string (nullable = true)
| |-- column252: string (nullable = true)
| |-- column253: string (nullable = true)
| |-- column254: string (nullable = true)
| |-- column255: string (nullable = true)
| |-- column256: string (nullable = true)
So the schema of the new DataFrame should be the same as the above.
However, the schema of the new DataFrame won't be updated automatically. The output of applying Dataset.printSchema method on the new DataFrame is still original:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
So, in order to get the correct (updated) schema, what I'm doing is using SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
My concern here is that falling back to RDD (newDataFrame.rdd) will hurt the transformation speed because Spark Catalyst doesn't handle RDD as well as Dataset/DataFrame.
My question is if there are any approaches to update the schema of the new DataFrame without explicitly calling SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
Thanks a lot.
You can use RowEncoder to define schema for newDataFrame.
See following example.
val originalDF = spark.sparkContext.parallelize(List(("Tonny", "city1"), ("Rogger", "city2"), ("Michal", "city3"))).toDF("name", "city")
val r = scala.util.Random
val encoderForNewDF = RowEncoder(StructType(Array(
StructField("name", StringType),
StructField("num", IntegerType),
StructField("city", StringType)
)))
val newDF = originalDF.mapPartitions { partition =>
partition.map{ row =>
val name = row.getAs[String]("name")
val city = row.getAs[String]("city")
val num = r.nextInt
Row.fromSeq(Array[Any](name, num, city))
}
} (encoderForNewDF)
newDF.printSchema()
|-- name: string (nullable = true)
|-- num: integer (nullable = true)
|-- city: string (nullable = true)
Row Encoder for spark: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-RowEncoder.html