How to stop pyspark from automatically renaming the duplicate columns - pyspark

I have a csv file with duplicate columns in it. When I read with spark.read.format("CSV").load(). It automatically renames the columns with index value appended in the last of column name.
""df=spark.read.format('csv').option('header',True).load('desktop/csv/2.csv')""
display(df)
Any idea on how to get the column names as year, year_1

From the following link https://dbmstutorials.com/pyspark/spark-dataframe-schema.html there is the solution:
from pyspark.sql.types import StructType # imported StructType
schema_def = StructType() # Created a StructType object
schema_def.add("db_id","integer",True) # Adding column 1 to StructType
schema_def.add("db_name","string",True) # Adding column 2 to StructType
schema_def.add("db_type_cd","string",True) # Adding column 3 to StructType
df_with_schema = spark.read.csv("file:///path_to_files/csv_file_with_duplicates.csv", schema=schema_def, header=True)
df_with_schema.printSchema()
You should create the dataset schema before loading your file, in this way you can override the default duplicate name that Spark returns.

Using Custom schema you can load csv file with renamed columns
>>> schema = StructType() \
... .add("Year",StringType(),True) \
... .add("Year_1",StringType(),True) \
... .add("Industry_code_NZSIOC",StringType(),True) \
... .add("Industry_name_NZSIOC",StringType(),True) \
... .add("Units",StringType(),True) \
... .add("Variable_code",StringType(),True)
>>> df = spark.read.option("header",True).option("inferSchema",True).schema(schema).csv("/dir1/dir2/Sample2.csv")
+----+-------+--------------------+--------------------+----------------+-------------+
|Year| Year_1|Industry_code_NZSIOC|Industry_name_NZSIOC| Units|Variable_code|
+----+-------+--------------------+--------------------+----------------+-------------+
|2020|Level 1| 99999| All industries|Dollar(millions)| H01|
|2020|Level 1| 99999| All industries|Dollar(millions)| H04|
|2020|Level 1| 99999| All industries|Dollar(millions)| H05|
|2020|Level 1| 99999| All industries|Dollar(millions)| H07|
|2020|Level 1| 99999| All industries|Dollar(millions)| H08|
+----+-------+--------------------+--------------------+----------------+-------------+

Related

structtype add method to add arraytype

structtype has a method call add. i see example to use
schema = Structype()
schema.add('testing',string)
schema.add('testing2',string)
how can I add Structype and array type in the schema , using add()?
You need to use it as below -
from pyspark.sql.types import *
schema = StructType()
schema.add('testing',StringType())
schema.add('testing2',StringType())
Sample example to create a dataframe using this schema -
df = spark.createDataFrame(data=[(1,2), (3,4)],schema=schema)
df.show()
+-------+--------+
|testing|testing2|
+-------+--------+
| 1| 2|
| 3| 4|
+-------+--------+

Column Renaming in pyspark dataframe

I have column names with special characters. I renamed the column and trying to save and it gives the save failed saying the columns have special characters. I ran the print schema on the dataframe and i am seeing the column names with out any special characters. Here is the code i tried.
for c in df_source.columns:
df_source = df_source.withColumnRenamed(c, c.replace( "(" , ""))
df_source = df_source.withColumnRenamed(c, c.replace( ")" , ""))
df_source = df_source.withColumnRenamed(c, c.replace( "." , ""))
df_source.coalesce(1).write.format("parquet").mode("overwrite").option("header","true").save(stg_location)
and i get the following error
Caused by: org.apache.spark.sql.AnalysisException: Attribute name "Number_of_data_samples_(input)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
Also one more thing i noticed was when i do a df_source.show() or display(df_source), both shows the same error and printschema shows that there are not special characters.
Can someone help me in finding a solutions for this.
Try Using it as below -
Input_df
from pyspark.sql.types import *
from pyspark.sql.functions import *
data = [("xyz", 1)]
schema = StructType([StructField("Number_of_data_samples_(input)", StringType(), True), StructField("id", IntegerType())])
df = spark.createDataFrame(data=data, schema=schema)
df.show()
+------------------------------+---+
|Number_of_data_samples_(input)| id|
+------------------------------+---+
| xyz| 1|
+------------------------------+---+
Method 1
Using regular expressions to replace the special characters and then use toDF()
import re
cols=[re.sub("\.|\)|\(","",i) for i in df.columns]
df.toDF(*cols).show()
+----------------------------+---+
|Number_of_data_samples_input| id|
+----------------------------+---+
| xyz| 1|
+----------------------------+---+
Method 2
Using .withColumnRenamed()
for i,j in zip(df.columns,cols):
df=df.withColumnRenamed(i,j)
df.show()
+----------------------------+---+
|Number_of_data_samples_input| id|
+----------------------------+---+
| xyz| 1|
+----------------------------+---+
Method 3
Using .withColumn to create a new column and drop the existing column
df = df.withColumn("Number_of_data_samples_input", lit(col("Number_of_data_samples_(input)"))).drop(col("Number_of_data_samples_(input)"))
df.show()
+---+----------------------------+
| id|Number_of_data_samples_input|
+---+----------------------------+
| 1| xyz|
+---+----------------------------+

How to read csv file in dataframe with different delimiter in header as ''," and rest of the rows are separated with "|"

Have csv file header was comma separated and rest of the rows are seperated with another delimiter "|" .How to handle this different delimiters scenario ? Please advise .
import org.apache.spark.sql.{DataFrame, SparkSession}
var df1: DataFrame = null
df1=spark.read.option("header", "true").option("delimiter", ",").option("inferSchema", "false")
.option("ignoreLeadingWhiteSpace", "true") .option("ignoreTrailingWhiteSpace", "true")
.csv("/testing.csv")
df1.show(10)
this commands displays the headers are delimited seperately .But all the data was displayed in first column ,remaining columns are displayed with null values
Read the csv first and split the columns, create new dataframe.
df.show
+---------+----+-----+
| Id|Date|Value|
+---------+----+-----+
|1|2020|30|null| null|
|1|2020|40|null| null|
|2|2020|50|null| null|
|2|2020|40|null| null|
+---------+----+-----+
val cols = df.columns
val index = 0 to cols.size - 1
val expr = index.map(i => col("array")(i))
df.withColumn("array", split($"Id", "\\|"))
.select(expr: _*).toDF(cols: _*).show
+---+----+-----+
| Id|Date|Value|
+---+----+-----+
| 1|2020| 30|
| 1|2020| 40|
| 2|2020| 50|
| 2|2020| 40|
+---+----+-----+

Creating Empty DF and adding column is NOT working

I'm trying to create a empty dataframe and append new column. I tried to do this by two option. Option A is working but Option B is not working. Please help!
Option A:
`
var initialDF1 = Seq(("test")).toDF("M")
initialDF1 = initialDF1.withColumn(("P"), lit(s"P"))
initialDF1.show
+----+---+
| M| P|
+----+---+
|test| P|
+----+---+
`
Option B: (Not working)
`
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row
val schema = StructType(List(StructField("N", StringType, true)))
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
initialDF = initialDF.withColumn(("P"), lit(s"P"))
initialDF.show
+---+---+
| N| P|
+---+---+
+---+---+
`
It is working as intended the withColumn command only affects the schema and it allows setting a value to existing records (lit or some other calculation) but that would only be applied to existing rows. In your second case you created an empty dataframe. the withColum iterates on that and adds a "P" to any existing row (none..)

Spark-SQL : How to read a TSV or CSV file into dataframe and apply a custom schema?

I'm using Spark 2.0 while working with tab-separated value (TSV) and comma-separated value (CSV) files. I want to load the data into Spark-SQL dataframes, where I would like to control the schema completely when the files are read. I don't want Spark to guess the schema from the data in the file.
How would I load TSV or CSV files into Spark SQL Dataframes and apply a schema to them?
Below is a complete Spark 2.0 example of loading a tab-separated value (TSV) file and applying a schema.
I'm using the Iris data set in TSV format from UAH.edu as an example. Here are the first few rows from that file:
Type PW PL SW SL
0 2 14 33 50
1 24 56 31 67
1 23 51 31 69
0 2 10 36 46
1 20 52 30 65
To enforce a schema, you can programmatically build it using one of two methods:
A. Create the schema with StructType:
import org.apache.spark.sql.types._
var irisSchema = StructType(Array(
StructField("Type", IntegerType, true),
StructField("PetalWidth", IntegerType, true),
StructField("PetalLength", IntegerType, true),
StructField("SepalWidth", IntegerType, true),
StructField("SepalLength", IntegerType, true)
))
B. Alternatively, create the schema with a case class and Encoders (this approach is less verbose):
import org.apache.spark.sql.Encoders
case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int,
SepalWidth: Int, SepalLength: Int)
var irisSchema = Encoders.product[IrisSchema].schema
Once you have created your schema, you can use spark.read to read in the TSV file. Note that you can actually also read comma-separated value (CSV) files as well, or any delimited files, as long as you set the option("delimiter", d) option correctly. Further, if you have a data file that has a header line, be sure to set option("header", "true").
Below is the complete final code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoders
val spark = SparkSession.builder().getOrCreate()
case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int,
SepalWidth: Int, SepalLength: Int)
var irisSchema = Encoders.product[IrisSchema].schema
var irisDf = spark.read.format("csv"). // Use "csv" regardless of TSV or CSV.
option("header", "true"). // Does the file have a header line?
option("delimiter", "\t"). // Set delimiter to tab or comma.
schema(irisSchema). // Schema that was built above.
load("iris.tsv")
irisDf.show(5)
And here is the output:
scala> irisDf.show(5)
+----+----------+-----------+----------+-----------+
|Type|PetalWidth|PetalLength|SepalWidth|SepalLength|
+----+----------+-----------+----------+-----------+
| 0| 2| 14| 33| 50|
| 1| 24| 56| 31| 67|
| 1| 23| 51| 31| 69|
| 0| 2| 10| 36| 46|
| 1| 20| 52| 30| 65|
+----+----------+-----------+----------+-----------+
only showing top 5 rows