Read multiple CSV files with different number of columns for each CSV file - pyspark

I wanted to read multiple CSV files with different number of columns using PySpark.
Files=['Data/f1.csv','Data/f2.csv','Data/f3.csv','Data/f4.csv','Data/f5.csv']
f1 file has 50 columns, f2 has 10 more columns that constitutes total 60 columns and f3 has 30 more columns that is total 80 columns for f3 file and so on.
However,
df = spark.read.csv(Files,header=True)
gives only 50 columns. I am expecting 80 columns. Since f1 file has only 50 columns, so remaining 30 columns will be filled NAN values for the f1 file data. Same is true for other CSV files. Pandas dataframe gives me the all 80 columns perfectly:
import pandas as pd
import glob
df = pd.concat(map(pd.read_csv, ['Data/f1.csv','Data/f2.csv','Data/f3.csv','Data/f4.csv','Data/f5.csv']))
But I can't do the same thing with PySpark. How can I read all columns of the above 5 CSV files into single spark dataframe?

You can read each file into its own Spark dataframe, to combine all dataframes into one dataframe, use union.
Fill the the missing columns in the dataframes with fewer columns.
Merge them using union or reduce.
from functools import reduce
from pyspark.sql.functions import lit, col
df_list = [spark.read.csv("f{}.csv".format(i), header=True) for i in range(1, 6)]
cols = [len(df.columns) for df in df_list]
max_cols = max(cols)
df_list = [df.select(*[col(c) for c in df.columns] + [lit(None).alias("col_{}".format(i+j)) for i in range(len(df.columns), max_cols)]) for j, df in enumerate(df_list)]
df_final = reduce(lambda x, y: x.union(y), df_list)
I reproduced your case on this github.

It was a very easy fix. What I did,
Files=['Data/f1.csv','Data/f2.csv','Data/f3.csv','Data/f4.csv','Data/f5.csv']
Files.reverse()
df = spark.read.csv(Files,inferSchema=True, header=True)
Last files had all columns because columns were added incrementally. Reversing them solved the issues.

Related

How can I replace selected columns at one using spark

I am trying to replace many columns at a time using pyspark, I am able to do it using below code, but it is iterating for each column name and when I have 100s of columns it is taking too much time.
from pyspark.sql.functions import col, when
df = sc.parallelize([(1,"foo","val","0","0","can","hello1","buz","oof"),
(2,"bar","check","baz","test","0","pet","stu","got"),
(3,"try","0","pun","0","you","omg","0","baz")]).toDF(["col1","col2","col3","col4","col5","col6","col7","col8","col9"])
df.show()
columns_for_replacement = ['col1','col3','col4','col5','col7','col8','col9']
replace_form = "0"
replace_to = "1"
for i in columns_for_replacement:
df = df.withColumn(i,when((col(i)==replace_form),replace_to).otherwise(col(i)))
df.show()
Can any one suggest how to replace all selected columns at once?

Read files with different column order

I have few csv files with headers but I found out that some files have different column orders. Is there a way to handle this with Spark where I can define select order for each file so that the master DF doesn't have mismatch where col x might have values from col y?
My current read -
val masterDF = spark.read.option("header", "true").csv(allFiles:_*)
Extract all file names and store into list variable.
Then define schema of with all the columns in it.
iterate through each file using header true, so we are reading each file separately.
unionAll the new dataframe with the existing dataframe.
Example:
file_lst=['<path1>','<path2>']
from pyspark.sql.functions import *
from pyspark.sql.types import *
#define schema for the required columns
schema = StructType([StructField("column1",StringType(),True),StructField("column2",StringType(),True)])
#create an empty dataframe
df=spark.createDataFrame([],schema)
for i in file_lst:
tmp_df=spark.read.option("header","true").csv(i).select("column1","column2")
df=df.unionAll(tmp_df)
#display results
df.show()

Difference in SparkSQL Dataframe columns

How do I locate difference between 2 dataframe columns ?
This is causing issues when I join 2 dataframes.
df1_cols = df1.columns
df2_cols = df2.columns
This will return columns for 2 dataframe in 2 list variables.
Thanks
df.columns returns a list here, so you can use any tool in python to compare with another list, i.e. df2_cols. e.g. You can use set to check the common columns in the two DataFrames
df1_cols = df1.columns
df2_cols = df2.columns
set(df1_cols).intersection(set(df2_cols)) # check common columns
set(df1_cols) - set(df2_cols) # check columns in df1 but not in df2
set(df2_cols) - set(df1_cols) # check columns in df2 but not in df1

convert datatypes for respective columns as per the dataframe

I have a pysaprk dataframe with 100 cols:
df1=[(col1,string),(col2,double),(col3,bigint),..so on]
I have another pyspark dataframe df2 with same col count and col names but different datatypes.
df2=[(col1,bigint),(col2,double),(col3,string),..so on]
how do i make the dataypes of all the cols in df2 same as ones present in the dataframe df1 for their respective cols?
It should happen iteratively and if the datatypes match then it should not change
If as you said the column names match and columns count match, then you can simply loop in the schema of df1 and cast the columns to dataTypes of df1
df2 = df2.select([F.col(c.name).cast(c.dataType) for c in df1.schema])
You can use the cast function:
from pyspark.sql import functions as f
# get schema for each DF
df1_schema=df1.dtypes
df2_schema=df2.dtypes
# iterate through cols to cast columns which differ in type
for (c1, d1), (c2,d2) in zip(df1_schema, df2_schema):
# check if datatypes are the same, otherwise cast
if d1!=d2:
df2=df2.withColumn(c2, f.col(c2).cast(d2))

Scala csv file read and display the data in new column

I am new to Scala. I need to read data from csv file which has two header columns named Name and Marks, based on the Marks column I want to show the result in a 3rd column; pass or fail (<35 fail, >35pass).
The data looks like this:
Name,Marks
x,10
y,50
z,80
Result should be:
Name Marks Result
x 10 Fail
Y 50 Pass
z 80 Pass
You can read the csv file with header, then add a column by using when and otherwise to give different values depending on the marks.
import spark.implicits._
val df = spark.read.option("header", true).csv("/path/to/csv") // read csv
val df2 = df.withColumn("Result", when($"Marks" < 35, "Fail").otherwise("Pass"))
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local")
.appName("").config("spark.sql.warehouse.dir", "C:/temp").getOrCreate()
val df = spark.read.option("header",true).csv("file path")
val resul = df.withColumn("Result", when(col("Marks").cast("Int")>=35, "PASS").otherwise("FAIL"))