PySpark: Problem building logic to read data from multiple format files - pyspark

I am facing issue creating empty dataframe with a defined number of columns in a list. I'll try to explain the issue here.
I don't know how to create empty data frame and what is the best way to iterate each file with multiple formats and merge the data in a single data fame
list_of_columns = [a,b,c,d]
finalDF = spark.createDataFrame([], schema=list_of_columns)
for file in list_of_files:
if format = '.csv':
df1 = spark.read.csv(CSVFile)
finalDF = df1.union(df1)
elif format = '.parquet':
df2 = spar.read.parque(ParquetFile)
finalDF = df2.union(df2)
finalDF.show()

Related

Compare two spark dataframes column wise and fetch the mismatch records

I want to iterate and compare the columns between two spark dataframes and store the mismatch records.
I am getting the mismatch records in dataframe format so i want to store in any variable as dataframe is immutable. Please suggest how to store dataframe output as rows and columns in variable or collection.
Var mismatchValues = new ArrayBuffer[String]()
Val columns1 = srcTable_colMismatch.schema.fields.map(_.name.tostring)
Val selectiveDifference = columns1.map(c=> srcTable_colMismatch.select(c, "hash_key","row_num"). exceptAll(tgtTable_colMismatch.select(c, "hash_key","row_num").as(c)))
selectiveDifference.zipWithIndex.foreach{ case (e,i) => if(e.count>0) mismatchValues += sortedMismatchRecords.select("*").as("SRC").join(e.as("dif"), $"SRC.hash_key" === $"dif.hash_key" && "SRC.columns1(i) != e.schema.fields.map(_.name)).select("SRC.*").collect.mkstring(",") }
Val convertedDF = mismatchValues.map(a=> a.toString).toDF()
ConvertedDF show()
Spark can do this for you:
df1.union(df2).subtract(df1.intersect(df2))
I strongly discourage you from using a variable to compare unless you have it on good authority that it will fit in memory.
df1.union(df2)\ # create one data set
.subtract(\ # remove items that match this data frame
df1.intersect(df2)\ # all items that are in both dataframes
)

Spark : Writing data frame to s3 bucket

I am trying to write DF data to S3 bucket. It is working fine as expected. Now i want to write to s3 bucket based on condition.
In data frame i am having one column as Flag and in that column values are T and F . Now the condition is If Flag is F then it should write the data to S3 bucket otherwise No. Please find the details below.
DF Data :
1015,2017/08,新潟,101,SW,39,1015,2017/08,山形,101,SW,10,29,74.35897435897436,11.0,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,大分,101,SW,14,25,64.1025641025641,15.4,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,山口,101,SW,6,33,84.61538461538461,6.6,T
1015,2017/08,新潟,101,SW,39,1015,2017/08,愛媛,101,SW,5,34,87.17948717948718,5.5,T
1015,2017/08,新潟,101,SW,39,1015,2017/08,神奈川,101,SW,114,75,192.30769230769232,125.4,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,富山,101,SW,12,27,69.23076923076923,13.2,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,高知,101,SW,3,36,92.3076923076923,3.3,T
1015,2017/08,新潟,101,SW,39,1015,2017/08,岩手,101,SW,11,28,71.7948717948718,12.1,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,三重,101,SW,45,6,15.384615384615385,49.5,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,京都,101,SW,23,16,41.02564102564102,25.3,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,静岡,101,SW,32,7,17.94871794871795,35.2,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,鹿児島,101,SW,18,21,53.84615384615385,19.8,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,福島,101,SW,17,22,56.41025641025641,18.7,F
Code :
val df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("s3a://test_system/transcation.csv")
df.createOrReplaceTempView("data")
val res = spark.sql("select count(*) from data")
res.show(10)
res.coalesce(1).write.format("csv").option("header","true").mode("Overwrite")
.save("s3a://test_system/Output/Test_Result")
res.createOrReplaceTempView("res1")
val res2 = spark.sql("select distinct flag from res1 where flag = 'F'")
if (res2 ==='F')
{
//writing to s3 bucket as raw data .Here transcation.csv file.
df.write.format("csv").option("header","true").mode("Overwrite")
.save("s3a://test_system/Output/Test_Result/rawdata")
}
I am trying this approach but it is not exporting df data to s3 bucket.
How can i export/write data to S3 bucket by using condition?
Many thanks for your help.
I am assuming you want to write the dataframe given a "F" flag present in the dataframe.
val df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("s3a://test_system/transcation.csv")
df.createOrReplaceTempView("data")
val res = spark.sql("select count(*) from data")
res.show(10)
res.coalesce(1).write.format("csv").option("header","true").mode("Overwrite")
.save("s3a://test_system/Output/Test_Result")
res.createOrReplaceTempView("res1")
Here we are using the data table since res1 table is just a count table which you created above. Also from the result dataframe, we are selecting just the first row by using first() function and the first column from that row using getAs[String](0)
val res2 = spark.sql("select distinct flag from data where flag = 'F'").first().getAs[String](0)
println("Printing out res2 = " + res2)
Here we are doing a comparision between the string extracted above and the string "F". Remember "F" is a string while 'F' is a char in scala.
if (res2.equals("F"))
{
println("Inside the if loop")
//writing to s3 bucket as raw data .Here transcation.csv file.
df.write.format("csv").option("header","true").mode("Overwrite")
.save("s3a://test_system/Output/Test_Result/rawdata")
}

Scala Spark: Order changes when writing a DataFrame to a CSV file

I have two data frames which I am merging using union. After performing the union, printing the final dataframe using df.show(), shows that the records are in the order as intended (first dataframe records on the top followed by second dataframe records). But when I write this final data frame to the csv file, the records from the first data frame, that I want to be on the top of the csv file are losing their position. The first data frame's records are getting mixed with the second dataframe's records. Any help would be appreciated.
Below is a the code sample:
val intVar = 1
val myList = List(("hello",intVar))
val firstDf = myList.toDF()
val secondDf: DataFrame = testRdd.toDF()
val finalDF = firstDf.union(secondDf)
finalDF.show() // prints the dataframe with firstDf records on the top followed by the secondDf records
val outputfilePath = "/home/out.csv"
finalDF.coalesce(1).write.csv(outputFilePath) //the first Df records are getting mixed with the second Df records.

Spark scala- How to apply transformation logic on a generic set of columns defined in a file

I am using spark scala 1.6 version.
I have 2 files, one is a schema file which has hundreds of columns separated by commas and another file is .gz file which contains data.
I am trying to read the data using the schema file and apply different transformation logic on a set of few columns .
I tried running a sample code but I have hardcoded the columns numbers in the attached pic.
Also I want to write a udf which could read any set of columns and apply the transformation like replacing a special character and give the output.
Appreciate any suggestion
import org.apache.spark.SparkContext
val rdd1 = sc.textFile("../inp2.txt")
val rdd2 = rdd1.map(line => line.split("\t"))
val rdd2 = rdd1.map(line => line.split("\t")(1)).toDF
val replaceUDF = udf{s: String => s.replace(".", "")}
rdd2.withColumn("replace", replaceUDF('_1)).show
You can read the field name file with simple scala code and create a list of column names as
// this reads the file and creates a list of columnnames
val line = Source.fromFile("path to file").getLines().toList.head
val columnNames = line.split(",")
//read the text file as an rdd and convert to Dataframe
val rdd1 = sc.textFile("../inp2.txt")
val rdd2 = rdd1.map(line => line.split("\t")(1))
.toDF(columnNames : _*)
This creates a dataframe with columns names that you have in a separate file.
Hope this helps!

null pointer exception while converting dataframe to list inside udf

I am reading 2 different .csv files which has only column as below:
val dF1 = sqlContext.read.csv("some.csv").select($"ID")
val dF2 = sqlContext.read.csv("other.csv").select($"PID")
trying to search if dF2("PID") exists in dF1("ID"):
val getIdUdf = udf((x:String)=>{dF1.collect().map(_(0)).toList.contains(x)})
val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))
This gives me null pointer exception.
but if I convert dF1 outside and use list in udf it works:
val dF1 = sqlContext.read.csv("some.csv").select($"ID").collect().map(_(0)).toList
val getIdUdf = udf((x:String)=>{dF1.contains(x)})
val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))
I know I can use join to get this done but want to know what is the reason of null pointer exception here.
Thanks.
Please check this question about accessing dataframe inside the transformation of another dataframe. This is exactly what you are doing with your UDF, and this is not possible in spark. Solution is either to use join, or collect outside of transformation and broadcast.