Spark : Writing data frame to s3 bucket - scala

I am trying to write DF data to S3 bucket. It is working fine as expected. Now i want to write to s3 bucket based on condition.
In data frame i am having one column as Flag and in that column values are T and F . Now the condition is If Flag is F then it should write the data to S3 bucket otherwise No. Please find the details below.
DF Data :
1015,2017/08,新潟,101,SW,39,1015,2017/08,山形,101,SW,10,29,74.35897435897436,11.0,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,大分,101,SW,14,25,64.1025641025641,15.4,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,山口,101,SW,6,33,84.61538461538461,6.6,T
1015,2017/08,新潟,101,SW,39,1015,2017/08,愛媛,101,SW,5,34,87.17948717948718,5.5,T
1015,2017/08,新潟,101,SW,39,1015,2017/08,神奈川,101,SW,114,75,192.30769230769232,125.4,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,富山,101,SW,12,27,69.23076923076923,13.2,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,高知,101,SW,3,36,92.3076923076923,3.3,T
1015,2017/08,新潟,101,SW,39,1015,2017/08,岩手,101,SW,11,28,71.7948717948718,12.1,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,三重,101,SW,45,6,15.384615384615385,49.5,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,京都,101,SW,23,16,41.02564102564102,25.3,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,静岡,101,SW,32,7,17.94871794871795,35.2,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,鹿児島,101,SW,18,21,53.84615384615385,19.8,F
1015,2017/08,新潟,101,SW,39,1015,2017/08,福島,101,SW,17,22,56.41025641025641,18.7,F
Code :
val df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("s3a://test_system/transcation.csv")
df.createOrReplaceTempView("data")
val res = spark.sql("select count(*) from data")
res.show(10)
res.coalesce(1).write.format("csv").option("header","true").mode("Overwrite")
.save("s3a://test_system/Output/Test_Result")
res.createOrReplaceTempView("res1")
val res2 = spark.sql("select distinct flag from res1 where flag = 'F'")
if (res2 ==='F')
{
//writing to s3 bucket as raw data .Here transcation.csv file.
df.write.format("csv").option("header","true").mode("Overwrite")
.save("s3a://test_system/Output/Test_Result/rawdata")
}
I am trying this approach but it is not exporting df data to s3 bucket.
How can i export/write data to S3 bucket by using condition?
Many thanks for your help.

I am assuming you want to write the dataframe given a "F" flag present in the dataframe.
val df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("s3a://test_system/transcation.csv")
df.createOrReplaceTempView("data")
val res = spark.sql("select count(*) from data")
res.show(10)
res.coalesce(1).write.format("csv").option("header","true").mode("Overwrite")
.save("s3a://test_system/Output/Test_Result")
res.createOrReplaceTempView("res1")
Here we are using the data table since res1 table is just a count table which you created above. Also from the result dataframe, we are selecting just the first row by using first() function and the first column from that row using getAs[String](0)
val res2 = spark.sql("select distinct flag from data where flag = 'F'").first().getAs[String](0)
println("Printing out res2 = " + res2)
Here we are doing a comparision between the string extracted above and the string "F". Remember "F" is a string while 'F' is a char in scala.
if (res2.equals("F"))
{
println("Inside the if loop")
//writing to s3 bucket as raw data .Here transcation.csv file.
df.write.format("csv").option("header","true").mode("Overwrite")
.save("s3a://test_system/Output/Test_Result/rawdata")
}

Related

PySpark: Problem building logic to read data from multiple format files

I am facing issue creating empty dataframe with a defined number of columns in a list. I'll try to explain the issue here.
I don't know how to create empty data frame and what is the best way to iterate each file with multiple formats and merge the data in a single data fame
list_of_columns = [a,b,c,d]
finalDF = spark.createDataFrame([], schema=list_of_columns)
for file in list_of_files:
if format = '.csv':
df1 = spark.read.csv(CSVFile)
finalDF = df1.union(df1)
elif format = '.parquet':
df2 = spar.read.parque(ParquetFile)
finalDF = df2.union(df2)
finalDF.show()

Not able to insert Value using SparkSql

I need to insert some values in my hive table using sparksql.I'm using below code.
val filepath:String = "/user/usename/filename.csv'"
val fileName : String = filepath
val result = fileName.split("/")
val fn=result(3) //filename
val e=LocalDateTime.now() //timestamp
First I tried using Insert Into Values but then i found this feature is not available in sparksql.
val ds=sparksession.sql("insert into mytable("filepath,filename,Start_Time") values('${filepath}','${fn}','${e}')
is there any other way to insert these values using sparksql(mytable is empty,I need to load this table everyday)?.
You can directly use Spark Dataframe Write API to insert data into the table.
If you do not have the Spark Dataframe then first create one Dataframe using spark.createDataFrame() then, try as follow to write the data:
df.write.insertInto("name of hive table")
Hi Below code worked for me since i need to use variable in my dataframe so first i created dataframe form selected data then using df.write.insertInto(tablename) saved in hive table.
val filepath:String = "/user/usename/filename.csv'"
val fileName : String = filepath
val result = fileName.split("/")
val fn=result(3) //filename
val e=LocalDateTime.now() //timestamp
val df1=sparksession.sql(s" select '${filepath}' as file_path,'${fn}' as filename,'${e}' as Start_Time")
df1.write.insertInto("dbname.tablename")

Should I cache or not my unified dataframes?

I am not familiar with caching in Spark.
I need to do multiple DF unions inside a loop. each union adds few million lines. Should I df.cache my result after each union?
var DB_List = List ("Database1", "Database2", "Database3", "Database4", "Database5", "Database6", "Database7", "Database8", "Database9", "Database10")
var df = getDF(spark, DB_List(0)) // this returns a DF.
for(i <- 1 until DB_List.length){
df = df.union(getDF(spark, DB_List(i)))
//df.cache or not?
}
//Here, I use df.repartition(1) to write resulted DF in a CSV file.
you don't need to cache the intermediate result but only the final one.
instead of for loop you can use fold:
val dfs = DB_List.map(getDF(spark, _))
val result = dfs.reduce(_ union _)

Scala Spark: Order changes when writing a DataFrame to a CSV file

I have two data frames which I am merging using union. After performing the union, printing the final dataframe using df.show(), shows that the records are in the order as intended (first dataframe records on the top followed by second dataframe records). But when I write this final data frame to the csv file, the records from the first data frame, that I want to be on the top of the csv file are losing their position. The first data frame's records are getting mixed with the second dataframe's records. Any help would be appreciated.
Below is a the code sample:
val intVar = 1
val myList = List(("hello",intVar))
val firstDf = myList.toDF()
val secondDf: DataFrame = testRdd.toDF()
val finalDF = firstDf.union(secondDf)
finalDF.show() // prints the dataframe with firstDf records on the top followed by the secondDf records
val outputfilePath = "/home/out.csv"
finalDF.coalesce(1).write.csv(outputFilePath) //the first Df records are getting mixed with the second Df records.

Timestamp issue when loading CSV to dataframe

I am trying to load a csv file into a distributed dataframe (ddf), whilst giving a schema. The ddf gets loaded but the timestamp column shows only null values. I believe this happens because spark expects timestamp in a particular format. So I have two questions:
1) How do I give spark the format or make it detect format (like
"MM/dd/yyyy' 'HH:mm:ss")
2) If 1 is not an option how to convert the field (assuming I imported as String) to timestamp.
For Q2 I have tried using following :
def convert(row :org.apache.spark.sql.Row) :org.apache.spark.sql.Row = {
import org.apache.spark.sql.Row
val format = new java.text.SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
val d1 = getTimestamp(row(3))
return Row(row(0),row(1),row(2),d1);
}
val rdd1 = df.map(convert)
val df1 = sqlContext.createDataFrame(rdd1,schema1)
The last step doesn't work as there are null values which dont let it finish. I get errors like :
java.lang.RuntimeException: Failed to check null bit for primitive long value.
The sqlContext.load however is able to load the csv without any problems.
val df = sqlContext.load("com.databricks.spark.csv", schema, Map("path" -> "/path/to/file.csv", "header" -> "true"))