Spark creates extra partition column in S3

Spark creates extra partition column in S3 - scala

I am writing a dataframe to s3 as shown below. Target location: s3://test/folder
val targetDf = spark.read.schema(schema).parquet(targetLocation)
val df1=spark.sql("select * from sourceDf")
val df2=spark.sql(select * from targetDf)
/*
for loop over a date range to dedup and write the data to s3
union dfs and run a dedup logic, have omitted dedup code and for loop
*/
val df3=spark.sql("select * from df1 union all select * from df2")
df3.write.partitionBy(data_id, schedule_dt).parquet("targetLocation")
Spark is creating extra partition column on write like shown below:
Exception in thread "main" java.lang.AssertionError: assertion failed: Conflicting partition column names detected:
Partition column name list #0: data_id, schedule_dt
Partition column name list #1: data_id, schedule_dt, schedule_dt
EMR optimizer class is enabled while writing, I am using spark 2.4.3
Please let me know what could be causing this error.
Thanks
Abhineet

You should give 1 extra column apart from partitioned columns. Can you please try
val df3=df1.union(df2)
instead of
val df3=spark.sql("select data_id,schedule_dt from df1 union all select data_id,schedule_dt from df2")

Related

Can we convert data frame in Databricks to string and why do we get error Queries with streaming sources must be executed with writeStream.start()

I'm selecting a column that's a data frame. I would like to cast it as a string so that it can be used to frame cosmos DB dynamic query. The function collect() on data frame complains about queries with streaming sources must be executed with writeStream.start();;
val DF = AppointmentDF
.select("*")
.filter($"xyz" === "abc")
DF.createOrReplaceTempView("MyTable")
val column1DF = spark.sql("SELECT column1 FROM MyTable")
// This is not getting resolved
val sql="select c.abc from c where c.column = \"" + String.valueOf(column1DF) + "\""
println(sql)
Error:
org.apache.spark.sql.AnalysisException: cannot resolve '`column1DF`' given input columns: []; line 1 pos 12;
DF.collect().foreach { row =>
println(row.mkString(","))
}
Error:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must
be executed with writeStream.start();;

A dataframe is a distributed data structure, not a structure located in your machine that can be printed. The value DF and column1DF are going to be exactly dataframes. To bring all the data of your queries, to the driver node, you can use the dataframe method collect, and extract from the returning Array of rows your value.
Collect can be harmful if you are bringing gigabytes of data to the memory of your driver node.

You can use collect and take head for getting first line of DataFrame:
val column1DF = spark.sql("SELECT column1 FROM MyTable").collect().head.getAs[String](0)
val sql="select c.abc from c where c.column = \"" + column1DF + "\""

spark Scala data frame select

I am trying to convert a pyspark code to spark Scala and i am facing the below error:
pyspark code
import pyspark.sql.functions as fn
valid_data = bcd_df.filter(fn.lower(bdb_df.table_name)==tbl_nme)
.select("valid_data").rdd
.map(lambda x: x[0])
.collect()[0]
From bcd_df dataframe I am getting a column with table_name and matching the value of table_name with the argument tbl_name that i am passing and then selecting the valid_data column data.
Here is the code in spark scala.
val valid_data =bcd_df..filter(col(table_name)===tbl_nme).select(col("valid_data")).rdd.map(x=> x(0)).collect()(0)
Error as below:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`abcd`' given input
columns:
Not sure why it is taking abcd as column.
Any help is appreciated.
Version
scala2.11.8
spark2.3

Enclose table_name column with quotes(") in col
val valid_data =bcd_df.filter(col("table_name")===tbl_nme).select(col("valid_data")).rdd.map(x=> x(0)).collect()(0)

What does it mean when DataFrame invoke a drop() without parameter?

I see a code from book "Spark The Definitive Guide",it invoke a drop on a dataframe with no parameter,when I use show(),I found nothing changed,but what is the meaning of it?
I execute it,nothing changed,dfNoNull.show() is the same as dfWithDate.show()
dfWithDate.createOrReplaceTempView("dfWithDate")
// in Scala
val dfNoNull = dfWithDate.drop()
dfNoNull.createOrReplaceTempView("dfNoNull")
does it mean, it create a new datarframe?
I know when a dataframe join itself when I using Hive sql,if I just
val df1=spark.sql("select id,date from date")
val df2=spark.sql("select id,date from date")
val joinedDf = spark.sql("select dateid1,dateid2 from sales")
.join(df1,df1["id"]===dateid1).join(df2,df2["id"]===dateid2)
Then an error occur:Cartesian join!
because the lazy evalution will consider df1 and df1 as the same one
so here,if I
val df2=df1.drop()
will I prevent that error?
If not,what does the drop method with no parameter mean?
Or it just mean remove the temp view name and create a new one?
but I try the code below,no exception throwed:
val df= Seq((1,"a")).toDF("id","name")
df.createOrReplaceTempView("df1")
val df2=df.drop()
df2.createOrReplaceTempView("df2")
spark.sql("select * from df1").show()
Or does the book mean below?
val dfNoNull = dfWithDate.na.drop()
because it wrote somewhere below the code:
Grouping sets depend on null values for aggregation levels. If you do
not filter-out null values, you will get incorrect results.This
applies to cubes, rollups, and grouping sets.

drop function with no parameter behave the same as drop with column name that doesn't exist in the Dataframe.
You can follow the code in the source of spark.
Even in the function documentation you can see a hint to this behavior.
/**
* Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain
* column name.
*
* This method can only be used to drop top level columns. the colName string is treated
* literally without further interpretation.
*
* #group untypedrel
* #since 2.0.0
*/
So when calling the function with no parameter no-op occur and nothing changes in the returning DataFrame.

How to resolve this erros "org.apache.spark.SparkException: Requested partitioning does not match the tablename table" in spark-shell

While writing data into hive partitioned table, I am getting below error.
org.apache.spark.SparkException: Requested partitioning does not match the tablename table:
I have converted my RDD to a DF using case class and then I am trying to write the data into the existing hive partitioned table. But I am getting his error and as per the printed logs "Requested partitions:" is coming as blank. Partition columns are coming as expected in the hive table.
spark-shell error :-
scala> data1.write.format("hive").partitionBy("category", "state").mode("append").saveAsTable("sampleb.sparkhive6")
org.apache.spark.SparkException: Requested partitioning does not match the sparkhive6 table:
Requested partitions:
Table partitions: category,state
Hive table format :-
hive> describe formatted sparkhive6;
OK
col_name data_type comment
txnno int
txndate string
custno int
amount double
product string
city string
spendby string
Partition Information
col_name data_type comment
category string
state string

Try with insertInto() function instead of saveAsTable().
scala> data1.write.format("hive")
.partitionBy("category", "state")
.mode("append")
.insertInto("sampleb.sparkhive6")
(or)
Register a temp view on top of the dataframe then write with sql statement to insert data into hive table.
scala> data1.createOrReplaceTempView("temp_vw")
scala> spark.sql("insert into sampleb.sparkhive6 partition(category,state) select txnno,txndate,custno,amount,product,city,spendby,category,state from temp_vw")

Spark- Load data frame contents in table in a loop

I use scala/ spark to insert data into a Hive parquet table as follows
for(*lots of current_Period_Id*){//This loop is on a result of another query that returns multiple rows of current_Period_Id
val myDf = hiveContext.sql(s"""SELECT columns FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id""")
val count: Int = myDf.count().toInt
if(count>0){
hiveContext.sql(s"""INSERT INTO destinationtable PARTITION(period_id=$current_Period_Id) SELECT columns FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id""")
}
}
This approach takes a lot of time to complete because the select statement is being executed twice.
I'm trying to avoid selecting data twice and one way I've thought of is writing the dataframe myDf to the table directly.
This is the gist of the code I'm trying to use for the purpose
val sparkConf = new SparkConf().setAppName("myApp")
.set("spark.yarn.executor.memoryOverhead","4096")
val sc = new SparkContext(sparkConf)
val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.exec.dynamic.partition","true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
for(*lots of current_Period_Id*){//This loop is on a result of another query
val myDf = hiveContext.sql("SELECT COLUMNS FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id")
val count: Int = myDf.count().toInt
if(count>0){
myDf.write.mode("append").format("parquet").partitionBy("PERIOD_ID").saveAsTable("destinationtable")
}
}
But I get an error in the myDf.write part.
java.util.NoSuchElementException: key not found: period_id
The destination table is partitioned by period_id.
Could someone help me with this?
The spark version I'm using is 1.5.0-cdh5.5.2.

The dataframe schema and table's description differs from each other. The PERIOD_ID != period_id column name is Upper case in your DF but in UPPER case in table. Try in sql with lowercase the period_id

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark creates extra partition column in S3 - scala

You should give 1 extra column apart from partitioned columns. Can you please try val df3=df1.union(df2) instead of val df3=spark.sql("select data_id,schedule_dt from df1 union all select data_id,schedule_dt from df2")

Related

Can we convert data frame in Databricks to string and why do we get error Queries with streaming sources must be executed with writeStream.start()

spark Scala data frame select

What does it mean when DataFrame invoke a drop() without parameter?

How to resolve this erros "org.apache.spark.SparkException: Requested partitioning does not match the tablename table" in spark-shell

Spark- Load data frame contents in table in a loop

Categories

Resources