I am trying to convert a pyspark code to spark Scala and i am facing the below error:
pyspark code
import pyspark.sql.functions as fn
valid_data = bcd_df.filter(fn.lower(bdb_df.table_name)==tbl_nme)
.select("valid_data").rdd
.map(lambda x: x[0])
.collect()[0]
From bcd_df dataframe I am getting a column with table_name and matching the value of table_name with the argument tbl_name that i am passing and then selecting the valid_data column data.
Here is the code in spark scala.
val valid_data =bcd_df..filter(col(table_name)===tbl_nme).select(col("valid_data")).rdd.map(x=> x(0)).collect()(0)
Error as below:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`abcd`' given input
columns:
Not sure why it is taking abcd as column.
Any help is appreciated.
Version
scala2.11.8
spark2.3
Enclose table_name column with quotes(") in col
val valid_data =bcd_df.filter(col("table_name")===tbl_nme).select(col("valid_data")).rdd.map(x=> x(0)).collect()(0)
Related
I am trying to cast an array as Decimal(30,0) for use in a select dynamically as:
WHERE array_contains(myArrayUDF(), someTable.someColumn)
However when casting with:
val arrIds = someData.select("id").withColumn("id", col("id")
.cast(DecimalType(30, 0))).collect().map(_.getDecimal(0))
Databricks accepts that and signature however already looks wrong to be:
intArrSurrIds: Array[java.math.BigDecimal] = Array(2181890000000,...) // ie, a BigDecimal
Which results in the below error:
Error in SQL statement: AnalysisException: cannot resolve.. due to data type mismatch: Input to function array_contains should have been array followed by a value with same element type, but it's [array<decimal(38,18)>, decimal(30,0)]
How do you correctly cast as decimal(30,0) in Spark Databricks Scala notebook instead of decimal(38,18) ?
Any help appreciated!
You can make arrIds an Array[Decimal] using the code below:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.{Decimal, DecimalType}
val arrIds = someData.select("id")
.withColumn("id", col("id").cast(DecimalType(30, 0)))
.collect()
.map(row => Decimal(row.getDecimal(0), 30, 0))
However, it will not solve your problem because you lose the precision and scale once you create your user defined function, as I explain in this answer
To solve your problem, you need to cast the column someTable.someColumn to Decimal with the same precision and scale than the UDF returned type. So your WHERE clause should be:
WHERE array_contains(myArray(), cast(someTable.someColumn as Decimal(38, 18)))
I am writing a dataframe to s3 as shown below. Target location: s3://test/folder
val targetDf = spark.read.schema(schema).parquet(targetLocation)
val df1=spark.sql("select * from sourceDf")
val df2=spark.sql(select * from targetDf)
/*
for loop over a date range to dedup and write the data to s3
union dfs and run a dedup logic, have omitted dedup code and for loop
*/
val df3=spark.sql("select * from df1 union all select * from df2")
df3.write.partitionBy(data_id, schedule_dt).parquet("targetLocation")
Spark is creating extra partition column on write like shown below:
Exception in thread "main" java.lang.AssertionError: assertion failed: Conflicting partition column names detected:
Partition column name list #0: data_id, schedule_dt
Partition column name list #1: data_id, schedule_dt, schedule_dt
EMR optimizer class is enabled while writing, I am using spark 2.4.3
Please let me know what could be causing this error.
Thanks
Abhineet
You should give 1 extra column apart from partitioned columns. Can you please try
val df3=df1.union(df2)
instead of
val df3=spark.sql("select data_id,schedule_dt from df1 union all select data_id,schedule_dt from df2")
I have this code :
val o = p_value.alias("d1").join(t_d.alias("d2"),
(col("d1.origin_latitude")===col("d2.origin_latitude")&&
col("d1.origin_longitude")===col("d2.origin_longitude")),"left").
filter(col("d2.origin_longitude").isNull)
val c = p_value2.alias("d3").join(o.alias("d4"),
(col("d3.origin_latitude")===col("d4.origin_latitude") &&
col("d3.origin_longitude")===col("d4.origin_longitude")),"left").
filter(col("d3.origin_longitude").isNull)
I get this error :
Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 'd4.origin_latitude' is ambiguous, could be: d4.origin_latitude, d4.origin_latitude.;
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:101)
On this line
(col("d3.origin_latitude")===col("d4.origin_latitude") && col("d3.origin_longitude")===col("d4.origin_longitude")),"left").
Any idea ?
Thank you .
You are aliasing DataFrame not columns, which is used to access/refer columns in that DataFrame.
So the first join will result into another DataFrame having same column name twice (origin_latitude as well as origin_longitude). Once you try to access one of these columns in resultant DataFrame, you are going to get Ambiguity error.
So you need to make sure that DataFrame contains each column only once.
You can rewrite the first join as below:
p_value
.join(t_d, Seq("origin_latitude", "origin_longitude"), "left")
.filter(t_d.col("t_d.origin_longitude").isNull)
I am trying to pivot Spark dataframe on multiple columns, I am using Pivot function, but after I add 2 columns, it is giving error like overloaded parameter.
This is the error I am getting after add the third column, overloaded
method value pivot with alternatives: (pivotColumn: String,values:
java.util.List[Any])org.apache.spark.sql.RelationalGroupedDataset
(pivotColumn: String,values:
Seq[Any])org.apache.spark.sql.RelationalGroupedDataset
(pivotColumn: String)org.apache.spark.sql.RelationalGroupedDataset
cannot be applied to (String, String, String)
Here's my work:
val df_new=df.join(df1, df("Col1")<=>df1("col1") && df1("col2")<=> df("col2")).groupBy(df("Col6"))
.agg(
sum(df("Col1")).alias("Col1"),
sum(df("Col2")).alias("Col2") ,
sum(df("Col3")).alias("Col3") ,
sum(df("Col4")).alias("Col4") ,
sum(df("Col5")).alias("Col5")
).select(
Amount,'Col1, 'Col2,'Col3,'Col4,'Col5
)
--Pivot
val pivotdf=df_new.groupBy($"Col1").
pivot("Col1","Col2","Col3","col4")
I have to pivot on col1,Col2,col3,col4 and col5.Please guide me How can I do that.
I am using following to find the max column value.
val d = sqlContext.sql("select max(date), id from myTable group By id")
How to do the same query on DataFrame without registering temp table.
thanks,
Direct translation to DataFrame Scala API:
df.groupBy("id").agg(max("date"))
Spark 2.2.0 execution plan is identical for both OP's SQL & DF scenarios.
Full code for spark-shell:
Seq((1, "2011-1-1"), (2, "2011-1-2")).toDF("id", "date_str").withColumn("date", $"date_str".cast("date")).write.parquet("tmp")
var df = spark.read.parquet("tmp")
df.groupBy("id").agg(max("date")).explain
df.createTempView("myTable")
spark.sql("select max(date), id from myTable group By id").explain
If you would like to translate that sql to code to be used with a dataframe, you could do something like:
df.groupBy("id").max("date").show()
For max use
df.describe(Columnname).filter("summary = 'max'").collect()[0].get(1))
And for min use
df.describe(Columnname).filter("summary = 'min'").collect()[0].get(1))
if you have a dataframe with id and date column, what you can do n spark 2.0.1 is
from pyspark.sql.functions import max
mydf.groupBy('date').agg({'id':'max'}).show()
var maxValue = myTable.select("date").rdd.max()