What does it mean when DataFrame invoke a drop() without parameter? - scala

I see a code from book "Spark The Definitive Guide",it invoke a drop on a dataframe with no parameter,when I use show(),I found nothing changed,but what is the meaning of it?
I execute it,nothing changed,dfNoNull.show() is the same as dfWithDate.show()
dfWithDate.createOrReplaceTempView("dfWithDate")
// in Scala
val dfNoNull = dfWithDate.drop()
dfNoNull.createOrReplaceTempView("dfNoNull")
does it mean, it create a new datarframe?
I know when a dataframe join itself when I using Hive sql,if I just
val df1=spark.sql("select id,date from date")
val df2=spark.sql("select id,date from date")
val joinedDf = spark.sql("select dateid1,dateid2 from sales")
.join(df1,df1["id"]===dateid1).join(df2,df2["id"]===dateid2)
Then an error occur:Cartesian join!
because the lazy evalution will consider df1 and df1 as the same one
so here,if I
val df2=df1.drop()
will I prevent that error?
If not,what does the drop method with no parameter mean?
Or it just mean remove the temp view name and create a new one?
but I try the code below,no exception throwed:
val df= Seq((1,"a")).toDF("id","name")
df.createOrReplaceTempView("df1")
val df2=df.drop()
df2.createOrReplaceTempView("df2")
spark.sql("select * from df1").show()
Or does the book mean below?
val dfNoNull = dfWithDate.na.drop()
because it wrote somewhere below the code:
Grouping sets depend on null values for aggregation levels. If you do
not filter-out null values, you will get incorrect results.This
applies to cubes, rollups, and grouping sets.

drop function with no parameter behave the same as drop with column name that doesn't exist in the Dataframe.
You can follow the code in the source of spark.
Even in the function documentation you can see a hint to this behavior.
/**
* Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain
* column name.
*
* This method can only be used to drop top level columns. the colName string is treated
* literally without further interpretation.
*
* #group untypedrel
* #since 2.0.0
*/
So when calling the function with no parameter no-op occur and nothing changes in the returning DataFrame.

Related

Spark: dropping multiple columns using regex

In a Spark (2.3.0) project using Scala, I would like to drop multiple columns using a regex. I tried using colRegex, but without success:
val df = Seq(("id","a_in","a_out","b_in","b_out"))
.toDF("id","a_in","a_out","b_in","b_out")
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.drop(df.colRegex("`.*_(in|out)`"))
// Hoping to get columns Array(id, a, b)
df_in.columns
// Getting Array(id, a, a_out, b, b_out)
On the other hand, the mechanism seems to work with select:
df.select(df.colRegex("`.*_(in|out)`")).columns
// Getting Array(a_in, a_out, b_in, b_out)
Several things are not clear to me:
what is this backquote syntax in the regex?
colRegex returns a Column: how can it actually represent several columns in the 2nd example?
can I combine drop and colRegex or do I need some workaround?
If you check spark code of colRefex method ... it expects regexs to be passed in the below format
/** the column name pattern in quoted regex without qualifier */
val escapedIdentifier = "`(.+)`".r
/** the column name pattern in quoted regex with qualifier */
val qualifiedEscapedIdentifier = ("(.+)" + """.""" + "`(.+)`").r
backticks(`) are necessary to enclose your regex, otherwise the above patterns will not identify your input pattern.
you can try selecting specific colums which are valid as mentioned below
val df = Seq(("id","a_in","a_out","b_in","b_out"))
.toDF("id","a_in","a_out","b_in","b_out")
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.drop(df.colRegex("`.*_(in|out)`"))
val validColumns = df_in.columns.filter(p => p.matches(".*_(in|out)$")).toSeq //select all junk columns
val final_df_in = df_in.drop(validColumns:_*) // this will drop all columns which are not valid as per your criteria.
In addition to the workaround proposed by Waqar Ahmed and kavetiraviteja (accepted answer), here is another possibility based on select with some negative regex magic. More concise, but harder to read for non-regex-gurus...
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.select(df.colRegex("`^(?!.*_(in|out)_).*$`")) // regex with negative lookahead

Spark - Column expression

I came across the following express, I know what does it mean - department("name"). I am curious to know, what it is resolved to. please share your inputs .
department("name") - it is used to refer the column with the name "name". Hope I am correct ? But , what it is resolved to, it seems like auxiliary constructor
From https://spark.apache.org/docs/2.4.5/api/java/index.html?org/apache/spark/sql/DataFrameWriter.html,
// To create Dataset[Row] using SparkSession
val people = spark.read.parquet("...")
val department = spark.read.parquet("...")
people.filter("age > 30")
.join(department, people("deptId") === department("id"))
.groupBy(department("name"), people("gender"))
.agg(avg(people("salary")), max(people("age")))
department("name") is just syntactic sugar for calling apply function:
department.apply("name") which returns Column
from Spark API, Dataset object:
/**
* Selects column based on the column name and returns it as a [[Column]].
*
* #note The column name can also reference to a nested column like `a.b`.
*
* #group untypedrel
* #since 2.0.0
*/
def apply(colName: String): Column = col(colName)

How to get columns of a spark sql query without execution

I want to extract columns of a spark sql query without executing it. With parsePlan:
val logicalPlan = spark.sessionState.sqlParser.parsePlan(query)
logicalPlan.collect{
case p: Project => p.projectList.map(_.name)
}.flatten
I was able to extract the list of columns. However it doesn't work in case of Select *, and throws an exception with the following message : An exception or error caused a run to abort: Invalid call to name on unresolved object, tree: *.
Without any form of execution it is not possible for Spark to determine the columns. For example if a table was loaded from a csv file
spark.read.option("header",true).csv("data.csv").createOrReplaceTempView("csvTable")
then the query
select * from csvTable
would not be able to read the column names without reading (at least the first line) of the csv file.
Extracting a bit of code from Spark's explain command the following lines get as close as possible to an answer to the question:
val logicalPlan: LogicalPlan = spark.sessionState.sqlParser.parsePlan("select * from csvTable")
val queryExecution: QueryExecution = spark.sessionState.executePlan(logicalPlan)
val outputAttr: Seq[Attribute] = queryExecution.sparkPlan.output
val colNames: Seq[String] = outputAttr.map(a => a.name)
println(colNames)
If the file data.csv contains the columns a and b the code prints
List(a, b)
Disclaimer: QueryExecution is not considered to be a public class that might be used by application developers. As of now (Spark version 2.4.5) the code above works, but it is not guaranteed to work in future versions.

how to get the row corresponding to the minimum value of some column in spark scala dataframe

i have the following code. df3 is created using the following code.i want to get the minimum value of distance_n and also the entire row containing that minimum value .
//it give just the min value , but i want entire row containing that min value
for getting the entire row , i converted this df3 to table for performing spark.sql
if i do like this
spark.sql("select latitude,longitude,speed,min(distance_n) from table1").show()
//it throws error
and if
spark.sql("select latitude,longitude,speed,min(distance_nd) from table180").show()
// by replacing the distance_n with distance_nd it throw the error
how to resolve this to get the entire row corresponding to min value
Before using a custom UDF, you have to register it in spark's sql Context.
e.g:
spark.sqlContext.udf.register("strLen", (s: String) => s.length())
After the UDF is registered, you can access it in your spark sql like
spark.sql("select strLen(some_col) from some_table")
Reference: https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html

Spark- Load data frame contents in table in a loop

I use scala/ spark to insert data into a Hive parquet table as follows
for(*lots of current_Period_Id*){//This loop is on a result of another query that returns multiple rows of current_Period_Id
val myDf = hiveContext.sql(s"""SELECT columns FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id""")
val count: Int = myDf.count().toInt
if(count>0){
hiveContext.sql(s"""INSERT INTO destinationtable PARTITION(period_id=$current_Period_Id) SELECT columns FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id""")
}
}
This approach takes a lot of time to complete because the select statement is being executed twice.
I'm trying to avoid selecting data twice and one way I've thought of is writing the dataframe myDf to the table directly.
This is the gist of the code I'm trying to use for the purpose
val sparkConf = new SparkConf().setAppName("myApp")
.set("spark.yarn.executor.memoryOverhead","4096")
val sc = new SparkContext(sparkConf)
val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.exec.dynamic.partition","true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
for(*lots of current_Period_Id*){//This loop is on a result of another query
val myDf = hiveContext.sql("SELECT COLUMNS FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id")
val count: Int = myDf.count().toInt
if(count>0){
myDf.write.mode("append").format("parquet").partitionBy("PERIOD_ID").saveAsTable("destinationtable")
}
}
But I get an error in the myDf.write part.
java.util.NoSuchElementException: key not found: period_id
The destination table is partitioned by period_id.
Could someone help me with this?
The spark version I'm using is 1.5.0-cdh5.5.2.
The dataframe schema and table's description differs from each other. The PERIOD_ID != period_id column name is Upper case in your DF but in UPPER case in table. Try in sql with lowercase the period_id