How to get columns of a spark sql query without execution

How to get columns of a spark sql query without execution - scala

I want to extract columns of a spark sql query without executing it. With parsePlan:
val logicalPlan = spark.sessionState.sqlParser.parsePlan(query)
logicalPlan.collect{
case p: Project => p.projectList.map(_.name)
}.flatten
I was able to extract the list of columns. However it doesn't work in case of Select *, and throws an exception with the following message : An exception or error caused a run to abort: Invalid call to name on unresolved object, tree: *.

Without any form of execution it is not possible for Spark to determine the columns. For example if a table was loaded from a csv file
spark.read.option("header",true).csv("data.csv").createOrReplaceTempView("csvTable")
then the query
select * from csvTable
would not be able to read the column names without reading (at least the first line) of the csv file.
Extracting a bit of code from Spark's explain command the following lines get as close as possible to an answer to the question:
val logicalPlan: LogicalPlan = spark.sessionState.sqlParser.parsePlan("select * from csvTable")
val queryExecution: QueryExecution = spark.sessionState.executePlan(logicalPlan)
val outputAttr: Seq[Attribute] = queryExecution.sparkPlan.output
val colNames: Seq[String] = outputAttr.map(a => a.name)
println(colNames)
If the file data.csv contains the columns a and b the code prints
List(a, b)
Disclaimer: QueryExecution is not considered to be a public class that might be used by application developers. As of now (Spark version 2.4.5) the code above works, but it is not guaranteed to work in future versions.

Related

Spark: dropping multiple columns using regex

In a Spark (2.3.0) project using Scala, I would like to drop multiple columns using a regex. I tried using colRegex, but without success:
val df = Seq(("id","a_in","a_out","b_in","b_out"))
.toDF("id","a_in","a_out","b_in","b_out")
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.drop(df.colRegex("`.*_(in|out)`"))
// Hoping to get columns Array(id, a, b)
df_in.columns
// Getting Array(id, a, a_out, b, b_out)
On the other hand, the mechanism seems to work with select:
df.select(df.colRegex("`.*_(in|out)`")).columns
// Getting Array(a_in, a_out, b_in, b_out)
Several things are not clear to me:
what is this backquote syntax in the regex?
colRegex returns a Column: how can it actually represent several columns in the 2nd example?
can I combine drop and colRegex or do I need some workaround?

If you check spark code of colRefex method ... it expects regexs to be passed in the below format
/** the column name pattern in quoted regex without qualifier */
val escapedIdentifier = "`(.+)`".r
/** the column name pattern in quoted regex with qualifier */
val qualifiedEscapedIdentifier = ("(.+)" + """.""" + "`(.+)`").r
backticks(`) are necessary to enclose your regex, otherwise the above patterns will not identify your input pattern.
you can try selecting specific colums which are valid as mentioned below
val df = Seq(("id","a_in","a_out","b_in","b_out"))
.toDF("id","a_in","a_out","b_in","b_out")
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.drop(df.colRegex("`.*_(in|out)`"))
val validColumns = df_in.columns.filter(p => p.matches(".*_(in|out)$")).toSeq //select all junk columns
val final_df_in = df_in.drop(validColumns:_*) // this will drop all columns which are not valid as per your criteria.

In addition to the workaround proposed by Waqar Ahmed and kavetiraviteja (accepted answer), here is another possibility based on select with some negative regex magic. More concise, but harder to read for non-regex-gurus...
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.select(df.colRegex("`^(?!.*_(in|out)_).*$`")) // regex with negative lookahead

Can we convert data frame in Databricks to string and why do we get error Queries with streaming sources must be executed with writeStream.start()

I'm selecting a column that's a data frame. I would like to cast it as a string so that it can be used to frame cosmos DB dynamic query. The function collect() on data frame complains about queries with streaming sources must be executed with writeStream.start();;
val DF = AppointmentDF
.select("*")
.filter($"xyz" === "abc")
DF.createOrReplaceTempView("MyTable")
val column1DF = spark.sql("SELECT column1 FROM MyTable")
// This is not getting resolved
val sql="select c.abc from c where c.column = \"" + String.valueOf(column1DF) + "\""
println(sql)
Error:
org.apache.spark.sql.AnalysisException: cannot resolve '`column1DF`' given input columns: []; line 1 pos 12;
DF.collect().foreach { row =>
println(row.mkString(","))
}
Error:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must
be executed with writeStream.start();;

A dataframe is a distributed data structure, not a structure located in your machine that can be printed. The value DF and column1DF are going to be exactly dataframes. To bring all the data of your queries, to the driver node, you can use the dataframe method collect, and extract from the returning Array of rows your value.
Collect can be harmful if you are bringing gigabytes of data to the memory of your driver node.

You can use collect and take head for getting first line of DataFrame:
val column1DF = spark.sql("SELECT column1 FROM MyTable").collect().head.getAs[String](0)
val sql="select c.abc from c where c.column = \"" + column1DF + "\""

What does it mean when DataFrame invoke a drop() without parameter?

I see a code from book "Spark The Definitive Guide",it invoke a drop on a dataframe with no parameter,when I use show(),I found nothing changed,but what is the meaning of it?
I execute it,nothing changed,dfNoNull.show() is the same as dfWithDate.show()
dfWithDate.createOrReplaceTempView("dfWithDate")
// in Scala
val dfNoNull = dfWithDate.drop()
dfNoNull.createOrReplaceTempView("dfNoNull")
does it mean, it create a new datarframe?
I know when a dataframe join itself when I using Hive sql,if I just
val df1=spark.sql("select id,date from date")
val df2=spark.sql("select id,date from date")
val joinedDf = spark.sql("select dateid1,dateid2 from sales")
.join(df1,df1["id"]===dateid1).join(df2,df2["id"]===dateid2)
Then an error occur:Cartesian join!
because the lazy evalution will consider df1 and df1 as the same one
so here,if I
val df2=df1.drop()
will I prevent that error?
If not,what does the drop method with no parameter mean?
Or it just mean remove the temp view name and create a new one?
but I try the code below,no exception throwed:
val df= Seq((1,"a")).toDF("id","name")
df.createOrReplaceTempView("df1")
val df2=df.drop()
df2.createOrReplaceTempView("df2")
spark.sql("select * from df1").show()
Or does the book mean below?
val dfNoNull = dfWithDate.na.drop()
because it wrote somewhere below the code:
Grouping sets depend on null values for aggregation levels. If you do
not filter-out null values, you will get incorrect results.This
applies to cubes, rollups, and grouping sets.

drop function with no parameter behave the same as drop with column name that doesn't exist in the Dataframe.
You can follow the code in the source of spark.
Even in the function documentation you can see a hint to this behavior.
/**
* Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain
* column name.
*
* This method can only be used to drop top level columns. the colName string is treated
* literally without further interpretation.
*
* #group untypedrel
* #since 2.0.0
*/
So when calling the function with no parameter no-op occur and nothing changes in the returning DataFrame.

IndexOutOfBoundsException when writing dataframe into CSV

So, I'm trying to read an existing file, save that into a DataFrame, once that's done I make a "union" between that existing DataFrame and a new one I have already created, both have the same columns and share the same schema.
ALSO I CANNOT GIVE SIGNIFICANT NAME TO VARS NOR GIVE ANYMORE DATA BECAUSE OF RESTRICTIONS
val dfExist = spark.read.format("csv").option("header", "true").option("delimiter", ",").schema(schema).load(filePathAggregated3)
val df5 = df4.union(dfExist)
Once that's done I get the "start_ts" (a timestamp on Epoch format) that's duplicate in the union between the above dataframes (df4 and dfExist) and also I get rid of some characters I don't want
val df6 = df5.select($"start_ts").collect()
val df7 = df6.diff(df6.distinct).distinct.mkString.replace("[", "").replace("]", "")
Now I use this "start_ts" duplicate to filter the DataFrame and create 2 new DataFrames selecting the items of this duplicate timestamp, and the items that are not like this duplicate timestamp
val itemsNotDup = df5.filter(!$"start_ts".like(df7)).select($"start_ts",$"avg_value",$"Number_of_val")
val items = df5.filter($"start_ts".like(df7)).select($"start_ts",$"avg_value",$"Number_of_val")
And then I save in 2 different lists the avg_value and the Number_of_values
items.map(t => t.getAs[Double]("avg_value")).collect().foreach(saveList => listDataDF += saveList.toString)
items.map(t => t.getAs[Long]("Number_of_val")).collect().foreach(saveList => listDataDF2 += saveList.toString)
Now I make some maths with the values on the lists (THIS IS WHERE I'M GETTING ISSUES)
val newAvg = ((listDataDF(0).toDouble*listDataDF2(0).toDouble) - (listDataDF(1).toDouble*listDataDF2(1).toDouble)) / (listDataDF2(0) + listDataDF2(1)).toInt
val newNumberOfValues = listDataDF2(0).toDouble + listDataDF2(1).toDouble
Then save the duplicate timestamp (df7), the avg and the number of values into a list as a single item, this list transforms into a DataFrame and then I transform I get a new DataFrame with the columns how are supposed to be.
listDataDF3 += df7 + ',' + newAvg.toString + ',' + newNumberOfValues.toString + ','
val listDF = listDataDF3.toDF("value")
val listDF2 = listDF.withColumn("_tmp", split($"value", "\\,")).select(
$"_tmp".getItem(0).as("start_ts"),
$"_tmp".getItem(1).as("avg_value"),
$"_tmp".getItem(2).as("Number_of_val")
).drop("_tmp")
Finally I join the DataFrame without duplicates with the new DataFrame which have the duplicate timestamp and the avg of the duplicate avg values and the sum of number of values.
val finalDF = itemsNotDup.union(listDF2)
finalDF.coalesce(1).write.mode(SaveMode.Overwrite).format("csv").option("header","true").save(filePathAggregated3)
When I run this code in SPARK it gives me the error, I supposed it was related to empty lists (since it's giving me the error when making some maths with the values of the lists) but If I delete the line where I write to CSV, the code runs perfectly, also I saved the lists and values of the math calcs into files and they are not empty.
My supposition, is that, is deleting the file before reading it (because of how spark distribute tasks between workers) and that's why the list is empty therefore I'm getting this error when trying to make maths with those values.
I'm trying to be as clear as possible but I cannot give much more details, nor show any of the output.
So, how can I avoid this error? also I've been only 1 month with scala/spark so any code recommendation will be nice as well.
Thanks beforehand.

This error comes because of the Data. Any of your list does not contains columns as expected. When you refer to that index, the List gives this error to you

It was a problem related to reading files, I made a check (df.rdd.isEmpty) and wether the DF was empty I was getting this error. Made this as an if/else statement to check if the DF is empty, and now it works fine.

How to execute spark sql multiline query when stored as a string variable?

I have code like so with a multiline query
val hiveInsertIntoTable = spark.read.text(fileQuery).collect()
hiveInsertIntoTable.foreach(println)
val actualQuery = hiveInsertIntoTable(0).mkString
println(actualQuery)
spark.sql(s"truncate table $tableTruncate")
spark.sql(actualQuery)
Whenever I try to execute actual query I get an error.
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input '<EOF>'(line 1, pos 52)
== SQL ==
insert into wera_tacotv_esd.lac_asset_table_pb_hive
----------------------------------------------------^^^
and the end of the query .... ; (terminates in a ;)
The query is actually about 450 lines
I tried to wrap the variable in triple quotes but that didn't work either.
Any help is appreciated.
I am using spark 2.1 and scala 2.11

Three problems:
hiveInsertIntoTable is an Array[org.apache.spark.sql.Row] - not very useful structure.
You take only the first row of it hiveInsertIntoTable(0)
Even if you took all rows, concatenating with empty string (.mkString) wouldn't work well.
Either:
val actualQuery = spark.read.text(path).as[String].collect.mkString("\n")
or
val actualQuery = spark.sparkContext.wholeTextFiles(path).values.first()

Categories

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to get columns of a spark sql query without execution - scala

Related

Spark: dropping multiple columns using regex

Can we convert data frame in Databricks to string and why do we get error Queries with streaming sources must be executed with writeStream.start()

What does it mean when DataFrame invoke a drop() without parameter?

IndexOutOfBoundsException when writing dataframe into CSV

How to execute spark sql multiline query when stored as a string variable?

Categories

Resources