This question already has answers here:
How to get a value from the Row object in Spark Dataframe?
(3 answers)
Closed 4 years ago.
I am currently exploring how to call big hql files (contains 100 line of an insert into select statement) via sqlContext.
Another thing is, The hqls files are parameterize, so while calling it from sqlContext, I want to pass the parameters as well.
Have gone through loads of blogs and posts, but not found any answers to this.
Another thing I was trying, to store an output of rdd into a variable.
pyspark
max_date=sqlContext.sql("select max(rec_insert_date) from table")
now want to pass max_date as variable to next rdd
incremetal_data=sqlConext.sql(s"select count(1) from table2 where rec_insert_date > $max_dat")
This is not working , moreover the value for max_date is coming as =
u[row-('20018-05-19 00:00:00')]
now this is not clear how to trim those extra characters.
The sql Context reterns a Dataset[Row]. You can get your value from there with
max_date=sqlContext.sql("select count(rec_insert_date) from table").first()[0]
In Spark 2.0+ using spark Session you can
max_date=spark.sql("select count(rec_insert_date) from table").rdd.first()[0]
to get the underlying rdd from the returned dataframe
Shouldn't you use max(rec_insert_date) instead of count(rec_insert_date)?
You have two options on passing values returned from one query to another:
Use collect, which will trigger computations and assign returned value to a variable
max_date = sqlContext.sql("select max(rec_insert_date) from table").collect()[0][0] # max_date has actual date assigned to it
incremetal_data = sqlConext.sql(s"select count(1) from table2 where rec_insert_date > '{}'".format(max_date))
Another (and better) option is to use Dataframe API
from pyspark.sql.functions import col, lit
incremental_data = sqlContext.table("table2").filter(col("rec_insert_date") > lit(max_date))
Use cross join - it should be avoided if you have more than 1 result from the first query. The advantage is that you don't break the graph of processing, so everything can be optimized by Spark.
max_date_df = sqlContext.sql("select max(rec_insert_date) as max_date from table") # max_date_df is a dataframe with just one row
incremental_data = sqlContext.table("table2").join(max_date_df).filter(col("rec_insert_date") > col("max_date"))
As for you first question how to call large hql files from Spark:
If you're using Spark 1.6 then you need to create a HiveContext https://spark.apache.org/docs/1.6.1/sql-programming-guide.html#hive-tables
If you're using Spark 2.x then while creating SparkSession you need to enable Hive Support https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
You can start by inserting im in a sqlContext.sql(...) method, from my experience this usually works and is a nice starting point to rewrite the logic to DataFrames/Datasets API. There may be some issues while running it in your cluster because your queries will be executed by Spark's SQL engine (Catalyst) and won't be passed to Hive.
Related
Long story short, I'm tasked with converting files from SparkSQL to PySpark as my first task at my new job.
However, I'm unable to see many differences outside of syntax. Is SparkSQL an earlier version of PySpark or a component of it or something different altogether?
And yes, it's my first time using these tools. But, I have experience with both Python & SQL, so it's not seeming to be that difficult of a task. Just want a better understanding.
Example of the syntax difference I'm referring to:
spark.read.table("db.table1").alias("a")
.filter(F.col("a.field1") == 11)
.join(
other = spark.read.table("db.table2").alias("b"),
on = 'field2',
how = 'left'
Versus
df = spark.sql(
"""
SELECT b.field1,
CASE WHEN ...
THEN ...
ELSE ...
end field2
FROM db.table1 a
LEFT JOIN db.table2 b
on a.field1= b.field1
WHERE a.field1= {}
""".format(field1)
)
From the documentation: PySpark is an interface within which you have the components of spark viz. Spark core, SparkSQL, Spark Streaming and Spark MLlib.
Coming to the task you have been assigned, it looks like you've been tasked with translating SQL-heavy code into a more PySpark-friendly format.
I have a scala dataframe with two columns:
id: String
updated: Timestamp
From this dataframe I just want to get out the latest date, for which I use the following code at the moment:
df.agg(max("updated")).head()
// returns a row
I've just read about the collect() function, which I'm told to be
safer to use for such a problem - when it runs as a job, it appears it is not aggregating the max on the whole dataset, it looks perfectly fine when it is running in a notebook -, but I don't understand how it should
be used.
I found an implementation like the following, but I could not figure how it should be used...
df1.agg({"x": "max"}).collect()[0]
I tried it like the following:
df.agg(max("updated")).collect()(0)
Without (0) it returns an Array, which actually looks good. So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamps. My question now is, how is collect() actually supposed to work in such a situation?
Thanks a lot in advance!
I'm assuming that you are talking about a spark dataframe (not scala).
If you just want the latest date (only that column) you can do:
df.select(max("updated"))
You can see what's inside the dataframe with df.show(). Since df are immutable you need to assign the result of the select to another variable or add the show after the select().
This will return a dataframe with just one row with the max value in "updated" column.
To answer to your question:
So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamp
When you select on a dataframe, spark will select data from the whole dataset, there is not a partitioned version and a driver version. Spark will shard your data across your cluster and all the operations that you define will be done on the entire dataset.
My question now is, how is collect() actually supposed to work in such a situation?
The collect operation is converting from a spark dataframe into an array (which is not distributed) and the array will be in the driver node, bear in mind that if your dataframe size exceed the memory available in the driver you will have an outOfMemoryError.
In this case if you do:
df.select(max("Timestamp")).collect().head
You DF (that contains only one row with one column which is your date), will be converted to a scala array. In this case is safe because the select(max()) will return just one row.
Take some time to read more about spark dataframe/rdd and the difference between transformation and action.
It sounds weird. First of all you donĀ“t need to collect the dataframe to get the last element of a sorted dataframe. There are many answers to this topics:
How to get the last row from DataFrame?
This question already has an answer here:
Spark Implicit $ for DataFrame
(1 answer)
Closed 3 years ago.
Let us say I've a DF created as follows
val posts = spark.read
.option("rowTag","row")
.option("attributePrefix","")
.schema(Schemas.postSchema)
.xml("src/main/resources/Posts.xml")
What is the advantage of converting it to a Column using posts.select("Id") over posts.select($"Id")
df.select operates on the column directly while $"col" creates a Column instance. You can also create Column instances using col function. Now the Columns can be composed to form complex expressions which then can be passed to any of the df functions.
You can also find examples and more usages on the Scaladoc of Column class.
Ref - https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Column
There is no particular advantage, it's an automatic conversion anyway. But not all methods in SparkSQL perform this conversion, so sometimes you have to put the Column object with the $.
There is not much difference but some functionalities can be used only using $ with the column name.
Example : When we want to sort the value in this column, without using $ prior to column name, it will not work.
Window.orderBy("Id".desc)
But if you use $ before column name, it works.
Window.orderBy($"Id".desc)
Iam trying to do some transformations on the dataset with spark using scala currently using spark sql but want to shift the code to native scala code. i want to know whether to use filter or map, doing some operations like matching the values in column and get a single column after the transformation into a different dataset.
SELECT * FROM TABLE WHERE COLUMN = ''
Used to write something like this earlier in spark sql can someone tell me an alternative way to write the same using map or filter on the dataset, and even which one is much faster when compared.
You can read documentation from Apache Spark website. This is the link to API documentation at https://spark.apache.org/docs/2.3.1/api/scala/index.html#package.
Here is a little example -
val df = sc.parallelize(Seq((1,"ABC"), (2,"DEF"), (3,"GHI"))).toDF("col1","col2")
val df1 = df.filter("col1 > 1")
df1.show()
val df2 = df1.map(x => x.getInt(0) + 3)
df2.show()
If I understand you question correctly, you need to rewrite your SQL query to DataFrame API. Your query reads all columns from table TABLE and filter rows where COLUMN is empty. You can do this with DF in the following way:
spark.read.table("TABLE")
.where($"COLUMN".eqNullSafe(""))
.show(10)
Performance will be the same as in your SQL. Use dataFrame.explain(true) method to understand what Spark will do.
I am using Apache spark in Scala to run aggregations on multiple columns in a dataframe for example
select column1, sum(1) as count from df group by column1
select column2, sum(1) as count from df group by column2
The actual aggregation is more complicated than just the sum(1) but it's besides the point.
Query strings such as the examples above are compiled for each variable that I would like to aggregate, and I execute each string through a Spark sql context to create a corresponding dataframe that represents the aggregation in question
The nature of my problem is that I would have to do this for thousands of variables.
My understanding is that Spark will have to "read" the main dataframe each time it executes an aggregation.
Is there maybe an alternative way to do this more efficiently?
Thanks for reading my question, and thanks in advance for any help.
Go ahead and cache the data frame after you build the DataFrame with your source data. Also, to avoid writing all the queries in the code, go ahead and put them in a file and pass the file at run time. Have something in your code that can read your file and then you can run your queries. The best part about this approach is you can change your queries by updating the file and not the applications. Just make sure you find a way to give the output unique names.
In PySpark, it would look something like this.
dataframe = sqlContext.read.parquet("/path/to/file.parquet")
// do your manipulations/filters
dataframe.cache()
queries = //how ever you want to read/parse the query file
for query in queries:
output = dataframe.sql(query)
output.write.parquet("/path/to/output.parquet")