I am currently using :
+---+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |sen |attributes |
+---+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |Stanford is good college.|[[Stanford,ORGANIZATION,NNP], [is,O,VBZ], [good,O,JJ], [college,O,NN], [.,O,.], [Stanford,ORGANIZATION,NNP], [is,O,VBZ], [good,O,JJ], [college,O,NN], [.,O,.]]|
+---+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
I want to get above df from :
+----------+--------+--------------------+
|article_id| sen| attribute|
+----------+--------+--------------------+
| 1|example1|[Standford,Organi...|
| 1|example1| [is,O,VP]|
| 1|example1| [good,LOCATION,ADP]|
+----------+--------+--------------------+
using :
df3.registerTempTable("d1")
val df4 = sqlContext.sql("select article_id,sen,collect(attribute) as attributes from d1 group by article_id,sen")
Is there any way that I don't have to register temp table, as while saving dataframe, it is giving lot of garbage!! Something lige df3.Select""??
The only way Spark currently has to run SQL against a dataframe is via a temporary table. However, you can add implicit methods to DataFrame to automate this, as we have done at Swoop. I can't share all the code as it uses a number of our internal utilities & implicits but the core is in the following gist. The importance of using unique temporary tables is that (at least until Spark 2.0) temporary tables are cluster global.
We use this approach regularly in our work, especially since there are many situations in which SQL is much simpler/easier to write and understand than the Scala DSL.
Hope this helps!
Related
Oversimplified Scenario:
A process which generates monthly data in a s3 file. The number of fields could be different in each monthly run. Based on this data in s3,we load the data to a table and we manually (as number of fields could change in each run with addition or deletion of few columns) run a SQL for few metrics.There are more calculations/transforms on this data,but to have starter Im presenting the simpler version of the usecase.
Approach:
Considering the schema-less nature, as the number of fields in the s3 file could differ in each run with addition/deletion of few fields,which requires manual changes every-time in the SQL, Im planning to explore Spark/Scala, so that we can directly read from s3 and dynamically generate SQL based on the fields.
Query:
How I can achieve this in scala/spark-SQL/dataframe? s3 file contains only the required fields from each run.Hence there is no issue reading the dynamic fields from s3 as it is taken care by dataframe.The issue is how can we generate SQL dataframe-API/spark-SQL code to handle.
I can read s3 file via dataframe and register the dataframe as createOrReplaceTempView to write SQL, but I dont think it helps manually changing the spark-SQL, during addition of a new field in s3 during next run. what is the best way to dynamically generate the sql/any better ways to handle the issue?
Usecase-1:
First-run
dataframe: customer,1st_month_count (here dataframe directly points to s3, which has only required attributes)
--sample code
SELECT customer,sum(month_1_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count").show()
Second-Run - One additional column was added
dataframe: customer,month_1_count,month_2_count) (here dataframe directly points to s3, which has only required attributes)
--Sample SQL
SELECT customer,sum(month_1_count),sum(month_2_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count","month_2_count").show()
Im new to Spark/Scala, would be helpful if you can provide the direction so that I can explore further.
It sounds like you want to perform the same operation over and over again on new columns as they appear in the dataframe schema? This works:
from pyspark.sql import functions
#search for column names you want to sum, I put in "month"
column_search = lambda col_names: 'month' in col_names
#get column names of temp dataframe w/ only the columns you want to sum
relevant_columns = original_df.select(*filter(column_search, original_df.columns)).columns
#create dictionary with relevant column names to be passed to the agg function
columns = {col_names: "sum" for col_names in relevant_columns}
#apply agg function with your groupBy, passing in columns dictionary
grouped_df = original_df.groupBy("customer").agg(columns)
#show result
grouped_df.show()
Some important concepts can help you to learn:
DataFrames have data attributes stored in a list: dataframe.columns
Functions can be applied to lists to create new lists as in "column_search"
Agg function accepts multiple expressions in a dictionary as explained here which is what I pass into "columns"
Spark is lazy so it doesn't change data state or perform operations until you perform an action like show(). This means writing out temporary dataframes to use one element of the dataframe like column as I do is not costly even though it may seem inefficient if you're used to SQL.
I have a use case where I intend to group by key(s) while aggregating over column(s). I am using Dataset and tried to achieve these operations by using groupBy and agg. For example take the following scenario
case class Result(deptId:String,locations:Seq[String])
case class Department(deptId:String,location:String)
// using spark 2.0.2
// I have a Dataset `ds` of type Department
+-------+--------------------+
|deptId | location |
+-------+--------------------+
| d1|delhi |
| d1|mumbai |
| dp2|calcutta |
| dp2|hyderabad |
+-------+--------------------+
I intended to convert it to
// Dataset `result` of type Result
+-------+--------------------+
|deptId | locations |
+-------+--------------------+
| d1|[delhi,mumbai] |
| dp2|[calcutta,hyderabad]|
+-------+--------------------+
For this I searched on stack and found the following:
val flatten = udf(
(xs: Seq[Seq[String]]) => xs.flatten)
val result = ds.groupBy("deptId").
agg(flatten(collect_list("location")).as("locations")
The above seemed pretty neat for me.
But before searching for the above, I first searched if Dataset had an inbuilt reduceByKey like a RDD does. But couldn't find, so opted for above. But I read this article grouByKey vs reduceByKey and came to know reduceByKey has less shuffles and is more efficient. Which is my first reason to ask the question, should I opt for RDD in my scenario ?
The reason I initially went for Dataset was solely enforcement of type,ie. each row being of type Department. But as my result has an entirely different schema should I bother with type safety ? So I tried doing result.as[Result] but that doesn't seem to do any compile time type check. Another reason I chose Dataset was, I'll pass the result Dataset to some other function, having a structure makes code easy to maintain. Also the case class can be highly nested, I cannot imagine maintaining that nesting in pairRDD while writing reduce/map operations.
Another thing I'm unsure of is about using udf. I came across post, where people said they would prefer changing Dataset to RDD, rather than using udf for complex aggregations/grouby.
I also googled around a bit and saw posts/articles where people said Dataset has overhead of type checking, but in higher spark version is better performance wise compared to RDD. Again not sure should I switch back to RDD ?
PS: please forgive, if I used some terms incorrectly.
To answer some of you questions:
groupBy + agg is not groupByKey - DataFrame / Dataset groupBy behaviour/optimization - in general case. There are specific cases where it might behave like one, this includes collect_list.
reduceByKey is not better than RDD-style groupByKey when groupByKey-like logic is required - Be Smart About groupByKey - and in fact it is almost always worse.
There is a important trade-off between static type checking and performance in Spark's Dataset - Spark 2.0 Dataset vs DataFrame
The linked post specifically advises against using UserDefinedAggregateFunction (not UserDefinedFunction) because of excessive copying of data - Spark UDAF with ArrayType as bufferSchema performance issues
You don't even need UserDefinedFunction as flattening is not required in your case:
val df = Seq[Department]().toDF
df.groupBy("deptId").agg(collect_list("location").as("locations"))
And this is what you should go for.
A statically typed equivalent would be
val ds = Seq[Department]().toDS
ds
.groupByKey(_.deptId)
.mapGroups { case (deptId, xs) => Result(deptId, xs.map(_.location).toSeq) }
considerably more expensive than the DataFrame option.
Iam trying to do some transformations on the dataset with spark using scala currently using spark sql but want to shift the code to native scala code. i want to know whether to use filter or map, doing some operations like matching the values in column and get a single column after the transformation into a different dataset.
SELECT * FROM TABLE WHERE COLUMN = ''
Used to write something like this earlier in spark sql can someone tell me an alternative way to write the same using map or filter on the dataset, and even which one is much faster when compared.
You can read documentation from Apache Spark website. This is the link to API documentation at https://spark.apache.org/docs/2.3.1/api/scala/index.html#package.
Here is a little example -
val df = sc.parallelize(Seq((1,"ABC"), (2,"DEF"), (3,"GHI"))).toDF("col1","col2")
val df1 = df.filter("col1 > 1")
df1.show()
val df2 = df1.map(x => x.getInt(0) + 3)
df2.show()
If I understand you question correctly, you need to rewrite your SQL query to DataFrame API. Your query reads all columns from table TABLE and filter rows where COLUMN is empty. You can do this with DF in the following way:
spark.read.table("TABLE")
.where($"COLUMN".eqNullSafe(""))
.show(10)
Performance will be the same as in your SQL. Use dataFrame.explain(true) method to understand what Spark will do.
This question already has answers here:
How to get a value from the Row object in Spark Dataframe?
(3 answers)
Closed 4 years ago.
I am currently exploring how to call big hql files (contains 100 line of an insert into select statement) via sqlContext.
Another thing is, The hqls files are parameterize, so while calling it from sqlContext, I want to pass the parameters as well.
Have gone through loads of blogs and posts, but not found any answers to this.
Another thing I was trying, to store an output of rdd into a variable.
pyspark
max_date=sqlContext.sql("select max(rec_insert_date) from table")
now want to pass max_date as variable to next rdd
incremetal_data=sqlConext.sql(s"select count(1) from table2 where rec_insert_date > $max_dat")
This is not working , moreover the value for max_date is coming as =
u[row-('20018-05-19 00:00:00')]
now this is not clear how to trim those extra characters.
The sql Context reterns a Dataset[Row]. You can get your value from there with
max_date=sqlContext.sql("select count(rec_insert_date) from table").first()[0]
In Spark 2.0+ using spark Session you can
max_date=spark.sql("select count(rec_insert_date) from table").rdd.first()[0]
to get the underlying rdd from the returned dataframe
Shouldn't you use max(rec_insert_date) instead of count(rec_insert_date)?
You have two options on passing values returned from one query to another:
Use collect, which will trigger computations and assign returned value to a variable
max_date = sqlContext.sql("select max(rec_insert_date) from table").collect()[0][0] # max_date has actual date assigned to it
incremetal_data = sqlConext.sql(s"select count(1) from table2 where rec_insert_date > '{}'".format(max_date))
Another (and better) option is to use Dataframe API
from pyspark.sql.functions import col, lit
incremental_data = sqlContext.table("table2").filter(col("rec_insert_date") > lit(max_date))
Use cross join - it should be avoided if you have more than 1 result from the first query. The advantage is that you don't break the graph of processing, so everything can be optimized by Spark.
max_date_df = sqlContext.sql("select max(rec_insert_date) as max_date from table") # max_date_df is a dataframe with just one row
incremental_data = sqlContext.table("table2").join(max_date_df).filter(col("rec_insert_date") > col("max_date"))
As for you first question how to call large hql files from Spark:
If you're using Spark 1.6 then you need to create a HiveContext https://spark.apache.org/docs/1.6.1/sql-programming-guide.html#hive-tables
If you're using Spark 2.x then while creating SparkSession you need to enable Hive Support https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
You can start by inserting im in a sqlContext.sql(...) method, from my experience this usually works and is a nice starting point to rewrite the logic to DataFrames/Datasets API. There may be some issues while running it in your cluster because your queries will be executed by Spark's SQL engine (Catalyst) and won't be passed to Hive.
I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy I don't get a data frame back.
For example, I have a DataFrame called logs that has the following form:
machine_id | event | other_stuff
34131231 | thing | stuff
83423984 | notathing | notstuff
34131231 | thing | morestuff
and I would like the unique machine ids where event is thing stored in a new DataFrame to allow me to do some filtering of some kind. Using
val machineId = logs
.where($"event" === "thing")
.select("machine_id")
.groupBy("machine_id")
I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame to extract all events for individual machine ids.
I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:
Extract unique id's from a log table.
Use unique ids to extract all events for a particular id.
Use some kind of analysis on this data that has been extracted.
It's the first two steps I would appreciate some guidance with here.
I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.
Thanks
Just use distinct not groupBy:
val machineId = logs.where($"event"==="thing").select("machine_id").distinct
Which will be equivalent to SQL:
SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'
GroupedData is not intended to be used directly. It provides a number of methods, where agg is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame. In terms of SQL what you have after where and groupBy is equivalent to something like this
SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id
where ... has to be provided by agg or equivalent method.
A group by in spark followed by aggregation and then a select statement will return a data frame. For your example it should be something like:
val machineId = logs
.groupBy("machine_id", "event")
.agg(max("other_stuff") )
.select($"machine_id").where($"event" === "thing")