We are using structured streaming to perform aggregations on real time data. I'm creating a configurable Spark job that is given a configuration and uses it to group rows across tumbling windows and performs aggregations. I know how to do this with the functional interface.
Here is a code fragment using the functional interface
var valStream = sparkSession.sql(sparkSession.sql(config.aggSelect)) //<- 1
.withWatermark("eventTime", "15 minutes") //<- 2
.groupBy(window($"eventTime", "1 minute"), $"aggCol1", $"aggCol2") //<- 3
.agg(count($"aggCol2").as("myAgg2Count"))
Line 1 executes a SQL string that comes from the configuration. I would like to move lines 2 & 3 into the SQL syntax so that the grouping and aggregations are specified in the configuration.
Does anyone out there know how to specify this in Spark SQL?
withWatermark does not have a corresponding SQL syntax. You have to use the dataframe API.
For aggregation, you can do something like
select count(aggcol2) as myAgg2Count
from xxx
group by window(eventTime, "1 minute"), aggCo1, aggCol2
Related
Long story short, I'm tasked with converting files from SparkSQL to PySpark as my first task at my new job.
However, I'm unable to see many differences outside of syntax. Is SparkSQL an earlier version of PySpark or a component of it or something different altogether?
And yes, it's my first time using these tools. But, I have experience with both Python & SQL, so it's not seeming to be that difficult of a task. Just want a better understanding.
Example of the syntax difference I'm referring to:
spark.read.table("db.table1").alias("a")
.filter(F.col("a.field1") == 11)
.join(
other = spark.read.table("db.table2").alias("b"),
on = 'field2',
how = 'left'
Versus
df = spark.sql(
"""
SELECT b.field1,
CASE WHEN ...
THEN ...
ELSE ...
end field2
FROM db.table1 a
LEFT JOIN db.table2 b
on a.field1= b.field1
WHERE a.field1= {}
""".format(field1)
)
From the documentation: PySpark is an interface within which you have the components of spark viz. Spark core, SparkSQL, Spark Streaming and Spark MLlib.
Coming to the task you have been assigned, it looks like you've been tasked with translating SQL-heavy code into a more PySpark-friendly format.
One can convert a raw SQL string into a DataFrame. But is it also possible the other way around, i.e., get the SQL representation for the query logic of a (derived) Spark DataFrame?
// Source data
val a = Seq(7, 8, 9, 7, 8, 7).toDF("foo")
// Query using DataFrame functions
val b = a.groupBy($"foo").agg(count("*") as "occurrences").orderBy($"occurrences")
b.show()
// Convert a SQL string into a DataFrame
val sqlString = "SELECT foo, count(*) as occurrences FROM a GROUP BY foo ORDER BY occurrences"
a.createOrReplaceTempView("a")
val c = currentSparkSession.sql(sqlString)
c.show()
// "Convert" a DataFrame into a SQL string
b.toSQLString() // Error: This function does not exist.
It is not possible to "convert" a DataFrame into an SQL string because Spark does not know how to write SQL queries and it does not need to.
I find it useful to recall how a Dataframe code or an SQL query gets handled by Spark. This is done by Spark's Catalyst Optimizer and it goes through four transformational phases as shown below:
In the first phase (Analysis), the Spark SQL engine generates an abstract syntax tree (AST) for the SQL or Dataframe query. This tree is the main data type in Catalyst (see section 4.1 in white paper Spark SQL: Relational Data Processing in Spark) and it is used to create the logical plan and eventually the physical plan. You get an representation of those plans if you use the explain API that Spark offers.
Although it is clear to me what you mean with "One can convert a raw SQL string into a DataFrame" I guess it helps to be more precise. We are not converting an SQL string (hence you put quotations around that word yourself) into a Dataframe, but you applied your SQL knowledge as this is a syntax that can be parsed by Spark to understand your intentions. In addition, you cannot just type in any SQL query as this could still fail in the Analysis phase when it comes to the comparison with the Catalog. So, the SQL string is just an agreement on how Spark allows you to give instructions. This SQL query then gets parsed, transformed into an AST (as described above) and after going through the other three phases ending up in a RDD-based code. The result of this SQL execution through the sql API returns a Dataframe, whereas you can easily transform it into an RDD with df.rdd.
Overall, there is no need for Spark to write any code and in particular any Dataframe code into an SQL syntax which you could then get out of Spark. The AST is the internal abstraction and it is not required for Spark to convert Dataframe code first to an SQL query instead of directly converting the Dataframe code into an AST.
No. There is no method that can get the SQL query from a dataframe.
You will have to create the query yourself by looking at all the filters and select you used to create the dataframe.
Oversimplified Scenario:
A process which generates monthly data in a s3 file. The number of fields could be different in each monthly run. Based on this data in s3,we load the data to a table and we manually (as number of fields could change in each run with addition or deletion of few columns) run a SQL for few metrics.There are more calculations/transforms on this data,but to have starter Im presenting the simpler version of the usecase.
Approach:
Considering the schema-less nature, as the number of fields in the s3 file could differ in each run with addition/deletion of few fields,which requires manual changes every-time in the SQL, Im planning to explore Spark/Scala, so that we can directly read from s3 and dynamically generate SQL based on the fields.
Query:
How I can achieve this in scala/spark-SQL/dataframe? s3 file contains only the required fields from each run.Hence there is no issue reading the dynamic fields from s3 as it is taken care by dataframe.The issue is how can we generate SQL dataframe-API/spark-SQL code to handle.
I can read s3 file via dataframe and register the dataframe as createOrReplaceTempView to write SQL, but I dont think it helps manually changing the spark-SQL, during addition of a new field in s3 during next run. what is the best way to dynamically generate the sql/any better ways to handle the issue?
Usecase-1:
First-run
dataframe: customer,1st_month_count (here dataframe directly points to s3, which has only required attributes)
--sample code
SELECT customer,sum(month_1_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count").show()
Second-Run - One additional column was added
dataframe: customer,month_1_count,month_2_count) (here dataframe directly points to s3, which has only required attributes)
--Sample SQL
SELECT customer,sum(month_1_count),sum(month_2_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count","month_2_count").show()
Im new to Spark/Scala, would be helpful if you can provide the direction so that I can explore further.
It sounds like you want to perform the same operation over and over again on new columns as they appear in the dataframe schema? This works:
from pyspark.sql import functions
#search for column names you want to sum, I put in "month"
column_search = lambda col_names: 'month' in col_names
#get column names of temp dataframe w/ only the columns you want to sum
relevant_columns = original_df.select(*filter(column_search, original_df.columns)).columns
#create dictionary with relevant column names to be passed to the agg function
columns = {col_names: "sum" for col_names in relevant_columns}
#apply agg function with your groupBy, passing in columns dictionary
grouped_df = original_df.groupBy("customer").agg(columns)
#show result
grouped_df.show()
Some important concepts can help you to learn:
DataFrames have data attributes stored in a list: dataframe.columns
Functions can be applied to lists to create new lists as in "column_search"
Agg function accepts multiple expressions in a dictionary as explained here which is what I pass into "columns"
Spark is lazy so it doesn't change data state or perform operations until you perform an action like show(). This means writing out temporary dataframes to use one element of the dataframe like column as I do is not costly even though it may seem inefficient if you're used to SQL.
I have a rather peculiar problem. In a DSE spark analytics engine I produce frequent stats that I store to cassandra in a small table. Since I keep the table trimmed and it is supposed to serve a web interface with consolidated information, I simply want to query the whole table in spark and send the results over an API. I have tried two methods for this:
val a = Try(sc.cassandraTable[Data](keyspace, table).collect()).toOption
val query = "SELECT * FROM keyspace.table"
val df = spark.sqlContext.sql(query)
val list = df.collect()
I am doing this in a scala program. When I use method 1, spark job mysteriously gets stuck showing stage 10 of 12 forever. Verified in logs and spark jobs page. When I use the second method it simply tells me that no such table exists:
Unknown exception: org.apache.spark.sql.AnalysisException: Table or view not found: keyspace1.table1; line 1 pos 15;
'Project [*]
+- 'UnresolvedRelation keyspace1.table1
Interestingly, I tested both methods in spark shell on the cluster and they work just fine. My program has plenty of other queries done using method 1 and they all work fine, the key difference being that in each of them the main partition key always has a condition on it unlike in this query (holds true for this particular table too).
Here is the table structure:
CREATE TABLE keyspace1.table1 (
userid text,
stat_type text,
event_time bigint,
stat_value double,
PRIMARY KEY (userid, stat_type))
WITH CLUSTERING ORDER BY (stat_type ASC)
Any solid diagnosis of the problem or a work around would be much appreciated
When you do select * without where clause in cassandra, you're actually performing a full range query. This is not intended use case in cassandra (aside from peeking at the data perhaps). Just for the fun of it, try replacing with select * from keyspace.table limit 10 and see if it works, it might...
Anyway, my gut feeling says you're problem isn't with spark, but with cassandra. If you have visibility for cassandra metrics, look for the range query latencies.
Now, if your code above is complete - the reason that method 1 freezes, while method 2 doesn't, is that method 1 contains an action (collect), while method 2 doesn't involve any spark action, just schema inference. Should you add to method 2 df.collect you will face the same issue with cassandra
I am using Apache spark in Scala to run aggregations on multiple columns in a dataframe for example
select column1, sum(1) as count from df group by column1
select column2, sum(1) as count from df group by column2
The actual aggregation is more complicated than just the sum(1) but it's besides the point.
Query strings such as the examples above are compiled for each variable that I would like to aggregate, and I execute each string through a Spark sql context to create a corresponding dataframe that represents the aggregation in question
The nature of my problem is that I would have to do this for thousands of variables.
My understanding is that Spark will have to "read" the main dataframe each time it executes an aggregation.
Is there maybe an alternative way to do this more efficiently?
Thanks for reading my question, and thanks in advance for any help.
Go ahead and cache the data frame after you build the DataFrame with your source data. Also, to avoid writing all the queries in the code, go ahead and put them in a file and pass the file at run time. Have something in your code that can read your file and then you can run your queries. The best part about this approach is you can change your queries by updating the file and not the applications. Just make sure you find a way to give the output unique names.
In PySpark, it would look something like this.
dataframe = sqlContext.read.parquet("/path/to/file.parquet")
// do your manipulations/filters
dataframe.cache()
queries = //how ever you want to read/parse the query file
for query in queries:
output = dataframe.sql(query)
output.write.parquet("/path/to/output.parquet")