Using CQL syntax query with DataFrame in PySpark - pyspark

We're trying to use PySpark with Cassandra connector to migrate Cassandra data between clusters.
We need to be able to use CQL specific syntax in our queries, to limit the data we migrate. Having full partition key (consisting of two fields) is still selecting too much data, so we need more restrictive filtering condition.
Selection criteria works ok when done with CQLSH tool, where we can use CQL statement like:
select * from ns.table where pk1 = 'val1' and pk2 = 'val2' and timestamp > maxTimeuuid("abc") and timestamp < minTimeuuid("xyz")
// pk1, pk2 - partition key fields, timestamp - next field in primary key
but the problem is that minTimeuuid() and maxTimeuuid() are CQL - specific functions, and seems that PySpark DataFrame does not allow filtering based on Cassandra specific functions. Only when it resembles regular SQL syntax ( field = value kind of comparisons ) does it work.
Tried filtering DataFrame with .filter() condition, or SparkContext sql() function - either one gives errors about unknown functions ( minTimeuuid() or maxTimeuuid() ).
Similarly tried using token('..') function, and 'token()' function is flagged by Spark as unknown function as well.
Are there any examples how CQL specific functions can be used as filtering conditions with PySpark/DataFrame ( or maybe RDD if not with DataFrame)?

Related

Pyspark window functions (lag and row_number) generate inconsistent results

I have been fighting an issue with window functions in pyspark for a few weeks now.
I have the following query to detect changes for a given key field:
rowcount = sqlContext.sql(f"""
with temp as (
select key, timestamp, originalcol, lag(originalcol,1) over (partition by key order by timestamp) as lg
from snapshots
where originalcol is not null
)
select count(1) from (
select *
from temp
where lg is not null
and lg != originalcol
)
""")
Data types are as follows:
key: string (not null)
timestamp: timestamp (unique, not null)
originalcol: timestamp
The snapshots table contains over a million records. This query is producing different row counts after each execution: 27952, 27930, etc. while the expected count is 27942. I can say it is only approximately correct, with a deviation of around 10 records, however this is not acceptable as running the same function twice with the same inputs should produce the same results.
I have a similar problem with row_number() over the same window, then filtering for row_number = 1, but I guess the issue should be related.
I tried the query in an AWS Glue job as both pyspark and athena SQL, and the inconsistencies are similar.
Any clue about what I am doing wrong here?
Spark is pretty picky about some silly things...
and lg != originalcol doesn't detect Null values and thus the first value from the window partition will always be filtered out (since the first value from LAG will always be Null).
The same thing happens when you try using Null using In statment
Another example where Null will filter-out:
where test in (Null, 1)
After a bit of research, I discovered that column timestamp is not unique. Even though SQL Server manages to produce the same execution plan and results, pyspark and presto get confused with the order by clause in the window function and produce different results after each execution. If anything can be learned from this experience, it would be to double-check the partition and order by keys in a window function.

how to split a list to multiple partitions and sent to executors

When we use spark to read data from csv for DB as follow, it will automatically split the data to multiple partitions and sent to executors
spark
.read
.option("delimiter", ",")
.option("header", "true")
.option("mergeSchema", "true")
.option("codec", properties.getProperty("sparkCodeC"))
.format(properties.getProperty("fileFormat"))
.load(inputFile)
Currently, I have a id list as :
[1,2,3,4,5,6,7,8,9,...1000]
What I want to do is split this list to multiple partitions and sent to executors, in each executor, run the sql as
ids.foreach(id => {
select * from table where id = id
})
When we load data from cassandra, the connector will generate the query sql as:
select columns from table where Token(k) >= ? and Token(k) <= ?
it means, the connector will scan the whole database, virtually, I needn't to scan the whole table, I just what to get all the data from the table where the k(partition key) in the id list.
the table schema as:
CREATE TABLE IF NOT EXISTS tab.events (
k int,
o text,
event text
PRIMARY KEY (k,o)
);
or how can i use spark to load data from cassandra using pre defined sql statement without scan the whole table?
You simply need to use joinWithCassandra function to perform selection only of the data is required for your operation. But be aware that this function is only available via RDD API.
Something like this:
val joinWithRDD = your_df.rdd.joinWithCassandraTable("tab","events")
You need to make sure that column name in your DataFrame matched the partition key name in Cassandra - see documentation for more information.
The DataFrame implementation is only available in the DSE version of Spark Cassandra Connector as described in following blog post.
Update in September 2020th: support for join with Cassandra was added in the Spark Cassandra Connector 2.5.0

Error executing Apache BEAM sql query - Use a Window.into or Window.triggering transform prior to GroupByKey

How do I include Window.into or Window.triggering transform prior to GroupByKey in BEAM SQL?
I have following 2 tables:
Ist table
CREATE TABLE table1(
field1 varchar
,field2 varchar
)
2nd Table
CREATE TABLE table2(
field1 varchar
,field3 varchar
)
And I am writing the result in a 3rd Table
CREATE TABLE table3(
field1 varchar
,field3 varchar
)
First 2 tables are reading data from a kafka stream and I am doing a join on these tables and inserting the data into the third table, using the following query. The first 2 tables are un-bounded/non-bounded
INSERT INTO table3
(field1,
field3)
SELECT a.field1,
b.field3
FROM table1 a
JOIN table2 b
ON a.field1 = b.field1
I am getting the following error:
Caused by: java.lang.IllegalStateException: GroupByKey cannot be
applied to non-bounded PCollection in the GlobalWindow without a
trigger. Use a Window.into or Window.triggering transform prior to
GroupByKey. at
org.apache.beam.sdk.transforms.GroupByKey.applicableTo(GroupByKey.java:173)
at
org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:204)
at
org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:120)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537) at
org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472) at
org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286) at
org.apache.beam.sdk.transforms.join.CoGroupByKey.expand(CoGroupByKey.java:126)
at
org.apache.beam.sdk.transforms.join.CoGroupByKey.expand(CoGroupByKey.java:74)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537) at
org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472) at
org.apache.beam.sdk.transforms.join.KeyedPCollectionTuple.apply(KeyedPCollectionTuple.java:107)
at
org.apache.beam.sdk.extensions.joinlibrary.Join.innerJoin(Join.java:59)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamJoinRel.standardJoin(BeamJoinRel.java:217)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamJoinRel.buildBeamPipeline(BeamJoinRel.java:161)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamProjectRel.buildBeamPipeline(BeamProjectRel.java:68)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamAggregationRel.buildBeamPipeline(BeamAggregationRel.java:80)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamIOSinkRel.buildBeamPipeline(BeamIOSinkRel.java:64)
at
org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.compileBeamPipeline(BeamQueryPlanner.java:127)
at
com.dss.tss.v2.client.BeamSqlCli.compilePipeline(BeamSqlCli.java:95)
at com.dss.test.v2.client.SQLCli.main(SQLCli.java:100)
This is current implementation limitation of Beam SQL. You need to define windows and then join the inputs per-window.
Couple of examples of how to do joins and windowing in Beam SQL:
complex SQL query with HOP window and joins;
test which defines a window in Java outside of SQL and then applies query with join;
examples of other window functions can be found here;
Background
The problem is caused by the fact that it's hard to define Join operation for unbounded data streams in general, it is not limited to Beam SQL.
Imagine, for example, when data processing system receives inputs from 2 sources and then has to match records between them. From high level perspective, such system has to keep all the data it has seen so far, and then for each new record it has to go over all records in the second input source to see if there's a match there. It works fine when you have finite and small data sources. In simple case you could just load everything in memory, match the data from the sources, produce output.
With streaming data you cannot keep caching it forever. What if data never stops coming? And it is unclear when you want to emit the data. What if you have an outer join operation, when do you decide that you don't have a matching record from another input?
For example see the explanation for the unbounded PCollections in GroupByKey section of the Beam guide. And Joins in Beam are usually implemented on top of it using CoGroupByKey (Beam SQL Joins as well).
All of these questions can probably be answered for a specific pipeline, but it's hard to solve them in general case. Current approach in Beam SDK and Beam SQL is to delegate it to the user to solve for concrete business case. Beam allows users decide what data to aggregate together into a window, how long to wait for late data, and when to emit the results. There are also things like state cells and timers for more granular control. This allows a programmer writing a pipeline to explicitly define the behavior and work around these problems somewhat, with (a lot of) extra complexity.
Beam SQL is implemented on top of regular Beam SDK concepts and is bound by the same limitations. But it has more implementations of its own. For example, you don't have a SQL syntax to define triggers, state, or custom windows. Or you cannot write a custom ParDo that could keep a state in an external service.

Cassandra group by and filter results

I'm trying to mimic something like this:
Given a table test:
CREATE TABLE myspace.test (
item_id text,
sub_id text,
quantity bigint,
status text,
PRIMARY KEY (item_id, sub_id)
In SQL, we could do:
select * from (select item_id, sum(quantity) as quan
from test where status <> 'somevalue') sub
where sub.quan >= 10;
i.e. group by item_id and then filter out the results with less than 10.
Cassandra is not designed for this kind of stuff though I could mimic group by using user-defined aggregate functions:
CREATE FUNCTION group_sum_state
(state map<text, bigint>, item_id text, val bigint)
CALLED ON NULL INPUT
RETURNS map<text, bigint>
LANGUAGE java
AS $$Long current = (Long)state.get(item_id);
if(current == null) current = 0l;
state.put(item_id, current + val); return state;$$;
CREATE AGGREGATE group_sum(text, bigint)
SFUNC group_sum_state
STYPE map<text, bigint>
INITCOND { }
And use it as group by (probably this is going to have very bad performance, but still):
cqlsh:myspace> select group_sum(item_id, quantity) from test;
mysales_data.group_sum(item_id, quantity)
-------------------------------------------
{'123': 33, '456': 14, '789': 15}
But it seems to be impossible to do filtering by map values, neither with final function for the aggregate nor with a separate function. I could define a function like this:
CREATE FUNCTION myspace.filter_group_sum
(group map<text, bigint>, vallimit bigint)
CALLED ON NULL INPUT
RETURNS map<text, bigint>
LANGUAGE java
AS $$
java.util.Iterator<java.util.Map.Entry<String, Long>> entries =
group.entrySet().iterator();
while(entries.hasNext()) {
Long val = entries.next().getValue();
if (val < vallimit)
entries.remove();
};
return group;$$;
But there is no way to call it and pass a constant:
select filter_group_sum(group_sum(item_id, quantity), 15) from test;
SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query]
message="line 1:54 no viable alternative at input '15'
(...(group_sum(item_id, quantity), [15]...)">
it complains about the constant 15.
Sorry for the long post, I need to provide all the details to explain what I need.
So my questions are:
Is there a way pass in a constant to a user-defined function in Cassandra. Or what alternatives do I have to implemented filtered group by.
More general question: what is the proper data design for Cassandra to cover such a use-case for a real-time query-serving application? Say I have a web app that takes the limit from the UI and needs to return back all the items that total quantity larger than the given limit? The tables are going to quite large, like 10 billions of records.
Vanilla Cassandra is a poor choice for ad hoc queries. DataStax Enterprise has added some of this functionality via integrations with Spark and Solr. The Spark integration is also open source, but you wouldn't want to do this for low-latency queries. If you need real-time queries, you're going to have to aggregate outside of Cassandra (in Spark or Storm, for example), then write back the aggregates to be consumed by your app. You can also look at Stratio's Lucene integration, which might help you for some of your queries.
I ran across your question when looking for information on passing a constant to a user defined function.
The closest I can get to passing a constant is to pass a static column for the parameter for which you want to pass a constant. So if you update the static column before using the UDF, then you can pass that column. This will only work if you have a single client running such a query at a time, since the static column is visible to all clients. See this answer for an example:
Passing a constant to a UDF

MAX(), DISTINCT and group by in Cassandra

I am trying to remodel a SQL database Cassandra such that, I can find the Cassandra equivalent for the SQL queries. I use CQL 3 and Cassandra v1.2. I modeled the db design in cassandra so that it supports the order by clauses and denormalized tables to support the join operation. However I am at sea when it comes to DISTINCT, SUM() and GROUPBY equvalents
SELECT a1,MAX(b1) FROM demo1 group by a1.
SELECT DISTINCT (a2) FROM demo2 where b2='sea'
SELECT sum(a3), sum(b3) from demo3 where c3='water' and d3='ocean'
This is like a showstopper to my work for past couple of days. Is there a way in Cassandra, that I can model the db schema to support queries of these kind? I cant think of any way in Cassandra . How are such queries be implemented using Cassandra?
I read that a hive layer over Cassandra can possibly make these queries work. I am just wondering if that is the only way that such queries can be supported in Cassandra..? Pls advise on any other possible methods..
With Cassandra you solve these kinds of problems by doing more work when you insert your data -- which sounds like it would be slow, but Cassandra is designed for fast writes, and you're probably going to read the data many more times than you write it so it makes sense when you consider the whole system.
I can't tell you exactly how to create your tables to model your problem because it will depend a lot on the details. You need to figure a schema that lets you get the data without performing any on-the-fly aggregations. Think about how you would create views for the queries in an RDBMS, and then try to think how you would insert data directly into those views, not into the underlying tables. That's kind of how you model things in Cassandra.
Although this is an old question, it appears in Google search results pretty high. So I wanted to give an update.
Cassandra 2.2+ supports user defined function and user defined aggregates. WARNING: this does not mean that you don't have to do data modeling anymore (as it was pointed by #Theo) rather it just allows you to slightly preprocess your data upon retrieval.
SELECT DISTINCT (a2) FROM demo2 where b2='sea'
To implement DISTINCT, you should define a function and an agreggate. I'll call both the function and the aggregate uniq rather than distinct to emphasize the fact that it is user defined.
CREATE OR REPLACE FUNCTION uniq(state set<text>, val text)
CALLED ON NULL INPUT RETURNS set<text> LANGUAGE java
AS 'state.add(val); return state;';
CREATE OR REPLACE AGGREGATE uniq(text)
SFUNC uniq STYPE set<text> INITCOND {};
Then you use it as follows:
SELECT uniq(a2) FROM demo2 where b2='sea';
SELECT sum(a3), sum(b3) from demo3 where c3='water' and d3='ocean'
SUM is provided out of the box and works as you would expect. See system.sum.
SELECT a1,MAX(b1) FROM demo1 group by a1
GROUP BY is a tricky one. Actually, there is no way to group result rows by some column. But what you can do is to create a map<text, int> and to group them manually in the map. Based on an example from Christopher Batey's blog, group-by and max:
CREATE OR REPLACE FUNCTION state_group_and_max(state map<text, int>, type text, amount int)
CALLED ON NULL INPUT
RETURNS map<text, int>
LANGUAGE java AS '
Integer val = (Integer) state.get(type);
if (val == null) val = amount; else val = Math.max(val, amount);
state.put(type, val);
return state;
' ;
CREATE OR REPLACE AGGREGATE state_group_and_max(text, int)
SFUNC state_group_and_max
STYPE map<text, int>
INITCOND {};
Then you use it as follows:
SELECT state_group_and_max(a1, b1) FROM demo1;
Notes
As it was mentioned above, you still have to invest some time in data modeling, don't overuse these features
You have to set enable_user_defined_functions=true in your cassandra.yaml to enable the features
You can overload the functions to support grouping by columns of different types.
References:
Great UDF and UDA examples by Christopher Batey and few more
Datastax docs on UDF and UDA
User Defined Functions in Cassandra 3.0 (Planet Cassandra Blog)
Cassandra 3.10 now supports Group by parition key and clustering key. You can refer to this link for more detail.
Cassandra doesn't support operations like this. You can use something like Hive on top or there's a (non-free) product from Acunu that may do what you need.
The other solution is to do the work yourself. For example, you can sum things by reading in all the data from certain rows and summing. Or maintain a Cassandra counter to increment on the fly.