Pyspark window functions (lag and row_number) generate inconsistent results

Pyspark window functions (lag and row_number) generate inconsistent results - pyspark

I have been fighting an issue with window functions in pyspark for a few weeks now.
I have the following query to detect changes for a given key field:
rowcount = sqlContext.sql(f"""
with temp as (
select key, timestamp, originalcol, lag(originalcol,1) over (partition by key order by timestamp) as lg
from snapshots
where originalcol is not null
)
select count(1) from (
select *
from temp
where lg is not null
and lg != originalcol
)
""")
Data types are as follows:
key: string (not null)
timestamp: timestamp (unique, not null)
originalcol: timestamp
The snapshots table contains over a million records. This query is producing different row counts after each execution: 27952, 27930, etc. while the expected count is 27942. I can say it is only approximately correct, with a deviation of around 10 records, however this is not acceptable as running the same function twice with the same inputs should produce the same results.
I have a similar problem with row_number() over the same window, then filtering for row_number = 1, but I guess the issue should be related.
I tried the query in an AWS Glue job as both pyspark and athena SQL, and the inconsistencies are similar.
Any clue about what I am doing wrong here?

Spark is pretty picky about some silly things...
and lg != originalcol doesn't detect Null values and thus the first value from the window partition will always be filtered out (since the first value from LAG will always be Null).
The same thing happens when you try using Null using In statment
Another example where Null will filter-out:
where test in (Null, 1)

After a bit of research, I discovered that column timestamp is not unique. Even though SQL Server manages to produce the same execution plan and results, pyspark and presto get confused with the order by clause in the window function and produce different results after each execution. If anything can be learned from this experience, it would be to double-check the partition and order by keys in a window function.

Related

Pyspark: correlated column is not allowed in predicate

I have a table with three columns EVENT, TIME, and `PRICE. For all events I would like to aggregate on previous events, for simplicity we'll assume it is mean.
What I would like to do is the following,
SELECT (
SELECT COUNT(*), MEAN(ti.PRICE)
    FROM table_1 ti
WHERE ti.EVENT = to.EVENT AND ti.TIME < to.TIME
), EVENT
FROM table_1
though if I run this in a pyspark environment or pyspark.sql(query) I get the error correlated column is not allowed in predicate.
Now, I wonder how I can change either the query to run without errors, or, how I can use native pyspark functions (F.filter....) to achieve the same result.
read other stackoverflow, that did not help

Why is counting by i not working but selecting from hdb works just fine?

I have a partitioned hdb and the following query works fine :
select from tableName where date within(.z.d-8;.z.d)
but the following query breaks :
select count i by date from tableName where date within(.z.d-8;.z.d)
with the following error :
"./2017.10.14/:./2017.10.15/tableName. OS reports: No such file or directory"
Any idea why this might happen ?

As the error indicates, there's no table called tableName is in a partition for 2017.10.15. For partitioned databases kdb caches table counts; it happens when it runs the first query with the following properties:
the "select" part of the query is either count i or the partition field itself (in your example that would be date)
the where clause is either empty or constrains the partition field only.
(.Q.ps -- partitioned select -- is where all this magic happens, see the definition of it if you need all the details.)
You have several options to avoid the error you're getting.
Amend the query to avoid having either count i on its own or the empty where.
Any of the following will work; the first is the simplest while the others are useful if you're writing a query for the general case and don't know field names in advance.
select count sym by date where date within (.z.d-8;.z.d) / any field except i
select count i by date where date within (.z.d-8;.z.d),i>=0
select `dummy, count i by date where date within (.z.d-8;.z.d)
select {count x}i by date where date within (.z.d-8;.z.d)
Use .Q.view to define a sub-view to exclude partitions with missing tables; kdb won't cache or otherwise access them.
The previous solutions will not work if the date range in your select includes partitions with missing tables. In this case you can either
Run .Q.chk to create empty tables where they are mising; or
Run .Q.bv to construct the dictionary of table schemas for tables with missing partitions.

You probably need to create the missing tables. I believe when doing a 'count i' on a partitioned table as you have done, it counts every single partition (not just the ones in your query) and caches these counts in .Q.pn
If you run .Q.chk[HDB root location], it should create the missing tables and your query should work
https://code.kx.com/q/ref/dotq/#qchk-fill-hdb

'count i' will scan each partition regardless of what is specified in the where clause. So it's likely those two partitions are incomplete.
Better to pick an actual column for things like that or else something like
select count i>0 by date from tableName where date within(.z.d-8;.z.d)
will prevent the scanning of all partitions.
Jason

Convert T-SQL Cross Apply to Redshift

I am converting the following T-SQL statement to Redshift. The purpose of the query is to convert a column in the table with a value containing a comma delimited string with up to 60 values into multiple rows with 1 value per row.
SELECT
id_1
, id_2
, value
into dbo.myResultsTable
FROM myTable
CROSS APPLY STRING_SPLIT([comma_delimited_string], ',')
WHERE [comma_delimited_string] is not null;
In SQL this processes 10 million records in just under 1 hour which is fine for my purposes. Obviously a direct conversation to Redshift isn't possible due to Redshift not having a Cross Apply or String Split functionality so I built a solution using the process detailed here (Redshift. Convert comma delimited values into rows) which utilizes split_part() to split the comma delimited string into multiple columns. Then another query that unions everything to get the final output into a single column. But the typical run takes over 6 hours to process the same amount of data.
I wasn't expecting to run into this issue just knowing the power difference between the machines. The SQL Server I was using for the comparison test was a simple server with 12 processors and 32 GB of RAM while the Redshift server is based on the dc1.8xlarge nodes (I don't know the total count). The instance is shared with other teams but when I look at the performance information there are plenty of available resources.
I'm relatively new to Redshift so I'm still assuming I'm not understanding something. But I have no idea what am I missing. Are there things I need to check to make sure the data is loaded in an optimal way (I'm not an adim so my ability to check this is limited)? Are there other Redshift query options that are better than the example I found? I've searched for other methods and optimizations but unless I start looking into Cross Joins, something I'd like to avoid (Plus when I tried to talk to the DBA's running the Redshift cluster about this option their response was a flat "No, can't do that.") I'm not even sure where to go at this point so any help would be much appreciated!
Thanks!

I've found a solution that works for me.
You need to do a JOIN on a number table, for which you can take any table as long as it has more rows that the numbers you need. You need to make sure the numbers are int by forcing the type. Using the funcion regexp_count on the column to be split for the ON statement to count the number of fields (delimiter +1), will generate a table with a row per repetition.
Then you use the split_part function on the column, and use the number.num column to extract for each of the rows a different part of the text.
SELECT comma_delimited_string, numbers.num, REGEXP_COUNT(comma_delimited_string , ',')+1 AS nfields, SPLIT_PART(comma_delimited_string, ',', numbers.num) AS field
FROM mytable
JOIN
(
select
(row_number() over (order by 1))::int as num
from
mytable
limit 15 --max num of fields
) as numbers
ON numbers.num <= regexp_count(comma_delimited_string , ',') + 1

Will Postgres' DISTINCT function always return null as the first element?

I'm selecting distinct values from tables thru Java's JDBC connector and it seems that NULL value (if there's any) is always the first row in the ResultSet.
I need to remove this NULL from the List where I load this ResultSet. The logic looks only at the first element and if it's null then ignores it.
I'm not using any ORDER BY in the query, can I still trust that logic? I can't find any reference in Postgres' documentation about this.

You can add a check for NOT NULL. Simply like
select distinct columnName
from Tablename
where columnName IS NOT NULL
Also if you are not providing the ORDER BY clause then then order in which you are going to get the result is not guaranteed, hence you can not rely on it. So it is better and recommended to provide the ORDER BY clause if you want your result output in a particular output(i.e., ascending or descending)
If you are looking for a reference Postgresql document then it says:
If ORDER BY is not given, the rows are returned in whatever order the
system finds fastest to produce.

If it is not stated in the manual, I wouldn't trust it. However, just for fun and try to figure out what logic is being used, running the following query does bring the NULL (for no apparent reason) to the top, while all other values are in an apparent random order:
with t(n) as (values (1),(2),(1),(3),(null),(8),(0))
select distinct * from t
However, cross joining the table with a modified version of itself brings two NULLs to the top, but random NULLs dispersed througout the resultset. So it doesn't seem to have a clear-cut logic clumping all NULL values at the top.
with t(n) as (values (1),(2),(1),(3),(null),(8),(0))
select distinct * from t
cross join (select n+3 from t) t2

strange result when use Where filter in CQL cassandra

i have a column family use counter as create table command below: (KEY i use bigin to filter when query ).
CREATE TABLE BannerCount (
KEY bigint PRIMARY KEY
) WITH
comment='' AND
comparator=text AND
read_repair_chance=0.100000 AND
gc_grace_seconds=864000 AND
default_validation=counter AND
min_compaction_threshold=4 AND
max_compaction_threshold=32 AND
replicate_on_write='true' AND
compaction_strategy_class='SizeTieredCompactionStrategy' AND
compression_parameters:sstable_compression='SnappyCompressor';
But when i insert data to this column family , and select using Where command to filter data
results i retrived very strange :( like that:
use Query:
select count(1) From BannerCount where KEY > -1
count
-------
71
use Query:
select count(1) From BannerCount where KEY > 0;
count
-------
3
use Query:
select count(1) From BannerCount ;
count
-------
122
What happen with my query , who any tell me why i get that :( :(

To understand the reason for this, you should understand Cassandra's data model. You're probably using RandomPartitioner here, so each of these KEY values in your table are being hashed to token values, so they get stored in a distributed way around your ring.
So finding all rows whose key has a higher value than X isn't the sort of query Cassandra is optimized for. You should probably be keying your rows on some other value, and then using either wide rows for your bigint values (since columns are sorted) or put them in a second column, and create an index on it.
To explain in a little more detail why your results seem strange: CQL 2 implicitly turns "KEY >= X" into "token(KEY) >= token(X)", so that a querier can iterate through all the rows in a somewhat-efficient way. So really, you're finding all the rows whose hash is greater than the hash of X. See CASSANDRA-3771 for how that confusion is being resolved in CQL 3. That said, the proper fix for you is to structure your data according to the queries you expect to be running on it.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse