Pyspark: correlated column is not allowed in predicate - pyspark

I have a table with three columns EVENT, TIME, and `PRICE. For all events I would like to aggregate on previous events, for simplicity we'll assume it is mean.
What I would like to do is the following,
SELECT (
SELECT COUNT(*), MEAN(ti.PRICE)
    FROM table_1 ti
WHERE ti.EVENT = to.EVENT AND ti.TIME < to.TIME
), EVENT
FROM table_1
though if I run this in a pyspark environment or pyspark.sql(query) I get the error correlated column is not allowed in predicate.
Now, I wonder how I can change either the query to run without errors, or, how I can use native pyspark functions (F.filter....) to achieve the same result.
read other stackoverflow, that did not help

Related

Pivot function without manually typing values in `for in`?

Documentation provides an example of using the pivot() function.
SELECT *
FROM (SELECT partname, price FROM part) PIVOT (
AVG(price) FOR partname IN ('prop', 'rudder', 'wing')
);
I would like to use pivot() without having to manually specify each value of partname. I want all parts. I tried:
SELECT *
FROM (SELECT partname, price FROM part) PIVOT (
AVG(price) FOR partname);
That gave an error. Then tried:
SELECT *
FROM (SELECT partname, price FROM part) PIVOT (
AVG(price) FOR partname IN (select distinct partname from part)
);
That also threw an error.
How can I tell Redshift to include all values of partname in the pivot?
I don't think this can be done in a simple single query. This would mean that the query compiler would need to work without knowing how many output columns will be produced. I don't think it can do that.
You can do this in multiple queries - use a query to create the list of partnames and then use this to "generate" a second query that populates the IN list. So something needs issue these queries and generated the second. This can be some code external to Redshift (lots of options) or a stored procedure in Redshift. This code, no matter where it exists, should understand that Redshift has a max number of columns limit - 1,600.
The Redshift docs are fairly good on the topic of dynamic SQL for stored procedures. The EXECUTE statement will be used to fire off the second query in a stored procedure. See: https://docs.aws.amazon.com/redshift/latest/dg/c_PLpgSQL-statements.html

Will Postgres' DISTINCT function always return null as the first element?

I'm selecting distinct values from tables thru Java's JDBC connector and it seems that NULL value (if there's any) is always the first row in the ResultSet.
I need to remove this NULL from the List where I load this ResultSet. The logic looks only at the first element and if it's null then ignores it.
I'm not using any ORDER BY in the query, can I still trust that logic? I can't find any reference in Postgres' documentation about this.
You can add a check for NOT NULL. Simply like
select distinct columnName
from Tablename
where columnName IS NOT NULL
Also if you are not providing the ORDER BY clause then then order in which you are going to get the result is not guaranteed, hence you can not rely on it. So it is better and recommended to provide the ORDER BY clause if you want your result output in a particular output(i.e., ascending or descending)
If you are looking for a reference Postgresql document then it says:
If ORDER BY is not given, the rows are returned in whatever order the
system finds fastest to produce.
If it is not stated in the manual, I wouldn't trust it. However, just for fun and try to figure out what logic is being used, running the following query does bring the NULL (for no apparent reason) to the top, while all other values are in an apparent random order:
with t(n) as (values (1),(2),(1),(3),(null),(8),(0))
select distinct * from t
However, cross joining the table with a modified version of itself brings two NULLs to the top, but random NULLs dispersed througout the resultset. So it doesn't seem to have a clear-cut logic clumping all NULL values at the top.
with t(n) as (values (1),(2),(1),(3),(null),(8),(0))
select distinct * from t
cross join (select n+3 from t) t2

Db2 sql for partition by range select

I am trying to get my head around db2 partition stuff.
Select a.*, max(a.bloo)
over (
partition by range (a.bloo) (starting '2014-4-20' ending '2015-1-1')
)
as maxmax from (
select * from someTable
) a
I get a sql code of negative 104 for this, and I cannot decipher the docs.
You are mixing up two different things: table partitioning, which is a physical characteristic of a table, and OLAP (window) functions, which provide logical grouping of records in a query.
I guess what you wanted was something like
Select
a.*,
max(a.bloo) over ( partition by a.bloo ) as maxmax
from someTable a
where
a.bloo between '2014-4-20' and '2015-1-1'
However, without knowing what you wanted to achieve in the first place it's impossible to give you a definitive answer. You may want to publish some sample data and the desired output.

Order by custom named rows

I’d like to sort my postgres results by some fancy ranking function, but for sake of simplicity, let’s say that I’d like to add two custom rows and sort by them.
SELECT my_table.*,
extract(epoch from (age(current_date, '2012-09-12 10:43:40'::date)))/3600 AS age_in_hours
Fancy_function_counting_distance() AS distance
FROM my_table
ORDER BY distance + age_in_hours;
However, it doesn’t work, since I’m getting error: ERROR: column "distance" does not exist.
Is it possible to order my results by that custom named rows?
I’m running postgres 9.1.x
As per the SQL standard, aliases in the SELECT list are not visible in ORDER BY.
You can use column-position specification (eg ORDER BY 1,2), but that doesn't accept an expression; you cannot ORDER BY 1+2, for example. So you need to use a subquery to generate the result set then sort it in an outer query:
SELECT *
FROM (
SELECT my_table.*,
extract(epoch from (age(current_date, '2012-09-12 10:43:40'::date)))/3600 AS age_in_hours
Fancy_function_counting_distance() AS distance
FROM my_table
) x
ORDER BY distance + age_in_hours;

Calling a function on distinct values from table

I've got a SQL Server 2005 database. I need to get distinct values in addition to calling a function on those distinct values. I'm not sure how the distinct works when there is a function call involved. For example, I have this query:
SELECT DISTINCT a, b, c, fcn_DoSomething(a, b, c) AS z FROM users
I'm guessing that the function (fcn_DoSomething) is being called for all of the values in the table, not the distinct values. Am I correct? If so, how can I write the query to call the function only on distinct values of a,b,c? I know one option is to use a temporary table, but if anyone has better ideas that would be great.
Thanks
This got me curious, so I did a bit of basic testing. I created a small table with some distinct and some repeating values, a function that just does string concatenation, and then looked at the execution plans for:
Go
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
select distinct cola, colb, dbo.sillyfunc(cola, colb)
from distincttest
--Clear the cache
Go
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
select cola, colb, dbo.sillyfunc(cola, colb)
from (select distinct cola, colb from distincttest) as t
In this case, the execution plans showed clearly that the first one ran the concatenation function for every single row, but the second did the sort for distinct values first, then ran the function. But for a small number of rows, they had the same execution time, and when run together they showed each one using 50% of the total query resources.
So, I added a few hundred thousand repeating rows. and tried again. This changed the query plan so it was doing a hash match to get distinctness rather than the former sort, and now the second version which forced it to select for distinctness first executed more than ten times faster.
Finally, I thought there was a chance that this might just be because SQL Server had my sillyfunc marked as nondeterministic (select OBJECTPROPERTYEX(object_id('dbo.sillyfunc'), 'isdeterministic') returned 0), so I switched to patindex which was a builtin function and considered deterministic. This gave me the same results with the function being called for every row in the first version and just for the few distinct ones in the second version.
So, its possible that further testing would find situations that would coax the optimizer to do something more sophisticated, but it appears that if you want to apply the distinct before the function is called then you need to use something like a subquery, CTE, or temp table to limit what the function has access to.
This would ensure that the function only got called on distinct values.
select *, fcn_DoSomething(a, b, c)
from
(select distinct a,b,c FROM users) v
However, I believe that the function call will be optimised, so it may not make a difference. Give it a try.