Ecto JOIN complications for append-only table query - append

I'm trying to query an Ecto table with append-only semantics, so I'd like the most recent version of a complete row for a given ID. The technique is described here, but in short: I want to JOIN a table on itself with a subquery that fetches the most recent time for an ID. In SQL this would look like:
SELECT r.*
FROM rules AS r
JOIN (
SELECT id, MAX(inserted_at) AS inserted_at FROM rules GROUP BY id
) AS recent_rules
ON (
recent_rules.id = r.id
AND recent_rules.inserted_at = r.inserted_at)
I'm having trouble expessing this in Ecto. I tried something like this:
maxes =
from(m in Rule,
select: {m.id, max(m.inserted_at)},
group_by: m.id)
from(r in Rule,
join: m in ^maxes, on: r.id == m.id and r.inserted_at == m.inserted_at)
But trying to run this, I hit a restriction:
queries in joins can only have where conditions in query
suggesting maxes must just be a SELECT _ FROM _ WHERE form.
If I try switching maxes and Rule in the JOIN:
maxes =
from(m in Rule,
select: {m.id, max(m.inserted_at)},
group_by: m.id)
from(m in maxes,
join: r in Rule, on: r.id == m.id and r.inserted_at == m.inserted_at)
then I'm not able to SELECT the whole row, just id and MAX(inserted_at).
Does anyone know how to do this JOIN? Or a better way to query append-only in Ecto? Thanks 🙂

Doing m in ^maxes is not running a subquery but either query composition (if in a from) or converting the query to a join (in a join). In both cases, you are changing the same query. Given your initial query, I believe you want subqueries.
Also note that a subquery requires the select to return a map, so we can refer to the fields later on. Something along these lines should work:
maxes =
from(m in Rule,
select: %{id: m.id, inserted_at: max(m.inserted_at)},
group_by: m.id)
from(r in Rule,
join: m in ^subquery(maxes), on: r.id == m.id and r.inserted_at == m.inserted_at)
PS: I have pushed a commit to Ecto that clarifies the error message in cases like yours.
invalid query was interpolated in a join.
If you want to pass a query to a join, you must either:
1. Make sure the query only has `where` conditions (which will be converted to ON clauses)
2. Or wrap the query in a subquery by calling subquery(query)

Related

Subquery with `WHERE` on function calls with outer query grouped by the function calls gives "subquery uses ungrouped column from outer query"

Consider this situation where age_group(.) is a function that returns an age bracket for an age (0-17: 'minor', 18-64: 'adult' etc.)
SELECT
date_of_data,
age_group(age),
count(1),
(SELECT avg(salary)
FROM tbl2
WHERE age_group(tbl2.age) = age_group(tbl1.age)
AND tbl2.date_of_file = tbl1.date_of_file
AND type = 'junior') AS average_salary_as_junior,
(SELECT avg(salary)
FROM tbl2
WHERE age_group(tbl2.age) = age_group(tbl1.age)
AND tbl2.date_of_file = tbl1.date_of_file
AND type = 'senior') AS average_salary_as_senior,
(SELECT avg(salary)
FROM tbl2
WHERE age_group(tbl2.age) = age_group(tbl1.age)
AND tbl2.date_of_file = tbl1.date_of_file
AND type = 'principal') AS average_salary_as_principal,
-- 15 more types to go
FROM tbl1
GROUP BY
date_of_data, age_group(age);
This will not work unless the outer query is grouped by age in contrast to age_group(age), because the subquery uses age as an argument to a function, despite being the same function:
subquery uses ungrouped column "tbl1.age" from outer query..
If I group by age instead of age_group(age), there will be redundant identical records in the output.
Conditional aggregation might be a solution, and so is using DISTINCT on the whole output, albeit inefficient. Not sure if there are more techniques to achieve the same, but I am wondering whether there's a way to make Postgres realise that the same function call exists in the GROUP BY clause, and permit such a query to execute.

Joining one table twice in postgresql

I have two columns in the same table that I want to join in Postgresql but for some reason I’m getting this error. Don’t know how to figure it out. Please help.
[42P01] ERROR: relation "a" does not exist
Position: 10
X table contains two pools(ABC,XYZ), ids, numbers and description. If an ID exists in one pool but not in the other, it should update description column to “ADD”. Pools need to be joined on number.
UPDATE A
SET A.Description = 'ADD'
FROM X AS A
LEFT JOIN X AS B ON B.number = A.number
AND B.id = 'ABC'
WHERE A.id = 'XYZ'
AND B.number IS NULL
AND A.Description IS NULL;
With standard SQL you can't do a join as part of an update, but what you can do is include a subquery to select the id's to update. The subquery can contain a join. I'm not entirely clear on what you're actually trying to accomplish, but you could do something like this:
UPDATE x SET description='ADD' WHERE number IN (
SELECT a.number FROM x AS a
LEFT OUTER JOIN x AS b ON a.number=b.number AND a.id='XYZ' AND b.id='ABC'
WHERE b.number IS NULL
);
This will join the table x with itself and will select (and update) any numbers's that don't have a matching number in the 'ABC' and 'XYZ' zone.
PostgreSQL does have a UPDATE FROM syntax that does let you update with complex subqueries. It's more flexible but it's non-standard. You can find an example of this type of query here.

convert SQL join query to pyspark syntax

I'm working to convert a known working SQL query to work in pyspark, given two dataframes, using methods such as: .join, .where, filter, etc.
Here are examples of SQL queries that work (only selecting r.id where I will normally select more columns):
# "invalid" records, where there is a matching `record_id` for rv_df
SELECT DISTINCT(r.id) FROM core_record AS r LEFT OUTER JOIN core_recordvalidation rv ON r.id = rv.record_id WHERE r.job_id = 41 AND rv.record_id is not null;
# "valid" records, where there is no matching `record_id` for rv_df
SELECT DISTINCT(r.id) FROM core_record AS r LEFT OUTER JOIN core_recordvalidation rv ON r.id = rv.record_id WHERE r.job_id = 41 AND rv.record_id is not null;
I'm 80/20 close, but having trouble wrapping my head around the the last few steps, and/or how to do this most efficiently.
I've got a Dataframe r_df with column id that I'd like to join with Dataframe rv_df on column record_id. As output, I'd like only distinct r.id, and only columns from r_df, none from rv_df. Finally, I'd like two different calls where there is a match (what will be "invalid" records for me), and where there is not a match (what I consider "valid" records).
I have pyspark queries that get close, but not terribly clear on how to ensure that r_df.id is distinct, and select only columns from r_df, none from rv_df.
Any help would be much appreciated!
Just had to walk away for a couple hours. Found a solution that works for my use case.
First, selecting only distinct record_id from rv_df:
rv_df = rv_df.select('record_id').distinct()
Then use that for intersection and disjoints:
# Intersection:
j_df = r_df.join(rv_df, r_df.id == rv_df.record_id, 'leftsemi').select(r_df['*'])
# Disjoint:
j_df = r_df.join(rv_df, r_df.id == rv_df.record_id, 'leftanti').select(r_df['*'])

EFCore returning too many columns for a simple LEFT OUTER join

I am currently using EFCore 1.1 (preview release) with SQL Server.
I am doing what I thought was a simple OUTER JOIN between an Order and OrderItem table.
var orders = from order in ctx.Order
join orderItem in ctx.OrderItem
on order.OrderId equals orderItem.OrderId into tmp
from oi in tmp.DefaultIfEmpty()
select new
{
order.OrderDt,
Sku = (oi == null) ? null : oi.Sku,
Qty = (oi == null) ? (int?) null : oi.Qty
};
The actual data returned is correct (I know earlier versions had issues with OUTER JOINS not working at all). However the SQL is horrible and includes every column in Order and OrderItem which is problematic considering one of them is a large XML Blob.
SELECT [order].[OrderId], [order].[OrderStatusTypeId],
[order].[OrderSummary], [order].[OrderTotal], [order].[OrderTypeId],
[order].[ParentFSPId], [order].[ParentOrderId],
[order].[PayPalECToken], [order].[PaymentFailureTypeId] ....
...[orderItem].[OrderId], [orderItem].[OrderItemType], [orderItem].[Qty],
[orderItem].[SKU] FROM [Order] AS [order] LEFT JOIN [OrderItem] AS
[orderItem] ON [order].[OrderId] = [orderItem].[OrderId] ORDER BY
[order].[OrderId]
(There are many more columns not shown here.)
On the other hand - if I make it an INNER JOIN then the SQL is as expected with only the columns in my select clause:
SELECT [order].[OrderDt], [orderItem].[SKU], [orderItem].[Qty] FROM
[Order] AS [order] INNER JOIN [OrderItem] AS [orderItem] ON
[order].[OrderId] = [orderItem].[OrderId]
I tried reverting to EFCore 1.01, but got some horrible nuget package errors and gave up with that.
Not clear whether this is an actual regression issue or an incomplete feature in EFCore. But couldn't find any further information about this elsewhere.
Edit: EFCore 2.1 has addressed a lot of issues with grouping and also N+1 type issues where a separate query is made for every child entity. Very impressed with the performance in fact.
3/14/18 - 2.1 Preview 1 of EFCore isn't recommended because the GROUP BY SQL has some issues when using OrderBy() but it's fixed in nightly builds and Preview 2.
The following applies to EF Core 1.1.0 (release).
Although shouldn't be doing such things, tried several alternative syntax queries (using navigation property instead of manual join, joining subqueries containing anonymous type projection, using let / intermediate Select, using Concat / Union to emulate left join, alternative left join syntax etc.) The result - either the same as in the post, and/or executing more than one query, and/or invalid SQL queries, and/or strange runtime exceptions like IndexOutOfRange, InvalidArgument etc.
What I can say based on tests is that most likely the problem is related to bug(s) (regression, incomplete implementation - does it really matter) in GroupJoin translation. For instance, #7003: Wrong SQL generated for query with group join on a subquery that is not present in the final projection or #6647 - Left Join (GroupJoin) always materializes elements resulting in unnecessary data pulling etc.
Until it get fixed (when?), as a (far from perfect) workaround I could suggest using the alternative left outer join syntax (from a in A from b in B.Where(b = b.Key == a.Key).DefaultIfEmpty()):
var orders = from o in ctx.Order
from oi in ctx.OrderItem.Where(oi => oi.OrderId == o.OrderId).DefaultIfEmpty()
select new
{
OrderDt = o.OrderDt,
Sku = oi.Sku,
Qty = (int?)oi.Qty
};
which produces the following SQL:
SELECT [o].[OrderDt], [t1].[Sku], [t1].[Qty]
FROM [Order] AS [o]
CROSS APPLY (
SELECT [t0].*
FROM (
SELECT NULL AS [empty]
) AS [empty0]
LEFT JOIN (
SELECT [oi0].*
FROM [OrderItem] AS [oi0]
WHERE [oi0].[OrderId] = [o].[OrderId]
) AS [t0] ON 1 = 1
) AS [t1]
As you can see, the projection is ok, but instead of LEFT JOIN it uses strange CROSS APPLY which might introduce another performance issue.
Also note that you have to use casts for value types and nothing for strings when accessing the right joined table as shown above. If you use null checks as in the original query, you'll get ArgumentNullException at runtime (yet another bug).
Using "into" will create a temporary identifier to store the results.
Reference : MDSN: into (C# Reference)
So removing the "into tmp from oi in tmp.DefaultIfEmpty()" will result in the clean sql with the three columns.
var orders = from order in ctx.Order
join orderItem in ctx.OrderItem
on order.OrderId equals orderItem.OrderId
select new
{
order.OrderDt,
Sku = (oi == null) ? null : oi.Sku,
Qty = (oi == null) ? (int?) null : oi.Qty
};

How to use GROUP BY with Firebird?

I'm trying create a SELECT with GROUP BY in Firebird but I can't have any success. How could I do this ?
Exception
Can't format message 13:896 -- message file C:\firebird.msg not found.
Dynamic SQL Error.
SQL error code = -104.
Invalid expression in the select list (not contained in either an aggregate function or the GROUP BY clause).
(49,765 sec)
trying
SELECT FA_DATA, FA_CODALUNO, FA_MATERIA, FA_TURMA, FA_QTDFALTA,
ALU_CODIGO, ALU_NOME,
M_CODIGO, M_DESCRICAO,
FT_CODIGO, FT_ANOLETIVO, FT_TURMA
FROM FALTAS Falta
INNER JOIN ALUNOS Aluno ON (Falta.FA_CODALUNO = Aluno.ALU_CODIGO)
INNER JOIN MATERIAS Materia ON (Falta.FA_MATERIA = Materia.M_CODIGO)
INNER JOIN FORMACAOTURMAS Turma ON (Falta.FA_TURMA = Turma.FT_CODIGO)
WHERE (Falta.FA_CODALUNO = 238) AND (Turma.FT_ANOLETIVO = 2015)
GROUP BY Materia.M_CODIGO
Simple use of group by in firebird,group by all columns
select * from T1 t
where t.id in
(SELECT t.id FROM T1 t
INNER JOIN T2 j ON j.id = t.jid
WHERE t.id = 1
GROUP BY t.id)
Using GROUP BY doesn't make sense in your example code. It is only useful when using aggregate functions (+ some other minor uses). In any case, Firebird requires you to specify all columns from the SELECT column list except those with aggregate functions in the GROUP BY clause.
Note that this is more restrictive than the SQL standard, which allows you to leave out functionally dependent columns (ie if you specify a primary key or unique key, you don't need to specify the other columns of that table).
You don't specify why you want to group (because it doesn't make much sense to do it with this query). Maybe instead you want to ORDER BY, or you want the first row for each M_CODIGO.