Why select not give all columns? - pyspark - select

Can you help me to understand why I am getting:
"cannot resolve 'actual_date' given input columns: [general_id, payments_803_all]
?
actual_date is in the MAIN_TABLE and it was selected by SELECT.
df = df_MAIN_TABLE.select(concat(lpad(col('id_1'), 3, '0'), lpad(col('id_2'), 3, '0'),
lpad(col('id_3'), 7, '0')).alias('general_id'), 'payment','actual_date')\
.where((col('payment_code')==803) & (col('actual_date').between(date_from1,end_of_month)))\
.groupBy('general_id').agg(sum('payment').alias('payments_803_all'))\
.agg(max('actual_date').alias('last_action_date'))\
.withColumn('validity_date', lit(end_of_month))

You have changed the actual_date to last_action_date and actual_date is not part in the group by.
So there is no reference for actual_date column to get the actual_date include this column in the groupBy.
Example:
Try to modify your groupBy condition as below and then you can get the actual_date column.
.groupBy('general_id','actual_date').agg(sum('payment').alias('payments_803_all'),max('actual_date').alias('last_action_date'))\

Related

Using min/max values from a CTE in a later query, instead of using a subquery in Postgres

I've got a remedial question about pulling results out of a CTE in a later part of the query. For the example code, below are the relevant, stripped down tables:
CREATE TABLE print_job (
created_dts timestamp not null default now(),
status text not null
);
CREATE TABLE calendar_day (
date_actual date not null
);
In the current setup, there are gaps in the dates in the print_job data, and we would like to have a gapless result. For example, there are 87 days from the first to last date in the table, and only 77 days in there have data. We've already got a calendar_day dimension table to join with to get the 87 rows for the 87-day range. It's easy enough to figure out the min and max dates in the data with a subquery or in a CTE, but I don't know how to use those values from a CTE. I've got a full query below, but here are the relevant fragments with comments:
-- Get the date range from the data.
date_range AS (
select min(created_dts::date) AS start_date,
max(created_dts::date) AS end_date
from print_job),
-- This CTE does not work because it doesn't know what date_range is.
complete_date_series_using_cte AS (
select actual_date
from calendar_day
where actual_date >= date_range.start_date
and actual_date <= date_range.end_date
),
-- Subqueries are fine, because the FROM is specified in the subquery condition directly.
complete_date_series_using_subquery AS (
select date_actual
from calendar_day
where date_actual >= (select min(created_dts::date) from print_job)
and date_actual <= (select max(created_dts::date) from print_job)
)
I run into this regularly, and finally figured I'd ask. I've hunted around already for an answer, but I'm not clear how to summarize it well. And while there's nothing wrong with the subqueries in this case, I've got other situations where a CTE is nicer/more readable.
If it helps, I've listed the complete query below.
-- Get some counts and give them names.
WITH
daily_status AS (
select created_dts::date as created_date,
count(*) AS daily_total,
count(*) FILTER (where status = 'Error') AS status_error,
count(*) FILTER (where status = 'Processing') AS status_processing,
count(*) FILTER (where status = 'Aborted') AS status_aborted,
count(*) FILTER (where status = 'Done') AS status_done
from print_job
group by created_dts::date
),
-- Get the date range from the data.
date_range AS (
select min(created_dts::date) AS start_date,
max(created_dts::date) AS end_date
from print_job),
-- There are gaps in the data, and we want a row for dates with no results.
-- Could use generate_series on a timestamp & convert that to dates. But,
-- in our case, we've already got dimension tables for days. All that's needed
-- here is the actual date.
-- This CTE does not work because it doesn't know what date_range is.
-- complete_date_series_using_cte AS (
-- select actual_date
--
-- from calendar_day
--
-- where actual_date >= date_range.start_date
-- and actual_date <= date_range.end_date
-- ),
complete_date_series_using_subquery AS (
select date_actual
from calendar_day
where date_actual >= (select min(created_dts::date) from print_job)
and date_actual <= (select max(created_dts::date) from print_job)
)
-- The final query joins the complete date series with whatever data is in the print_job table daily summaries.
select date_actual,
coalesce(daily_total,0) AS total,
coalesce(status_error,0) AS errors,
coalesce(status_processing,0) AS processing,
coalesce(status_aborted,0) AS aborted,
coalesce(status_done,0) AS done
from complete_date_series_using_subquery
left join daily_status
on daily_status.created_date =
complete_date_series_using_subquery.date_actual
order by date_actual
I said it was a remedial question....I remembered where I'd seen this done before:
https://tapoueh.org/manual-post/2014/02/postgresql-histogram/
In my example, I need to list the CTE in the table list. That's obvious in retrospect, and I realize that I automatically don't think to do that as I'm habitually avoiding CROSS JOIN. The fragment below shows the slight change needed:
WITH
date_range AS (
select min(created_dts)::date as start_date,
max(created_dts)::date as end_date
from print_job
),
complete_date_series AS (
select date_actual
from calendar_day, date_range
where date_actual >= date_range.start_date
and date_actual <= date_range.end_date
),

Apply join, sort on date column and select the first row where one of the column value is not null

I have two tables(Table A and Table B) in a Postgres DB.
Both have "id" column in common. Table A has one column called "id" and Table B has three columns: "id, date, value($)".
For each "id" of Table A there exists multiple rows in Table B in the following format - (id, date, value).
For instance, for Table A with "id" as 1 if there exists following rows in Table B:
(1, 2018-06-21, null)
(1, 2018-06-20, null)
(1, 2018-06-19, 202)
(1, 2018-06-18, 200)
I would like to extract the most recent dated non-null value. For example for id - 1, the result should be 202. Please share your thoughts or let me know in case more info is required.
Here is the solution I went ahead with:
with mapping as ( select distinct table1.id, table2.value, table2.date, row_number() over (partition by table1.id order by table2.date desc nulls last) as row_number
from table1
left join table2 on table2.id=table1.id and table2.value is not null
)
select * from mapping where row_number = 1
Let me know if there is scope for improvement.
You may very well want an inner join, not an outer join. If you have an id in table1 that does not exist in table2 or that has only null values you will get NULL for both date and value. This is due to the how outer join works. What it says is if nothing in the right side table matches the ON condition then return NULL for each column in that table. So
with mapping as
(select distinct table1.id
, table2.value
, table2.date
, row_number() over (partition by table1.id order by table2.date desc nulls last) as row_number
from table1
join table2 on table2.id=table1.id and table2.value is not null
)
select *
from mapping
where row_number = 1;
See example of each here. Your query worked because all your test data satisfied the 1st condition of the ON condition. You really need test data that fails to see what your query does.
Caution: DATE and VALUE are very poor choice for a column names. Both are SQL standard reserved words, although not Postgres specifically. Further DATE is a Postgres data type. Having columns with names the same as datatype leads to confusion.

DB2: substring a number

In my dataset I have a variable (numeric) which is year+month, called year_month with values 201702, 201703 etc.
Normally my code looks like this:
select
year_month
,variable2
,variable3
from dataset
I wish to extract the month and the year from the year_month variable, but I'm not sure how to do this when year_month is numeric.
edit: not a duplicate, different problem, I do not care about dates.
To extract the date parts from an integer
SELECT year_month/100,MOD(year_month,100)
To fully convert the integer to a date :
SELECT TO_DATE(CHAR(year_month),'YYYYMM')
Possible with this methods too:
select left(cast(year_mont as varchar(6)), 4) as YYYY,
right(cast(year_mont as varchar(6)), 2) as MM from yourtable
You can have a timestamp like this:
select TIMESTAMP_FORMAT(cast(year_mont as varchar(6)), 'YYYYMM') as YouTimeStamp
from yourtable
Or a date too:
select Date(TIMESTAMP_FORMAT(cast(year_mont as varchar(6)), 'YYYYMM')) as YouTimeStamp
from yourtable

How to sort this data inside the field based on the recent Date

I have a data in the field as " Date: 03-21-13 12/13/14/15 Date:04-21-13 39/12/34/14 Date:04-19-13 19/45/65/12 ".How to sort this data inside the field based on the recent Date.
It should Look like
Date:04-21-13 39/12/34/14
Date:04-19-13 19/45/65/12
Date: 03-21-13 12/13/14/15
Because you are storing it as text, you cannot correctly sort directly on the column (as you appear to have discovered). You will need to split the column, and then sort on that. Something like:
Declare #tvTable Table (
TextColumn varchar(max)
)
Insert #tvTable
Select '04-19-13 19/45/65/12'
Union All
Select '04-21-13 39/12/34/14'
Union All
Select '03-21-13 12/13/14/15'
Union All
Select '03-25-13 17/18/19/20'
Union All
Select '05-01-13 99/88/77/66'
Union All
Select '02-01-13 11/22/33/44'
Select t.TextColumn
From #tvTable t
Cross Apply dbo.fncDelimitedSplit8k(TextColumn, ' ') split
Where split.ItemNumber = 1
Order By Cast(split.Item As DateTime) Desc
The split function taken from Jeff Moden Tally OH!

Crystal Reports - Removing duplicate rows based on fields

Struggling on Crystal Reports XI R2
I am trying to capture only one row per ID, based on the earliest timestamp.
Example:
ID Time
1 7:00
1 9:00
2 11:00
2 11:30
Would return
ID Time
1 7:00
2 11:00
I have tried to suppress duplicates, but since it is looking at multiple fields that will not work. I wonder if I can group on ID, and then sort on time, removing the later entries?
***I think I may have figured this out, by going to section expert, Selecting Details, Suppress, and then adding the function: {LOG.id} = Next ({LOG.id});.
Would love to hear any other opinions on this before I mark as Answered. Thanks
If you have access to the query/stored procedure being used to return the data you could do your grouping there, ala:
SELECT ID, MIN(Time) as Time
FROM Table GROUP BY ID
Depending on your data you may get better results if you filter at the source of your data, but without knowing what you're reporting against it's impossible to know.
If you are filtering what's displayed at the report you maybe dragging lots of data across the network only to supress it in the report.. why not just filter at source?
In sql server you could do the top N function something like this (test data included)
create table t3 (id int, supplierId int, description varchar(max), value decimal(5,2), created datetime default getdate())
insert into t3 values
(1, 1, 'test', 180.0, '20101001'),
(1, 1, 'test', 181.0, '20101003'),
(1, 1, 'test', 182.0, '20101002'),
(1, 2, 'test', 183.0, '20101005'),
(1, 2, 'test', 184.0, '20101002'),
(1, 2, 'test', 185.0, '20101001')
;with cte as
(select
t.id
, t.supplierId
, t.description
, t.value
, t.created
, rank() over (partition by t.supplierId order by t.created desc) as Position
from t3 t)
select * from cte where Position = 1
I was able to group by ID, then sort by the time. I had tried to add a min to my SQL query, but it was a beast of a query (by my standards). The part I included in here was a small section of it. Thanks for all the tips.