I have two issues:
I am trying to create a pivot table using the crosstabs tablefunc but all of the queries I've tried are returning NULL for all values.
I have two grouping variables (airport code and date) that are in one row of data that need to be separate columns in the pivot table, but I can only seem to get one to work.
I have gotten the pivot table to partially work by ignoring the date value for the moment. When I leave 'yyyymm' out of my query, the setup of my output table is okay, but the values don't calculate properly.
The data: I have rows with various airport codes, aircraft user and engine codes, flight identifiers, and year/month values. Each row counts for one flight. A simplified example looks like this:
ident
primary_fa
user_engn
yyyymm
20191122-AFR-23
MKE
O_O
201911
20191210-ASH-61
N90
T_R
201912
20200120-EDV-2
MKE
C_J
202001
20200811-FLC-148
A90
O_O
202008
I need my output table to count the number of arrivals for each user engine combo grouped by airport code and yyyymm. So the rows would be each airport code (primary_fa), yyyymm and columns would be user_engn codes (O_O, T_R, C_J, etc.) with counts for the number of flights per user_engn.
My goal output would look something like this:
primary_fa
yyyymm
C_J
T_R
O_O
MKE
201911
1
0
1
N90
201912
0
1
0
A90
202008
0
0
1
But I am getting this (because I have to ignore the date portion to even get this far):
primary_fa
C_J
T_R
O_O
MKE
NULL
NULL
NULL
N90
NULL
NULL
NULL
A90
NULL
NULL
NULL
I've tried a lot of different versions of the crosstabs query and the closest I have gotten to correct is this:
SELECT *
FROM crosstab(
'SELECT primary_fa as locid,
yyyymm,
count(*)
FROM fy20_keeps_emdf
GROUP BY primary_fa, yyyymm
ORDER BY 1,2',
'VALUES (''C_J''),(''O_O''),(''T_R'')')
AS (primary_fa varchar,
C_J bigint,
O_O bigint,
T_R bigint);
Am I missing something obvious or do I need to do more data manipulation to get this to work?
Related
Source data
I am working on an ELT project to load data from CSV files into PostgreSQL where I will transform it. The CSV files have many columns that are consistent across files, but also contain activity columns that are inconsistent with names like Date (05/19/2020), Type (05/19/2020), etc.
In the loading script I am merging all of the columns with dates in the column name into one jsonb column so I don't have to constantly add new columns to the raw data table.
The resulting jsonb column in the raw data table looks like this:
id
activity
12345678
{"Date (05/19/2020)": null, "Type (05/19/2020)": null, "Date (06/03/2020)": "06/01/2020", "Type (06/03/2020)": "E"}
98765432
{"Date (05/19/2020)": "05/18/2020", "Type (05/19/2020)": "B", "Date (10/23/2020)": "10/26/2020", "Type (10/23/2020)": "T"}
JSON to columns
Using the amazing create_jsonb_flat_view function from this post I can convert the jsonb to columns like this:
id
Date (05/19/2020)
Type (05/19/2020)
Date (06/03/2020)
Type (06/03/2020)
Type (10/23/2020
Date (10/23/2020)
Type (10/23/2020)
10629465
null
null
06/01/2020
E
98765432
05/18/2020
B
10/26/2020
T
Need to move part of column name to row
Now, this is where I'm stuck. I need to remove the portion of the column name that is the Activity Date (e.g. (05/19/2020)) and create a row for each id and ActivityDate with additional columns for Date and Type like this:
id
ActivityDate
Date
Type
12345678
05/19/2020
null
null
12345678
06/03/2020
06/01/2020
E
98765432
05/19/2020
05/18/2020
B
98765432
10/23/2020
10/26/2020
T
I followed your link to the create_jsonb_flat_view article yesterday and then forgot this question. While I thank you for pointing me there, I think that mentioning it worked against you.
A more conventional approach using regexp_replace() works here. I left the date values as strings, but you can convert them with to_date() if needed:
with parse as (
select id, e.k, e.v,
regexp_replace(e.k, '\s+\([0-9/]{10}\)', '') as k_no_date,
regexp_replace(e.k, '^.+([0-9/]{10}).+', '\1') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;
db<>fiddle here
#Mike-Organek's Answer works beautifully!
However, I was curious if the regexp_replace() calls might be slowing the query down a bit and it seemed I could get the same results using a simpler function.
Since Mike gave me a great example to start with I modified it to split on the space between Date and (05/19/2020).
For 20,000 rows, it went from taking an avg of 7 sec on my local machine to an avg of .9 sec.
Here is the resulting query:
with parse as (
select id, e.k, e.v,
split_part(e.k, ' ', 1) as k_no_date,
trim(split_part(e.k, ' ', 2),'()') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;
Suppose that I have a lot of NULL values (missing values) in a column named 'score'. I want to replace them by a specific average not from all the values of the column 'score' but by groups that I built with a crosscategory from two concatenated categories:
This kind of query works for getting averages by groups:
SELECT
category1 || ' > ' || category2 AS crosscategory,
ROUND(CAST(AVG(score) AS FLOAT), 2) AS score_avg
FROM DatabaseName.TableName
GROUP BY crosscategory
ORDER BY score_avg;
This one works to replace NULL values by a constant:
SELECT
NVL(score, 0) AS score_without_missing_values
FROM DatabaseName.TableName
The problem that I cannot solve now is how to articulate the replacement of NULL values with a constant here the averages computed with the functions AVG and GROUP BY.
Thank you very much for your help!
Seems you want a Group Average:
SELECT
t.*,
coalesce(score, AVG(score) OVER (PARTITION BY category1, category2)) AS score_avg
FROM DatabaseName.TableName AS t
I removed the ROUND/CAST, because AVG returns FLOAT by default and ROUND in probably not needed (if you need it, you might better cast to a DECIMAL).
Edited to provide sample data
I am trying to create a calculated column that sums 2 other columns, which should be easy. However, some values in both of the columns are null, so I want to use a case expression to replace null values in both columns with 0s and then add up the resulting values. The other complicating factor is that the second column contains text values with commas that need to be converted to numerical before I can add them. What I am currently trying to do is:
SELECT (case when pm."PS" is null then 0 else pm."PS" end) + (case when pm."PS-PREV1" is null then 0 else replace(pm."PS-PREV1", ',', '')::numeric end) AS "Sales"
FROM pm
Sample data:
PS
PS-PREV1
20000
null
30000
20,000
null
null
null
30,000
Desired output:
output
20000
50000
0
30000
This is just returning the value of the 1st column without adding in the second column. Where am I going wrong? Am I overthinking this?
your code should work , however you can write it a little bit more clean:
SELECT COALESCE(pm."PS",0)
+ COALESCE(replace(pm."PS-PREV1", ',', '')::numeric,0) AS "Sales"
FROM pm
I'm trying to solve this problem:
I have a query/view that will join ~10 tables to extract some fields for a report (if any). The query doesn't use any grouping function, only joins and cut off some unuseful data.
I have to take this one big view, get the group for the first index, take the max of a date in the second column and take all the information from other fields referring the record of the max value.
I cannot be able to to this in postgres.
As a pseudo code I can give this:
select 1
, max(2)
, 3 referred to the record from max(2)
, 4 referred to the record from max(2)
, ...
, 20 referred to the record from max(2)
from (ViewWithAllJoins) a
group by 1
For privacy and business problem I had to obfuscate some informations, 1/2/3/4... are the name of the column from the view "ViewWithAllJoins", I hope that the problem is still understandable and resolvable!
I've tryied with WINDOW command as reported in Convert keep dense_rank from Oracle query into postgres but I cannot be able to use the group by that I need. Other tryes that I've done was about the dense_rank like shown in Dense_rank first Oracle to Postgresql convert but I can't do any assumption on the order of the data in any of the other fields in exception of 1 and 2, so I can't use any of the aggregate function on them.
Any ideas? Possibly without adding too much subqueryes.
Thank you!
EDIT:
As suggested I'll add some synthetic data to better understand the problem and what I want.
Start:
ID DATE COLUMN1 COLUMN2 COLUMN3
=====================================================================
88888888;"2016-04-02 09:00:00";"aaaaaaaaaaa";"TEXT89" ; 999999999
88888888;"2018-08-21 09:00:00";"a" ;"TEXT1" ; 988888888
88888888;"2017-11-09 09:00:00";"zzzz" ;"TEXT80000" ; 850580582
75858585;"2017-01-31 09:00:00";"~~~~~~~~~~~";"TEXT10" ; 101010101
75858585;"2018-04-02 09:00:00";"eeeeeeeeeee";"TEXT1000" ; 111111111
99999999;"2016-04-02 09:00:00";"8d2ecafd866";"TEXT808911"; 777777777
What I want:
ID DATE COLUMN1 COLUMN2 COLUMN3
===================================================================
88888888;"2018-08-21 09:00:00";"a" ;"TEXT1" ; 988888888
75858585;"2018-04-02 09:00:00";"eeeeeeeeeee";"TEXT1000" ; 111111111
99999999;"2016-04-02 09:00:00";"8d2ecafd866";"TEXT808911"; 777777777
So the group by id, the max of the date and the other fields related to the row of the max date.
-- So you have duplicate records per ID, and for every ID you want to select the record with the most recent date ?
Use NOT EXISTS:
SELECT id,zdate,column1,column2,column3 -- , ...
FROM queryview t
WHERE NOT EXISTS (
SELECT *
FROM queryview x
WHERE x.id=t.id
AND x.zdate > t.zdate
);
Or, use row_number() over a window, and pick only the row with the final date:
SELECT id,zdate,column1,column2,column3 -- , ...
FROM ( SELECT *
, row_number() OVER(PARTITION BY id, ORDER BY zdate DESC) AS rn
FROM queryview
) q
WHERE q.rn = 1
;
what mdx query logic could i implement for this example to get two rows in result set for hrid = 1 with 1/1/16 as min date(start) for first row where someattribut shows up on column with value 'A'
and 1/15/16 as min date(start) for second row where someattribute has value of 'B' and measure.whatevers has its aggregation for whatever data corresponds to that dimension row.
Im trying to just look at january 2016
everything ive tried i seem to get min date values of 1/1/1900 or both rows have value of 1/1/2016 or i get errors since i cant figure it out.
heres my mdx sample:
WITH MEMBER [Measures].[Start] as
(
-- min date that the combination of someattribute and hrid have certain
-- value withing the range of the where clause restriction of january 2016
SELECT {
[Measures].[Start]
, [Measures].[Whatevers]
} ON COLUMNS
, NON EMPTY {
[Agent].[HRID].children
* [Agent].[someAtribute].Members
} ON ROWS
FROM [RADM_REPORTING]
WHERE (
[Date].[Date View].[Month].&[201601]
)
this works, but it feels kind of like a hack or maybe it feels like its not robust, I am not familiar enough with mdx to be able to make that call.
WITH MEMBER [Measures].[Start] as
filter([Date].[Date View].[Month].&[201601].children,
[Measures].[Whatevers]).item(0).membervalue
Here is a potential direction that is more general:
WITH
MEMBER [Measures].[Start] AS
Min
(
(EXISTING
[Date].[Date].[Date].MEMBERS)
,IIF
(
[Measures].[Internet Sales Amount] = 0
,NULL
,[Date].[Date].CurrentMember.MemberValue
)
)
SELECT
NON EMPTY
{
[Measures].[Start]
,[Measures].[Internet Sales Amount]
} ON COLUMNS
,NON EMPTY
[Product].[Product Categories].[Product] ON ROWS
FROM [Adventure Works]
WHERE
[Date].[Calendar].[Calendar Year].&[2005];
It gives the following: