query count of rows where id is less than a series of values in Redshift - amazon-redshift

I have a table etl_control which stores latest_id of x_data table everyday. Now I have a requirement to get the number of rows for each day.
My idea is to run a query to get the count based on a condition x_data.id <= etl_control.latest_id for everyday and get the count.
The table structures are as follows.
etl_control:
record_date | latest_id |
---------------------------------
2016-11-01 | 55 |
2016-11-02 | 125 |
2016-11-03 | 154 |
2016-11-04 | 190 |
2016-11-05 | 201 |
2016-11-06 | 225 |
2016-11-07 | 287 |
x_data:
id | value |
---------------------------------
10 | xyz |
11 | xyz |
21 | xyz |
55 | xyz |
101 | xyz |
108 | xyz |
125 | xyz |
142 | xyz |
154 | xyz |
160 | xyz |
166 | xyz |
178 | xyz |
190 | xyz |
191 | xyz |
The end result should have the number of rows in x_data for each day. I tried a number of variations using JOIN, WITH and COUNT(*) OVER. But the biggest hurdle is to iteratively compare x_data.id with etl_control.latest_id.

Really sorry folks. Got the answer myself after posting the question.
The query is really simple.
WITH data AS (
SELECT e.latest_id
FROM x_data AS x, etl_control AS e
WHERE x.id <= e.latest_id)
SELECT latest_id, count(*) FROM data GROUP BY latest_id;
This basically creates a temp table with latest_id repeated for each row. The latest_id is always greater than or equal to the id from x_data.
A simple group by on this temp table would give the expected result.

Related

Filtering out hierarchical data

I need help with a problem I am facing processing hierarchical data.
Schema of the tables that maintain hierarchical data:
Category table:
| ID | Label |
Mapping table:
| ID | QualifierID | ItemID | ParentID |
Step 1: Wrote a simple self-join query to trasnform above mappings:
WITH category_masterlist AS (
SELECT id,
label
FROM Category
)
select id, id as itemid, label, NULL as parentId from [Category] where categoryLevel = 1
UNION
select itemid as id, itemId, (select label from category_masterlist where id = cm.itemid) Label, parentId
from [CategoryMapping] cm
Step 2: Wrote a self-join query using common table expression to return mapping data as follows:
WITH CategoryCTE(ParentID, ID, Label, CategoryLevel) AS
(
SELECT ParentID, ItemID, Label, 0 AS CategoryLevel
FROM [view_TreeviewCategoryMapping]
WHERE ParentID IS NULL
UNION ALL
SELECT e.ParentID, e.ItemID, e.Label, CategoryLevel + 1
FROM [view_TreeviewCategoryMapping] AS e
INNER JOIN CategoryCTE AS d
ON e.ParentID = d.ID
)
SELECT distinct ParentID, ID, Label, CategoryLevel
FROM CategoryCTE
| ID | Label | ParentID | CategoryLevel |
--------------------------------------------------------------------------------
| 90 | Satellite | NULL | 0 |
| 91 | Concrete | NULL | 0 |
| 92 | ETC | NULL | 0 |
| 93 | Chisel | NULL | 0 |
| 94 | Steel | NULL | 0 |
| 96 | Wood | NULL | 0 |
| 97 | MIC Systems | 90 | 1 |
| 97 | MIC Systems | 91 | 1 |
| 99 | Foundations | 91 | 1 |
| 100 | Down Systems | 91 | 1 |
| 101 | Side Systems | 91 | 1 |
| 102 | Systems | 91 | 1 |
| 98 | DWG | 92 | 1 |
| 97 | MIC Systems | 93 | 1 |
| 97 | MIC Systems | 94 | 1 |
| 99 | Foundations | 94 | 1 |
| 100 | Down Systems | 94 | 1 |
| 101 | Side Systems | 94 | 1 |
| 102 | Systems | 94 | 1 |
| 97 | MIC Systems | 95 | 1 |
| 98 | DWG | 95 | 1 |
| 102 | Systems | 95 | 1 |
| 103 | Project Management| 95 | 1 |
| 104 | Software | 95 | 1 |
| 99 | Foundations | 96 | 1 |
| 119 | Fronts | 97 | 2 |
| 121 | Technology | 98 | 2 |
| 112 | Root Systems | 98 | 2 |
| 112 | Root Systems | 99 | 2 |
| 137 | Closed Systems | 112 | 3 |
| 203 | Support | 121 | 3 |
Step 3: I would like to filter above results so that only categories that are mapped completely are returned. Completed mapping is a mapping that has children at level=3. For example, below is what I am looking for based on above resultset:
| ID | Label | ParentID | CategoryLevel |
--------------------------------------------------------------------------------
| 96 | Wood | NULL | 0 |
| 92 | ETC | NULL | 0 |
| 98 | DWG | 92 | 1 |
| 99 | Foundations | 96 | 1 |
| 121 | Technology | 98 | 2 |
| 112 | Root Systems | 98 | 2 |
| 112 | Root Systems | 99 | 2 |
| 137 | Closed Systems | 112 | 3 |
| 203 | Support | 121 | 3 |
Step 4: Ultimately, end user should be presented with a tree view control as follows:
Root
|
|---Wood
| |---Foundations
| |---Root Systems
| |---Closed Systems
|
|---ETC
| |---DWG
| |---Technology
| |---Support
| |---Root Systems
| |---Closed Systems
Please note, a category can have multiple parents. For example, Root Systems has two parents - DWG and Foundations. Did I get the schema correct for category and mapping table especially for the case when a category can have multiple parents?
How can I filter out categories that are not mapped completely from Step 2 to Step 3? That is the hurdle I am unable to cross. Any pointers? I can filter them out at the application level but would really love to filter them out at database level.
I am open to suggestions and recommendations that will help me achieve my goal. I also want a confirmation that the schema I am using is the most efficient one.
Thank you!
Here is a working option that uses the datatype hierarchyID
The nesting is option and really for illustration.
Example
Declare #Top int = null --<< Sets top of Hier Try 94
;with cteP as (
Select ID
,ParentID
,Label
,HierID = convert(hierarchyid,concat('/',ID,'/'))
From YourTable
Where IsNull(#Top,-1) = case when #Top is null then isnull(ParentID ,-1) else ID end
Union All
Select ID = r.ID
,Pt = r.ParentID
,Label = r.Label
,HierID = convert(hierarchyid,concat(p.HierID.ToString(),r.ID,'/'))
From YourTable r
Join cteP p on r.ParentID = p.ID)
Select Lvl = HierID.GetLevel()
,ID
,ParentID
,Label = replicate('|----',HierID.GetLevel()-1) + Label -- Nesting Optional ... For Presentation
,HierID_String = HierID.ToString()
From cteP A
Order By A.HierID
Results
Now if #Top was set to 94

posgresql selecting two different data as two columns from one column

I need to select two id's from stockcurrent as two different columns (id1,id2), first where points.id = '244' and second where points.id ='191'. But result facing last where clause and filling only one column based on that statement.
I think I've faced a similar problem as in that case: Two SELECT statements as two columns
The only difference is that in the case above his last where clause is in range but mine is not. In my opinion, it is the reason why my statement is not working:
select
(case when po.id='244' then st.id end) id1,
(case when po.id='191' then st.id end) id2
from stockcurrent st
inner join points po on po.id = st.point
where po.id ='244';
My result:
Expected result:
So I need to find a solution to fill both columns with id's not only one which in that case giving me the result(s) of '244'. Thanks in advance.
Example of stockcurrent table:
+-------+-------+
| id | point |
+-------+-------+
| 23414 | 191 |
| 12493 | 191 |
| 16121 | 170 |
| 24325 | 191 |
| 51232 | 244 |
| 11255 | 244 |
| 56572 | 244 |
| 16123 | 170 |
+-------+-------+
Example of points table:
+-----+------+------+
| id | comp | type |
+-----+------+------+
| 191 | 96 | 2 |
| 307 | 96 | 1 |
| 244 | 97 | 0 |
| 311 | 98 | 0 |
| 170 | 109 | 0 |
+-----+------+------+
Change the query to:
select
(case when po.id='244' then st.id end) id1,
(case when po.id='191' then st.id end) id2
from stockcurrent st
inner join points po on po.id = st.point
where po.id in ('244', '191');

Aggregate at either of two levels

In Tableau, I am joining two tables where a header can have multiple details
Work Order Header
Work Order Details
The joined data looks like this:
Header.ID | Header.ManualTotal | Details.ID | Details.LineTotal
A | 1000 | 1 | 550
A | 1000 | 2 | 35
A | 1000 | 3 | 100
B | 335 | 1 | 250
B | 335 | 2 | 300
C | null | 1 | 50
C | null | 2 | 25
C | null | 3 | 5
C | null | 4 | 5
Where there is a manual total, use that, if there is no manual total, use the sum of the line totals
ID | Total
A | 1000
B | 335
C | 85
I tried something like this:
ifnull( sum({fixed [Header ID] : [Manual Total] }), sum([Line Total]) )
basically I need to use the ifnull, then use the manual total if it exists, or sum line totals if it doesn't
Please advise on how to use LODs or some other solution to get the correct answer
Here is a solution that does not require a level-of-detail calculation.
Just try this:
use an inner join on id of the two tables
create this calculation: ifnull(median([Manual Total]),sum([Line Total]))
insert agg(your_calculation) into your sheet

How to Join data from a dataframe

I have one table with a lot of type of data, and some of the data has one information that is really important to analyse the rest of the data.
This is the table that I have
name |player_id|data_ms|coins|progress |
progress | 1223 | 10 | | 128 |
complete | 1223 | 11 | 154| |
win | 1223 | 9 | 111| |
progress | 1223 | 11 | | 129 |
played | 1111 | 19 | 141| |
progress | 1111 | 25 | | 225 |
This is the table that I want
name |player_id|data_ms|coins|progress |
progress | 1223 | 10 | | 128 |
complete | 1223 | 11 | 154| 128 |
win | 1223 | 9 | 111| 129 |
progress | 1223 | 11 | | 129 |
played | 1111 | 19 | 141| 225 |
progress | 1111 | 25 | | 225 |
I need to find the progress of the player, using the condition that, it has to be the first progress emitted after the data_ms (epoch unixtimstamp) of this event.
My table has 4 bilions lines of data, it's partitioned by data.
I tried to create a UDF function that should read the table filtering it, but it's not an option since you can't serialize spark to an UDF.
Any idea of how should I do this?
It seems like you want to fill gaps in column progress. I didn't really understand the condition but if it's based on data_ms then your hive query should look like this:
dataFrame.createOrReplaceTempView("your_table")
val progressDf = sparkSession.sql(
"""
SELECT name, player_id, data_ms, coins,
COALESCE(progress, LAST_VALUE(progress, TRUE) over (PARTITION BY player_id ORDER BY data_ms ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) AS progress
FROM your_table;
"""
)

Split postgres records into groups based on time fields

I have a table with records that look like this:
| id | coord-x | coord-y | time |
---------------------------------
| 1 | 0 | 0 | 123 |
| 1 | 0 | 1 | 124 |
| 1 | 0 | 3 | 125 |
The time column represents a time in milliseconds. What I want to do is find all coord-x, coord-y as a set of points for a given timeframe for a given id. For any given id there is a unique coord-x, coord-y, and time.
What I need to do however is group these points as long as they're n milliseconds apart. So if I have this:
| id | coord-x | coord-y | time |
---------------------------------
| 1 | 0 | 0 | 123 |
| 1 | 0 | 1 | 124 |
| 1 | 0 | 3 | 125 |
| 1 | 0 | 6 | 140 |
| 1 | 0 | 7 | 141 |
I would want a result similar to this:
| id | points | start-time | end-time |
| 1 | (0,0), (0,1), (0,3) | 123 | 125 |
| 1 | (0,140), (0,141) | 140 | 141 |
I do have PostGIS installed on my database, the times I posted above are not representative but I kept them small just as a sample, the time is just a millisecond timestamp.
The tricky part is picking the expression inside your GROUP BY. If n = 5, you can do something like time / 5. To match the example exactly, the query below uses (time - 3) / 5. Once you group it, you can aggregate them into an array with array_agg.
SELECT
array_agg(("coord-x", "coord-y")) as points,
min(time) AS time_start,
max(time) AS time_end
FROM "<your_table>"
WHERE id = 1
GROUP BY (time - 3) / 5
Here is the output
+---------------------------+--------------+------------+
| points | time_start | time_end |
|---------------------------+--------------+------------|
| {"(0,0)","(0,1)","(0,3)"} | 123 | 125 |
| {"(0,6)","(0,7)"} | 140 | 141 |
+---------------------------+--------------+------------+