Stream Analytics: Inner Join two subquery - tsql

I am using two subquery in Stream Analytics so that I can run two AzureML functions.
WITH subquery as (
SELECT
id as id,
username as username,
try_cast(startTime as datetime) as startTime,
try_cast(endTime as datetime) as endTime,
AC as AC, FM as FM, UC as UC,
DL as DL, DS as DS, DP as DP,
LB as LB, ASTV as ASTV, MSTV as MSTV,
ALTV as ALTV, MLTV as MLTV, Width as Width,
Min as Min, Max as Max, Nmax as Nmax,
Nzeros as Nzeros, Mode as Mode, Mean as Mean,
Median as Median, Variance as Variance, Tendency as Tendency,
rms,fmed,fpeak,sample_entropy,
EventProcessedUtcTime as EventProcessedUtcTime,
Distress(AC,FM,UC,DL,DS,DP,1,LB,ASTV,MSTV,ALTV,MLTV,
Width,Min,Max,Nmax,Nzeros,Mode,Mean,Median,Variance,
Tendency,1,1,1,1,1,1,1,1,1,1,1,1) as resultFHR
FROM
iot
),
subquery2 as (
SELECT
id as id,
try_cast(startTime as datetime) as startTime,
try_cast(endTime as datetime) as endTime,
AC as AC, FM as FM, UC as UC,
DL as DL, DS as DS, DP as DP,
LB as LB, ASTV as ASTV, MSTV as MSTV,
ALTV as ALTV, MLTV as MLTV, Width as Width,
Min as Min, Max as Max, Nmax as Nmax,
Nzeros as Nzeros, Mode as Mode, Mean as Mean,
Median as Median, Variance as Variance, Tendency as Tendency,
rms,fmed,fpeak,sample_entropy,
EventProcessedUtcTime as EventProcessedUtcTime,
Labour("",1,1,1,"",rms,fmed,fpeak,sample_entropy,"","") as resultUC
FROM
iot
)
SELECT
id as id,
username as username,
startTime as startTime,
endTime as endTime,
AC as AC, FM as FM, UC as UC,
DL as DL, DS as DS, DP as DP,
LB as LB, ASTV as ASTV, MSTV as MSTV,
ALTV as ALTV, MLTV as MLTV, Width as Width,
Min as Min, Max as Max, Nmax as Nmax,
Nzeros as Nzeros, Mode as Mode, Mean as Mean,
Median as Median, Variance as Variance, Tendency as Tendency,
EventProcessedUtcTime as EventProcessedUtcTime,
resultFHR.[classes] as distress,
resultFHR.[probabilities] as distressProbability,
resultUC.[classes] as labour,
resultUC.[probabilities] as labourProbability
INTO
sql
FROM
subquery INNER JOIN subquery2 ON subquery.id = subquery2.id
AND DATEDIFF(second, subquery, subquery2) BETWEEN 0 AND 20
SELECT
*
INTO
c2d
FROM
subquery INNER JOIN subquery2 ON subquery.id = subquery2.id
AND DATEDIFF(second, subquery, subquery2) BETWEEN 0 AND 20
I try to use Inner Join to join two subquery but it works for second query and do not work for first query. When I use the Inner Join at the first query, it will show an error.
Invalid column name: 'id'. Column with such name does not exist.
Any solution for that?

Since you joining two sources subquery and subquery2, you need to qualify columns with the name of the source like you did in the ON clause (subquery.id = subquery2.id).
Unqualified names only allowed when you have a single source, like in the subquery step as an example.
Change column references to fully qualify them like this:
SELECT
subquery.id as id,
subquery.username as username,
subquery.startTime as startTime,
subquery.endTime as endTime,
...

Related

Open, high, low, close aggregation in BigQuery

Based on the BigQuery best practice of using ARRAY_AGG() to obtain the latest record, here's how I'm getting the first, last minimum and maximum values of a field for a day. The data is reported approximately hourly.
WITH t AS (
SELECT TIMESTAMP('2021-01-01 01:00:00') as l, 10 as v
UNION ALL
SELECT TIMESTAMP('2021-01-01 02:00:00') as l, 12 as v
UNION ALL
SELECT TIMESTAMP('2021-01-01 03:00:00') as l, 15 as v
UNION ALL
SELECT TIMESTAMP('2021-01-01 04:00:00') as l, 2 as v
UNION ALL
SELECT TIMESTAMP('2021-01-02 01:00:00') as l, 600 as v
UNION ALL
SELECT TIMESTAMP('2021-01-02 02:00:00') as l, 120 as v
UNION ALL
SELECT TIMESTAMP('2021-01-02 03:00:00') as l, 150 as v
UNION ALL
SELECT TIMESTAMP('2021-01-03 04:00:00') as l, 0 as v)
SELECT EXTRACT(DATE FROM l) d,
ARRAY_AGG(t.v ORDER BY t.l ASC LIMIT 1)[OFFSET(0)] first_value,
ARRAY_AGG(t.v ORDER BY t.l DESC LIMIT 1)[OFFSET(0)] last_value,
ARRAY_AGG(t.v ORDER BY t.v DESC LIMIT 1)[OFFSET(0)] max_value,
ARRAY_AGG(t.v ORDER BY t.v ASC LIMIT 1)[OFFSET(0)] min_value,
FROM
t
GROUP BY
d
Output:
Row
d
max_value
min_value
last_value
first_value
1
2021-01-01
15
2
2
10
2
2021-01-02
600
120
150
600
3
2021-01-03
0
0
0
0
Since there are only six BigQuery questions on Code Review, I thought I'd ask here on the main Stack Overflow. Is this the fastest method? Do I have anything extraneous in my query? (I'm not too sure that [OFFSET(0)] is doing anything.)
I've seen this question asked on Stack Overflow for Oracle, T-SQL and Postgres but I haven't seen anything specific for BigQuery. Thanks!
Obvious improvement is to use simple MIN and MAX for min_value and max_value
select date(l) d,
array_agg(v order by l asc limit 1)[offset(0)] first_value,
array_agg(v order by l desc limit 1)[offset(0)] last_value,
max(v) max_value,
min(v) min_value
from t
group by d
Rather than this, using array_agg is a good practice here and using [offset(0)] is important here as without it - your outputs will be arrays with one elements - but you most likely want the element itself out
One more - depends on the volume of your data - you can try below approach which uses analytic aggregation functions vs. just aggregation functions
select distinct * from (
select date(l) d,
first_value(v) over(partition by date(l) order by l asc) first_value,
first_value(v) over(partition by date(l) order by l desc) last_value,
max(v) over(partition by date(l)) max_value,
min(v) over(partition by date(l)) min_value
from t
)
More options to consider - using approximate aggregate functions as in below example
select extract(date from l) d,
approx_top_sum(v, 1 / unix_seconds(l), 1)[offset(0)].value first_value,
approx_top_sum(v, unix_seconds(l), 1)[offset(0)].value last_value,
max(v) max_value,
min(v) min_value,
from t
group by d

How to count values from temp column in PostGIS?

I need to count the total count of calculated values between -10 and 10.
What I have tried to do is:
WITH routes as (
SELECT
reg,
heading-lag(heading) over (PARTITION BY reg order by time) AS direction
FROM my_table
)
SELECT direction, reg, Count(direction) AS total_count
FROM routes WHERE direction between -10 AND 10
GROUP BY reg, direction;
This counts how many of each value between -10 and 10 each route has. But how to count just one values per route in given range?
Try the following query:
WITH routes as (
SELECT
reg,
heading-lag(heading) over (PARTITION BY reg order by time) AS direction
FROM my_table
)
SELECT COUNT(DISTINCT reg) AS total
FROM routes WHERE direction between -10 AND 10;

How to correctly use the GROUP BY function in Postgresql

The below code works well, except for the step where I determine the threshold AS. Instead of calculating the count for each raster created in the polygon_dump AS step, it counts all the rasters together. I have been trying to use GROUP BY with limited success.
WITH
-- Select Features
feat AS
(SELECT toid AS building_id,
wkb_geometry AS geom
FROM buildings
),
polygon_dump AS
(SELECT (ST_DumpAsPolygons(ST_Clip(a.st_roughness,1,b.geom,-9999,true))).val AS polygon_vals,building_id AS building_id2
FROM grosvenor_raster_roughness a, feat b
),
threshold AS
(SELECT Count(*) AS thres_val
FROM polygon_dump
WHERE polygon_vals >= 0 AND polygon_vals < 0.5
GROUP BY building_id2
),
b_stats AS
(SELECT building_id, (stats).*
FROM (SELECT building_id, ST_SummaryStats(ST_Clip(a.st_roughness,1,b.geom,-9999,true)) AS stats
FROM grosvenor_raster_roughness a
INNER JOIN feat b
ON ST_Intersects(b.geom,a.st_roughness)
) AS foo
)
-- Summarise statistics
SELECT count As pixel_count,
thres_val AS threshold_val,
cast(thres_val as real)/cast(count as real)*100 AS percent_value,
min AS min_pixel_val,
max AS max_pixel_val,
mean AS avg_pixel_val,
stddev AS pixel_stddev
FROM b_stats, threshold
WHERE count > 0;
I get the following results:
The two columns in red are the correct results, what do I need to do to only get those results?
You are doing CROSS JOIN. You need to add column with building_id into your threshold CTE, so you can JOIN it with b_stats. I'm not sure if it should be LEFT or INNER JOIN, so I'm gonna use INNER.
WITH
-- Select Features
feat AS
(SELECT toid AS building_id,
wkb_geometry AS geom
FROM buildings
),
polygon_dump AS
(SELECT (ST_DumpAsPolygons(ST_Clip(a.st_roughness,1,b.geom,-9999,true))).val AS polygon_vals,building_id AS building_id2
FROM grosvenor_raster_roughness a, feat b
),
threshold AS
(SELECT building_id2 AS building_id, Count(*) AS thres_val
FROM polygon_dump
WHERE polygon_vals >= 0 AND polygon_vals < 0.5
GROUP BY building_id2
),
b_stats AS
(SELECT building_id, (stats).*
FROM (SELECT building_id, ST_SummaryStats(ST_Clip(a.st_roughness,1,b.geom,-9999,true)) AS stats
FROM grosvenor_raster_roughness a
INNER JOIN feat b
ON ST_Intersects(b.geom,a.st_roughness)
) AS foo
)
-- Summarise statistics
SELECT count As pixel_count,
thres_val AS threshold_val,
cast(thres_val as real)/cast(count as real)*100 AS percent_value,
min AS min_pixel_val,
max AS max_pixel_val,
mean AS avg_pixel_val,
stddev AS pixel_stddev
FROM b_stats
JOIN threshold USING(building_id)
WHERE count > 0;

Stream Analytics: Source 'subquery' can only be used in temporal predicate using 'datediff' function

I am querying the data in stream analytics with two subqueries using WITH. I want to combine the data from two subqueries and put them into SQL. So, I am using JOIN function.
WITH subquery as (
SELECT
id as id,
deviceId as deviceId,
username as username,
try_cast(localtime as datetime) as localtime,
AC as AC, FM as FM, UC as UC,
DL as DL, DS as DS, DP as DP,
LB as LB, ASTV as ASTV, MSTV as MSTV,
ALTV as ALTV, MLTV as MLTV, Width as Width,
Min as Min, Max as Max, Nmax as Nmax,
Nzeros as Nzeros, Mode as Mode, Mean as Mean,
Median as Median, Variance as Variance, Tendency as Tendency,
EventProcessedUtcTime as EventProcessedUtcTime,
Distress(AC,FM,UC,DL,DS,DP,1,LB,ASTV,MSTV,ALTV,MLTV,
Width,Min,Max,Nmax,Nzeros,Mode,Mean,Median,Variance,
Tendency,1,1,1,1,1,1,1,1,1,1,1,1) as resultFHR
FROM
iot
),
subquery2 as (
SELECT
id as id,
deviceId as deviceId,
username as username,
try_cast(localtime as datetime) as localtime,
rms,fmed,fpeak,sample_entropy,
EventProcessedUtcTime as EventProcessedUtcTime,
Labour("",1,1,1,"",rms,fmed,fpeak,sample_entropy,"","") as resultUC
FROM
iot
)
SELECT
id as id,
deviceId as deviceId,
username as username,
localtime as localtime,
AC as AC, FM as FM, UC as UC,
DL as DL, DS as DS, DP as DP,
LB as LB, ASTV as ASTV, MSTV as MSTV,
ALTV as ALTV, MLTV as MLTV, Width as Width,
Min as Min, Max as Max, Nmax as Nmax,
Nzeros as Nzeros, Mode as Mode, Mean as Mean,
Median as Median, Variance as Variance, Tendency as Tendency,
EventProcessedUtcTime as EventProcessedUtcTime,
resultFHR.[classes] as distress,
resultFHR.[probabilities] as distressProbability,
resultUC.[classes] as labour,
resultUC.[probabilities] as labourProbability
INTO
sql
FROM
subquery
INNER JOIN
subquery2 ON subquery.id = subquery2. id
AND DATEDIFF(second, subquery, subquery2) BETWEEN 0 AND 20
It throws an error at the last line:
Source 'subquery' can only be used in temporal predicate using 'datediff' function.
Example:
SELECT input1.a, input2.b
FROM input1
JOIN input2 ON DATEDIFF(minute, input1, input2) BETWEEN 0 AND 10.
Please make sure there is no 'or' in temporal predicates.
I have followed the example given but it still has error. How can I combine the two subqueries?
The syntax error might be caused by the typo you have in your last two lines. Try replace "subquerry" with "subquery" like below:
FROM
subquery
INNER JOIN
subquery2 ON subquery.id = subquery2.id
AND DATEDIFF(second, subquery, subquery2) BETWEEN 0 AND 20

Top N values in window frame

I have a table t with 3 fields of interest:
d (date), pid (int), and score (numeric)
I am trying to calculate a 4th field that is an average of each player's top N (3 or 5) scores for the days before the current row.
I tried the following join on a subquery but it is not producing the results I'm looking for:
SELECT t.d, t.pid, t.score, sq.highscores
FROM t, (SELECT *, avg(score) as highscores FROM
(SELECT *, row_number() OVER w AS rnum
FROM t AS t2
WINDOW w AS (PARTITION BY pid ORDER BY score DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) isq
WHERE rnum <= 3) sq
WHERE t.d = sq.d AND t.pid = sq.pid
Any suggestions would be greatly appreciated! I'm a hobbyist programmer and this is more complex of a query than I'm used to.
You can't select * and avg(score) in the same (inner) query. I.e. which non-aggregated values should be selected for each average? PostgreSQL won't decide this instead of you.
Becasue you PARTITION BY pid in the innermost query, you should use GROUP BY pid in the aggregating subquery. That way, you can SELECT pid, avg(score) as highscores:
SELECT pid, avg(score) as highscores
FROM (SELECT *, row_number() OVER w AS rnum
FROM t AS t2
WINDOW w AS (PARTITION BY pid ORDER BY score DESC)) isq
WHERE rnum <= 3
GROUP BY pid
Note: ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING makes no difference for row_number().
But if the top N part is fixed (and N will be few in your real-world use-case too), you can solve this without that much subquery (with the nth_value() window function):
SELECT d, pid, score,
(coalesce(nth_value(score, 1) OVER w, 0) +
coalesce(nth_value(score, 2) OVER w, 0) +
coalesce(nth_value(score, 3) OVER w, 0)) /
((nth_value(score, 1) OVER w IS NOT NULL)::int +
(nth_value(score, 2) OVER w IS NOT NULL)::int +
(nth_value(score, 3) OVER w IS NOT NULL)::int) highscores
FROM t
WINDOW w AS (PARTITION BY pid ORDER BY score DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
http://rextester.com/GUUPO5148