I have a table which contains transaction dates and the balances of those transactions. Example Below:
select id, transaction_bal1, transaction_bal2, transaction_date
from transactions
Which results in
ID | transaction_bal1 | transaction_bal2 | transaction_date
1 | -10000 | 1000 | 2017.01.02
2 | 4000 | 1000 | 2017.02.02
3 | 4000 | 1000 | 2017.03.02
etc...
What I want to do is to generate a series with '1 day'::interval so that I select all days between the transaciton dates and that the rows in the above table falls under the right day. Something like this:
Gen_series | ID | transaction 1 | transaction 2 | transaction_date
2017.01.01 |null| 0 | 0 | null
2017.01.02 |1 | -10000 | 1000 | 2017.01.02
2017.01.03 |null| 0 | 0 | null
...
2017.02.01 |null| 0 | 0 | null
2017.02.02 |1 | 4000 | 1000 | 2017.01.02
2017.02.03 |null| 0 | 0 | null
etc...
I use Postgresql(dont know which version) but I use PgAdmin 4 3.2 (if that is of any help)
Feel free to ask any questions if I need to flesh out anything.
Related
I have a Spark DataFrame with the following entries:
| order id | time | amt |
| 1 | 2017-10-01 12:00 | 100 |
| 2 | 2017-10-01 15:00 | 100 |
| 3 | 2017-10-01 17:00 | 100 |
| 4 | 2017-10-02 16:00 | 100 |
| 5 | 2017-10-02 23:00 | 100 |
I want to add a column amount_prev_24h that has, for each order id, the sum of amt for all orders in the last 24 hours.
| order id | time | amt | amt_24h
| 1 | 2017-10-01 12:00 | 100 | 0
| 2 | 2017-10-01 15:00 | 100 | 100
| 3 | 2017-10-01 17:00 | 100 | 200
| 4 | 2017-10-02 16:00 | 100 | 100
| 5 | 2017-10-02 23:00 | 100 | 100
How would I go about doing it?
This is a pyspark code and similar to scala API.
df = df.withColumn('time_uts', unix_timestamp('time', format='yyyy-MM-dd HH:mm'))
df = df.withColumn('amt_24h', sum('amt').over(Window.orderBy('time_uts').rangeBetween(-24 * 3600, -1))).fillna(0, subset='amt_24h')
I hope this may help you.
From a table of "time entries" I'm trying to create a report of weekly totals for each user.
Sample of the table:
+-----+---------+-------------------------+--------------+
| id | user_id | start_time | hours_worked |
+-----+---------+-------------------------+--------------+
| 997 | 6 | 2018-01-01 03:05:00 UTC | 1.0 |
| 996 | 6 | 2017-12-01 05:05:00 UTC | 1.0 |
| 998 | 6 | 2017-12-01 05:05:00 UTC | 1.5 |
| 999 | 20 | 2017-11-15 19:00:00 UTC | 1.0 |
| 995 | 6 | 2017-11-11 20:47:42 UTC | 0.04 |
+-----+---------+-------------------------+--------------+
Right now I can run the following and basically get what I need
SELECT COALESCE(SUM(time_entries.hours_worked),0) AS total,
time_entries.user_id,
week::date
--Using generate_series here to account for weeks with no time entries when
--doing the join
FROM generate_series( (DATE_TRUNC('week', '2017-11-01 00:00:00'::date)),
(DATE_TRUNC('week', '2017-12-31 23:59:59.999999'::date)),
interval '7 day') as week LEFT JOIN time_entries
ON DATE_TRUNC('week', time_entries.start_time) = week
GROUP BY week, time_entries.user_id
ORDER BY week
This will return
+-------+---------+------------+
| total | user_id | week |
+-------+---------+------------+
| 14.08 | 5 | 2017-10-30 |
| 21.92 | 6 | 2017-10-30 |
| 10.92 | 7 | 2017-10-30 |
| 14.26 | 8 | 2017-10-30 |
| 14.78 | 10 | 2017-10-30 |
| 14.08 | 13 | 2017-10-30 |
| 15.83 | 15 | 2017-10-30 |
| 8.75 | 5 | 2017-11-06 |
| 10.53 | 6 | 2017-11-06 |
| 13.73 | 7 | 2017-11-06 |
| 14.26 | 8 | 2017-11-06 |
| 19.45 | 10 | 2017-11-06 |
| 15.95 | 13 | 2017-11-06 |
| 14.16 | 15 | 2017-11-06 |
| 1.00 | 20 | 2017-11-13 |
| 0 | | 2017-11-20 |
| 2.50 | 6 | 2017-11-27 |
| 0 | | 2017-12-04 |
| 0 | | 2017-12-11 |
| 0 | | 2017-12-18 |
| 0 | | 2017-12-25 |
+-------+---------+------------+
However, this is difficult to parse particularly when there's no data for a week. What I would like is a pivot or crosstab table where the weeks are the columns and the rows are the users. And to include nulls from each (for instance if a user had no entries in that week or week without entries from any user).
Something like this
+---------+---------------+--------------+--------------+
| user_id | 2017-10-30 | 2017-11-06 | 2017-11-13 |
+---------+---------------+--------------+--------------+
| 6 | 4.0 | 1.0 | 0 |
| 7 | 4.0 | 1.0 | 0 |
| 8 | 4.0 | 0 | 0 |
| 9 | 0 | 1.0 | 0 |
| 10 | 4.0 | 0.04 | 0 |
+---------+---------------+--------------+--------------+
I've been looking around online and it seems that "dynamically" generating a list of columns for crosstab is difficult. I'd rather not hard code them, which seems weird to do anyway for dates. Or use something like this case with week number.
Should I look for another solution besides crosstab? If I could get the series of weeks for each user including all nulls I think that would be good enough. It just seems that right now my join strategy isn't returning that.
Personally I would use a Date Dimension table and use that table as the basis for the query. I find it far easier to use tabular data for these types of calculations as it leads to SQL that's easier to read and maintain. There's a great article on creating a Date Dimension table in PostgreSQL at https://medium.com/#duffn/creating-a-date-dimension-table-in-postgresql-af3f8e2941ac, though you could get away with a much simpler version of this table.
Ultimately what you would do is use the Date table as the base for the SELECT cols FROM table section and then join against that, or probably use Common Table Expressions, to create the calculations.
I'll write up a solution to that if you would like demonstrating how you could create such a query.
I asked a question in the following link and one of the members helped me to solve most of it (to calculate column t and column pre_score). But I need to calculate one more column. I explained the details in the following link.
Previous question
In summary, how I can calculate the intellectual-capital column using column t and column pre_score? intellectual-capital column considers the pre-score from all previous competitions and then multiplies each pre-score by the e^(number of days that have passed from that competition/500). in this example for each user we have at most 2 previous competitions but in my dataset it may be even more than 200 competitions therefore I need to have query that considers all scores from competitions and the time that have passed from each competition.
--> the value of e is approximately 2.71828
competitionId UserId t pre_score intelectual-capital
1 100
2 100 -4 3000 3000* POWER (e, -4/500)
3 100 -5 4000 3000*POWER(e,-9/500) + 4000*POWER(e, -5/500)
1 200
4 200 -19 3000 3000*POWER(e,-19/500)
1 300
3 300 -9 3000 3000*POWER(e,-9/500)
4 300 -10 1200 3000*POWER(e,-19/500)+ 1200*POWER(e,-10/500)
1 400
2 400 -4 3000 3000* POWER(e, -4/500)
3 400 -5 4000 3000* POWER(e, -9/500) + 4000*POWER(e,-5/500)
This result:
| prev_score | intellectual_capital | competitionsId | UserId | date | score | day_diff | t | prev_score |
|------------|----------------------|----------------|--------|----------------------|-------|----------|--------|------------|
| (null) | (null) | 1 | 100 | 2015-01-01T00:00:00Z | 3000 | -4 | (null) | (null) |
| 3000 | 2976.09 | 2 | 100 | 2015-01-05T00:00:00Z | 4000 | -5 | -4 | 3000 |
| 4000 | 6936.29 | 3 | 100 | 2015-01-10T00:00:00Z | 1200 | (null) | -5 | 4000 |
| (null) | (null) | 1 | 200 | 2015-01-01T00:00:00Z | 3000 | -19 | (null) | (null) |
| 3000 | 2888.13 | 4 | 200 | 2015-01-20T00:00:00Z | 1000 | (null) | -19 | 3000 |
| (null) | (null) | 1 | 300 | 2015-01-01T00:00:00Z | 3000 | -9 | (null) | (null) |
| 3000 | 2946.48 | 3 | 300 | 2015-01-10T00:00:00Z | 1200 | -10 | -9 | 3000 |
| 1200 | 4122.72 | 4 | 300 | 2015-01-20T00:00:00Z | 1000 | (null) | -10 | 1200 |
| (null) | (null) | 1 | 400 | 2015-01-01T00:00:00Z | 3000 | -4 | (null) | (null) |
| 3000 | 2976.09 | 2 | 400 | 2015-01-05T00:00:00Z | 4000 | -5 | -4 | 3000 |
| 4000 | 6936.29 | 3 | 400 | 2015-01-10T00:00:00Z | 1200 | (null) | -5 | 4000 |
Produced by this query, which now contains e
with Primo as (
select
*
, datediff(day,lead([date],1) over(partition by userid order by [date]),[date]) day_diff
from Table1
)
, Secondo as (
select
*
, lag(day_diff,1) over(partition by userid order by [date]) t
, lag(score,1) over(partition by userid order by [date]) prev_score
from primo
)
select
prev_score
, sum(prev_score*power(2.71828,t/500.0)) over(partition by userid order by [date]) intellectual_capital
, competitionsId,UserId,date,score,day_diff,t,prev_score
from secondo
Demo
I have a table with records that look like this:
| id | coord-x | coord-y | time |
---------------------------------
| 1 | 0 | 0 | 123 |
| 1 | 0 | 1 | 124 |
| 1 | 0 | 3 | 125 |
The time column represents a time in milliseconds. What I want to do is find all coord-x, coord-y as a set of points for a given timeframe for a given id. For any given id there is a unique coord-x, coord-y, and time.
What I need to do however is group these points as long as they're n milliseconds apart. So if I have this:
| id | coord-x | coord-y | time |
---------------------------------
| 1 | 0 | 0 | 123 |
| 1 | 0 | 1 | 124 |
| 1 | 0 | 3 | 125 |
| 1 | 0 | 6 | 140 |
| 1 | 0 | 7 | 141 |
I would want a result similar to this:
| id | points | start-time | end-time |
| 1 | (0,0), (0,1), (0,3) | 123 | 125 |
| 1 | (0,140), (0,141) | 140 | 141 |
I do have PostGIS installed on my database, the times I posted above are not representative but I kept them small just as a sample, the time is just a millisecond timestamp.
The tricky part is picking the expression inside your GROUP BY. If n = 5, you can do something like time / 5. To match the example exactly, the query below uses (time - 3) / 5. Once you group it, you can aggregate them into an array with array_agg.
SELECT
array_agg(("coord-x", "coord-y")) as points,
min(time) AS time_start,
max(time) AS time_end
FROM "<your_table>"
WHERE id = 1
GROUP BY (time - 3) / 5
Here is the output
+---------------------------+--------------+------------+
| points | time_start | time_end |
|---------------------------+--------------+------------|
| {"(0,0)","(0,1)","(0,3)"} | 123 | 125 |
| {"(0,6)","(0,7)"} | 140 | 141 |
+---------------------------+--------------+------------+
I have two tables. The first generate the condition for counting records in the second. The two tables are linked by a relation of 1:1 by Timestamp.
The problem is that the second table have many columns, and we need a count for each column that match the condition in the first column.
Example:
Tables met and pot
CREATE TABLE met (
tstamp timestamp without time zone NOT NULL,
h1_rad double precision,
CONSTRAINT met_pkey PRIMARY KEY (tstamp)
)
CREATE TABLE pot (
tstamp timestamp without time zone NOT NULL,
c1 double precision,
c2 double precision,
c3 double precision,
CONSTRAINT met_pkey PRIMARY KEY (tstamp)
)
REALLY pot have 108 columns from c1 to c108.
Tables values:
+ Table met + + Table pot +
+----------------+--------+--+----------------+------+------+------+
| tstamp | h1_rad | | tstamp | c1 | c2 | c3 |
+----------------+--------+--+----------------+------+------+------+
| 20150101 00:00 | 0 | | 20150101 00:00 | 5,5 | 3,3 | 15,6 |
| 20150101 00:05 | 1,8 | | 20150101 00:05 | 12,8 | 15,8 | 1,5 |
| 20150101 00:10 | 15,4 | | 20150101 00:10 | 25,4 | 4,5 | 1,4 |
| 20150101 00:15 | 28,4 | | 20150101 00:15 | 18,3 | 63,5 | 12,5 |
| 20150101 00:20 | 29,4 | | 20150101 00:20 | 24,5 | 78 | 17,5 |
| 20150101 00:25 | 13,5 | | 20150101 00:25 | 12,8 | 5,4 | 18,4 |
| 20150102 00:00 | 19,5 | | 20150102 00:00 | 11,1 | 25,6 | 6,5 |
| 20150102 00:05 | 2,5 | | 20150102 00:05 | 36,5 | 21,4 | 45,2 |
| 20150102 00:10 | 18,4 | | 20150102 00:10 | 1,4 | 35,5 | 63,5 |
| 20150102 00:15 | 20,4 | | 20150102 00:15 | 18,4 | 23,4 | 8,4 |
| 20150102 00:20 | 6,8 | | 20150102 00:20 | 16,8 | 12,5 | 18,4 |
| 20150102 00:25 | 17,4 | | 20150102 00:25 | 25,8 | 23,5 | 9,5 |
+----------------+--------+--+----------------+------+------+------+
What i need is the number of rows of pot where value is higher than 15 when in met the value is higher than 15 with the same timestamp, grouped by day.
With the data supplied we need something like:
+----------+----+----+----+
| day | c1 | c2 | c3 |
+----------+----+----+----+
| 20150101 | 3 | 2 | 1 |
| 20150102 | 2 | 4 | 1 |
+----------+----+----+----+
How can i get this ?
Is this possible with a single query even with subquerys ?
Actually the raw data is stored every minute in others tables. The tables met and pot are summarized and filtered tables for performance.
If necessary, i can create tables with data summarized by days if this simplify the solution.
Thanks
P.D.
Sorry for my english
You can solve this with some CASE statements. Test for both conditions, and if true return a 1. Then SUM() the results using a GROUP BY on the timestamp converted to a date to get your total:
SELECT
date(met.tstamp),
SUM(CASE WHEN met.h1_rad > 15 AND pot.c1 > 15 THEN 1 END) as C1,
SUM(CASE WHEN met.h1_rad > 15 AND pot.c2 > 15 THEN 1 END) as C2,
SUM(CASE WHEN met.h1_rad > 15 AND pot.c3 > 15 THEN 1 END) as C3
FROM
met INNER JOIN pot ON met.tstamp = pot.tstamp
GROUP BY date(met.tstamp)