I have a query that pulls part of the data that I need for tracking. What I need to add is either a column that includes the date or the ability to query the table for a date range. I would prefer the column if possible. I am using psql 8.3.3.
Here is the current query:
select count(mailing.mailing_id) as mailing_count, department.name as org
from mailing, department
where mailing.department_id = department.department_id
group by department.name;
This returns the following information:
mailing_count | org
---------------+-----------------------------------------
2 | org1 name
8 | org2 name
22 | org3 name
21 | org4 name
39 | org5 name
The table that I am querying has 3 columns that have date in a timestamp format which are target_launch_date, created_time and modified_time.
When I try to add the date range to the query I get an error:
Query:
select count(mailing.mailing_id) as mailing_count, department.name as org
from mailing, department
where mailing.department_id = department.department_id
group by department.name,
WHERE (target_launch_date)>= 2016-09-01 AND < 2016-09-05;
Error:
ERROR: syntax error at or near "WHERE" LINE 1:
...department.department_id group by department.name,WHERE(targ...
I've tried moving the location of the date range in the string and a variety of other changes, but cannot achieve what I am looking for.
Any insight would be greatly appreciated!
Here's a query that would do what you need:
SELECT
count(m.mailing_id) as mailing_count,
d.name as org
FROM mailing m
JOIN department d USING( department_id )
WHERE
m.target_launch_date BETWEEN '2016-09-01' AND '2016-09-05'
GROUP BY 2
Since your target_launch_date is of type timestamp you can safely do <= '2016-09-05' which will actually convert to 2016-09-05 00:00:00.00000 giving you all the dates that are before start of that day or exactly 2016-09-05 00:00:00.00000
Couple of additional notes:
Use aliases for table names to shorten the code, eg. mailing m
Use explicit JOIN syntax to connect data from related tables
Apply your WHERE clause before GROUP BY to exclude rows that don't match it
Use BETWEEN operator to handle date >= X AND date <= Y case
You can use USING instead of ON in JOIN syntax when joined column names are the same
You can use column numbers in GROUP BY which point to position of a column in your select
To gain more insight on the matter of how processing of a SELECT statement behaves in steps look at the documentation.
Edit
Approach using BETWEEN operator would account 2015-09-05 00:00:00.00000 to the resultset. If this timestamp should be discarded change BETWEEN x AND y to either of those two:
(...) BETWEEN x AND y::timestamp - INTERVAL '1 microsecond'
(...) >= x AND (...) < y
You were close, you need to supply the column name on second part of where too and you would have a single where:
select count(mailing.mailing_id) as mailing_count, department.name as org
from mailing
inner join department on mailing.department_id = department.department_id
where target_launch_date >= '2016-09-01 00:00:00'
AND target_launch_date < '2016-09-05 00:00:00'
group by department.name;
EDIT: This part is just for Kamil G. showing clearly that between should NOT be used:
create table sample (id int, d timestamp);
insert into sample (id, d)
values
(1, '2016/09/01'),
(2, '2016/09/02'),
(3, '2016/09/03'),
(4, '2016/09/04'),
(5, '2016/09/05'),
(6, '2016/09/05 00:00:00'),
(7, '2016/09/05 00:00:01'),
(8, '2016/09/06');
select * from sample where d between '2016-09-01' and '2016-09-05';
Result:
1;"2016-09-01 00:00:00"
2;"2016-09-02 00:00:00"
3;"2016-09-03 00:00:00"
4;"2016-09-04 00:00:00"
5;"2016-09-05 00:00:00"
6;"2016-09-05 00:00:00"
BTW if you wouldn't believe without seeing explain, then here it is:
Filter: ((d >= '2016-09-01 00:00:00'::timestamp without time zone) AND
(d <= '2016-09-05 00:00:00'::timestamp without time zone))
Related
Source data
I am working on an ELT project to load data from CSV files into PostgreSQL where I will transform it. The CSV files have many columns that are consistent across files, but also contain activity columns that are inconsistent with names like Date (05/19/2020), Type (05/19/2020), etc.
In the loading script I am merging all of the columns with dates in the column name into one jsonb column so I don't have to constantly add new columns to the raw data table.
The resulting jsonb column in the raw data table looks like this:
id
activity
12345678
{"Date (05/19/2020)": null, "Type (05/19/2020)": null, "Date (06/03/2020)": "06/01/2020", "Type (06/03/2020)": "E"}
98765432
{"Date (05/19/2020)": "05/18/2020", "Type (05/19/2020)": "B", "Date (10/23/2020)": "10/26/2020", "Type (10/23/2020)": "T"}
JSON to columns
Using the amazing create_jsonb_flat_view function from this post I can convert the jsonb to columns like this:
id
Date (05/19/2020)
Type (05/19/2020)
Date (06/03/2020)
Type (06/03/2020)
Type (10/23/2020
Date (10/23/2020)
Type (10/23/2020)
10629465
null
null
06/01/2020
E
98765432
05/18/2020
B
10/26/2020
T
Need to move part of column name to row
Now, this is where I'm stuck. I need to remove the portion of the column name that is the Activity Date (e.g. (05/19/2020)) and create a row for each id and ActivityDate with additional columns for Date and Type like this:
id
ActivityDate
Date
Type
12345678
05/19/2020
null
null
12345678
06/03/2020
06/01/2020
E
98765432
05/19/2020
05/18/2020
B
98765432
10/23/2020
10/26/2020
T
I followed your link to the create_jsonb_flat_view article yesterday and then forgot this question. While I thank you for pointing me there, I think that mentioning it worked against you.
A more conventional approach using regexp_replace() works here. I left the date values as strings, but you can convert them with to_date() if needed:
with parse as (
select id, e.k, e.v,
regexp_replace(e.k, '\s+\([0-9/]{10}\)', '') as k_no_date,
regexp_replace(e.k, '^.+([0-9/]{10}).+', '\1') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;
db<>fiddle here
#Mike-Organek's Answer works beautifully!
However, I was curious if the regexp_replace() calls might be slowing the query down a bit and it seemed I could get the same results using a simpler function.
Since Mike gave me a great example to start with I modified it to split on the space between Date and (05/19/2020).
For 20,000 rows, it went from taking an avg of 7 sec on my local machine to an avg of .9 sec.
Here is the resulting query:
with parse as (
select id, e.k, e.v,
split_part(e.k, ' ', 1) as k_no_date,
trim(split_part(e.k, ' ', 2),'()') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;
SELECT src_user, CAST(start_time as timestamp) as start_time_ts, start_time, dest_ip, src_ip, count(*) as `count`
FROM mytable
WHERE
start_time like '2022-06%'
AND src_ip = '2.3.4.5'
AND rule_name like '%XASDF%'
group by 1, 2, 3, 4, 5 order by 2 desc)
I'm getting an error with pyspark:
[Cloudera][JDBC](10140) Error converting value to Timestamp.
Now, if I don't order by in the query I can get a good result set with timestamps properly converted, so there's a row somewhere that starts with 2022-06 that is not parsing correctly.
How can I setup my error handling so that it will show me the actual value that is causing the error, rather than telling me what the error is?
Here's the code:
df = spark.read.jdbc(url= 'jdbc:impala://asdf.com/dasafsda', table = select_sql, properties = properties)
df.printSchema()
df.show(truncate=False)
I dont know Impala specific function but lets say your normal date is in format yyyy-mm-dd so you can do length check on it if you suspect 2022-06 is causing problem and then set it to default date for error value and store it in separate column to see its real value
like
SELECT src_user,CASE WHEN LENGTH (start_time) =10 THEN CAST(start_time as timestamp) ELSE CAST("1900-01-01" as timestamp) END as start_time_ts, start_time as start_time_raw,start_time, dest_ip, src_ip, count(*) as `count`
FROM mytable
WHERE
start_time like '2022-06%'
AND src_ip = '2.3.4.5'
AND rule_name like '%XASDF%'
group by 1, 2, 3, 4, 5 order by 2 desc)
after this query just filter 1900-01-01 to see vales in start_time_raw which were not casted
i have a table where my date/time is of form: 2020-03-10 22:54:08
This is a timestampped object. I tried the following query but didn't return any rows:
select ts from table1
where cast(ts as timestamp) = '2020-03-10 22:54:08'
returns nothing.
How do i query based on date and time in postgressql?
A timestamp has microsecond resolution, so you have to use same techiques as when testing floating point numbers: Round it or use only < and > for comparison.
To retrieve data from a database, you need to refer to SQL SELECT syntax. In your situation, the ts column is already a timestamp, so there is no need to use cast(). Bear in mind, however, that a timestamp type contains fractions of a second (i.e., 2020-03-10 22:54:03.xxx), so you would be better off using a comparison operator (>,<,>=,or <=)
You can retrieve all columns by using the * syntax:
select *
from my_table
where ts >= '2020-03-10 22:54:08';
tymestamp type by default contains also microseconds, so, now() which, for example, is 2020-03-11 01:56:27.593985 here obviously is not equal to 2020-03-11 01:56:27. If you do not want to have microseconds precision in your data then declare your field like ts timestamp(0) NOT NULL DEFAULT now() which means "0 decimal digits for microseconds":
select
current_timestamp::timestamp as ts,
current_timestamp::timestamp(2) as ts2,
current_timestamp::timestamp(0) as ts0;
ts | ts2 | ts0
---------------------------+------------------------+---------------------
2020-03-11 02:02:52.98298 | 2020-03-11 02:02:52.98 | 2020-03-11 02:02:53
Actually this worked.
select distinct(ts) from my_table where ts >= '2020-03-10 22:54:08' and ts <= '2020-03-10 22:54:09'
But this doesn't give me the whole rows
But then i tried this and this worked:
select ts from table1
where to_char(ts,'YYYY-MM-DD HH24:MI:SS') = '2020-03-10 22:54:08'
I need to include EXTRACT() function within WHERE clause as follow:
SELECT * FROM my_table WHERE EXTRACT(YEAR FROM date) = '2014';
I get a message like this:
pg_catalog.date_part(unknown, text) doesn't exist**
SQL State 42883
Here is my_table content (gid INTEGER, date DATE):
gid | date
-------+-------------
1 | 2014-12-12
2 | 2014-12-08
3 | 2013-17-15
I have to do it this way because the query is sent from a form on a website that includes a 'Year' field where users enter the year on a 4-digits basis.
The problem is that your column is of data type text, while EXTRACT() only works for date / time types.
You should convert your column to the appropriate data type.
ALTER TABLE my_table ALTER COLUMN date TYPE date;
That's smaller (4 bytes instead of 11 for the text), faster and cleaner (disallows illegal dates and most typos).
If you have non-standard format add a USING clause with a conversion expression. Example:
Alter character field to date
Also, for your queries to be fast with a plain index on date you should rather use sargable predicates. Like:
SELECT * FROM my_table
WHERE date >= '2014-01-01'
AND date < '2015-01-01';
Or, to go with your 4-digit input for the year:
SELECT * FROM my_table
WHERE date >= to_date('2014', 'YYYY')
AND date < to_date('2015', 'YYYY');
You could also be more explicit:
to_date('2014' || '0101', 'YYYYMMNDD')
Both produce the same date '2014-01-01'.
Aside: date is a reserved word in standard SQL and a basic type name in Postgres. Don't use it as identifier.
This happens because the column has a text or varchar type, as opposed to date or timestamp. This is easily reproducible:
SELECT 1 WHERE extract(year from '2014-01-01'::text)='2014';
yields this error:
ERROR: function pg_catalog.date_part(unknown, text) does not exist
LINE 1: SELECT 1 WHERE extract(year from '2014-01-01'::text)='2014';
^ HINT: No function matches the given name and argument types. You might need to add explicit type casts.
extract or is underlying function date_part does not exist for text-like datatypes, but they're not needed anyway. Extracting the year from this date format is equivalent to getting the 4 first characters, so your query would be:
SELECT * FROM my_table WHERE left(date,4)='2014';
I have to solve a problem and don't know how to do it. Im using SQL Server 2012.
I have the data like this schema:
-----------------------------------------------------------------------------------
DriverId | BeginDate | EndDate | NextBegin | Rest in | Drive Time | Drive
| | | Date | Hours | in Minutes | KM
-----------------------------------------------------------------------------------
integer datetime datetime datetime integer integer decimal(10,3)
Rest in hours = EndDate - NextBeginDate
Drive Time in Minutes = BeginDate - EndDate
I have to search the first rest => 36 hours then
Do
Compute how many days are
SUM(DriveTime)
SUM(TotalKM)
until next rest => 36 hours
IF No More Rest EXIT DO
Loop
From the begining to the first Rest is discard
From the last Rest to the end is discard
I have data in excel sheet you can download from here: Download Excel with data example
I'm sorry for my english, I hope you can understand and help me, thank you in advance.
There are several parts to the query. The first part pulls out the rows where Rest is >= 36 and assigns a row number. The result is stored in a CTE called BigRest.
with BigRest(RowNumber, DriverId, BeginDate, EndDate)
as
(
select ROW_NUMBER() over(partition by d.DriverId order by d.DriverId, d.BeginDate) RowNumber,d.DriverId, d.BeginDate, d.EndDate
from Drive d
where d.Rest >= 36
)
Then I assign the row number from BigRest to each row in Drive (which is what I'm calling the table that has all the data in it) based on the BeginDate. So the data is effectively segmented by the days where Rest >= 36. Each segment gets a number called DriveGroup.
;with Grouped(DriverId, BeginDate, EndDate, DriveTime, DriveKM, DriveGroup)
as
(
select d.DriverId, d.BeginDate, d.EndDate, d.Drivetime, d.DriveKM, (select Top 1 RowNumber from BigRest b where b.DriverId = d.DriverId and b.BeginDate >= d.BeginDate order by b.BeginDate)
from Drive d
)
Finally, I select the data from Grouped, cross applying it with some aggregate data from itself. We can filter out the rows where the DriveGroup is 1 or null because those represent the beginning and end rows that don't matter (the "do nothing" rows).
select distinct DriverId, MinBeginDate BeginDate, MaxEndDate EndDate, DATEDIFF(D, MinBeginDate, MaxEndDate)+1 Days, DriveTimeSum Drive, DriveKMSum KM
from
(
select g.DriverId, g.BeginDate, g.EndDate, g.DriveGroup, g.DriveTime, c.DriveTimeSum, c.DriveKMSum, c.MinBeginDate, c.MaxEndDate
from Grouped g
cross apply(select SUM(g2.DriveTime) DriveTimeSum,
SUM(g2.DriveKM) DriveKMSum,
MIN(g2.BeginDate) MinBeginDate,
MAX(g2.EndDate) MaxEndDate
from Grouped g2
where g2.DriverId = g.DriverId
and g2.DriveGroup = g.DriveGroup) as c
where g.DriveGroup is not null
and g.DriveGroup > 1
) x
Here's a SQL Fiddle
I'd encourage you to look at the results at each step of the query to see what's actually going on.