Impala - handle CAST exception for column - pyspark

SELECT src_user, CAST(start_time as timestamp) as start_time_ts, start_time, dest_ip, src_ip, count(*) as `count`
FROM mytable
WHERE
start_time like '2022-06%'
AND src_ip = '2.3.4.5'
AND rule_name like '%XASDF%'
group by 1, 2, 3, 4, 5 order by 2 desc)
I'm getting an error with pyspark:
[Cloudera][JDBC](10140) Error converting value to Timestamp.
Now, if I don't order by in the query I can get a good result set with timestamps properly converted, so there's a row somewhere that starts with 2022-06 that is not parsing correctly.
How can I setup my error handling so that it will show me the actual value that is causing the error, rather than telling me what the error is?
Here's the code:
df = spark.read.jdbc(url= 'jdbc:impala://asdf.com/dasafsda', table = select_sql, properties = properties)
df.printSchema()
df.show(truncate=False)

I dont know Impala specific function but lets say your normal date is in format yyyy-mm-dd so you can do length check on it if you suspect 2022-06 is causing problem and then set it to default date for error value and store it in separate column to see its real value
like
SELECT src_user,CASE WHEN LENGTH (start_time) =10 THEN CAST(start_time as timestamp) ELSE CAST("1900-01-01" as timestamp) END as start_time_ts, start_time as start_time_raw,start_time, dest_ip, src_ip, count(*) as `count`
FROM mytable
WHERE
start_time like '2022-06%'
AND src_ip = '2.3.4.5'
AND rule_name like '%XASDF%'
group by 1, 2, 3, 4, 5 order by 2 desc)
after this query just filter 1900-01-01 to see vales in start_time_raw which were not casted

Related

Create rows from part of column names

Source data
I am working on an ELT project to load data from CSV files into PostgreSQL where I will transform it. The CSV files have many columns that are consistent across files, but also contain activity columns that are inconsistent with names like Date (05/19/2020), Type (05/19/2020), etc.
In the loading script I am merging all of the columns with dates in the column name into one jsonb column so I don't have to constantly add new columns to the raw data table.
The resulting jsonb column in the raw data table looks like this:
id
activity
12345678
{"Date (05/19/2020)": null, "Type (05/19/2020)": null, "Date (06/03/2020)": "06/01/2020", "Type (06/03/2020)": "E"}
98765432
{"Date (05/19/2020)": "05/18/2020", "Type (05/19/2020)": "B", "Date (10/23/2020)": "10/26/2020", "Type (10/23/2020)": "T"}
JSON to columns
Using the amazing create_jsonb_flat_view function from this post I can convert the jsonb to columns like this:
id
Date (05/19/2020)
Type (05/19/2020)
Date (06/03/2020)
Type (06/03/2020)
Type (10/23/2020
Date (10/23/2020)
Type (10/23/2020)
10629465
null
null
06/01/2020
E
98765432
05/18/2020
B
10/26/2020
T
Need to move part of column name to row
Now, this is where I'm stuck. I need to remove the portion of the column name that is the Activity Date (e.g. (05/19/2020)) and create a row for each id and ActivityDate with additional columns for Date and Type like this:
id
ActivityDate
Date
Type
12345678
05/19/2020
null
null
12345678
06/03/2020
06/01/2020
E
98765432
05/19/2020
05/18/2020
B
98765432
10/23/2020
10/26/2020
T
I followed your link to the create_jsonb_flat_view article yesterday and then forgot this question. While I thank you for pointing me there, I think that mentioning it worked against you.
A more conventional approach using regexp_replace() works here. I left the date values as strings, but you can convert them with to_date() if needed:
with parse as (
select id, e.k, e.v,
regexp_replace(e.k, '\s+\([0-9/]{10}\)', '') as k_no_date,
regexp_replace(e.k, '^.+([0-9/]{10}).+', '\1') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;
db<>fiddle here
#Mike-Organek's Answer works beautifully!
However, I was curious if the regexp_replace() calls might be slowing the query down a bit and it seemed I could get the same results using a simpler function.
Since Mike gave me a great example to start with I modified it to split on the space between Date and (05/19/2020).
For 20,000 rows, it went from taking an avg of 7 sec on my local machine to an avg of .9 sec.
Here is the resulting query:
with parse as (
select id, e.k, e.v,
split_part(e.k, ' ', 1) as k_no_date,
trim(split_part(e.k, ' ', 2),'()') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;

Filter data from BigQuery by using "last day of previous week"

For my BigQuery request I want to define to dates in the Between function as string. But I get the following error message 'Cannot read field 'date' of type INT64 as STRING'.
My calculation of LastDayofPreviusWeek works in SELECT clause and gives correct results as well, but I can't use in WHERE clause? Any ideas?
SELECT
FORMAT_DATE('%Y%m%d',DATE_SUB(DATE_TRUNC(PARSE_DATE('%Y%m%d', date), WEEK(MONDAY)),INTERVAL 1 DAY)) as LastDayofPreviousWeek,
sum(totals.bounces) as Bounces
from `xxx.ga_sessions_*` t
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 71 DAY))
AND
FORMAT_DATE('%Y%m%d',DATE_SUB(DATE_TRUNC(PARSE_DATE('%Y%m%d', date), WEEK(MONDAY)),INTERVAL 1 DAY))
group by 1
Sample input data:
date, Bounces
20201118, 18695
20201119, 18694
20201120, 18693
The below allows your LastDayofPreviousWeek to execute.
You need to cast the date as a string.
However there are still issues with your sql statement specifically in your WHERE clause that you will need to address and actually set an operator and a value to evaluate.
with temp as (
SELECT 20201118 date, 18695 Bounces UNION ALL
SELECT 20201119, 18694 UNION ALL
SELECT 20201120, 18693
)
SELECT
FORMAT_DATE('%Y%m%d',DATE_SUB(DATE_TRUNC(PARSE_DATE('%Y%m%d', cast(date as string)), WEEK(MONDAY)),INTERVAL 1 DAY)) as LastDayofPreviousWeek,
sum(t.bounces) as Bounces
from temp t

Postgres query for report

I'm trying to solve this problem:
I have a query/view that will join ~10 tables to extract some fields for a report (if any). The query doesn't use any grouping function, only joins and cut off some unuseful data.
I have to take this one big view, get the group for the first index, take the max of a date in the second column and take all the information from other fields referring the record of the max value.
I cannot be able to to this in postgres.
As a pseudo code I can give this:
select 1
, max(2)
, 3 referred to the record from max(2)
, 4 referred to the record from max(2)
, ...
, 20 referred to the record from max(2)
from (ViewWithAllJoins) a
group by 1
For privacy and business problem I had to obfuscate some informations, 1/2/3/4... are the name of the column from the view "ViewWithAllJoins", I hope that the problem is still understandable and resolvable!
I've tryied with WINDOW command as reported in Convert keep dense_rank from Oracle query into postgres but I cannot be able to use the group by that I need. Other tryes that I've done was about the dense_rank like shown in Dense_rank first Oracle to Postgresql convert but I can't do any assumption on the order of the data in any of the other fields in exception of 1 and 2, so I can't use any of the aggregate function on them.
Any ideas? Possibly without adding too much subqueryes.
Thank you!
EDIT:
As suggested I'll add some synthetic data to better understand the problem and what I want.
Start:
ID DATE COLUMN1 COLUMN2 COLUMN3
=====================================================================
88888888;"2016-04-02 09:00:00";"aaaaaaaaaaa";"TEXT89" ; 999999999
88888888;"2018-08-21 09:00:00";"a" ;"TEXT1" ; 988888888
88888888;"2017-11-09 09:00:00";"zzzz" ;"TEXT80000" ; 850580582
75858585;"2017-01-31 09:00:00";"~~~~~~~~~~~";"TEXT10" ; 101010101
75858585;"2018-04-02 09:00:00";"eeeeeeeeeee";"TEXT1000" ; 111111111
99999999;"2016-04-02 09:00:00";"8d2ecafd866";"TEXT808911"; 777777777
What I want:
ID DATE COLUMN1 COLUMN2 COLUMN3
===================================================================
88888888;"2018-08-21 09:00:00";"a" ;"TEXT1" ; 988888888
75858585;"2018-04-02 09:00:00";"eeeeeeeeeee";"TEXT1000" ; 111111111
99999999;"2016-04-02 09:00:00";"8d2ecafd866";"TEXT808911"; 777777777
So the group by id, the max of the date and the other fields related to the row of the max date.
-- So you have duplicate records per ID, and for every ID you want to select the record with the most recent date ?
Use NOT EXISTS:
SELECT id,zdate,column1,column2,column3 -- , ...
FROM queryview t
WHERE NOT EXISTS (
SELECT *
FROM queryview x
WHERE x.id=t.id
AND x.zdate > t.zdate
);
Or, use row_number() over a window, and pick only the row with the final date:
SELECT id,zdate,column1,column2,column3 -- , ...
FROM ( SELECT *
, row_number() OVER(PARTITION BY id, ORDER BY zdate DESC) AS rn
FROM queryview
) q
WHERE q.rn = 1
;

How to query from the result of a changed column of a table in postgresql

So I have a string time column in a table and now I want to change that time to date time type and then query data for selected dates.
Is there a direct way to do so? One way I could think of is
1) add a new column
2) insert values into it with converted date
3) Query using the new column
Here I am stuck with the 2nd step with INSERT so need help with that
ALTER TABLE "nds".”unacast_sample_august_2018"
ADD COLUMN new_date timestamp
-- Need correction in select statement that I don't understand
INSERT INTO "nds".”unacast_sample_august_2018” (new_date)
(SELECT new_date from_iso8601_date(substr(timestamp,1,10))
Could some one help me with correction and if possible a better way of doing it?
Tried other way to do in single step but gives error as Column does not exist new_date
SELECT *
FROM (SELECT from_iso8601_date(substr(timestamp,1,10)) FROM "db_name"."table_name") AS new_date
WHERE new_date > from_iso8601('2018-08-26') limit 10;
AND
SELECT new_date = (SELECT from_iso8601_date(substr(timestamp,1,10)))
FROM "db_name"."table_name"
WHERE new_date > from_iso8601('2018-08-26') limit 10;
Could someone correct these queries?
You don't need those steps, just use USING CAST clause on your ALTER TABLE:
CREATE TABLE foobar (my_timestamp) AS
VALUES ('2018-09-20 00:00:00');
ALTER TABLE foobar
ALTER COLUMN my_timestamp TYPE timestamp USING CAST(my_timestamp AS TIMESTAMP);
If your string timestamps are in a correct format this should be enough.
Solved as follows:
select *
from
(
SELECT from_iso8601_date(substr(timestamp,1,10)) as day,*
FROM "db"."table"
)
WHERE day > date_parse('2018-08-26', '%Y-%m-%d')
limit 10

Postgres query including time range

I have a query that pulls part of the data that I need for tracking. What I need to add is either a column that includes the date or the ability to query the table for a date range. I would prefer the column if possible. I am using psql 8.3.3.
Here is the current query:
select count(mailing.mailing_id) as mailing_count, department.name as org
from mailing, department
where mailing.department_id = department.department_id
group by department.name;
This returns the following information:
mailing_count | org
---------------+-----------------------------------------
2 | org1 name
8 | org2 name
22 | org3 name
21 | org4 name
39 | org5 name
The table that I am querying has 3 columns that have date in a timestamp format which are target_launch_date, created_time and modified_time.
When I try to add the date range to the query I get an error:
Query:
select count(mailing.mailing_id) as mailing_count, department.name as org
from mailing, department
where mailing.department_id = department.department_id
group by department.name,
WHERE (target_launch_date)>= 2016-09-01 AND < 2016-09-05;
Error:
ERROR: syntax error at or near "WHERE" LINE 1:
...department.department_id group by department.name,WHERE(targ...
I've tried moving the location of the date range in the string and a variety of other changes, but cannot achieve what I am looking for.
Any insight would be greatly appreciated!
Here's a query that would do what you need:
SELECT
count(m.mailing_id) as mailing_count,
d.name as org
FROM mailing m
JOIN department d USING( department_id )
WHERE
m.target_launch_date BETWEEN '2016-09-01' AND '2016-09-05'
GROUP BY 2
Since your target_launch_date is of type timestamp you can safely do <= '2016-09-05' which will actually convert to 2016-09-05 00:00:00.00000 giving you all the dates that are before start of that day or exactly 2016-09-05 00:00:00.00000
Couple of additional notes:
Use aliases for table names to shorten the code, eg. mailing m
Use explicit JOIN syntax to connect data from related tables
Apply your WHERE clause before GROUP BY to exclude rows that don't match it
Use BETWEEN operator to handle date >= X AND date <= Y case
You can use USING instead of ON in JOIN syntax when joined column names are the same
You can use column numbers in GROUP BY which point to position of a column in your select
To gain more insight on the matter of how processing of a SELECT statement behaves in steps look at the documentation.
Edit
Approach using BETWEEN operator would account 2015-09-05 00:00:00.00000 to the resultset. If this timestamp should be discarded change BETWEEN x AND y to either of those two:
(...) BETWEEN x AND y::timestamp - INTERVAL '1 microsecond'
(...) >= x AND (...) < y
You were close, you need to supply the column name on second part of where too and you would have a single where:
select count(mailing.mailing_id) as mailing_count, department.name as org
from mailing
inner join department on mailing.department_id = department.department_id
where target_launch_date >= '2016-09-01 00:00:00'
AND target_launch_date < '2016-09-05 00:00:00'
group by department.name;
EDIT: This part is just for Kamil G. showing clearly that between should NOT be used:
create table sample (id int, d timestamp);
insert into sample (id, d)
values
(1, '2016/09/01'),
(2, '2016/09/02'),
(3, '2016/09/03'),
(4, '2016/09/04'),
(5, '2016/09/05'),
(6, '2016/09/05 00:00:00'),
(7, '2016/09/05 00:00:01'),
(8, '2016/09/06');
select * from sample where d between '2016-09-01' and '2016-09-05';
Result:
1;"2016-09-01 00:00:00"
2;"2016-09-02 00:00:00"
3;"2016-09-03 00:00:00"
4;"2016-09-04 00:00:00"
5;"2016-09-05 00:00:00"
6;"2016-09-05 00:00:00"
BTW if you wouldn't believe without seeing explain, then here it is:
Filter: ((d >= '2016-09-01 00:00:00'::timestamp without time zone) AND
(d <= '2016-09-05 00:00:00'::timestamp without time zone))