RedShift: some troubles with nested json - amazon-redshift

I have next JSON:
{"promptnum":4,"corpuscode":"B0014","prompttype":"video","skipped":false,"transcription":"1","deviceinfo":{"DEVICE_ID":"exynos980","DEVICE_MANUFACTURER":"samsung","DEVICE_SERIAL":"unknown","DEVICE_DESIGN":"a51x","DEVICE_MODEL":"SM-A5160","DEVICE_OS":"android","DEVICE_OS_VERSION":"10","DEVICE_CARRIER":"","DEVICE_BATTERY_LEVEL":"70.00%","DEVICE_BATTERY_STATE":"unplugged","Current App Version":"1.1.0","Current App Build":"6"}}
I want to get values from 1-st level and 2-nd level.
1-st level: "promptnum":4,"corpuscode":"B0014","prompttype":"video","skipped":false,"transcription":"1","deviceinfo":...
2-nd level:
"deviceinfo":{"DEVICE_ID":"exynos980","DEVICE_MANUFACTURER":"samsung","DEVICE_SERIAL":"unknown","DEVICE_DESIGN":"a51x","DEVICE_MODEL":"SM-A5160","DEVICE_OS":"android","DEVICE_OS_VERSION":"10","DEVICE_CARRIER":"","DEVICE_BATTERY_LEVEL":"70.00%","DEVICE_BATTERY_STATE":"unplugged","Current App Version":"1.1.0","Current App Build":"6"}
When I parse 1-st level with
SELECT d.*
FROM (
SELECT c.json_parse, c.json_parse.deviceinfo AS device_info
FROM (
SELECT JSON_PARSE(file_attr)
FROM public.dc_ac_files
) AS c) AS d
it's work well.
But when I try to get values from 2-nd level with
SELECT d.*, l.DEVICE_ID
FROM (
SELECT c.json_parse, c.json_parse.deviceinfo AS device_info
FROM (
SELECT JSON_PARSE(file_attr)
FROM public.dc_ac_files
) AS c) AS d, d.device_info AS l
it doesn't work - no errors and no data.
If I know, it's right way to parse nested json, but it doesn't work for me.
Can you help me?

Viktor you have a couple of issues. First the notation "AS d, d.device_info AS l" is used to unnest arrays in your super data. You don't have any arrays to unnest so this is returning zero rows.
Second Redshift defaults to lower case for all column names so DEVICE_ID is being seen as device_id. You can enable case sensitive column names by setting the enable_case_sensitive_identifier connection variable to true and quoting all column names that require upper characters. "SET enable_case_sensitive_identifier TO true;" and changing l.DEVICE_ID to l."DEVICE_ID".
You also have unneeded layers in your query.
Putting all these together you can run:
SELECT l, l.deviceinfo, l.deviceinfo."DEVICE_ID"
FROM (
SELECT JSON_PARSE(file_attr) AS l
FROM public.dc_ac_files
) AS c
You also don't need SUPER data type to perform this. This can be done with json string parsing functions.
SELECT file_attr, json_extract_path_text(file_attr, 'deviceinfo') as deviceinfo, json_extract_path_text(file_attr, 'deviceinfo','DEVICE_ID') as device_id
FROM public.dc_ac_files

Related

Casting rows to arrays in PostgreSQL

I need to query a table as in
SELECT *
FROM table_schema.table_name
only each row needs to be a TEXT[] with array values corresponding to column values casted to TEXT coming in the same order as in SELECT * so assuming the table has columns a, b and c I need the result to look like
SELECT ARRAY[a::TEXT, b::TEXT, c::TEXT]
FROM table_schema.table_name
only it shouldn't explicitly list columns by name. Ideally it should look like
SELECT as_text_array(a)
FROM table_schema.table_name AS a
The best I came up with looks ugly and relies on "hstore" extension
WITH columnz AS ( -- get ordered column name array
SELECT array_agg(attname::TEXT ORDER BY attnum) AS column_name_array
FROM pg_attribute
WHERE attrelid = 'table_schema.table_name'::regclass AND attnum > 0 AND NOT attisdropped
)
SELECT hstore(a)->(SELECT column_name_array FROM columnz)
FROM table_schema.table_name AS a
I am having a feeling there must be a simpler way to achieve that
UPDATE 1
Another query that achieves the same result but arguably as ugly and inefficient as the first one is inspired by the answer by #bspates. It may be even less efficient but doesn't rely on extensions
SELECT r.text_array
FROM table_schema.table_name AS a
INNER JOIN LATERAL ( -- parse ROW::TEXT presentation of a row
SELECT array_agg(COALESCE(replace(val[1], '""', '"'), NULLIF(val[2], ''))) AS text_array
FROM regexp_matches(a::text, -- parse double-quoted and simple values separated by commas
'(?<=\A\(|,) (?: "( (?:[^"]|"")* )" | ([^,"]*) ) (?=,|\)\Z)', 'xg') AS t(val)
) AS r ON TRUE
It is still far from ideal
UPDATE 2
I tested all 3 options existing at the moment
Using JSON. It doesn't rely on any extensions, it is short to write, easy to understand and the speed is ok.
Using hstore. This alternative is the fastest (>10 times faster than JSON approach on a 100K dataset) but requires an extension. hstore in general is very handy extension to have through.
Using regex to parse TEXT presentation of a ROW. This option is really slow.
A somewhat ugly hack is to convert the row to a JSON value, then unnest the values and aggregate it back to an array:
select array(select (json_each_text(to_json(t))).value) as row_value
from some_table t
Which is to some extent the same as your hstore hack.
If the order of the columns is important, then using json and with ordinality can be used to keep that:
select array(select val
from json_each_text(to_json(t)) with ordinality as t(k,val,idx)
order by idx)
from the_table t
The easiest (read hacky-est) way I can think of is convert to a string first then parse that string into an array. Like so:
SELECT string_to_array(table_name::text, ',') FROM table_name
BUT depending on the size and type of the data in the table, this could perform very badly.

SQL with table as becomes ambiguous

Perhaps I'm approaching this all wrong, in which case feel free to point out a better way to solve the overall question, which "How do I use an intermediate table for future queries?"
Let's say I've got tables foo and bar, which join on some baz_id, and I want to use combine this into an intermediate table to be fed into upcoming queries. I know of the WITH .. AS (...) statement, but am running into problems as such:
WITH foobar AS (
SELECT *
FROM foo
INNER JOIN bar ON bar.baz_id = foo.baz_id
)
SELECT
baz_id
-- some other things as well
FROM
foobar
The issue is that (Postgres 9.4) tells me baz_id is ambiguous. I understand this happens because SELECT * includes all the columns in both tables, so baz_id shows up twice; but I'm not sure how to get around it. I was hoping to avoid copying the column names out individually, like
SELECT
foo.var1, foo.var2, foo.var3, ...
bar.other1, bar.other2, bar.other3, ...
FROM foo INNER JOIN bar ...
because there are hundreds of columns in these tables.
Is there some way around this I'm missing, or some altogether different way to approach the question at hand?
WITH foobar AS (
SELECT *
FROM foo
INNER JOIN bar USING(baz_id)
)
SELECT
baz_id
-- some other things as well
FROM
foobar
It leaves only one instance of the baz_id column in the select list.
From the documentation:
The USING clause is a shorthand that allows you to take advantage of the specific situation where both sides of the join use the same name for the joining column(s). It takes a comma-separated list of the shared column names and forms a join condition that includes an equality comparison for each one. For example, joining T1 and T2 with USING (a, b) produces the join condition ON T1.a = T2.a AND T1.b = T2.b.
Furthermore, the output of JOIN USING suppresses redundant columns: there is no need to print both of the matched columns, since they must have equal values. While JOIN ON produces all columns from T1 followed by all columns from T2, JOIN USING produces one output column for each of the listed column pairs (in the listed order), followed by any remaining columns from T1, followed by any remaining columns from T2.

Will Postgres' DISTINCT function always return null as the first element?

I'm selecting distinct values from tables thru Java's JDBC connector and it seems that NULL value (if there's any) is always the first row in the ResultSet.
I need to remove this NULL from the List where I load this ResultSet. The logic looks only at the first element and if it's null then ignores it.
I'm not using any ORDER BY in the query, can I still trust that logic? I can't find any reference in Postgres' documentation about this.
You can add a check for NOT NULL. Simply like
select distinct columnName
from Tablename
where columnName IS NOT NULL
Also if you are not providing the ORDER BY clause then then order in which you are going to get the result is not guaranteed, hence you can not rely on it. So it is better and recommended to provide the ORDER BY clause if you want your result output in a particular output(i.e., ascending or descending)
If you are looking for a reference Postgresql document then it says:
If ORDER BY is not given, the rows are returned in whatever order the
system finds fastest to produce.
If it is not stated in the manual, I wouldn't trust it. However, just for fun and try to figure out what logic is being used, running the following query does bring the NULL (for no apparent reason) to the top, while all other values are in an apparent random order:
with t(n) as (values (1),(2),(1),(3),(null),(8),(0))
select distinct * from t
However, cross joining the table with a modified version of itself brings two NULLs to the top, but random NULLs dispersed througout the resultset. So it doesn't seem to have a clear-cut logic clumping all NULL values at the top.
with t(n) as (values (1),(2),(1),(3),(null),(8),(0))
select distinct * from t
cross join (select n+3 from t) t2

PostgreSQL Full Text search

I need to use Full Text Search with Postgresql but I don't find the way to look for a list of words from a table (using ts_query) against an indexed text field (ts_vector data type). Is ts_query just able to process a few words or can process also multiple values that come from a table?
Thanks in advance for your help.
Let me try to formulate an answer according to the comments given on the question (if I understand your request correctly).
Problem
You are trying to do a full text search on the table tableA, column indexed_text_field (a tsvector type) based on words that are stored as text in another table tableB in a column called words.
Solution
First, if you wish to feed PostgreSQL multiple tokens (individual words) during a full text search you have two functions at your disposal:
to_tsquery()
plainto_tsquery()
In the first function you need to split each given token with an ampersand (&). The second function can be fed any string of text and it will chop it into tokens for you. More info here.
Your challenge is that you wish to select matches based on words present in another table. This can be done in different ways, for example via a simple (INNER) JOIN:
SELECT a.* FROM tableA a, tableB b WHERE a.indexed_text_field ## to_tsquery(b.words);
Or if you have multiple words in the words column you should most likely be using the plainto_tsquery() function to keep things simple:
SELECT a.* FROM tableA a, tableB b WHERE a.indexed_text_field ## plainto_tsquery(b.words);
Yet, if you must use the more low-level to_tsquery() version:
SELECT a.* FROM tableA a, tableB b WHERE a.indexed_text_field ## to_tsquery(replace(b.words, ' ', '&'));
In the latter you replace all spaces between the words with and ampersand, thus making them separate tokens. Mind the index usage on the last one though, as you might need to create an expression index on the usage of the replace() function.

Comparing data in a column of one table with the same column in another table

I have two tables temp and md respectively. There is a field called uri_stem which has certain details that I want to omit from temp but not from md. I need to make a comparison that is able to compare certain patterns and remove them from temp if there are similar patterns in md.
Right now I am using this code to remove data similar to the patterns I want to omit, but I want some method that is able to compare the patterns from the md table rather than me hardcording each one. Hope the explanation is clear enough.
FROM
spfmtr01.tbl_1c_apps_log_temp
where
uri_stem not like '%.js' and
uri_stem not like '%.css' and
uri_stem not like '%.gif'
and uri_stem not like '%.png'
and uri_stem not like '%.html'
and uri_stem not like '%.jpg'
and uri_stem not like '%.jpeg'
and uri_stem not like '%.ico'
and uri_stem not like '%.htm'
and uri_stem not like '%.pdf'
and uri_stem not like '%.Png'
and uri_stem not like '%.PNG'
This example is based on answer I mentioned in my comment.
SQLFiddle
Sample data:
drop table if exists a, b;
create table a (testedstr varchar);
create table b (condstr varchar);
insert into a values
('aa.aa.jpg'),
('aa.aa.bjpg'), -- no match
('aa.aa.jxpg'), -- no match
('aa.aa.jPg'),
('aa.aa.aico'), -- no match
('aa.aa.ico'),
('bb.cc.dd.icox'), -- no match
('bb.cc.dd.cco'); -- no match
insert into b values ('jpg'), ('ico');
Explanation:
in table a we have strings we would like to test (stored in column testedstr)
in table b we have strings we would to like to use as testing expresions (stored in column condstr)
SQL:
with cte as (select '\.(' || string_agg(condstr,'|') || ')$' condstr from b)
select * from a, cte where testedstr !~* condstr;
Explanation:
in the first line we will aggregate all patterns we would like to test into one string; as a result we will get jpg|ico string (aggregated into single row).
in the second line we crossjoin tested table with our testing expression (from the first line) and use regular expression to perform the test.
the regular expression at the end looks like \.(jpg|ico)$
For older versions, you should use answer provided by #Bohemian. For my sample data it would look like (adjusted for multiple possible dots) this (SQLFiddle:
select
*
from
a
where
lower(reverse(split_part(reverse(testedstr),'.',1)))
not in (select lower(condstr) from b)
Without reverse function (SQLFiddle):
select
*,
lower(split_part(testedstr,'.',length(testedstr)- length(replace(testedstr,'.','')) + 1)) as extension
from
a
where
lower(split_part(testedstr,'.',length(testedstr)- length(replace(testedstr,'.','')) + 1)) not in (select lower(condstr) from b)
First let's refactor the many conditions into just one:
where lower(substring(uri_stem from '[^.]+$')) not in ('js', 'css', 'gif', 'png', 'html', 'jpg', 'jpeg', 'ico', 'htm', 'pdf')
In this form, it's easy to see how the list of values can be selected instead of coded:
where lower(substring(uri_stem from '[^.]+$')) not in (
select lower(somecolumn) from sometable)
Note the use of lower() to avoid problems of dealing with variants of case.
You could also code it as a join:
select t1.*
from mytable t1
left join sometable t2
on lower(somecolumn) = lower(split_part(uri_stem, '.', 2))
where t2.somecolumn is null -- filter out matches