How can I explode a Hive map into a denormalized ("long") format? - hiveql

I want to take a bunch of user features (say, country, language, signup date, ...) and produce them in Hive in a denormalized 'long' format, i.e. rows of the form (userid, feature name, feature value), where feature name is something like "country" and feature value is something like "US".
I am using Hive 0.13.
The examples below all have one feature (country) for simplicity, but if I can get one to work, I'll add more.
Query #1:
select explode(map('country', get_json(json, 'country')))
from users
This works, with two columns of results (key, value) where the results look like
country US
country CA
...
Query #2:
select id, explode(map('country', get_json(json, 'country')))
from users
This fails with
FAILED: SemanticException [Error 10081]: UDTF's are not supported
outside the SELECT clause, nor nested in expressions
Query #3:
select key, value
from users
lateral view explode(map('country', get_json(json, 'country')))
This fails with
FAILED: ParseException line 3:63 cannot recognize input
near '' '' '' in table alias
Query #4:
select key, value
from users
lateral view explode(map('country', get_json(json, 'country'))) as (key, value)
This fails with
FAILED: ParseException line 3:67 missing EOF at '(' near 'as'
Is there a version of this that works?

Got this to work.
select id, key, value
from users
lateral view explode(map('country', get_json(json, 'country'))) feature_row

Related

Equivalent of FIRST in Postgresql

Edit: Answer is to use MIN. it works on both strings & numbers. Credit to #cadet down below.
Original question:
I've been reading through similar questions around this for the last half an hour and cannot understand the responses so let me try to get a simple easy to follow answer.
What is the PostgresSQL equivalent to this code which I would write if I were using SQL Server, to bring back the first value in field2 when aggregating:
Select field1, first(field2) from table group by field1?
I have read that DISTINCT ON is the right thing to use? In that case would it be:
Select field1, DISTINCT ON(field2) from table group by field1? because that gives me a syntax error
Edit:
Here is the error stating that the FIRST function does not exist in PostGresSQL:
ERROR: function first(asset32type) does not exist
LINE 1: Select policy, first (name) from multi_asset group by policy...
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
SQL state: 42883
Character: 16
And in case it isn't already clear when I say that in SQL Server the first() function brings back the first value in field2 when aggregating, I mean if you had data like this:
field1
field2
Tom
32
Tom
53
Then select field1, first(field2) group by field1 would give you back:
Tom, 32 - i.e. it picks the first value from field2
Maybe this one, using DISTINCT ON():
SELECT DISTINCT ON (field1)
field1
, field2
FROM table
ORDER BY
field1
, field2;
But without any data or any example, it's just a wild guess.
If first is related with specific order
select distinct field1,
first_value(field2)
over (partition by field1 order by field2) from
(
values (1,10),(1,11),(1,12),(2,23),(2,24)
) as a(field1,field2)
If first is just minimum or maximum
select field1,
min(field2)
from
(
values (1,10),(1,11),(1,12),(2,23),(2,24)
) as a(field1,field2)
group by field1

syntax error using redshift listagg function

select id, listagg(timestamp,',')
within group (order by timestamp) as timestamp
from activity group by contact_id order by contact_id limit 1;
This is the error I am getting:
syntax error at or near ","
LINE 1: select eloqua_contact_id, listagg(timestamp,',') within grou...
Anything wrong with this query? When i remove the delimiter option i do not get an error and everything returns fine. How do i add commas to separate the list agg column?
I suspect the issue is the column name "timestamp" as that is a data type and reserved word. If you enclose the column name in double quotes it will keep it from being interpreted as a datatype. (best guess)
select id, listagg("timestamp",',')
within group (order by "timestamp") as "timestamp"
from activity group by contact_id order by contact_id limit 1;
Generally not a good idea to name your columns the same as datatypes.

How do I insert two related records into two different tables with a single query in PostgreSQL?

I have two tables with a relation by id. And I want to insert two related records. The problem is that id is not known until I make the first insert. Is there a way to write a kind of embedded query that makes both inserts correctly? I would like to have one exact query and to avoid variables, if it is possible. What I tried is:
insert into "b" ("value", "b_id")
select 'val2', (select insert into "a" ("value") values ('val1') returning id);
I get the error:
ERROR: syntax error at or near "("
You'll need to use a CTE to do that, INSERT statements cannot be arbitrarily nested (unlike SELECT):
WITH a_results AS (
INSERT INTO a (value)
VALUES ('val1')
RETURNING id
)
INSERT INTO b (value, b_id)
SELECT 'val2', id
FROM a_results;

Syntax error when trying to populate column with count of unique values in another column

I'm trying to count the number of unique pool operators for every permit # in a table but am having trouble putting this value in a new column dedicated to that count.
So I have 2 tables: doh_analysis; doh_pools.
Both of these tables have a "permit" column (TEXT), but doh_analysis has about 1000 rows with duplicates in the permit column but occasional unique values in the operator column (TEXT).
I'm trying to fill a column "operator_count" in the table "doh_pools" with a count of unique values in "pooloperator" for each permit #.
So I tried the following code but am getting a syntax error at or near "(":
update doh_pools
set operator_count = select count(distinct doh_analysis.pooloperator)
from doh_analysis
where doh_analysis.permit ilike doh_pools.permit;
When I remove the "select" from before the "count" I get "SQL Error [42803]: ERROR: aggregate functions are not allowed in UPDATE".
I can successfully query a list of distinct permit-pooloperator pairs using:
select distinct permit, pooloperator
from doh_analysis;
And I can query the # of unique pooloperators per permit 1 at a time using:
select count(distinct pooloperator)
from doh_analysis
where permit ilike '52-60-03054';
But I'm struggling to insert a count of unique pairs for each permit # in the operatorcount column.
Is there a way to do this?
There is certainly a better way of doing this but I accomplished my goal by creating 2 intermediary tables and the updating the target table with values from the 2nd intermediate table like so:
select distinct permit, pooloperator
into doh_pairs
from doh_analysis;
select permit, count(distinct pooloperator)
into doh_temp
from doh_pairs
group by permit;
select count(distinct permit)
from doh_temp;
update doh_pools
set operator_count = doh_temp.count
from doh_temp
where doh_pools.permit ilike doh_temp.permit
and doh_pools.permit is not NULL
returning count;

PostgreSQL use function result in ORDER BY

Is there a way to use the results of a function call in the order by clause?
My current attempt (I've also tried some slight variations).
SELECT it.item_type_id, it.asset_tag, split_part(it.asset_tag, 'ASSET', 2)::INT as tag_num
FROM serials.item_types it
WHERE it.asset_tag LIKE 'ASSET%'
ORDER BY split_part(it.asset_tag, 'ASSET', 2)::INT;
While my general assumption is that this can't be done, I wanted to know if there was a way to accomplish this that I wasn't thinking of.
EDIT: The query above gives the following error [22P02] ERROR: invalid input syntax for integer: "******"
Your query is generally OK, the problem is that for some row the result of split_part(it.asset_tag, 'ASSET', 2) is the string ******. And that string cannot be cast to an integer.
You may want to remove the order by and the cast in the select list and add a where split_part(it.asset_tag, 'ASSET', 2) = '******', for instance, to narrow down that data issue.
Once that is resolved, having such a function in the order by list is perfectly fine. The quoted section of the documentation in the comments on the question is referring to applying an order by clause to the results of UNION, INTERSECTION, etc. queries. In other words, the order by found in this query:
(select column1 as result_column1 from table1
union
select column2 from table 2)
order by result_column1
can only refer to the accumulated result columns, not to expressions on individual rows.