Is there anyway to load comma seperated string into a single column in hive? - hiveql

'TOK_STRINGLITERALSEQUENCE not supported in insert/values' getting this error while loading data into the hive.
when trying to insert a comma-separated string into a single column it is showing error as
'TOK_STRINGLITERALSEQUENCE not supported in insert/values'
insert into table table_name values('llu'/t'ghf'/t'a,b,c,d'/t'gh,edf,ghu,kjhl'/t'1')
/t represents delimiter as tab
while loading data I am getting an error as 'TOK_STRINGLITERALSEQUENCE not supported in insert/values'.
Expected results
col1 col2 col3 col4 col5
llu ghf a,b,c,d gh,edf,ghu,kjhl 1

I'm not sure why you're using tab-delimitation for the insert statement. This worked for me in Hive version 1.2.1
create table test (col1 STRING, col2 STRING, col3 STRING, col4 STRING, col5 STRING);
insert into table test values('llu','ghf','a,b,c,d','gh,edf,ghu,kjhl','1');
select * from test;
+------------+------------+------------+------------------+------------+--+
| test.col1 | test.col2 | test.col3 | test.col4 | test.col5 |
+------------+------------+------------+------------------+------------+--+
| llu | ghf | a,b,c,d | gh,edf,ghu,kjhl | 1 |
+------------+------------+------------+------------------+------------+--+

Related

aggregate column of type row

I want to filter a column of rowtype and aggregate rowtypes when they have complement information.
So my data looks like that :
|col1|rowcol |
|----|--------------------------------|
|1 |{col1=2, col2=null, col3=4} |
|1 |{col1=null, col2=3, col3=null} |
|2 |{col1=7, col2=8, col3=null} |
|2 |{col1=null, col2=null, col3=56} |
|3 |{col1=1, col2=3, col3=7} |
Here is some code you can use to have an working example:
select col1, cast(rowcol as row(col1 integer, col2 integer, col3 integer))
from (
values
(1, row(2,null,4)),
(1, row(null,3,null)),
(2, row(7,8,null)),
(2, row(null,null,56)),
(3, row(1,3,7))
)
AS x (col1, rowcol)
I am expecting the result as following:
|col1|rowcol |
|----|-------------------------------|
|1 |{col1=2, col2=3, col3=4} |
|2 |{col1=7, col2=8, col3=56} |
|3 |{col1=1, col2=3, col3=7} |
Maybe someone can help me...
Thanks in advance
You need to group them by col1 and process to merge not nulls, for example using max:
-- sample data
WITH dataset (col1, rowcol) AS (
VALUES
(1, row(2,null,4)),
(1, row(null,3,null)),
(2, row(7,8,null)),
(2, row(null,null,56)),
(3, row(1,3,7))
)
--query
select col1,
cast(row(max(r.col1), max(r.col2), max(r.col3)) as row(col1 integer, col2 integer, col3 integer)) rowcol
from (
select col1,
cast(rowcol as row(col1 integer, col2 integer, col3 integer)) r
from dataset
)
group by col1
order by col1 -- for ordered output
Output:
col1
rowcol
1
{col1=2, col2=3, col3=4}
2
{col1=7, col2=8, col3=56}
3
{col1=1, col2=3, col3=7}

UPDATE from temp table picking the "last" row per group

Suppose there is a table with data:
+----+-------+
| id | value |
+----+-------+
| 1 | 0 |
| 2 | 0 |
+----+-------+
I need to do a bulk update. And use COPY FROM STDIN for fast insert to temp table without constraints and so it can contains duplicate values in id column
Temp table to update from:
+----+-------+
| id | value |
+----+-------+
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
+----+-------+
If I simply run a query like with:
UPDATE test target SET value = source.value FROM tmp_test source WHERE target.id = source.id;
I got wrong results:
+----+-------+
| id | value |
+----+-------+
| 1 | 1 |
| 2 | 1 |
+----+-------+
I need the target table to contain the values that appeared last in the temporary table.
What is the most effective way to do this, given that the target table may contain millions of records, and the temporary table may contain tens of thousands?**
Assuming you want to take the value from the row that was inserted last into the temp table, physically, you can (ab-)use the system column ctid, signifying the physical location:
UPDATE test AS target
SET value = source.value
FROM (
SELECT DISTINCT ON (id)
id, value
FROM tmp_test
ORDER BY id, ctid DESC
) source
WHERE target.id = source.id
AND target.value <> source.value; -- skip empty updates
About DISTINCT ON:
Select first row in each GROUP BY group?
This builds on a implementation detail, and is not backed up by the SQL standard. If some insert method should not write rows in sequence (like future "parallel" INSERT), it breaks. Currently, it should work. About ctid:
How do I decompose ctid into page and row numbers?
If you want a safe way, you need to add some user column to signify the order of rows, like a serial column. But do your really care? Your tiebreaker seems rather arbitrary. See:
Temporary sequence within a SELECT
AND target.value <> source.value
skips empty updates - assuming both columns are NOT NULL. Else, use:
AND target.value IS DISTINCT FROM source.value
See:
How do I (or can I) SELECT DISTINCT on multiple columns?

How to find duplicated columns with all values in spark dataframe?

I'm preprocessing my data(2000K+ rows), and want to count the duplicated columns in a spark dataframe, for example:
id | col1 | col2 | col3 | col4 |
----+--------+-------+-------+-------+
1 | 3 | 999 | 4 | 999 |
2 | 2 | 888 | 5 | 888 |
3 | 1 | 777 | 6 | 777 |
In this case, the col2 and col4's values are the same, which is my interest, so let the count +1.
I had tried toPandas(), transpose, and then duplicateDrop() in pyspark, but it's too slow.
Is there any function could solve this?
Any idea will be appreciate, thank you.
So you want to count the number of duplicate values based on the columns col2 and col4? This should do the trick below.
val dfWithDupCount = df.withColumn("isDup", when($"col2" === "col4", 1).otherwise(0))
This will create a new dataframe with a new boolean column saying that if col2 is equal to col4, then enter the value 1 otherwise 0.
To find the total number of rows, all you need to do is do a group by based on isDup and count.
import org.apache.spark.sql.functions._
val groupped = df.groupBy("isDup").agg(sum("isDup")).toDF()
display(groupped)
Apologies if I misunderstood you. You could probably use the same solution if you were trying to match any of the columns together, but that would require nested when statements.

select data from Hive where column value in list

I would like to get data from Hive such that: if one column value is in List, then select data from Hive.
Example Data in Hive table is as:
Col1 | Col2 | Col3
-------+---------------
Joe | 32 | Place-1
Nancy | 28 | Place-2
Shalyn | 35 | Place-1
Andy | 20 | Place-3
I am querying Hive table as:
val name = List("Sherley","Joe","Shalyan","Dan")
var dataFromHive = sqlCon.sql("select Col1,Col2,Col3 from default.NameInfo where Col1 in (${name})")
I know that my query is wrong, as its throwing error. But I am not able to get proper replacement for where Col1 in (${name}).
The better idea is converting name to DataFrame and joins with dataFromHive. The inner join does the same as filtering only intersected data.
val nameDf = List("Sherley","Joe","Shalyan","Dan").toDF("Col1")
var dataFromHive = sqlCon.table("default.NameInfo").join(nameDf, "Col1").select("Col1", "Col2", "Col3")
Try to use the DataFrame API. It will make the code easy to read.
convert your List to String (with proper format to use in hive query)
val name = List("Sherley","Joe","Shalyan","Dan")
val name_string = name.mkString("('","','", "')")
//name_string: String = ('Sherley','Joe','Shalyan','Dan')
var dataFromHive = sqlCon.sql("select Col1,Col2,Col3 from default.NameInfo where Col1 in " + name_string )

Oracle: How to group records by certain columns before fetching results

I have a table in Redshift that looks like this:
col1 | col2 | col3 | col4 | col5 | col6
=======================================
123 | AB | SSSS | TTTT | PQR | XYZ
---------------------------------------
123 | AB | SSTT | TSTS | PQR | XYZ
---------------------------------------
123 | AB | PQRS | WXYZ | PQR | XYZ
---------------------------------------
123 | CD | SSTT | TSTS | PQR | XYZ
---------------------------------------
123 | CD | PQRS | WXYZ | PQR | XYZ
---------------------------------------
456 | AB | GGGG | RRRR | OPQ | RST
---------------------------------------
456 | AB | SSTT | TSTS | PQR | XYZ
---------------------------------------
456 | AB | PQRS | WXYZ | PQR | XYZ
I have another table that also has a similar structure and data.
From these tables, I need to select values that don't have 'SSSS' in col3 and 'TTTT' in col4 in (edited) either of the tables. I'd also need to group my results by the value in col1 and col2.
Here, I'd like my query to return:
123,CD
456,AB
I don't want 123, AB to be in my results, since one of the rows corresponding to 123, AB has SSSS and TTTT in col3 and col4 respectively. i.e, I want to omit items that have SSSS and TTTT in col3 and col4 in either of the two tables that I'm looking up.
I am very new to writing queries to extract information from a database, so please bear with my ignorance. I was told to explore GROUP BY and ORDER BY, but I am not sure I understand their usage well enough yet.
The query I have looks like:
SELECT * from table1 join table2 on
table1.col1 = table2.col1 AND
table1.col2 = table2.col2
WHERE
col3 NOT LIKE 'SSSS' AND
col4 NOT LIKE 'TTTT'
GROUP BY col1,col2
However, this query throws an error: col5 must appear in the GROUP BY clause or be used in an aggregate function;
I'm not sure how to proceed. I'd appreciate any help. Thank you!
It seems you also want DISTINCT results. In this case a solution with MINUS is probably as efficient as any other (and, remember, MINUS automatically also means DISTINCT):
select col1, col2 from table_name -- enter your column and table names here
minus
select col1, col2 from table_name where col3 = 'SSSS' and col4 = 'TTTT'
;
No need to group by anything!
With that said, here is a solution using GROUP BY. Note that the HAVING condition uses a non-trivial aggregate function - it is a COUNT() but what is counted is a CASE to take care of what was required. Note that it is not necessary/required that the aggregate function in the HAVING clause/condition be included in the SELECT list!
select col1, col2
from table_name
group by col1, col2
having count(case when col3 = 'SSSS' and col4 = 'TTTT' then 1 else null end) = 0
;
You should use the EXCEPT operator.
EXCEPT and MINUS are two different versions of the same operator.
Here is the syntax of what your query should look like
SELECT col1, col2 FROM table1
EXCEPT
SELECT col1, col2 FROM table1 WHERE col3 = 'SSSS' AND col4 = 'TTTT';
One important consideration is to know if your desired answer requires either the and or OR operator. Do you want to see the records where col3 = 'SSSS' and col4 has a value different than col4 = 'TTTT'?
If the answer is no you should use the version below:
SELECT col1, col2 FROM table1
EXCEPT
SELECT col1, col2 FROM table1 WHERE col3 = 'SSSS' OR col4 = 'TTTT';
You can learn more about the MINUS or EXCEPT operator on the Amazon Redshift documentation here.