Load data from one table to another in hive - pyspark

I have a table in hive named a.table1 which has columns id, name, class and it is fully loaded with data.
id name class
1 a 1
11 b 14
I want to create a new table b.table2 from a.table1 which which have fields id, name, class, status.
When id is less than 10 the class and status would have same value else value would be 0.
id name class status
1 a 1 1
11 b 14 0
What I am doing is, creating a table :
CREATE TABLE IF NOT EXISTS b.table2(
id BIGINT,
name string,
class int,
status int
)
How can i load the contents of the table? Or is there any better way to do it?
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
sc = spark.sparkContext
sqlContext = spark._wrapped

Just do a select and insert the results into table2:
insert into b.table2 (
select *, case when id < 10 then class else 0 end as status from a.table1
);

CTAS will do create and load table in single statement:
CREATE TABLE table2 AS
select id, name, class, status
from table1;

Related

Postgresql group into predefined groups where group names come from a database table

I have a database table with data similar to this.
create table DataTable (
name text,
value number
)
insert into DataTable values
('A', 1),('A', 2),('B', 3),('Other', 5),('C', 1);
And i have another table
create table "group" (
name text,
default boolean
)
insert into "group" values
('A', false),('B', false),('Other', true);
I want to group the data in the first table based on the defined groups in the second table.
Expected output
Name | sum
A | 3
B | 3
Other | 6
Right now I'm using this query:
select coalesce(g.name, (select name from group where default = true)) name
sum(dt.value)
from DataTable dt
left join group g on dt.name = g.name
group by 1
This works but can cause performance tips in some situations. Any better way to do this?

How to update null values with sequence number?

I have a table like this
id val
1 null
2 10
3 null
4 null
5 7
I want to get
id val
1 1
2 10
3 2
4 3
5 7
I tried to do it like this
CREATE SEQUENCE new_val START 1
UPDATE tb1
SET val = new_val
WHERE val is null
But I get an error that new_val doesn't exist
You need to use nextval() and provide the sequence name as a string:
CREATE SEQUENCE new_val 1;
UPDATE tb1
SET val = nextval('new_val')
WHERE val is null;
Another option is to use row_number()
UPDATE tb1
SET val = t.rn
FROM (
select id, row_number() over (order by id) as rn
from tb1
)
WHERE tb1.val is null
and tb1.id = t.id;
id is assumed to be the primary key of the table.
You get the next value of a sequence with the nextval function. The function takes regclass as argument type, for which you can supply the name of the sequence (as single quoted string) or the object identifier:
nextval('new_val')

Update Variable based on Group

I need to perform an update to a field in a table with a variable, but I need the variable to change when the group changes. It is just an INTt, so for example if I The example below I want to update the record of texas with a 1 and flordia with the next number of 2:
UPDATE table
set StateNum = #Count
FROM table
where xxxxx
GROUP BY state
Group Update Variable
Texas 1
Texas 1
Florida 2
Florida 2
Florida 2
I think you should use a lookup table with the state and its number StateNum Then you should store this number instead of the name to your table.
You might use DENSE_RANK within an updateable CTE:
--mockup data
DECLARE #tbl TABLE([state] VARCHAR(100),StateNum INT);
INSERT INTO #tbl([state]) VALUES
('Texas'),('Florida'),('Texas'),('Nevada');
--your update-statement
WITH updateableCTE AS
(
SELECT StateNum
,DENSE_RANK() OVER(ORDER BY [state]) AS NewValue
FROM #tbl
)
UPDATE updateableCTE SET StateNum=NewValue;
--check the result
SELECT * FROM #tbl;
And then you should use this to get the data for your lookup table
SELECT StateNum,[state] FROM #tbl GROUP BY StateNum,[state];
Then drop the state-column from your original table and let the StateNum be a foreign key.

Consume The Changes / Deltas using Postgresql

Following is my scenario:
I have 2 landing tables source_table and destination_table.
I need a query/queries which will update the destination table with the new rows as well as the updated rows from source table.
Sample Data would be:
source table:
id name salary
1 P1 10000
2 P2 20000
target table:
id name salary
1 P1 8000
And the expected output should be:
target table:
id name salary
1 P1 10000 (salary updated)
2 P2 20000 (new row inserted)
This doesn't seem to work:
select * from user_source
except
select * from user_target as s
INSERT INTO user_target (id, name, salary)
VALUES (s.id, s.name, s.salary) WHERE id !=s.id
UPDATE user_target
SET name=s.name, salary=s.salary,
WHERE id = s.id
Seems like a simple insert ... on conflict to me:
insert into target_table (id, name, salary)
select id, name, salary
from source_table
on conflict (id) do update
set name = excluded.name,
salary = excluded.salary;
This assumes that the id column is the primary (or unique) key. Looking at the sample data (id, name) might also be unique. In that case you need to change the on conflict() clause and obviously remove the update of the name column as well.

Copy data with new IDs

Is there any way to COPY some rows into same table with new IDs?
My table is like this:
ID | data
1 | SOMETHING
2 | SOMETHING
3 | SOMETHING
I have old IDs: '{1,513,3,4,5}', and new ones: '{1338,7,512,9,10}' and I need to add row 1338 with data from row 1, 7 <= 513 etc. Like old[0] = new[0].
Currently I am using a loop:
SELECT old_ids INTO oIds FROM vars_table WHERE sid = id;
FOR i IN 0..array_length(new_ids, 1) LOOP
INSERT INTO ids(ID, data)
SELECT new_ids[i], data
FROM ids
WHERE id = oIds[i]
AND NOT EXISTS(SELECT 1 FROM ids WHERE id = new_ids[i]);
END LOOP;
Is there better way to do this? Maybe in 1 query?
There is no need for a loop:
insert into the_table (id, data)
select id + 5, data
from the_table;
However the above requires you to know how many rows there are in the table. To take the current number of rows into account you can do:
insert into the_table (id, data)
select id + (select max(id) from the_table), data
from the_table;
Attention: the above is NOT safe in a multi-user environment. It should only be used if you are the only one doing this.
The best way to deal with this kind of data duplication is to define the ID column as serial and let Postgres deal with creating new values:
create table the_table (id serial not null, data text);
The initial data would then be inserted like this:
insert into the_table (data)
values ('foo'), ('bar'), ('foobar');
Duplicating the data is then as easy as:
insert into the_table (data)
select data
from the_data;