HBase to Hive incremental data load using Pyspark - pyspark

I have a table in HBase and I want to perform incremental load to HIVE table using Pyspark. I am fairly new to this.
In my project, they are using MYSQL table for creating JSON which is used as schema to load HBase to Spark Data Frame.
Now my question is how do I perform incremental load if there are changes in HBase tables.
Here is MYSQL table structure-
+----------------+-------------+---------------+----------------+--------------+-----------+------------+-----------+---------------------+---------------------+
| tgt_table_name | col_name | col_data_type | map_table_name | map_col_name | weightage | action_set | is_active | created_datetime | modified_datetime |
+----------------+-------------+---------------+----------------+--------------+-----------+------------+-----------+---------------------+---------------------+
| employee | id | STRING | emp | id | 0 | COPY | y | 2023-02-15 00:00:00 | 2023-02-15 00:00:00 |
| employee | name | STRING | emp | name | 0 | COPY | y | 2023-02-15 00:00:00 | 2023-02-15 00:00:00 |
| employee | city | STRING | emp | city | 0 | COPY | y | 2023-02-15 00:00:00 | 2023-02-15 00:00:00 |
| employee | designation | STRING | emp | designation | 0 | COPY | y | 2023-02-15 00:00:00 | 2023-02-15 00:00:00 |
| employee | salary | STRING | emp | salary | 0 | COPY | y | 2023-02-15 00:00:00 | 2023-02-15 00:00:00 |
+----------------+-------------+---------------+----------------+--------------+-----------+------------+-----------+---------------------+---------------------+
tgt_table_name represent - Hive Table
col_name - Hive col name
map_table_name - HBase table name
map_col_name-Hbase column
Here is the sample data frame which I have created from HBase.Spark DF from HBase
I have inserted this data frame to a Hive external table.

Related

Fetch Row count for all Database in PostgreSQL

I'm able to fetch row count of a specific database in PostgreSqL using below query
select current_timestamp - query_start as runtime, datname, usename, query
from pg_stat_activity
where state != 'idle'
order by 1 desc;
+─────────────+──────────+─────────────+
| schemaname | relname | n_live_tup |
+─────────────+──────────+─────────────+
| 2022 | AA | 13960236 |
| 2022 | BB | 7176815 |
| 2022 | CC | 4669837 |
| 2022 | DD | 3782882 |
| 2022 | EE | 3315106 |
| 2022 | FF | 3060672 |
+─────────────+──────────+─────────────+
How could I get the row count for all database. I will be using python to query PostgreSQL.

PostgreSQL insert performance - why would it be so slow?

I've got a PostgreSQL database running inside a docker container on an AWS Linux instance. I've got some telemetry running, uploading records in batches of ten. A Python server inserts these records into the database. The table looks like this:
postgres=# \d raw_journey_data ;
Table "public.raw_journey_data"
Column | Type | Collation | Nullable | Default
--------+-----------------------------+-----------+----------+---------
email | character varying | | |
t | timestamp without time zone | | |
lat | numeric(20,18) | | |
lng | numeric(21,18) | | |
speed | numeric(21,18) | | |
There aren't that many rows in the table; about 36,000 presently. But committing the transactions that insert the data is taking about a minute each time:
postgres=# SELECT pid, age(clock_timestamp(), query_start), usename, query
FROM pg_stat_activity
WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%'
ORDER BY query_start desc;
pid | age | usename | query
-----+-----------------+----------+--------
30 | | |
32 | | postgres |
28 | | |
27 | | |
29 | | |
37 | 00:00:11.439313 | postgres | COMMIT
36 | 00:00:11.439565 | postgres | COMMIT
39 | 00:00:36.454011 | postgres | COMMIT
56 | 00:00:36.457828 | postgres | COMMIT
61 | 00:00:56.474446 | postgres | COMMIT
35 | 00:00:56.474647 | postgres | COMMIT
(11 rows)
The load average on the system's CPUs is zero and about half of the 4GB system RAM is available (as shown by free). So what causes the super-slow commits here?
The insertion is being done with SqlAlchemy:
db.session.execute(import_table.insert([
{
"email": current_user.email,
"t": row.t.ToDatetime(),
"lat": row.lat,
"lng": row.lng,
"speed": row.speed
}
for row in data.data
]))
Edit Update with the state column:
postgres=# SELECT pid, age(clock_timestamp(), query_start), usename, state, query
FROM pg_stat_activity
WHERE query NOT ILIKE '%pg_stat_activity%'
ORDER BY query_start desc;
pid | age | usename | state | query
-----+-----------------+----------+-------+--------
32 | | postgres | |
30 | | | |
28 | | | |
27 | | | |
29 | | | |
46 | 00:00:08.390177 | postgres | idle | COMMIT
49 | 00:00:08.390348 | postgres | idle | COMMIT
45 | 00:00:23.35249 | postgres | idle | COMMIT
(8 rows)

How to split a big table into multiple related tables?

Am new to Postgres SQL.
I have a big table that can be splitted to multiple table
ID_Student | Name_Student | Departement_Student | is_Student_works | job_title | Work_Departement | Location|
=============================================================================================================
1 | Rolf | Software Eng | Yes | intern SE | data Studio
| london |
2 | Silvya | Accounting | Yes | Accounter | TORnivo
| New York |
I want to split it into 3 tables ( student, departement, work) :
STUDENT TABLE
ID_Student | Name_Student | is_Student_works | Location|
========================================================
1 | Rolf | yes | london |
2 | Silvya | Yes | New York|
DEPARTEMENT TABLE
ID_DEPARTEMENT | Name_DEPARTEMENT |
===================================
1 | Software Eng |
2 | Accounting |
WORK TABLE
ID_WORK | Name_WORK |
===================================
1 | intern SE |
2 | Accounter |
I need Only the Query that Split the table into multiple tables, THE CREATION OF TABLES ARE NOT NEEDED.

How to convert row into column in PostgreSQL of below table

I was trying to convert the trace table to resulted table in postgress. I have hug data in the table.
I have table with name : Trace
entity_id | ts | key | bool_v | dbl_v | str_v | long_v |
---------------------------------------------------------------------------------------------------------------
1ea815c48c5ac30bca403a1010b09f1 | 1593934026155 | temperature | | | | 45 |
1ea815c48c5ac30bca403a1010b09f1 | 1593934026155 | operation | | | Normal | |
1ea815c48c5ac30bca403a1010b09f1 | 1593934026155 | period | | | | 6968 |
1ea815c48c5ac30bca403a1010b09f1 | 1593933202984 | temperature | | | | 44 |
1ea815c48c5ac30bca403a1010b09f1 | 1593933202984 | operation | | | Reverse | |
1ea815c48c5ac30bca403a1010b09f1 | 1593933202984 | period | | | | 3535 |
Trace Table
convert the above table into following table in PostgreSQL
Output Table: Result
entity_id | ts | temperature | operation | period |
----------------------------------------------------------------------------------------|
1ea815c48c5ac30bca403a1010b09f1 | 1593934026155 | 45 | Normal | 6968 |
1ea815c48c5ac30bca403a1010b09f1 | 1593933202984 | 44 | Reverse | 3535 |
Result Table
Have you tried this yet?
select entity_id, ts,
max(long_v) filter (where key = 'temperature') as temperature,
max(str_v) filter (where key = 'operation') as operation,
max(long_v) filter (where key = 'period') as period
from trace
group by entity_id, ts;

Sumif in Postgresql between two tables

These are my two sample tables.
table "outage" (column formats are text, timestamp, timestamp)
+-------------------+----------------+----------------+
| outage_request_id | actual_start | actual_end |
+-------------------+----------------+----------------+
| 1-07697685 | 4/8/2015 4:48 | 4/8/2015 9:02 |
| 1-07223444 | 7/17/2015 4:24 | 8/01/2015 9:23 |
| 1-07223450 | 2/13/2015 4:24 | 4/29/2015 1:03 |
| 1-07223669 | 4/28/2017 9:20 | 4/30/2017 6:58 |
| 1-08985319 | 8/24/2015 3:18 | 8/24/2015 8:27 |
+-------------------+----------------+----------------+
and a second table "prices" (column format is numeric, timestamp)
+-------+---------------+
| price | stamp |
+-------+---------------+
| -2.31 | 2/1/2018 3:00 |
| -2.35 | 2/1/2018 4:00 |
| -1.77 | 2/1/2018 5:00 |
| -2.96 | 2/1/2018 6:00 |
| -5.14 | 2/1/2018 7:00 |
+-------+---------------+
My Goal: To sum the prices in between the start and stop times of each outage_request_id.
I have no idea how to go about properly joining the tables and getting a sum of prices in those outage timestamp ranges.
I can't promise this is efficient (in fact for very large tables I feel pretty confident it's not), but this should notionally get you what you want:
select
o.outage_request_id, o.actual_start, o.actual_end,
sum (p.price) as total_price
from
outage o
left join prices p on
p.stamp between o.actual_start and o.actual_end
group by
o.outage_request_id, o.actual_start, o.actual_end