When creating hive managed bucketing table it is converting to external table - pyspark

I have created a hive bucket table using below command:
CREATE TABLE sales_bucketed_notrsc_3 (id INT, fname STRING)
CLUSTERED BY (id) INTO 10 BUCKETS
TBLPROPERTIES ('transactional'='false');
But, after performing above command, the table is created as external table. Don't know why??
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `sales_bucketed_notrsc_3`( |
| `id` int, |
| `fname` string) |
| CLUSTERED BY ( |
| id) |
| INTO 10 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
|/warehouse/tablespace/external/hive/sales_bucketed_notrsc_3' |
| TBLPROPERTIES ( |
| 'TRANSLATED_TO_EXTERNAL'='TRUE', |
| 'bucketing_version'='2', |
| 'external.table.purge'='TRUE', |
| 'transient_lastDdlTime'='1631200197', |
| 'translated_to_external'='false') |
+----------------------------------------------------+
20 rows selected (0.188 seconds)
If, we create without ANY TBLPROPERTIES then it is creating a managed table. But the no# of bucketing files is not created as expected.
CREATE TABLE sales_bucketed_1 (id INT, fname STRING) CLUSTERED BY (id) INTO 10 BUCKETS;
[root]# hdfs dfs -ls /warehouse/tablespace/managed/hive/sales_bucketed_1
Found 2 items
drwxr-xr-x - hive hive 0 2021-09-09 20:59 /warehouse/tablespace/managed/hive/sales_bucketed_1/delta_0000001_0000001_0000
drwxr-xr-x - hive hive 0 2021-09-09 21:01 /warehouse/tablespace/managed/hive/sales_bucketed_1/delta_0000002_0000002_0000
[root]#

Related

HBase to Hive incremental data load using Pyspark

I have a table in HBase and I want to perform incremental load to HIVE table using Pyspark. I am fairly new to this.
In my project, they are using MYSQL table for creating JSON which is used as schema to load HBase to Spark Data Frame.
Now my question is how do I perform incremental load if there are changes in HBase tables.
Here is MYSQL table structure-
+----------------+-------------+---------------+----------------+--------------+-----------+------------+-----------+---------------------+---------------------+
| tgt_table_name | col_name | col_data_type | map_table_name | map_col_name | weightage | action_set | is_active | created_datetime | modified_datetime |
+----------------+-------------+---------------+----------------+--------------+-----------+------------+-----------+---------------------+---------------------+
| employee | id | STRING | emp | id | 0 | COPY | y | 2023-02-15 00:00:00 | 2023-02-15 00:00:00 |
| employee | name | STRING | emp | name | 0 | COPY | y | 2023-02-15 00:00:00 | 2023-02-15 00:00:00 |
| employee | city | STRING | emp | city | 0 | COPY | y | 2023-02-15 00:00:00 | 2023-02-15 00:00:00 |
| employee | designation | STRING | emp | designation | 0 | COPY | y | 2023-02-15 00:00:00 | 2023-02-15 00:00:00 |
| employee | salary | STRING | emp | salary | 0 | COPY | y | 2023-02-15 00:00:00 | 2023-02-15 00:00:00 |
+----------------+-------------+---------------+----------------+--------------+-----------+------------+-----------+---------------------+---------------------+
tgt_table_name represent - Hive Table
col_name - Hive col name
map_table_name - HBase table name
map_col_name-Hbase column
Here is the sample data frame which I have created from HBase.Spark DF from HBase
I have inserted this data frame to a Hive external table.

Any way to find and delete almost similar records with SQL?

I have a table in Postgres DB, that has a lot of almost identical rows. For example:
1. 00Zicky_-_San_Pedro_Danilo_Vigorito_Remix
2. 00Zicky_-_San_Pedro__Danilo_Vigorito_Remix__
3. 0101_-_Try_To_Say__Strictlyjaz_Unit_Future_Rmx__
4. 0101_-_Try_To_Say__Strictlyjaz_Unit_Future_Rmx_
5. 01_-_Digital_Excitation_-_Brothers_Gonna_Work_it_Out__Piano_Mix__
6. 01_-_Digital_Excitation_-_Brothers_Gonna_Work_it_Out__Piano_Mix__
I think about to writing a little golang script to remove duplicates, but maybe SQL can do it?
Table definition:
\d+ songs
Table "public.songs"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
---------------+-----------------------------+-----------+----------+----------------------------------------+----------+-------------+--------------+-------------
song_id | integer | | not null | nextval('songs_song_id_seq'::regclass) | plain | | |
song_name | character varying(250) | | not null | | extended | | |
fingerprinted | smallint | | | 0 | plain | | |
file_sha1 | bytea | | | | extended | | |
total_hashes | integer | | not null | 0 | plain | | |
date_created | timestamp without time zone | | not null | now() | plain | | |
date_modified | timestamp without time zone | | not null | now() | plain | | |
Indexes:
"pk_songs_song_id" PRIMARY KEY, btree (song_id)
Referenced by:
TABLE "fingerprints" CONSTRAINT "fk_fingerprints_song_id" FOREIGN KEY (song_id) REFERENCES songs(song_id) ON DELETE CASCADE
Access method: heap
Tried several methods to find duplicates, but that methods search only for exact similarity.
There is no operator which is essentially A almost = B. (Well there is full text search, but that seems to be a little excessive here.) If the only difference is the number of - and _ then just get rid of them and compare the the resulting difference. If they are equal, then one is a duplicate. You can use the replace() function to remove them. So something like: (see demo)
delete
from songs s2
where exists ( select null
from songs s1
where s1.song_id < s2.song_id
and replace(replace(s1.name, '_',''),'-','') =
replace(replace(s2.name, '_',''),'-','')
);
If your table is large this will not be fast, but a functional index may help:
create index song_name_idx on songs
(replace(replace(name, '_',''),'-',''));

Why don't columns with citext datatype is processed by presto?

I'm running pgsql queries on the sql console provided by presto-client connected to presto-server running on top of postgres. The resultset of the queries contain only the columns that aren't of citext type.
DataDetails Table Description:
Table "public.datadetails"
Column | Type | Modifiers | Storage | Stats target | Description
------------------+----------+------------------------------+----------+--------------+-------------
data_sequence_id | bigint | not null | plain | |
key | citext | not null | extended | |
uploaded_by | bigint | not null | plain | |
uploaded_time | bigint | not null | plain | |
modified_by | bigint | | plain | |
modified_time | bigint | | plain | |
retrieved_by | bigint | | plain | |
retrieved_time | bigint | | plain | |
file_name | citext | not null | extended | |
file_type | citext | not null | extended | |
file_size | bigint | not null default 0::bigint | plain | |
Indexes:
"datadetails_pk1" PRIMARY KEY, btree (data_sequence_id)
"datadetails_uk0" UNIQUE CONSTRAINT, btree (key)
Check constraints:
"datadetails_file_name_c" CHECK (length(file_name::text) <= 32)
"datadetails_file_type_c" CHECK (length(file_type::text) <= 2048)
"datadetails_key_c" CHECK (length(key::text) <= 64)
Query Result in Presto-Client:
presto:public> select * from datadetails;
data_sequence_id | uploaded_by | uploaded_time | modified_by | modified_time | retrieved_by | retrieved_time | file_size |
------------------+-------------+---------------+-------------+---------------+--------------+----------------+-----------+
2000000000007 | 15062270 | 1586416286363 | 0 | 0 | 0 | 0 | 61 |
2000000000011 | 15062270 | 1586416299159 | 0 | 0 | 15062270 | 1586417517045 | 36 |
(2 rows)
Query 20200410_130419_00017_gmjgh, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [2 rows, 0B] [10 rows/s, 0B/s]
In the above resultset it is evident that the columns with citext type are missing.
Does presto support the citext datatype or Is there any configuration to process the citext datatype using presto?
Postgres: PostgreSQL 9.4.0-relocatable (Red Hat 4.4.7-11), 64-bit
Presto-Server: presto-server-0.230
Presto-Client: presto-cli-332

Does size of row or column effect aggregation queries in Postgresql?

Consider the following table definition:
Column | Type | Collation | Nullable | Default
-----------------+--------------------------+-----------+----------+-------------
id | uuid | | not null |
reference_id | text | | |
data | jsonb | | |
tag | character varying(255) | | |
created_at | timestamp with time zone | | |
updated_at | timestamp with time zone | | |
is_active | boolean | | not null | true
status | integer | | | 0
message | text | | |
batch_id | uuid | | not null |
config | jsonb | | |
Overall table size to be over 500M and every row in the table contains a data column to have a JSON of over 50MB.
Questions -
Does the size of the data column effect aggregation such as count?
Assume we are running the below query -
select count(*)
from table
where batch_id = '88f30539-32d7-445c-8d34-f1da5899175c';
Does the size of the data column effect aggregation such as sum?
Assume we are running the below queries -
Query 1 -
select sum(data->>'count'::int)
from table
where batch_id = '88f30539-32d7-445c-8d34-f1da5899175c';
Query 2 -
select sum(jsonb_array_length(data->'some_array'))
from table
where batch_id = '88f30539-32d7-445c-8d34-f1da5899175c';
The best way to know is to measure.
Once the data is large enough to always be TOASTed, then its size will no longer affect the performance of queries which do not need to access the TOASTed data contents, like your first one. Your last two do need to access the contents and their performance will depend on the size.

How to maintain maximum number of records in a table. Postgres

I have the following problem, I have a table with a foreign key and a creation date column.
At the time a new record is inserted, I need to check how many records exist with the same foreign key, and if the query throws more than N records, I have to delete the oldest record. So that there are always only N records of the same foreign key.
I could do it from the application, (I'm doing a REST API) but I do not think it's optimal, and I'd rather do it with a Trigger directly in SQL.
For example
+----+----------+------------+
| TABLE: example |
+----+----------+------------+
| FK | DATE.... | MORE_FIELD |
+----+----------+------------+
| 01 | 17/12/92 | |
| 01 | 17/12/93 | |
| 02 | 17/12/92 | |
| 01 | 17/12/94 | |
| 02 | 17/12/93 | |
| 03 | 17/12/92 | |
+----+----------+------------+
INSERT INTO example(FK, DATE, MORE_FIELD) VALUES(01,now(), "...");
After insert on example
IF (SELECT count(FK) FROM example WHERE new.FK = example.FK > N)
THEN DELETE THE OLDEST RECORD;
If new.FK = 1 and N = 3 then the register (01 | 17/12/92) should be removed after insertion