Are the bucket hash algorithms of tez and MR different?

Are the bucket hash algorithms of tez and MR different? - hash

I'm using Hive 3.1.2 and tried to create a bucket with bucket version=2.
When I created a bucket and checked the bucket file using hdfs dfs -cat, I could see that the hashing result was different.
Are the hash algorithms of Tez and MR different? Shouldn't it be the same if bucket version=2?
Here's the test method and its results.
1. Create Bucket table & Data table
CREATE EXTERNAL TABLE `bucket_test`(
`id` int COMMENT ' ',
`name` string COMMENT ' ',
`age` int COMMENT ' ',
`phone` string COMMENT ' ')
CLUSTERED BY (id, name, age) SORTED BY(phone) INTO 2 Buckets
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
TBLPROPERTIES (
'bucketing_version'='2',
'orc.compress'='ZLIB');
CREATE TABLE data_table (id int, name string, age int, phone string)
row format delimited fields terminated by ',';
2. Insert data into DATA_TABLE
INSERT INTO TABLE data_table
select stack
( 20
,1, 'a', 11, '111'
,1,'a',11,'222'
,3,'b',14,'333'
,3,'b',13,'444'
,5,'c',18,'555'
,5,'c',18,'666'
,5,'c',21,'777'
,8,'d',23,'888'
,9,'d',24,'999'
,10,'d',26,'1110'
,11,'d',27,'1112'
,12,'e',28,'1113'
,13,'f',28,'1114'
,14,'g',30,'1115'
,15,'q',31,'1116'
,16,'w',32,'1117'
,17,'e',33,'1118'
,18,'r',34,'1119'
,19,'t',36,'1120'
,20,'y',36,'1130')
3. Create Bucket with MR
set hive.enforce.bucketing = true;
set hive.execution.engine = mr;
set mapreduce.job.queuename=root.test;
Insert overwrite table bucket_test
select * from data_table ;
4. Check Bucket contents
# bucket0 : 6 rows
[root#test~]# hdfs dfs -cat /user/hive/warehouse/bucket_test/000000_0
10d261110
11d271112
18r341119
3b13444
5c18555
5c18666
# bucket1 : 14 rows
[root#test~]# hdfs dfs -cat /user/hive/warehouse/bucket_test/000001_0
1a11111
12e281113
13f281114
14g301115
15q311116
16w321117
17e331118
19t361120
20y361130
1a11222
3b14333
5c21777
8d23888
9d24999
5. Create Bucket with Tez
set hive.enforce.bucketing = true;
set hive.execution.engine = tez;
set tez.queue.name=root.test;
Insert overwrite table bucket_test
select * from data_table ;
6. Check Bucket contents
# bucket0 : 11 rows
[root#test~]# hdfs dfs -cat /user/hive/warehouse/bucket_test/000000_0
1a11111
10d261110
11d271112
13f281114
16w321117
17e331118
18r341119
20y361130
1a11222
5c18555
5c18666
# bucket1 : 9 rows
[root#test~]# hdfs dfs -cat /user/hive/warehouse/bucket_test/000001_0
12e281113
14g301115
15q311116
19t361120
3b14333
3b13444
5c21777
8d23888
9d24999

Related

Flink SQL not rolling iceberg files to hdfs while flink sql streaming job running

I am working in project using flink and iceberg to write data from kafka to iceberg hive table or hdfs using hadoop catalog when i publish message to kafka i can see message in kafka table but there is no file added in hdfs or row added in hive table
files added only if i canceled job i can see files in hdfs. what is reason ? why i can not see one file per kafka message in hdfs ? also i can not see data in sink table(flink_iceberg_tbl3).
************************* code i used below ********************************
CREATE CATALOG hadoop_iceberg WITH (
'type'='iceberg',
'catalog-type'='hadoop',
'warehouse'='hdfs://localhost:9000/flink_iceberg_warehouse'
);
create table hadoop_iceberg.iceberg_db.flink_iceberg_tbl3
(id int
,name string
,age int
,loc string
) partitioned by(loc);
create table kafka_input_table(
id int,
name varchar,
age int,
loc varchar
) with (
'connector' = 'kafka',
'topic' = 'test_topic',
'properties.bootstrap.servers'='localhost:9092',
'scan.startup.mode'='latest-offset',
'properties.group.id' = 'my-group-id',
'format' = 'csv'
);
ALTER TABLE kafka_input_table SET ('scan.startup.mode'='earliest-offset');
set table.dynamic-table-options.enabled = true
insert into hadoop_iceberg.iceberg_db.flink_iceberg_tbl3 select id,name,age,loc from kafka_input_table
select * from hadoop_iceberg.iceberg_db.flink_iceberg_tbl3 /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s')*/
i tried with many blogs but same issue

Delete in Apache Hudi - Glue Job

I have to build a Glue Job for updating and deleting old rows in Athena table.
When I run my job for deleting it returns an error:
AnalysisException: 'Unable to infer schema for Parquet. It must be specified manually.;'
My Glue Job:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table", transformation_ctx = "datasource0")
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table_output", transformation_ctx = "datasource1")
datasource0.toDF().createOrReplaceTempView("view_dyf")
datasource1.toDF().createOrReplaceTempView("view_dyf_output")
ds = spark.sql("SELECT * FROM view_dyf_output where id in (select id from view_dyf where op like 'D')")
hudi_delete_options = {
'hoodie.table.name': 'test_table_output',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.table.name': 'test_table_output',
'hoodie.datasource.write.operation': 'delete',
'hoodie.datasource.write.precombine.field': 'name',
'hoodie.upsert.shuffle.parallelism': 1,
'hoodie.insert.shuffle.parallelism': 1
}
from pyspark.sql.functions import lit
deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
df = spark.sparkContext.parallelize(deletes).toDF(['id']).withColumn('name', lit(0.0))
df.write.format("hudi"). \
options(**hudi_delete_options). \
mode("append"). \
save('s3://data/test-output/')
roAfterDeleteViewDF = spark. \
read. \
format("hudi"). \
load("s3://data/test-output/")
roAfterDeleteViewDF.registerTempTable("test_table_output")
spark.sql("SELECT * FROM view_dyf_output where id in (select distinct id from view_dyf where op like 'D')").count()
I have 2 data sources; first old Athena table where data has to updated or deleted, and the second table in which are coming new updated or deleted data.
In ds I have selected all rows that have to be deleted in old table.
op is for operation; 'D' for delete, 'U' for update.
Does anyone know what am I missing here?

The value for hoodie.datasource.write.operation is invalid in your code, the supported write operations are: UPSERT/Insert/Bulk_insert. check Hudi Doc.
Also what is your intention for deleting records: hard delete or soft ?
For Hard delete, you have to provide
{'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.EmptyHoodieRecordPayload}

Databricks and Polybase cannot parse CSV including polygon

I have Azure Data Factory, which read CSV via HTTP Connection and store data to Azure Storage Gen2. File format is UTC-8. It seem like file get somehow corrupted because of polygon definitions.
File content is followings:
Shape123|"MULTIPOLYGON (((496000 6908000, 495000 6908000, 495000 6909000, 496000 6909000, 496000 6908000)))"|"Red"|"Long"|"208336"|"5"|"-1"
Problem 1:
Polybase complain about encoding and cannot read file.
Problem 2:
Databricks data frame cannot handle this and it can cuts row and reads only "Shape123|"MULTIPOLYGON (((496000 6908000,"
Quick solution:
Open CSV file with Notepad++ and reconfirm encoding as UTC-8. Then Polybase is able to handle.
Question:
What are automatic way to fix CSV file?
How to make dataframe to handle entire row if CSV file cannot not be fixed?

Polybase can cope perfectly well with UTF8 files and various delimiters. Did you create an external file format with pipe delimiter, double-quote as string delimiter, something like this?
CREATE EXTERNAL FILE FORMAT ff_pipeFileFormatSHAPE
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '|',
STRING_DELIMITER = '"',
ENCODING = 'UTF8'
)
);
GO
CREATE EXTERNAL TABLE shape_data (
col1 VARCHAR(20),
col2 VARCHAR(8000),
col3 VARCHAR(20),
col4 VARCHAR(20),
col5 VARCHAR(20),
col6 VARCHAR(20),
col7 VARCHAR(20)
)
WITH (
LOCATION = 'yourPath/shape/shape working.txt',
DATA_SOURCE = ds_azureDataLakeStore,
FILE_FORMAT = ff_pipeFileFormatSHAPE,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
My results:

How to load 533 columns of data into snowflake table?

We have a table with 533 columns with a lot of LOB columns that have to be moved to snowflake. Since our source transformation system having an issue to manage 533 columns in one job. We have split ted the columns into 2 jobs. The first job will insert 283 columns and the second job needs to update the remaining column.
We are using one copy command and upsert command respectively for these two jobs.
copy command
copy into "ADIUATPERF"."APLHSTRO"."FNMA1004_APPR_DEMO" (283 columns) from #"ADIUATPERF"."APLHSTRO".fnma_talend_poc/jos/outformatted.csv
--file_format = (format_name = '"ADIUATPERF"."APLHSTRO".CSV_DQ_HDR0_NO_ESC_CARET');
FILE_FORMAT = (DATE_FORMAT='dd-mm-yyyy', TIMESTAMP_FORMAT='dd-mm-yyyy',TYPE=CSV, ESCAPE_UNENCLOSED_FIELD = NONE,
SKIP_HEADER=1, field_delimiter ='|', RECORD_DELIMITER = '\\n', FIELD_OPTIONALLY_ENCLOSED_BY = '"',
NULL_IF = ('')) PATTERN='' on_error = 'CONTINUE',FORCE=true;
Upsert command
MERGE INTO db.schema._table as target
USING
(SELECT t.$1
from #"ADIUATPERF"."APLHSTRO".fnma_talend_poc/jos/fnma1004_appr.csv
--file_format = (format_name = '"ADIUATPERF"."APLHSTRO".CSV_DQ_HDR0_NO_ESC_CARET');
(FILE_FORMAT =>'CSV', ESCAPE_UNENCLOSED_FIELD => NONE,
SKIP_HEADER=>1, field_delimiter =>'|', RECORD_DELIMITER => '\\n', FIELD_OPTIONALLY_ENCLOSED_BY => '"',
NULL_IF => (''))
) source ON target.document_id = source.document_id
WHEN MATCHED THEN
--update lst_updated
UPDATE SET <columns>=<values>;
I would like to know if we have any other option ?

I would recommend that you run the COPY INTO for both of your split files into temp/transient tables, first. And then execute a single CTAS statement using the JOIN between those 2 tables on document_id. Don't MERGE from a flat file. You could, optionally, run a MERGE on the 2nd temp table into the first table (not temp), if you wished, but I think a straight CTAS from 2 "half" tables might be faster for you.

Inserting values into multiple columns by splitting a string in PostgreSQL

I have the following heap of text:
"BundleSize,155648,DynamicSize,204800,Identifier,com.URLConnectionSample,Name,
URLConnectionSample,ShortVersion,1.0,Version,1.0,BundleSize,155648,DynamicSize,
16384,Identifier,com.IdentifierForVendor3,Name,IdentifierForVendor3,ShortVersion,
1.0,Version,1.0,".
What I'd like to do is extract data from this in the following manner:
BundleSize:155648
DynamicSize:204800
Identifier:com.URLConnectionSample
Name:URLConnectionSample
ShortVersion:1.0
Version:1.0
BundleSize:155648
DynamicSize:16384
Identifier:com.IdentifierForVendor3
Name:IdentifierForVendor3
ShortVersion:1.0
Version:1.0
All tips and suggestions are welcome.

It isn't quite clear what do you need to do with this data. If you really need to process it entirely in the database (looks like the task for your favorite scripting language instead), one option is to use hstore.
Converting records one by one is easy:
Assuming
%s =
BundleSize,155648,DynamicSize,204800,Identifier,com.URLConnectionSample,Name,URLConnectionSample,ShortVersion,1.0,Version,1.0
SELECT * FROM each(hstore(string_to_array(%s, ',')));
Output:
key | value
--------------+-------------------------
Name | URLConnectionSample
Version | 1.0
BundleSize | 155648
Identifier | com.URLConnectionSample
DynamicSize | 204800
ShortVersion | 1.0
If you have table with columns exactly matching field names (note the quotes, populate_record is case-sensitive to key names):
CREATE TABLE data (
"BundleSize" integer, "DynamicSize" integer, "Identifier" text,
"Name" text, "ShortVersion" text, "Version" text);
You can insert hstore records into it like this:
INSERT INTO data SELECT * FROM
populate_record(NULL::data, hstore(string_to_array(%s, ',')));
Things get more complicated if you have comma-separated values for more than one record.
%s = BundleSize,155648,DynamicSize,204800,Identifier,com.URLConnectionSample,Name,URLConnectionSample,ShortVersion,1.0,Version,1.0,BundleSize,155648,DynamicSize,16384,Identifier,com.IdentifierForVendor3,Name,IdentifierForVendor3,ShortVersion,1.0,Version,1.0,
You need to break up an array into chunks of number_of_fields * 2 = 12 elements first.
SELECT hstore(row) FROM (
SELECT array_agg(str) AS row FROM (
SELECT str, row_number() OVER () AS i FROM
unnest(string_to_array(%s, ',')) AS str
) AS str_sub
GROUP BY (i - 1) / 12) AS row_sub
WHERE array_length(row, 1) = 12;
Output:
"Name"=>"URLConnectionSample", "Version"=>"1.0", "BundleSize"=>"155648", "Identifier"=>"com.URLConnectionSample", "DynamicSize"=>"204800", "ShortVersion"=>"1.0"
"Name"=>"IdentifierForVendor3", "Version"=>"1.0", "BundleSize"=>"155648", "Identifier"=>"com.IdentifierForVendor3", "DynamicSize"=>"16384", "ShortVersion"=>"1.0"
And inserting this into the aforementioned table:
INSERT INTO data SELECT (populate_record(NULL::data, hstore(row))).* FROM ...
the rest of the query is the same.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Are the bucket hash algorithms of tez and MR different? - hash

Related

Flink SQL not rolling iceberg files to hdfs while flink sql streaming job running

Delete in Apache Hudi - Glue Job

Databricks and Polybase cannot parse CSV including polygon

How to load 533 columns of data into snowflake table?

Inserting values into multiple columns by splitting a string in PostgreSQL

Categories

Resources