I need to understand the algorithm used by Hive to hash partition data. For example, Spark uses Murmur Hashing. Any ideas or resources?
Partitions in Hive are folders, one folder for each partition key value, not hashed (can be composite key). Hive does not support other partitioning types such as hash or range.
But you can calculate hash in the SQL and use dynamic partitioning when writing the data.
like this, using reflect you can call static Java method:
insert into table partition(mycolumn)
SELECT ...
reflect('org.apache.commons.codec.digest.DigestUtils', 'sha256Hex', mycolumn)
FROM mytable;
Also Hive has int hash(a1[, a2...]), sha2(string/binary, int) and mask_hash(string|char|varchar) native functions.
Hive is using hashing for bucketing. Buckets are files. See this question about hashing in buckets.
Related
Is it possible to alter an existing table in DB2 to add a hash partition? Something like...
ALTER TABLE EXAMPLE.TEST_TABLE
PARITION BY HASH(UNIQUE_ID)
Thanks!
If you run a Db2-LUW local server on zLinux, the following syntax may be available:
ALTER TABLE .. ADD DISTRIBUTE BY HASH (...)
This syntax is not available if the zLinux is not running a Db2-LUW server but is instead only a client of Db2-for-z/OS.
For this syntax to be meaningful, there are various pre-requisites. Refer to the documentation for details of partitioned instances, database partition groups , distribution key rules and default behaviours etc.
The intention of distributed tables (spread over multiple physical and/or logical partitions of a partitioned database in a partitioned Db2-instance) is to exploit hardware capabilities. So if your Db2-instance and database and tablespaces are not appropriately configured, this syntax has limited value.
Depending on your true motivations, partition by range may offer functionality that is useful. Note that partition by range can be combined with distribute by hash if the configuration is appropriate.
I am trying to find the best solution to build a database relation. I need something to create a table that will contain data split across other tables from different databases. All the tables got exactly the same structure (same column number, names and types).
In the single database, I would create a parent table with partitions. However, the volume of the data is too big to do it in a single database that's why I am trying to do a split. From the Postgres documentation what I think I am trying to do is "Multiple-Server Parallel Query Execution".
At the moment the only solution I think to implement is to build API of databases address and use it to get data across the network into the main parent database when needed. I also found Postgres external extension called Citus that might do the job but I don't know how to implement the unique key across multiple databases (or Shards like Citus call it).
Is there any better way to do it?
Citus would most likely solve your problem. It lets you use unique keys across shards if it is the distribution column, or if it is a composite key and contains the distribution column.
You can also use distributed-partitioned table in citus. That is a partitioned table on some column (timestamp ?) and hash distributed table on some other column (like what you use in your existing approach). Query parallelization and data collection would be handled by Citus for you.
I'm looking for suggestions to approach this problem:
parallel queries using JDBC driver
big (in rows) Postgres table
there is no numeric column to be used as partitionColumn
I would like to read this big table using multiple parallel queries, but there is no evident numeric column to partition the table. I though about the physical location of the data using CTID, but I'm not sure if I should follow this path.
The spark-postgres library provides several functions to read/load postgres data. It uses the COPY statement under the hood. As a result it can handle large postgres tables.
The Spark 1.4.0 and higher source code seems to indicate the subject of this post were not possible (except in Spark specific format).
def saveAsTable(tableName: String): Unit = {
* When the DataFrame is created from a non-partitioned [[HadoopFsRelation]] with a single input
* path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC
* and Parquet), the table is persisted in a Hive compatible format, which means other systems
* like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL
* specific format.
I was wondering if there were systematic workarounds for this. Any hive table worth its salt will be partitioned - for scalability and performance reasons. Therefore this is the normal usecase not a corner case.
I have a table with one or more colums of type CLOB.
This table contains duplicate rows.
Normal mechanisms like distinct and group by don't work for CLOBs in DB2.
How can I remove the duplicates on such tables?
One way of approaching this, especially if this is something that you will need to do regularly, is to compare CLOB digests, or hashes instead of CLOBs themselves.
DB2 does not have a built-in hash function available to you, so you'll need to jump through some hoops to accomplish that. For example, you could export CLOBs as files and calculate their hashes using an OS utility.
Alternatively, you could create a simple user-defined function written in Java (which has built-in MD5 and various SHA algorithm support). One such solution is described in detail here.
You could try to utilize the dbms_lob.compare function to compare the content of CLOB fields. It is a built-in module. The supported CLOB size is up to 10MB.