Can we have partitions within partition in a Hive table? - hiveql

Can we do partitions within partitions in Hive table.
I mean can we partition a partitioned table? or is bucketing the only option in Hive tables?

Hive supports multiple levels of partitioning. But keep in mind that having more than a single level of partitioning in Hive is almost never a good idea. HDFS is really optimized for manipulating large files, ~100MB and larger. Each partition of a Hive table is a HDFS directory. There are normally multiple files in each of these directories. You really should be closing on a petabyte of data to make multiple levels of partitioning in a Hive table.
What problem are you trying to solve? I'm sure we can find a sensible solution for it.

Related

Design stream pipeline using spark structured streaming and databricks delta to handle multiple tables

I am designing a streaming pipeline where my need is to consume events from Kafka topic.
Single Kafka topic can have data from around 1000 tables, data coming as a json record. Now I have below problems to solve.
Reroute messages based on its table to its seperate folder: this is done using spark structured streaming partition by on table name.
Second wanted to parse each json record attach appropriate table schema to it and create/append/update corresponding delta table. I am not able find best solution to do it. Infering json schema dyanamically and write to delta table dynamically. How this can be done? As records are coming as json string.
As I have to process so many tables do I need to write those many number of streaming queries? How it can be solved?
Thanks

Flink Table and Hive Catalog storage

I have a kafka topic and a Hive Metastore. I want to join the incomming events from the kafka topic with records of the metastore. I saw the possibility with Flink to use a catalog to query Hive Metastore.
So I see two ways to handle this:
using the DataStream api to consume the kafka topic and query the Hive Catalog one way or another in a processFunction or something similar
using the Table-Api, I would create a table from the kafka topic and join it with the Hive Catalog
My biggest concerns are storage related.
In both cases, what is stored in memory and what is not ? Does the Hive catalog stores anything on the Flink's cluster side ?
In the second case, how the table is handle ? Does flink create a copy ?
Which solution seems the best ? (maybe both or neither are good choices)
Different methods are suitable for different scenarios, sometimes depending on whether your hive table is a static table or a dynamic table.
If your hive is only a dimension table, you can try this chapter.
joins-in-continuous-queries
It will automatically associate the latest partition of hive, and it is suitable for scenarios where dimension data is slowly updated.
But you need to note that this feature is not supported by the Legacy planner.

Does Spark maintain parquet partitioning on read?

I am having a lot trouble finding the answer to this question. Let's say I write a dataframe to parquet and I use repartition combined with partitionBy to get a nicely partitioned parquet file. See Below:
df.repartition(col("DATE")).write.partitionBy("DATE").parquet("/path/to/parquet/file")
Now later on I would like to read the parquet file so I do something like this:
val df = spark.read.parquet("/path/to/parquet/file")
Is the dataframe partitioned by "DATE"? In other words if a parquet file is partitioned does spark maintain that partitioning when reading it into a spark dataframe. Or is it randomly partitioned?
Also the why and why not to this answer would be helpful as well.
The number of partitions acquired when reading data stored as parquet follows many of the same rules as reading partitioned text:
If SparkContext.minPartitions >= partitions count in data, SparkContext.minPartitions will be returned.
If partitions count in data >= SparkContext.parallelism, SparkContext.parallelism will be returned, though in some very small partition cases, #3 may be true instead.
Finally, if the partitions count in data is somewhere between SparkContext.minPartitions and SparkContext.parallelism, generally you'll see the partitions reflected in the dataset partitioning.
Note that it's rare for a partitioned parquet file to have full data locality for a partition, meaning that, even when the partitions count in data matches the read partition count, there is a strong likelihood that the dataset should be repartitioned in memory if you're trying to achieve partition data locality for performance.
Given your use case above, I'd recommend immediately repartitioning on the "DATE" column if you're planning to leverage partition-local operations on that basis. The above caveats regarding minPartitions and parallelism settings apply here as well.
val df = spark.read.parquet("/path/to/parquet/file")
df.repartition(col("DATE"))
You would get the number of partitions based on the spark config spark.sql.files.maxPartitionBytes which defaults to 128MB. And the data would not be partitioned as per the partition column which was used while writing.
Reference https://spark.apache.org/docs/latest/sql-performance-tuning.html
In your question, there are two ways we could say the data are being "partitioned", which are:
via repartition, which uses a hash partitioner to distribute the data into a specific number of partitions. If, as in your question, you don't specify a number, the value in spark.sql.shuffle.partitions is used, which has default value 200. A call to .repartition will usually trigger a shuffle, which means the partitions are now spread across your pool of executors.
via partitionBy, which is a method specific to a DataFrameWriter that tells it to partition the data on disk according to a key. This means the data written are split across subdirectories named according to your partition column, e.g. /path/to/parquet/file/DATE=<individual DATE value>. In this example, only rows with a particular DATE value are stored in each DATE= subdirectory.
Given these two uses of the term "partitioning," there are subtle aspects in answering your question. Since you used partitionBy and asked if Spark "maintain's the partitioning", I suspect what you're really curious about is if Spark will do partition pruning, which is a technique used drastically improve the performance of queries that have filters on a partition column. If Spark knows values you seek cannot be in specific subdirectories, it won't waste any time reading those files and hence your query completes much quicker.
If the way you're reading the data isn't partition aware, you'll get a number of partitions something like what's in bsplosion's answer. Spark won't employ partition pruning, and hence you won't get the benefit of Spark automatically ignoring reading certain files to speed things up1.
Fortunately, reading parquet files in Spark that were written with partitionBy is a partition-aware read. Even without a metastore like Hive that tells Spark the files are partitioned on disk, Spark will discover the partitioning automatically. Please see partition discovery in Spark for how this works in parquet.
I recommend testing reading your dataset in spark-shell so that you can easily see the output of .explain, which will let you verify that Spark correctly finds the partitions and can prune out the ones that don't contain data of interest in your query. A nice writeup on this can be found here. In short, if you see PartitionFilters: [], it means that Spark isn't doing any partition pruning. But if you see something like PartitionFilters: [isnotnull(date#3), (date#3 = 2021-01-01)], Spark is only reading in a specific set of DATE partitions, and hence the query execution is usually a lot faster.
1A separate detail is that parquet stores statistics about data in its columns inside of the files themselves. If these statistics can be used to eliminate chunks of data that can't match whatever filtering you're doing, e.g. on DATE, then you'll see some speedup even if the way you read the data isn't partition-aware. This is called predicate pushdown. It works because the files on disk will still contain only specific values of DATE when using .partitionBy. More info can be found here.

SnappyData table definitions using partition keys

Reading through the documentation (http://snappydatainc.github.io/snappydata/streamingWithSQL/) and had a question about this item:
"Reduced shuffling through co-partitioning: With SnappyData, the partitioning key used by the input queue (e.g., for Kafka sources), the stream processor and the underlying store can all be the same. This dramatically reduces the need to shuffle records."
If we are using Kafka and partition our data in a topic using a key (single value). Is it possible to map this single key from kafka to multiple partition keys identified in the snappy table?
Is there a hash of some sort to turn multiple keys into a single key?
The benefit of reduced shuffling seems significant and trying to understand the best practice here.
thanks!
With DirectKafka stream, each partition pulls the data from own designated topic. If no partitioning is specified for the storage table, then each DirectKafka partition will put only to local storage buckets and then everything will line up well without requiring anything extra. The only thing to take care of is enough number of topics (thus partitions) for better concurrency -- ideally at least as many as total number of processor cores in the cluster so all cores are busy.
When partitioning storage tables explicitly, SnappyData's store has been adjusted to use the same hashing as Spark's HashPartitioning (for "PARTITION_BY" option of both column and row tables) since that is the one used at Catalyst SQL execution layer. So execution and storage are always collocated.
However, aligning that with ingestion from DirectKafka partitions will require some manual work (align kafka topic partitioning with HashPartitioning, then having the preferred locations for each DirectKafka partition match the storage). Will be simplified in coming releases.

Hbase data on Slaves

If I use Hbase Cluster, does every slave have the same data or it could be partitioned?
What are the best practices?
HBase rows are ordered by the key and automatically arranged into small partitions called regions. Each server handles some of the automatically partitions the data into regions (see this question for more details).
You can let Hbase control the splitting or pre-split yourself to control the load on the cluster