Can anyone explain the point of partitioning a Hive table.
If I create a table and partition it by date. In hdfs it is shown as a file name.. or a sub file. What does this mean?
Can anyone please explain the concept ?
You have loaded one partition, namely the "age equals 22" partition with your entire data set. Therefore, all rows in your table have an age of 22.
If you specify a partition in your statement, it will write into that partition. You might be wanting dynamic partitions where you can create partitions pulled from some select statement.
In general, the purpose of partitioning in Hive is to improve performance and to structure the table to mirror know access patterns and usage - eg I always query my table by age.
Related
I'm using pg_partman to partition three of my tables and and ended up with a large number of child tables.
Some users find it difficult to navigate whit their database tool (DBeaver or SQuirreL) with this increasing number of tables showing up.
Is there a way to "hide" these table from a user without changing their access rights to them ?
You cannot hide the partitions, but you could put them in a different schema than the partitioned table. Then they are “hidden” if you only look at the schema with the partitioned table.
bump into same question, and didn't found a perfect solution. My work around here is to filter out tables with name %prt%
I use count to calculate the number of RDD,got 13673153,but after I transfer the rdd to df and insert into hive,and count again,got 13673182,why?
rdd.count
spark.sql("select count(*) from ...").show()
hive sql: select count(*) from ...
This could be caused by a mismatch between data in the underlying files and the metadata registered in hive for that table. Try running:
MSCK REPAIR TABLE tablename;
in hive, and see if the issue is fixed. The command updates the partition information of the table. You can find more info in the documentation here.
During a Spark Action and part of SparkContext, Spark will record which files were in scope for processing. So, if the DAG needs to recover and reprocess that Action, then the same results are gotten. By design.
Hive QL has no such considerations.
UPDATE
As you noted, the other answer did not help in this use case.
So, when Spark processes Hive tables it looks at the list of files that it will use for the Action.
In the case of a failure (node failure, etc.) it will recompute data from the generated DAG. If it needs to go back and re-compute as far as the start of reading from Hive itself, then it will know which files to use - i.e the same files, so that same results are gotten instead of non-deterministic outcomes. E.g. think of partitioning aspects, handy that same results can be recomputed!
It's that simple. It's by design. Hope this helps.
I have a table that stores information about weather for specific events and for specific timestamps. I do insert, update and select (more often than delete) on this table. All of my queries query on timestamp and event_id. Since this table is blowing up, I was considering doing table partitioning in postgres.
I could also think of having multiple tables and naming them "table_< event_id >_< timestamp >" to store specific timestamp information, instead of using postgres declarative/inheritance partitioning. But, I noticed that no one on the internet has done or written about any approach like this. Is there something I am missing?
I see that in postgres partitioning, the data is both kept in master as well as child tables. Why keep in both places? It seems less efficient to do inserts and updates to me.
Is there a generic limit on the number of tables when postgres will start to choke?
Thank you!
re 1) Don't do it. Why re-invent the wheel if the Postgres devs have already done it for you by providing declarative partitioning
re 2) You are mistaken. The data is only kept in the partition to which it belongs to. It just looks as if it is stored in the "master".
re 3) there is no built-in limit, but anything beyond a "few thousand" partitions is probably too much. It will still work, but especially query planning will be slower. And sometime the query execution might also suffer because runtime partition pruning is not as efficient any more.
Given your description you probably want to do hash partitioning on the event ID and then create range sub-partitions on the timestamp value (so each partition for an event is again partitioned on the range of the timestamps)
I am trying to find the best solution to build a database relation. I need something to create a table that will contain data split across other tables from different databases. All the tables got exactly the same structure (same column number, names and types).
In the single database, I would create a parent table with partitions. However, the volume of the data is too big to do it in a single database that's why I am trying to do a split. From the Postgres documentation what I think I am trying to do is "Multiple-Server Parallel Query Execution".
At the moment the only solution I think to implement is to build API of databases address and use it to get data across the network into the main parent database when needed. I also found Postgres external extension called Citus that might do the job but I don't know how to implement the unique key across multiple databases (or Shards like Citus call it).
Is there any better way to do it?
Citus would most likely solve your problem. It lets you use unique keys across shards if it is the distribution column, or if it is a composite key and contains the distribution column.
You can also use distributed-partitioned table in citus. That is a partitioned table on some column (timestamp ?) and hash distributed table on some other column (like what you use in your existing approach). Query parallelization and data collection would be handled by Citus for you.
To reduce process time I partitioned my data by dates so that I use only required date data (not complete table).So now in HDFS my tables are stored like below
src_tbl //main dir trg_tbl
2016-01-01 //sub dir 2015-12-30
2016-01-02 2015-12-31
2016-01-03 2016-01-01
2016-01-03
Now I want to select min(date) from src_tbl which will be 2016-01-01
and from trg_tbl I want to use data in >= 2016-01-01(src_tbl min(date)) directories which will be2016-01-01 and 2016-01-03 data`
How can select required partitions or date folder from hdfs using Spark-scala ? After completing process I need to overwrite same date directories too.
Details about process:
I want to choose correct window of data (as all other date data in not required) from source and target table..then I want to do join -> lead / lag -> union -> write.
Spark SQL (including the DataFrame/set api's) is kind of funny in the way it handles partitioned tables wrt retaining the existing partitioning info from one transformation/stage to the next.
For the initial loading Spark SQL tends to do a good job on understanding how to retain the underlying partitioning information - if that information were available in the form of the hive metastore metadata for the table.
So .. are these hive tables?
If so - so far so good - you should see the data loaded partition by partition according to the hive partitions.
Will the DataFrame/Dataset remember this nice partitioning already
set up?
Now things get a bit more tricky. The answer depends on whether a shuffle were required or not.
In your case - a simple filter operation - there should not be any need. So once again - you should see the original partitioning preserved and thus good performance. Please verify that the partitioning were indeed retained.
I will mention that if any aggregate functions were invoked then you can be assured your partitioning would be lost. Spark SQL will in that case use a HashPartitioner -inducing a full shuffle.
Update The OP provided more details here: there is lead/lag and join involved. Then he is well advised - from the strictly performance perspective - to avoid the Spark SQL and do the operations manually.
To the OP: the only thing I can suggest at this point is to check
preservesPartitioning=true
were set in your RDD operations. But I am not even sure that capability were exposed by Spark for the lag/lead: please check.