spark hive get external partition file location - scala

I'm trying to get the external table file location of a partition calculated at run time. Simply alter table drop wouldn't work since it's external. The closest I can get is
spark.sql(s"describe $tableName partition ($partitionBy=$partitionValue)")
But this would fail when partitionValue is of type timestamp and directly toString before calling the above function. Is there a way to use the same function where spark.write.saveastable() use to create the file path? or is there a way to get the location of the data by partition at runtime?

Try something
show table extended FROM $tableSchema like $tableName partition ($partitionBy=$partitionValue)
See more here:
https://spark.apache.org/docs/3.0.0/sql-ref-syntax-aux-show-table.html

Related

Change databricks delta table typr to external

I have a MANAGED table in delta format in databrciks and I wanted to change it to EXTERNAL to make sure dropping the table would not affect the data. However the following code did not change the table TYPE and just added a new table property. How can I correctly convert my managed table to an external table ?
%sql
alter table db_delta.table1 SET TBLPROPERTIES('EXTERNAL'='TRUE')
Describe Table:
# Detailed Table Information
Name
db_delta.table1
Location
dbfs:/user/hive/warehouse/db_delta.db/table1
Provider
delta
Type
MANAGED
Table Properties
[EXTERNAL=TRUE,overwriteSchema=true]
I found the following workaround for the above scenario.
1.Copy the Managed table location to external location
dbutils.fs.cp('dbfs:/user/hive/warehouse/amazon_data_agg','abfss://data#amazondata.dfs.core.windows.net/amzon_aggred/',True)
Now drop the managed table.
drop table amazon_data_agg;
Now create the external table by the schema of the already created table, if there is schema mismatch you will get error.
Now you can append and do all operation
df_agg.write.format('delta').mode('append').option('path','abfss://data#amazondata.dfs.core.windows.net/amzon_aggred/').saveAsTable('amazon_data_agg')

Load data with default values into Redshift from a parquet file

I need to load data with a default value column into Redshift, as outlined in the AWS docs.
Unfortunately the COPY command doesn't allow loading data with default values from a parquet file, so I need to find a different way to do that.
My table requires a column with the getdate function from Redshift:
LOAD_DT TIMESTAMP DEFAULT GETDATE()
If I use the COPY command and add the column names as arguments I get the error:
Column mapping option argument is not supported for PARQUET based COPY
What is a workaround for this?
Can you post a reference for Redshift not supporting default values for a Parquet COPY? I haven't heard of this restriction.
As to work-arounds I can think of two.
Copy the file to a temp table and then insert from this temp table into your table with the default value.
Define an external table that uses the parquet file as source and insert from this table onto the table with the default value.

Hive create partitioned table based on Spark temporary table

I have a Spark temporary table spark_tmp_view with DATE_KEY column. I am trying to create a Hive table (without writing the temp table to a parquet location. What I have tried to run is spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.result AS SELECT * FROM spark_tmp_view PARTITIONED BY(DATE_KEY DATE)")
The error I got is mismatched input 'BY' expecting <EOF> I tried to search but still haven't been able to figure out the how to do it from a Spark app, and how to insert data after. Could someone please help? Many thanks.
PARTITIONED BY is part of definition of a table being created, so it should precede ...AS SELECT..., see Spark SQL syntax.

KSQL table get old and new value

Is it possible in KSQL to stream out the old and new values from a table? We'd like to use a table as a store of values and when one changes stream out a "reversal" value which is the previous one, tagged in some way, and the new value so that we can just handle the delta in downstream systems?
Kafka tables are generally used for storing the latest values. So for example say stream with key '123' exist in table and a new stream with same key '123' but different column value appears on topic, this will override(upsert) the existing value in table.
So probably its not a great idea to do it on Table.
Your use case is not clear to me still my suggestion would be you need to have some mechanism either in the source of stream or using timestamp to deal with delta feed.
Yes its possible. Does require some juggling.
Create table to keep last state
create table v1_mux_connection_ping_ta
as
select
assetid,
LATEST_BY_OFFSET(pingable) pingable
from v1_mux_connection_ping_st_parse
group by assetid;
Problem is it also emits no-changes. A solution is to translate the table to a stream.
CREATE STREAM v1_mux_connection_ping_ta_s
(assetId VARCHAR KEY, pingable VARCHAR)
WITH (kafka_topic='V1_MUX_CONNECTION_PING_TA', value_format='JSON');
To arrive at only changed values
create table d_opt_details as
select
s.assetId,
LATEST_BY_OFFSET(s.pingable) new,
LATEST_BY_OFFSET(s.pingable, 2)[1] old
from v1_mux_connection_ping_ta_s s
group by
s.assetId;
create table opt_details as
select
s.assetId, s.new as pingable
from d_opt_details s
where new != old;

Hive: dynamic partition adding to external table

I am running hive 071, processing existing data which is has the following directory layout:
-TableName
- d= (e.g. 2011-08-01)
- d=2011-08-02
- d=2011-08-03
... etc
under each date I have the date files.
now to load the data I'm using
CREATE EXTERNAL TABLE table_name (i int)
PARTITIONED BY (date String)
LOCATION '${hiveconf:basepath}/TableName';**
I would like my hive script to be able to load the relevant partitions according to some input date, and number of days. so if I pass date='2011-08-03' and days='7'
The script should load the following partitions
- d=2011-08-03
- d=2011-08-04
- d=2011-08-05
- d=2011-08-06
- d=2011-08-07
- d=2011-08-08
- d=2011-08-09
I havn't found any discent way to do it except explicitlly running:
ALTER TABLE table_name ADD PARTITION (d='2011-08-03');
ALTER TABLE table_name ADD PARTITION (d='2011-08-04');
ALTER TABLE table_name ADD PARTITION (d='2011-08-05');
ALTER TABLE table_name ADD PARTITION (d='2011-08-06');
ALTER TABLE table_name ADD PARTITION (d='2011-08-07');
ALTER TABLE table_name ADD PARTITION (d='2011-08-08');
ALTER TABLE table_name ADD PARTITION (d='2011-08-09');
and then running my query
select count(1) from table_name;
however this is offcourse not automated according to the date and days input
Is there any way I can define to the external table to load partitions according to date range, or date arithmetics?
I have a very similar issue where, after a migration, I have to recreate a table for which I have the data, but not the metadata. The solution seems to be, after recreating the table:
MSCK REPAIR TABLE table_name;
Explained here
This also mentions the "alter table X recover partitions" that OP commented on his own post. MSCK REPAIR TABLE table_name; works on non-Amazon-EMR implementations (Cloudera in my case).
The partitions are a physical segmenting of the data - where the partition is maintained by the directory system, and the queries use the metadata to determine where the partition is located. so if you can make the directory structure match the query, it should find the data you want. for example:
select count(*) from table_name where (d >= '2011-08-03) and (d <= '2011-08-09');
but I do not know of any date-range operations otherwise, you'll have to do the math to create the query pattern first.
you can also create external tables, and add partitions to them that define the location.
This allows you to shred the data as you like, and still use the partition scheme to optimize the queries.
I do not believe there is any built-in functionality for this in Hive. You may be able to write a plugin. Creating custom UDFs
Probably do not need to mention this, but have you considered a simple bash script that would take your parameters and pipe the commands to hive?
Oozie workflows would be another option, however that might be overkill. Oozie Hive Extension - After some thinking I dont think Oozie would work for this.
I have explained the similar scenario in my blog post:
1) You need to set properties:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
2)Create a external staging table to load the input files data in to this table.
3) Create a main production external table "production_order" with date field as one of the partitioned columns.
4) Load the production table from the staging table so that data is distributed in partitions automatically.
Explained the similar concept in the below blog post. If you want to see the code.
http://exploredatascience.blogspot.in/2014/06/dynamic-partitioning-with-hive.html