Creating hive with multiple parquets - scala

I have parquet folders with their name as "yearquarter" starting from (2007q1 - 2020q3). The hive tables I am creating should pull data from only 2014q1 through 2020q2. How do i achieve this?

You'll have to change the parquet folder names and add a prefix to them like yearquarter=2001q1 (for example) which indicates what column stores these values, so it sits in a hierarchy under a top level folder (named table_name below).
table_name
|
- yearquarter=2001q1
- yearquarter=2001q2
.
.
- yearquarter=2020q3
Hive-based solution:
You would then create an external hive table which is located at the top level folder. You choose external so you can set the location. The table schema should correspond to the column labels in the files.
CREATE EXTERNAL TABLE TABLE_NAME (
col_name1 HIVE_TYPE,
...,
col_nameN HIVE_TYPE)
PARTITIONED BY (yearquarter STRING)
STORED AS PARQUET
LOCATION '/location/to/your/table_name';
After you have a hive table on your folder hierarchy, partitioned by the folders, you create a hive view which uses a WHERE clause to SELECT a subset.
CREATE VIEW view_name
AS SELECT *
FROM table_name
WHERE yearquarter >= "2014q1" AND yearquarter <= "2020q2";
Performing a SELECT from this view will then provide the required range.
Spark-based solution:
You create a DataFrame which reads the top level location. Because you stored the hierarchy like yearquarter=2001q1, these values are automatically read into a column labeled yearquarter.
// Read parquet hierarchy. The schema (if present) is automatically detected.
val df = spark.read.parquet("/location/to/your/table_name")
// Set filter condition to use.
val filterCondition = col("yearquarter") >= "2014q1" && col("yearquarter") <= "2020q2"
// Filter according to condition.
val filtered = df.filter(filterCondition)

Related

How to create many tables programatically?

I have a table in my database called products and has prouductId, ProductName, BrandId and BrandName. I need to create delta tables for each brands by passing brand id as parameter and the table name should be corresponding .delta. Every time when new data is inserted into products (master table) the data in brand tables need to be truncated and reloaded into brand.delta tables. Could you please let me know if this is possible within databricks using spark or dynamic SQL?
It's easy to do, really there are few variants:
in Spark - read data from source table, filter out, etc., and use .saveAsTable in the overwrite mode:
df = spark.read.table("products")
... transform df
brand_table_name = "brand1"
df.write.mode("overwrite").saveAsTable(brand_table_name)
in SQL by using CREATE OR REPLACE TABLE (You can use spark.sql to substitute variables in this text):
CREATE OR REPLACE TABLE brand1
USING delta
AS SELECT * FROM products where .... filter condition
for list of brands you just need to use spark.sql with loop:
for brand in brands:
spark.sql(f"""CREATE OR REPLACE TABLE {brand}
USING delta
AS SELECT * FROM products where .... filter condition""")
P.S. Really, I think that you just need to define views (doc) over the products table, that will have corresponding condition - in this case you avoid data duplication, and don't incur computing costs for that writes.

External Table in Hive showing 0 records, although the location where table is pointed is contains text file (.dat and .txt fixed width) with data

I have fixed width files stored in S3 location, and need to create external hive table on top of it. Below are the options I tried:
option 1 : To create table with single column, and then I can use sql to substring to multiple columns based on length and index.
CREATE EXTERNAL TABLE `tbl`(
line string)
ROW FORMAT delimited
fields terminated by '/n'
stored as textfile
LOCATION 's3://bucket/folder/';
option 2: Use RegexSerDe to segregate the data into different columns:
CREATE EXTERNAL TABLE `tbl`(
col1 string ,
col2 string ,
col3 string ,
col4_date string ,
col5 string )
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "(.{10})(.{10})(.{16})(.{19})(.*)")
LOCATION 's3://bucket/folder/';
Both the above options do not give any record.
select * from tbl;
OK
Time taken: 0.086 seconds

column values change between loading two partioned tables in KDB (q)

I have two partioned kdb tables on disk (one called trades, one called books). I created the data by
using
.Q.dpft[`:I:/check/trades/;2020.01.01;`symTrade;`trades]
and
.Q.dpft[`:I:/check/books/;2020.01.01;`sym;`books]
for each day. If I select data from the trades table and then load the books table (without selecting data) the values in the symTrade columns of my result change to new values. I assume it has got something to do with the paritioning in the books table getting applied to the result from trades table (also the trades table is no longer accessible after loading the books table).
How do I:
keep the trades table accessible after loading the books table?
avoid having my symTrade column overwritten by the sym values in
the books table?
Here is an example:
system "l I:/check/trades/";
test: 10 sublist select from trades where date=2020.01.01;
show cols test;
// gives `date`symTrade`time`Price`Qty`Volume
select distinct symTrade from test;
// gives TICKER1
// now loading another table
system "l I:/check/books";
select distinct symTrade from test;
// now gives a different value e.g. TICKER200
I think the problem is that you are saving these tables to two different databases.
The first argument in .Q.dpft is the path to the root of the database, and the fourth argument is the name of the table you want to store. So when you do
.Q.dpft[`:I:/check/trades/;2020.01.01;`symTrade;`trades]
You are storing the trades table in a database in I:/check/trades and when you do
.Q.dpft[`:I:/check/books/;2020.01.01;`sym;`books]
you are storing the books table in a database in I:/check/books. I think q can only load in one database at a time, so that might be the problem.
Try doing this
.Q.dpft[`:I:/check/;2020.01.01;`symTrade;`trades]
.Q.dpft[`:I:/check/;2020.01.01;`sym;`books]
system "l I:/check/";
Let us know if that works!

Query execution Error via PySpark - GC Error

I have requirement to to extract row counts of each table in hive database(which has multiple schemas). I wrote pyspark job which extracts counts of each table, it works fine when i try for some of the schemas, however it fails with GV overhead error when i try for all schemas. i tried creating union all for all table queries across the database, also tried union all for all tables within schema. both have failed with the GC Error.
Can you please advise to avoid this error. below is my script:
# For loop for Schema starts here
for schema in schemas_list:
# Dataframe with all table names available in given Schema for level1 and level2
tables_1_df=tables_df(schema,1)
tables_1_list=formatted_list(tables_1_df,1)
tables_2_df=tables_df(schema,2)
tables_2_list=formatted_list(tables_2_df,2)
tables_list=list(set(tables_1_list) & set(tables_2_list)) #Intersection of level1 and level2 tables per Schema Name
# For loop for Tables starts her
for table in tables_list:
# Creating Dataframe with Row Count of given table for level 1 and level2
level_1_query=prep_query(schema, table, 1)
level_2_query=prep_query(schema, table, 2)
level_1_count_df=level_1_count_df.union(table_count(level_1_query))
level_1_count_df.persist()
level_2_count_df=level_2_count_df.union(table_count(level_2_query))
level_2_count_df.persist()
# Validate if level1 and level2 are re-conciled, if not write the row into data frame which will intern write into file in S3 Location
level_1_2_join_df = level_1_count_df.alias("one").join(level_2_count_df.alias("two"),(level_1_count_df.schema_name==level_2_count_df.schema_name) & (level_1_count_df.table_name==level_2_count_df.table_name),'inner').select(col("one.schema_name"),col("two.table_name"),col("level_1_count"),col("level_2_count"))
main_df=header_df.union(level_1_2_join_df)
if extracttype=='DELTA':
main_df=main_df.filter(main_df.level_1_count!=main_df.level_2_count)
main_df=main_df.select(concat(col("schema_name"),lit(","),col("table_name"),lit(","),col("level_1_count"),lit(","),col("level_2_count")))
# creates file in temp location
file_output(main_df, tempfolder) # writes to txt file in hadoop

Redshift Copy and auto-increment does not work

I am using the COPY command from redshift to copy json data from S3.
The table definition is as follows:
CREATE TABLE my_raw
(
id BIGINT IDENTITY(1,1),
...
...
) diststyle even;
The command for copy i am using is as follows:
COPY my_raw FROM 's3://dev-usage/my/2015-01-22/my-usage-1421928858909-15499f6cc977435b96e610298919db26' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXX' json 's3://bemole-usage/schemas/json_schema' ;
I am expecting that any new id inserted will always be > select max(id) from my_raw . In fact it's clearly not the case.
If I issue the above copy command twice, the first time the ids start from 1 to N although that file is creating 114 records(that's a known issue with redshift when it has multiple shards). The second time the ids are also between 1 and N but it took free numbers that were not used in the first copy.
See below for a demo:
usagedb=# COPY my_raw FROM 's3://bemole-usage/my/2015-01-22/my-usage-1421930213881-b8afbe07ab34401592841af5f7ddb31c' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXXX' json 's3://bemole-usage/schemas/json_schema' COMPUPDATE OFF;
INFO: Load into table 'my_raw' completed, 114 record(s) loaded successfully.
COPY
usagedb=#
usagedb=# select max(id) from my_raw;
max
------
4556
(1 row)
usagedb=# COPY my_raw FROM 's3://bemole-usage/my/2015-01-22/my-usage-1421930213881-b8afbe07ab34401592841af5f7ddb31c' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXXX' json 's3://bemole-usage/schemas/my_json_schema' COMPUPDATE OFF;
INFO: Load into table 'my_raw' completed, 114 record(s) loaded successfully.
COPY
usagedb=# select max(id) from my_raw;
max
------
4556
(1 row)
Thx in advance
The only solution i found to make sure have sequential Ids based on the insertion is to maintain a pair of tables. The first one is the stage table in which the items are inserted by the COPY command. The stage table will actually not have an ID column.
Then I have another table that is the exact replica of the stage table except that it has an additional column for the Ids. Then there is a job that takes care of filling the master table from the stage using the ROW_NUMBER() function.
In practice, this means executing the following statement after each Redshift COPY is performed:
insert into master
(id,result_code,ct_timestamp,...)
select
#{startIncrement}+row_number() over(order by ct_timestamp) as id,
result_code,...
from stage;
Then the Ids are guaranteed to be sequential/consecutives in the master table.
I can't reproduce your problem, however it is interesting how you have identity columns set correctly in conjunction with copy. Here a small summary:
Be aware that you can specify the columns (and their order) for a copy command.
COPY my_table (col1, col2, col3) FROM s3://...
So if:
EXPLICIT_IDS flag is NOT set
no columns listed like shown above
and you csv does not contain data for the IDENTITY column
then the identity values in the table will be set automatically in monotonously as we all want it.
doc:
If an IDENTITY column is included in the column list, then EXPLICIT_IDS must also be specified; if an IDENTITY column is omitted, then EXPLICIT_IDS cannot be specified. If no column list is specified, the command behaves as if a complete, in-order column list was specified, with IDENTITY columns omitted if EXPLICIT_IDS was also not specified.