External Table in Hive showing 0 records, although the location where table is pointed is contains text file (.dat and .txt fixed width) with data - hiveql

I have fixed width files stored in S3 location, and need to create external hive table on top of it. Below are the options I tried:
option 1 : To create table with single column, and then I can use sql to substring to multiple columns based on length and index.
CREATE EXTERNAL TABLE `tbl`(
line string)
ROW FORMAT delimited
fields terminated by '/n'
stored as textfile
LOCATION 's3://bucket/folder/';
option 2: Use RegexSerDe to segregate the data into different columns:
CREATE EXTERNAL TABLE `tbl`(
col1 string ,
col2 string ,
col3 string ,
col4_date string ,
col5 string )
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "(.{10})(.{10})(.{16})(.{19})(.*)")
LOCATION 's3://bucket/folder/';
Both the above options do not give any record.
select * from tbl;
OK
Time taken: 0.086 seconds

Related

Creating hive with multiple parquets

I have parquet folders with their name as "yearquarter" starting from (2007q1 - 2020q3). The hive tables I am creating should pull data from only 2014q1 through 2020q2. How do i achieve this?
You'll have to change the parquet folder names and add a prefix to them like yearquarter=2001q1 (for example) which indicates what column stores these values, so it sits in a hierarchy under a top level folder (named table_name below).
table_name
|
- yearquarter=2001q1
- yearquarter=2001q2
.
.
- yearquarter=2020q3
Hive-based solution:
You would then create an external hive table which is located at the top level folder. You choose external so you can set the location. The table schema should correspond to the column labels in the files.
CREATE EXTERNAL TABLE TABLE_NAME (
col_name1 HIVE_TYPE,
...,
col_nameN HIVE_TYPE)
PARTITIONED BY (yearquarter STRING)
STORED AS PARQUET
LOCATION '/location/to/your/table_name';
After you have a hive table on your folder hierarchy, partitioned by the folders, you create a hive view which uses a WHERE clause to SELECT a subset.
CREATE VIEW view_name
AS SELECT *
FROM table_name
WHERE yearquarter >= "2014q1" AND yearquarter <= "2020q2";
Performing a SELECT from this view will then provide the required range.
Spark-based solution:
You create a DataFrame which reads the top level location. Because you stored the hierarchy like yearquarter=2001q1, these values are automatically read into a column labeled yearquarter.
// Read parquet hierarchy. The schema (if present) is automatically detected.
val df = spark.read.parquet("/location/to/your/table_name")
// Set filter condition to use.
val filterCondition = col("yearquarter") >= "2014q1" && col("yearquarter") <= "2020q2"
// Filter according to condition.
val filtered = df.filter(filterCondition)

Hive - create table

I need to create a table in hive to insert a data like the one below:
Column 1 -- account id String(11 characters)
Column 2 -- Age int
Column 3 -- duplicate account_id
The data is stored in a text file delimited by spaces, but the last column will have multiple values, hence doing querying I will need to eliminate that row if the value is present in that column
Example text file:
Thomsxx3125 25 Davidxx3125 Raghuxx3125 Vijayxx3125 Gracexx3125
Appreciate your help on this please.
You can't create duplicate column names.
Here is a query that will work:
create table if not exists name_of_table
(
account_id string comment '11 characters',
age int,
account_id2 string
)
fields terminated by ' '
stored as textfile;
You can also refer to the official documentation for Hive:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable

Inserting a substring of column in Redshift

Hello I am using Redshift where I have a staging table & a base table. one of the column (city) in my base table has data type varchar & its length is 100.When I am trying to insert the column value from staging table to base table, I want this value to be truncated to 1st 100 characters or leftmost 100 characters. Can this be possible in Redshift?
INSERT into base_table(org_city) select substring(city,0,100) from staging_table;
I tried using the above query but it failed. Any solutions please ?
Try this! Your base table column length is Varchar(100), so you need to substring 0-99 chars, which is 100 chars. You are trying to substring 101 chars.
INSERT into base_table(org_city) select substring(city,0,99) from staging_table;

T-SQL - Insert Statement Ignoring a Column

I have a table with a dozen or so columns. I only know the name of the first column and the last 4 columns (in theory, I could only know the name of one column and not its position)
How can I can I write a statement which ignores this column? At the moment I do various column counts in ASP and construct a statement that way but was wondering if there was an easier way
UPDATE
INSERT INTO tblName VALUES ("Value for col2", "Value for col3")
but the table has a col4 and potentially more, which I'd be ignoring.
I basically have a CSV file. This CSV file has no headers. It has 'X' less columns than the table I'm inserting into. I would like to insert the data from the CSV into the table.
There are many tables of different structures and many CSV files. I have created a ASP page to take any CSV and upload it to the corresponding table (based on a parameter within the CSV file).
It works fine, I was just wondering that when I was doing the INSERT statement, if I could ignore certain columns and cut down on my code.
So let's say the CSV has data as follows
123 | 456 | 789
234 | 567 | 873
The table has a structure of
ID | Col1 | Col2 | Col3 | Col4 | Col5
I currently construct an insert statement that says
INSERT into tblName ("123", "456","789","","")
However I was wondering if there was a way I could omit the empty values by somehow "ignoring" the columns. As mentioned, the column names are not known apart from the ones I have no data for.
There is no Sql shortcut for
Select * (except column col1) from ...
You have to construct your Sql from database metadata, like you already did if I understood you correctly.
You can specify the columns that you want to insert.
So instead of...
INSERT INTO tblName VALUES ("Value for col2", "Value for col3")
You could specify column names...
INSERT INTO tblName (ColumnName1, ColumnName2) VALUES ("Value for col2", "Value for col3")

Copy selected query fields name in Mysql Workbench

I am using mysql workbench (SQL Editor). I need copy the list of columns in each query as was existed in Mysql Query Browser.
For example
Select * From tb
I want have the list of fields like as:
id,title,keyno,......
You mean you want to be able to get one or more columns for a specified table?
1st way
Do SHOW COLUMNS FROM your_table_name and from there on depending on what you want have some basic filtering added by specifying you want only columns that data type is int, default value is null etc e.g. SHOW COLUMNS FROM your_table_name WHERE type='mediumint(8)' ANDnull='yes'
2nd way
This way is a bit more flexible and powerful as you can combine many tables and other properties kept in MySQL's INFORMATION_SCHEMA internal db that has records of all db columns, tables etc. Using the query below as it is and setting TABLE_NAME to the table you want to find the columns for
SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME='your_table_name';
To limit the number of matched columns down to a specific database add AND TABLE_SCHEMA='your_db_name' at the end of the query
Also, to have the column names appear not in multiple rows but in a single row as a comma separated list you can use GROUP_CONCAT(COLUMN_NAME,',') instead of only COLUMN_NAME
To select all columns in select statement, please go to SCHEMAS menu and right click ok table which you want to select column names, then select "Copy to Clipboard > Select All statement".
The solution accepted is fine, but it is limited to field names in tables. To handle arbitrary queries would be to standardize your select clause to be able to use regex to strip out only the column aliases. I format my select clause as "1 row per element" so
Select 1 + 1 as Col1, 1 + 2 Col2 From Table
becomes
Select 1 + 1 as Col1
, 1 + 2 Col2
From Table
Then I use simple regex on the "1 row per select element" version to replace "^.* " (excluding quotes) with nothing. The regex finds everything before the final space in the line, so it assumes your column aliases doesn't contain spaces (so replace spaces with underscore). Or if you don't like "1 row per element" then always use "as" keyword to give you a handle that regex can grasp.