How do I load data correctly in Hive using spark? - scala

I want to input data which looks as-
"58;""management"";""married"";""tertiary"";""no"";2143;""yes"";""no"";""unknown"";5;""may"";261;1;-1;0;""unknown"";""no"""
"44;""technician"";""single"";""secondary"";""no"";29;""yes"";""no"";""unknown"";5;""may"";151;1;-1;0;""unknown"";""no"""
"33;""entrepreneur"";""married"";""secondary"";""no"";2;""yes"";""yes"";""unknown"";5;""may"";76;1;-1;0;""unknown"";""no"""
"47;""blue-collar"";""married"";""unknown"";""no"";1506;""yes"";""no"";""unknown"";5;""may"";92;1;-1;0;""unknown"";""no"""
My create table statement is as-
sqlContext.sql("create table dummy11(age int, job string, marital string, education string, default string, housing string, loan string, contact string, month string, day_of_week string, duration int, campaign int, pday int, previous int, poutcome string, emp_var_rate int, cons_price_idx int, cons_conf_idx int, euribor3m int, nr_employed int, y string)row format delimited fields terminated by ';'")
When I run the statement-
sqlContext.sql("from dummy11 select age").show()
OR
sqlContext.sql("from dummy11 select y").show()
It returns NULL value instead of correct values, though other values are visible
So how do I correct this??

As you are using Hive QL syntax, you need to validate the input data before processing.
In your data, few records have lesser columns - than the actual columns defined in DDL.
So, for those records, the rest columns (from last) are set as NULL; as that row does not have enough values.
That's why, the last column y has values NULL.
Also, in DDL, first field's data type is INT; but in record, first field values are:
"58
"44
"33
Because of ", the values are not type-casted to INT; setting field value as NULL.
As per the DDL and data - you provided, values are getting set as:
age "58
job ""management""
marital ""married""
education ""tertiary""
default ""no""
housing 2143
loan ""yes""
contact ""no""
month ""unknown""
day_of_week 5
duration ""may""
campaign 261
pday 1
previous -1
poutcome 0
emp_var_rate ""unknown""
cons_price_idx ""no""
cons_price_idx NULL
cons_conf_idx NULL
euribor3m int NULL
nr_employed NULL
y NULL
Check the NULL values for last 5 columns.
So, if that is not expected, you need to validate the data first before proceeding.
And for the column age, if you need it in INT type, cleanse the data to remove unwanted " character.
WORKAROUND
As workaround, you can define age as STRING at beginning, as use spark transformations to parse the first field and convert it to INT
import org.apache.spark.sql.functions._
val ageInINT = udf { (make: String) =>
Integer.parseInt(make.substring(1))
}
df.withColumn("ageInINT", ageInINT(df("age"))).show
Here df is your dataframe created while executing the hive DDL with column age as sTRING.
Nnow, you can perform operation on new column ageInINT rather than column age with INTEGER values.

Since your data contains " just before the age, it is considered as string. In the code you have defined it as int therefore sql parser is trying to find the integer value and therefore you are getting the null record. Change the age int with age string and you will be able to see the result.
Please see below working example Using Spark HiveContext.
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
sqlContext.sql("create external table dummy11(age string, job string, marital string, education string, default string, housing string, loan string, contact string, month string, day_of_week string, duration int, campaign int, pday int, previous int, poutcome string, emp_var_rate int, cons_price_idx int, cons_conf_idx int, euribor3m int, nr_employed int, y string)row format delimited fields terminated by ';' location '/user/skumar143/stack/'")
sqlContext.sql("select age, job from dummy11").show()
Its output:
+---+----------------+
|age| job|
+---+----------------+
|"58| ""management""|
|"44| ""technician""|
|"33|""entrepreneur""|
|"47| ""blue-collar""|
+---+----------------+

Related

pyspark + hive: difference between first row in dataframe and table

I created a table in Hive using a csv file containing a header:
CREATE TABLE resultado(
data_jogo date,
mandante string,
visitante string,
gols_mandante int,
gols_visitante int,
torneio string,
cidade string,
pais string,
campo_neutro boolean
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
TBLPROPERTIES('skip.header.line.count'='1');
LOAD DATA INPATH '/user/hive/projeto/results.csv' OVERWRITE INTO TABLE resultado;
And works fine when a try a select:
SELECT * FROM resultado LIMIT 5;
Then I went to pyspark to see the same data:
from pyspark.sql import HiveContext
h = HiveContext(sc)
df = h.table('resultado')
df.show(5)
But it returns a dataframe with the header from file loaded in the table.
Please, can someone tell me what I'm doing wrong? As you can see I'm really new into this xD

Not able to create Hive table with TIMESTAMP datatype in Azure Databricks

org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.UnsupportedOperationException: Parquet does not support
timestamp. See HIVE-6384;
Getting above error while executing following code in Azure Databricks.
spark_session.sql("""
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time TIMESTAMP
)
PARTITIONED BY (
Date DATE)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION "/mnt/data_analysis/pre-processed/"
""")
As per Hive-6384 Jira, Starting from Hive-1.2 you can use Timestamp,date types in parquet tables.
Workarounds for Hive < 1.2 version:
1. Using String type:
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time STRING
)
PARTITIONED BY (
Date STRING)
Stored as parquet
Location '/mnt/data_analysis/pre-processed/';
Then while processing you can cast arrival_time,Date cast to timestamp,date types.
Using a view and cast the columns but views are slow.
2. Using ORC format:
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time Timestamp
)
PARTITIONED BY (
Date date)
Stored as orc
Location '/mnt/data_analysis/pre-processed/';
ORC supports both timestamp,date type

Specified partition columns do not match the partition columns of the table.Please use () as the partition columns

here i'm trying to persist the data frame in to a partitioned hive table and getting this silly exception. I have looked in to it many times but not able to find the fault.
org.apache.spark.sql.AnalysisException: Specified partition columns
(timestamp value) do not match the partition columns of the table.
Please use () as the partition columns.;
Here is the script by which the external table is created with,
CREATE EXTERNAL TABLEIF NOT EXISTS events2 (
action string
,device_os_ver string
,device_type string
,event_name string
,item_name string
,lat DOUBLE
,lon DOUBLE
,memberid BIGINT
,productupccd BIGINT
,tenantid BIGINT
) partitioned BY (timestamp_val DATE)
row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
stored AS inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 'maprfs:///location/of/events2'
tblproperties ('serialization.null.format' = '');
Here is the result of describe formatted of table "events2"
hive> describe formatted events2;
OK
# col_name data_type comment
action string
device_os_ver string
device_type string
event_name string
item_name string
lat double
lon double
memberid bigint
productupccd bigint
tenantid bigint
# Partition Information
# col_name data_type comment
timestamp_val date
# Detailed Table Information
Database: default
CreateTime: Wed Jan 11 16:58:55 IST 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: maprfs:/location/of/events2
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
serialization.null.format
transient_lastDdlTime 1484134135
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.078 seconds, Fetched: 42 row(s)
Here is the line of code where the data is partitioned and stored in to a table,
val tablepath = Map("path" -> "maprfs:///location/of/events2")
AppendDF.write.format("parquet").partitionBy("Timestamp_val").options(tablepath).mode(org.apache.spark.sql.SaveMode.Append).saveAsTable("events2")
While running the application, i'm getting the below
Specified partition columns (timestamp_val) do not match the partition
columns of the table.Please use () as the partition columns.
I might be committing an obvious error, any help is much appreciated with an upvote :)
Please print schema of df:
AppendDF.printSchema()
Make sure it is not type mismatch??

variable not binding value in Spark

I am passing variable, but it is not passing value.
I populates variable value here.
val Temp = sqlContext.read.parquet("Tabl1.parquet")
Temp.registerTempTable("temp")
val year = sqlContext.sql("""select value from Temp where name="YEAR"""")
year.show()
here year.show() proper value.
I am passing the parameter here in below code.
val data = sqlContext.sql("""select count(*) from Table where Year='$year' limit 10""")
data.show()
The value year is a Dataframe, not a specific value (Int or Long). So when you use it inside a string interpolation, you get the result of Dataframe.toString, which isn't something you can use to compare values to (the toString returns a string representation of the Dataframe's schema).
If you can assume the year Dataframe has a single Row with a single column of type Int, and you want to get the value of that column - you get use first().getAs[Int](0) to get that value and then use it to construct your next query:
val year: DataFrame = sqlContext.sql("""select value from Temp where name="YEAR"""")
// get the first column of the first row:
val actualYear: Int = year.first().getAs[Int](0)
val data = sqlContext.sql(s"select count(*) from Table where Year='$actualYear' limit 10")
If value column in Temp table has a different type (String, Long) - just replace the Int with that type.

How to preserve order of columns in cassandra

I have two tables in Cassandra:
CREATE TABLE table1 (
name text PRIMARY KEY,
grade text,
labid List<int>);
CREATE TABLE table2(
name text PRIMARY KEY,
deptid List<int>
grade text,);
for example:
val result: RDD[String, String, List[Int]] = myFunction();
result.saveToCassandra(keyspace, table1)
It is working fine.
but in case of using below line:
result.saveToCassandra(keyspace, table2)
m getting this type of error : com.datastax.spark.connector.types.TypeConversionException: Cannot convert object test_data of type class java.lang.String to List[AnyRef]
Is there any solution using SomeColumns which satisfy the both tables[we don't know which table will be executed]. eg:
result.saveToCassandra(keyspace, table, SomeColumns(....))?
By default the dataframe schema only cares about position, not column name, so if your c* table has a different column order, you will get incorrect writes. The solution is like you said, to use SomeColumns.
val columns = dataFrame.schema.map(_.name: ColumnRef)
dataFrame.rdd.saveToCassandra(keyspaceName, tableName, SomeColumns(columns: _*))
Now the dataframe columns will be written to c* using their name, not position.
You arguments should be in different order because the tables have different column types:
val result: RDD[String, String, List[Int]] = myFunction();
val reorder: RDD[String, List[Int], String] = result.map(r => r._1, r._3, r._2)
reorder.saveToCassandra(keyspace, table2)