Sqoop export from Hive to Netezza if column has array of values - hiveql

I was trying to run sqoop export to load the Hive table rows into Netezza table.The problem is i have a few columns contains array of values and i created DDL in Netezza for those column data type as varchar(200) and run the sqoop job but i am getting a error as bad rows reached limit.
below is my Sqoop job :
sqoop export --options-file
--direct --connect jdbc:netezza://10.90.21.140:5480/analytics --username sat144 --P --table analytics_stage --export-dir /home/dir1/analytics/data --fields-terminated-by '~' --input-null-string '\N' --input-null-non-string '\N' -m 1 -max-errors #0
My Netezza DDL below :
CREATE TABLE analytics_stage
(
id varchar(30),
name varchar(60),
dept nvarchar(99),
dept_id nvarchar(200) );
My Hive table column values are below
Row1: 20134(id) sat(name) Data_Group(dept) [121,103,201,212,310] (dept_id)
Can any one help me on this? if column has negative values and Array of values in Hive table then what is the suggested data types in Netezza ??
Sqoop Error log below
16/05/09 15:46:49 INFO mapreduce.Job: map 50% reduce 0%
16/05/09 15:46:55 INFO mapreduce.Job: Task Id : attempt_1460986388847_0849_m_000000_1, Status : FAILED
Error: java.io.IOException: org.netezza.error.NzSQLException: ERROR: External Table : count of bad input rows reached maxerrors limit
at org.apache.sqoop.mapreduce.db.netezza.NetezzaExternalTableExportMapper.run(NetezzaExternalTableExportMapper.java:255)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: org.netezza.error.NzSQLException: ERROR: External Table : count of bad input rows reached maxerrors limit

Import/export functionalities are available from RDBMS to HDFS and vice versa. But while working with Hive, HBase, Hcatalog we have only one option for importing. We cannot export the data from Hive, Hbase, and HCatalog so far.

Related

spark sql unable to find the database and table which it earlier wrote to

There is a spark component creating a sql table out of transformed data. It successfully saves the data into spark-warehouse under the <database_name>.db folder. The component also tries to read from existing table in order to not blindly overwrite. While reading, spark is unable to find any database other than default.
sparkVersion: 2.4
val spark: SparkSession = SparkSession.builder().master("local[*]").config("spark.debug.maxToStringFields", 100).config("spark.sql.warehouse.dir", "D:/Demo/spark-warehouse/").getOrCreate()
def saveInitialTable(df:DataFrame) {
df.createOrReplaceTempView(Constants.tempTable)
spark.sql("create database " + databaseName)
spark.sql(
s""" create table if not exists $databaseName.$tableName
|using parquet partitioned by (${Constants.partitions.mkString(",")})
|as select * from ${Constants.tempTable}""".stripMargin)
}
def deduplication(dataFrame: DataFrame): DataFrame ={
if(Try(spark.sql("show tables from " + databaseName)).isFailure){
//something
}
}
After saveInitialTable function is performed successfully. In the second run, the deduplication function still is not able to pick up the <database_name>
I am not using hive explicitly anywhere, just spark DataFrames and SQL API.
When I run the repl in the same directory as spark-warehouse, it too gives on default database.
scala> spark.sql("show databases").show()
2021-10-07 18:45:57 WARN ObjectStore:6666 - Version information not found in metastore.
hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
2021-10-07 18:45:57 WARN ObjectStore:568 - Failed to get database default, returning
NoSuchObjectException
+------------+
|databaseName|
+------------+
| default|
+------------+

Unable to load hive table from spark dataframe with more than 25 columns in hdp 3

We were trying to populate hive table from spark shell. Dataframe with 25 columns got successfully added to the hive table using hive warehouse connector. But for more than this limit we got below error:
Caused by: java.lang.IllegalArgumentException: Missing required char ':' at 'struct<_c0:string,_c1:string,_c2:string,_c3:string,_c4:string,_c5:string,_c6:string,_c7:string,_c8:string,_c9:string,_c10:string,_c11:string,_c12:string,_c13:string,_c14:string,_c15:string,_c16:string,_c17:string,_c18:string,_c19:string,_c20:string,_c21:string,_c22:string,_c23:string,...^ 2 more fields>'
at org.apache.orc.TypeDescription.requireChar(TypeDescription.java:293)
Below is the sample input file data (input file is of type csv).
|col1 |col2 |col3 |col4 |col5 |col6 |col7 |col8 |col9 |col10 |col11 |col12 |col13 |col14 |col15 |col16 |col17|col18 |col19 |col20 |col21 |col22 |col23 |col24 |col25|col26 |
|--------------------|-----|-----|-------------------|--------|---------------|-----------|--------|--------|--------|--------|--------|--------|--------|--------|------|-----|---------------------------------------------|--------|-------|---------|---------|---------|------------------------------------|-----|----------|
|11111100000000000000|CID81|DID72|2015-08-31 00:17:00|null_val|919122222222222|1627298243 |null_val|null_val|null_val|null_val|null_val|null_val|Download|null_val|Mobile|NA |x-nid:xyz<-ch-nid->N4444.245881.ABC-119490111|12452524|1586949|sometext |sometext |sometext1|8b8d94af-5407-42fa-9c4f-baaa618377c8|Click|2015-08-31|
|22222200000000000000|CID82|DID73|2015-08-31 00:57:00|null_val|919122222222222|73171145211|null_val|null_val|null_val|null_val|null_val|null_val|Download|null_val|Tablet|NA |x-nid:xyz<-ch-nid->N4444.245881.ABC-119490111|12452530|1586956|88200211 |88200211 |sometext2|9b04580d-1669-4eb3-a5b0-4d9cec422f93|Click|2015-08-31|
|33333300000000000000|CID83|DID74|2015-08-31 00:17:00|null_val|919122222222222|73171145211|null_val|null_val|null_val|null_val|null_val|null_val|Download|null_val|Laptop|NA |x-nid:xyz<-ch-nid->N4444.245881.ABC-119490111|12452533|1586952|sometext2|sometext2|sometext3|3ab8511d-6f85-4e1f-8b11-a1d9b159f22f|Click|2015-08-31|
Spark shell was instantiated using below command:
spark-shell --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.hive.hiveserver2.jdbc.url="jdbc:hive2://sandbox-hdp.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;user=raj_ops"
Version of HDP is 3.0.1
Hive table was created using below command:
val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()
hive.createTable("tablename").ifNotExists().column()...create()
Data was saved using below command:
df.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").option("table", "tablename").mode("append").save()
Kindly help us on this.
Thank you in advance.
I faced this problem, after thoroughly examining the source code of the following classes:
org.apache.orc.TypeDescription
org.apache.spark.sql.types.StructType
org.apache.spark.util.Utils
I found out that the culprit was the variable DEFAULT_MAX_TO_STRING_FIELDS inside the class org.apache.spark.util.Utils:
/* The performance overhead of creating and logging strings for wide schemas can be large. To limit the impact, we bound the number of fields to include by default. This can be overridden by setting the 'spark.debug.maxToStringFields' conf in SparkEnv. */
val DEFAULT_MAX_TO_STRING_FIELDS = 25
so, after setting this property, by example: conf.set("spark.debug.maxToStringFields", "128"); in my application, the issue has gone.
I hope it can help others.

AWS Redshift Parquet COPY has an incompatible Parquet schema

I am writing DataFrame to Redshift using temporary s3 bucket and Parquet as the temporary format. Spark successfully has written data to s3 temp bucket, but Redshift trying to COPY data to warehouse failed with the following error:
error: S3 Query Exception (Fetch)
code: 15001
context: Task failed due to an internal error. File 'https://s3.amazonaws.com/...../part-00001-de882e65-a5fa-4e52-95fd-7340f40dea82-c000.parquet has an incompatible Parquet schema for column 's3://bucket-dev-e
query: 17567
location: dory_util.cpp:872
process: query0_127_17567 [pid=13630]
What am I doing wrong and how to fix it?
UPDATED 1
This is the detailed error:
S3 Query Exception (Fetch). Task failed due to an internal error.
File 'https://....d5e6c7a/part-00000-9ca1b72b-c5f5-4d8e-93ce-436cd9c3a7f1-c000.parquet has an incompatible Parquet schema for column 's3://.....a-45f6-bd9c-d3d70d5e6c7a/manifest.json.patient_dob'.
Column type: TIMESTAMP, Parquet schema:\noptional byte_array patient_dob [i:26 d:1 r:0]\n (s3://.......-45f6-bd9c-d3d70d5e6c7a/
Apache Spark version 2.3.1
Also tried to set up the following properties with no luck:
writer
.option("compression", "none")
.option("spark.sql.parquet.int96TimestampConversion", "true")
.option("spark.sql.parquet.int96AsTimestamp", "true")
.option("spark.sql.parquet.writeLegacyFormat", "true")
Where may be the issue?
UPDATED 2
Dataframe patient_dob column type is DateType
Redshift patient_dob field type is date
S3 select shows the following on patient_dob Parquet field - "patient_dob": "1960-05-28"

Sqoop incremental import from postgres to HDFS is giving org.postgresql.util.PSQLException

I am trying to import incremental data from postgres sql to hdfs's directory using sqoop job as,
sqoop job --create incident_import -- import --connect jdbc:postgresql://IP ADDRESS:5432/Analyst_Bangalore --username postgres --password track#123 --map-column-java the_geom=String --table incident -m 1 --warehouse-dir /user/hive/warehouse/analyst_bangalore.db --incremental lastmodified --check-column incident_start_time --last-value "2016-08-03 14:33:48.087" --driver org.postgresql.Driver -- --schema analyst
If I am executing this job it's giving ERROR as,
sqoop job -exec incident_import
org.postgresql.util.PSQLException: ERROR: relation "incident" does not exist
Position: 17
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2157)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1886)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:555)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:417)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:302)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:758)
2016-11-14 18:35:47,255 ERROR tool.ImportTool (ImportTool.java:run(613)) - Encountered IOException running import job: java.io.IOException: No columns to generate for ClassWriter
at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1651)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:107)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:478)
anybody have any idea please share me.It is a littlebit required.

sqoop import from DB2 to Datastax, schema not in proper format

When I try to sqoop import (Datastax 3.2.0) from DB2 database using the below command:
./dse sqoop import --connect jdbc:db2://172.29.252.40:4922/DSNN --username tst -P --table tstschema."dsn_filter_table" --cassandra-keyspace SqoopTest --cassandra-column-family actest2 --cassandra-row-key PREDNO --cassandra-thrift-host 10.247.31.42 --cassandra-create-schema --split-by PREDNO
[ DB2 Select query: select * from SchemaName.TableName with ur; ]
Why I am not getting the schema in proper format as it is in DB2?
Issue Faced:
Why Column Names of DB2 table are getting into rows of Cassandra?
Request your help to resolve the issue.
I'm not familiar with the Datastax version of Sqoop, but generally the parameter --table can't be used to also specify a schema name. You can specify the schema in the JDBC URL using currentSchema property. For example
sqoop import --connect jdbc:db2://host/db:currentSchema=tstschema