Hive crashing on where clause - mongodb

I am trying to get a hive-hadoop-mongo setup to work. I have imported the data into mongodb from a json file, then I created both internal and external tables in hive that connect to mongo:
CREATE EXTERNAL TABLE reviews(
user_id STRING,
review_id STRING,
stars INT,
date1 STRING,
text STRING,
type STRING,
business_id STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"date1":"date"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/test.reviews');
This part works fine because a select all query (select * from reviews) outputs everything like it should. But when I do one with a where clause (select * from reviews where stars=4 for example) hive crashes.
I have the following jars being added when I start up hive:
add jar mongo-hadoop.jar;
add jar mongo-java-driver-3.3.0.jar;
add jar mongo-hadoop-hive-2.0.1.jar;
And if it is relevant in any sense, I am using Amazon's EMR cluster for this, and I'm connected through ssh.
Thanks for all the help
Here is the error hive throws out:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.exec.Utilities.deserializeExpression(Ljava/lang/String;)Lorg/apache/hadoop/hive/ql/plan/ExprNodeGenericFuncDesc;
at com.mongodb.hadoop.hive.input.HiveMongoInputFormat.getFilter(HiveMongoInputFormat.java:134)
at com.mongodb.hadoop.hive.input.HiveMongoInputFormat.getRecordReader(HiveMongoInputFormat.java:103)
at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:691)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:329)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:455)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:424)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:144)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1885)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:252)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Crete table like below and check.
CREATE [EXTERNAL] TABLE <tablename>
(<schema>)
ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe'
[WITH SERDEPROPERTIES('mongo.columns.mapping'='<JSON mapping>')]
STORED AS INPUTFORMAT 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
OUTPUTFORMAT 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat'
[LOCATION '<path to existing directory>'];
Instead of using a StorageHandler to read, serialize, deserialize, and output the data from Hive objects to BSON objects, the individual components are listed individually. This is because using a StorageHandler has too many negative effects when dealing with the native HDFS filesystem

I see
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"date1":"date"}')
and you are querying the column stars which is not mapped.

I met this probelem on our cluster.
The cluster hive version is higher than version in mongo-hive(which is 1.2.1)
The old class org.apache.hadoop.hive.ql.exec.Utilities.deserializeExpression has been renamed to org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpression
You need to rebuild the jar by yourself.

Related

Delta Lake Data Load Datatype mismatch

I am loading data from SQL Server to Delta lake tables. Recently i had to repoint the source to another table(same columns), but the data type is different in new table. This is causing error while loading data to delta table. Getting following error:
Failed to merge fields 'COLUMN1' and 'COLUMN1'. Failed to merge incompatible data types LongType and DecimalType(32,0)
Command i use to write data to delta table:
DF.write.mode("overwrite").format("delta").option("mergeSchema", "true").save("s3 path)
The only option i can think of right now is to enable OverWriteSchema to True.
But this will rewrite my target schema completely. I am just concerned about any sudden change in source schema that will replace existing target schema without any notification or alert.
Also i can't explicitly convert these columns because the databricks notebook i am using is a parametrized one used to to load data from source to Target(We are reading data from a CSV file that contain all the details about Target table, Source table, partition key etc)
Is there any better way to tackle this issue?
Any help is much appreciated!

Even after setting the "orc.force.positional.evolution" to false hive is still picking up based on position

I have an external table where I have added few new columns and wanted to ensure that data in orc format file should be written from Spark dataframe to Hive external table based on the column name and not based on position and so have set "orc.force.positional.evolution"="false" in TBLPROPERTIES but still data is written based on a position which is incorrect.
Please suggest what I am missing here. I have used below question as a reference:
Hive external table with ORC format- how to map the column names in the orc file to the hive table columns?
I have a workaround of using select on spark Dataframe but looking for better options without making any code changes.
Hive version I am using is 3.1

Cassandra Alter Column type from Timestamp to Date

Is there any way to alter the Cassandra column from timestamp to date without data lost? For example '2021-02-25 20:30:00+0000' to '2021-02-25'
If not, what is the easiest way to migrate this column(timestamp) to the new column(date)?
It's impossible to change a type of the existing column, so you need to add a new column with correct data type, and perform migration. Migration could be done via Spark + Spark Cassandra Connector - it could be most flexible solution, and even could be done via single node machine with Spark running in the local master mode (default). Code could look something like this (try on test data first):
import pyspark.sql.functions as F
options = { "table": "tbl", "keyspace": "ks"}
spark.read.format("org.apache.spark.sql.cassandra").options(**options).load()\
.select("pk_col1", "pk_col2", F.col("timestamp_col").cast("date").alias("new_name"))\
.write.format("org.apache.spark.sql.cassandra").options(**options).save()
P.S. you can use DSBulk, for example, but you need to have enough space to offload the data (although you need only primary key column + your timestamp)
To add to Alex Ott's answer, there are validations done in Cassandra that prevents changing the data type of a column. The reason is that SSTables (Cassandra data files) are immutable -- once they are written to disk, they are never modified/edited/updated. They can only be compacted to new SSTables.
Some try to get around it by dropping the column from the table then adding it back in with a new data type. Unlike traditional RDBMS, the existing data in the SSTables don't get updated so if you tried to read the old data, you'll get a CorruptSSTableException because the CQL type of the data on disk won't match that of the schema.
For this reason, it is no longer possible to drop/recreate columns with the same name (CASSANDRA-14948). If you're interested, I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/8018/. Cheers!
You can use ToDate to change it. For example: Table Email has column Date with format: 2001-08-29 13:03:35.000000+0000.
Select Date, ToDate(Date) as Convert from keyspace.Email:
date | convert ---------------------------------+------------ 2001-08-29 13:03:35.000000+0000 | 2001-08-29

Int to Date conversion error without any changes happening in data model in Talend

I had the ETL working until two days ago when I started receiving:
Exception in component tDBInput_1 (test4) java.sql.SQLDataException:
Unsupported conversion from TIMESTAMP to java.lang.Integer
Clueless what happened here.
It look like your schema in your tDb_Input_1 is not the same between schema inside component and your database table type. For exemple product_line_id is a varchar in your table and it's a Integer in your schema.
Try to redefine your schema in yout tDBInput Component.
You can do it like that :
Define your DB connection inside metadata (in Db Connection)
Retrieve schema from your database in order to import the schema automatically in Talend. (See screeshot)
Use the schema retrieved in your component and propagate to all the job.

Returning count of updated rows when UPserting to a Postgres table using jOOQ

I am upserting some data to a Postgres table using jOOQ's insertInto() and onDuplicateKeyUpdate() methods. I want to know later how many duplicates were in my data and hence need to return if a row was inserted or updated.
From my postgres specific research so far, I found RETURNING (not MY_TABLE.xmax = 0) AS updated to be a valid option. However, the auto-generated Java table classes from jOOQ don't seem to give me access to the system columns of postgres like xmax.
Here is my query so far:
dsl.insertInto(MY_TABLE)
.columns(
// pkey columns
MY_TABLE.SHIFT,
MY_TABLE.DATE_UTC,
MY_TABLE.TIME_UTC,
MY_TABLE.DURATION,
)
.values(
shiftId,
utcDateId,
utcTime,
duration
)
.onDuplicateKeyUpdate()
.set(MY_TABLE.DURATION, newDuration)
.returning((MY_TABLE.xmax = 0).`às`("inserted"))
.execute()
This causes the following compile time error:
Error: Kotlin: Unresolved reference: XMAX
I have rechecked my Maven jOOQ table generation configuration and I am not excluding any columns. I have also read through everything I could find on jOOQ's own website but found no useful information for this specific use-case.
Any tips on what I could do here?
In this case you should use jOOQ's SQL templating. Specifically look at the DSL.field() method. Something like this: field("my_table.xmax", int.class).eq(0).