sparkR-Mongo connector query to subdocument - mongodb

I am using Mongo-Spark connector all examples in documentation (https://docs.mongodb.com/spark-connector/sparkR/) are fine, but if I test query in a document which has subdocuments it fails, obviously SQL is not ready for this query:
result <- sql(sqlContext, "SELECT DOCUMENT.SUBDOCUMENT FROM TABLE")
ERROR:
com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast INT32 into a ConflictType (value: BsonInt32{value=171609012})
at com.mongodb.spark.sql.MapFunctions$.com$mongodb$spark$sql$MapFunctions$$convertToDataType(MapFunctions.scala:79)
at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:38)
at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:36)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at com.mongodb.spark.sql.MapFunctions$.documentToRow(MapFunctions.scala:36)
at com.mongodb.spark.sql.MapFunctions$.castToStructType(MapFunctions.scala:108)
at com.mongodb.spark.sql.MapFunctions$.com$mongodb$spark$sql$MapFunctions$$convertToDataType(MapFunctions.scala:74)
Previously I have registered the table as follow:
registerTempTable(schema, "TABLE")
I guess that the key problem is how to register an mongo-subdocument as table.
Someone has the solution?

Solution: All fields must follow the same type, I had fields in String type and others in Double type for this reason table is registered but it can´t be process.

Related

Gorm Jsonb type stored as bytea

I'm using a locally hosted postgres DB to test queries to a postgres DB in production. The production database has an info field of type jsonb; and I'm trying to mimic this schema locally when using gorm's AutoMigrate. The model I've defined is below:
import "github.com/jinzhu/gorm/dialects/postgres"
type Event struct {
...
Info postgres.Jsonb
...
}
But when I query JSON attributes, e.g. stmt.Where("info->>'attr' = value"), I get the following error:
...
Message:"operator does not exist: bytea ->> unknown", Detail:"", Hint:"No operator matches the given name and argument type(s). You might need to add explicit type casts.",
...
This query works however in the production environment. It seems that the Info field is being stored as bytea instead of jsonb. I'm aware that I can do stmt.Where("encode(info, "escape")::jsonb->>'attr' = value"), but I'd prefer to mimic the production environment more closely (if possible) than change the query to support these unit tests.
I've tried using type tags in the model (e.g. gorm:"type=jsonb") as well as defining my own JSON type implmementing the valuer, scanner, and GormDataTypeInterface as suggested here. None of these approaches have automigrated the type as jsonb.
Is there any way to ensure AutoMigrate creates a table with type jsonb? Thanks!
I was facing the same problem, type JsonB is automigrated to bytea. I solved it by adding the tag gorm:"type:jsonb". It's also mentioned in your question, but you're using gorm:"type=jsonb", which is not correct.

How to filter a Dataframe with information from other Dataframe using command filter

I have a big Dataframe with a lot of information from different devices with their IDs. What I would like is to filter this Dataframe with the IDs that are in a second Dataframe. I know that I can easily do it with the command join, but I would like to try it with the command filter.
Also, I'm trying it because I've read that the command filter is more efficient than the join, could someone shed some light about it?
Thank you
I've tried this:
val DfFiltered = DF1.filter(col("Id").isin(DF2.rdd.map(r => r(0)).collect())
But I get the following error:
Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Unsupported component type class java.lang.Object in arrays;
=== Streaming Query ===
Identifier: [id = 0d89d684-d794-407d-a03c-feb3ad6a78c2, runId = b7b774c0-ce83-461e-ac26-7535d6d2b0ac]
Current Committed Offsets: {KafkaV2[Subscribe[MeterEEM]]: {"MeterEEM":{"0":270902}}}
Current Available Offsets: {KafkaV2[Subscribe[MeterEEM]]: {"MeterEEM":{"0":271296}}}
Current State: ACTIVE
Thread State: RUNNABLE
Logical Plan:
Project [value2#21.meterid AS meterid#23]
+- Project [jsontostructs(StructField(meterid,StringType,true), cast(value#8 as string), Some(Europe/Paris)) AS value2#21]
+- StreamingExecutionRelation KafkaV2[Subscribe[MeterEEM]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: org.apache.spark.sql.AnalysisException: Unsupported component type class java.lang.Object in arrays;
at org.apache.spark.sql.catalyst.expressions.Literal$.componentTypeToDataType(literals.scala:117)
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:70)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
at scala.util.Try.getOrElse(Try.scala:79)
at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:163)
at org.apache.spark.sql.functions$.typedLit(functions.scala:127)
at org.apache.spark.sql.functions$.lit(functions.scala:110)
at org.apache.spark.sql.Column$$anonfun$isin$1.apply(Column.scala:796)
at org.apache.spark.sql.Column$$anonfun$isin$1.apply(Column.scala:796)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.Column.isin(Column.scala:796)
at StreamingProcedure.MetersEEM$.meterEemCalcs(MetersEEM.scala:28)
at LaunchFunctions.LaunchMeterEEM$$anonfun$1.apply(LaunchMeterEEM.scala:23)
at LaunchFunctions.LaunchMeterEEM$$anonfun$1.apply(LaunchMeterEEM.scala:15)
at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:534)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:532)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:531)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apacahe.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBtchExecution.scala:160)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
... 1 more
I've made the assumption that the data in the Id column is of an Integer datatype.
val list = DF2.select("Id").as[Int] collect()
val DfFiltered = DF1.filter($"Id".isin(list: _*))
DfFiltered collect()
From High Performance Spark book, it's explained that:
Joining data is an important part of many of our pipelines, and both Spark Core and SQL support the same fundamental types of joins. While joins are very common and powerful, they warrant special performance consideration as they may require large network transfers or even create datasets beyond our capability to handle.1 In core Spark it can be more important to think about the ordering of operations, since the DAG optimizer, unlike the SQL optimizer, isn’t able to re-order or push down filters.
So, choosing filter instead of join seems a good choice
you can simply add (:_*) to your code this would work perfectly fine.
scala> val DfFiltered = df.filter(col("a").isin(df2.rdd.map(r => r(0)).collect():_*)).show()

Int to Date conversion error without any changes happening in data model in Talend

I had the ETL working until two days ago when I started receiving:
Exception in component tDBInput_1 (test4) java.sql.SQLDataException:
Unsupported conversion from TIMESTAMP to java.lang.Integer
Clueless what happened here.
It look like your schema in your tDb_Input_1 is not the same between schema inside component and your database table type. For exemple product_line_id is a varchar in your table and it's a Integer in your schema.
Try to redefine your schema in yout tDBInput Component.
You can do it like that :
Define your DB connection inside metadata (in Db Connection)
Retrieve schema from your database in order to import the schema automatically in Talend. (See screeshot)
Use the schema retrieved in your component and propagate to all the job.

Hive crashing on where clause

I am trying to get a hive-hadoop-mongo setup to work. I have imported the data into mongodb from a json file, then I created both internal and external tables in hive that connect to mongo:
CREATE EXTERNAL TABLE reviews(
user_id STRING,
review_id STRING,
stars INT,
date1 STRING,
text STRING,
type STRING,
business_id STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"date1":"date"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/test.reviews');
This part works fine because a select all query (select * from reviews) outputs everything like it should. But when I do one with a where clause (select * from reviews where stars=4 for example) hive crashes.
I have the following jars being added when I start up hive:
add jar mongo-hadoop.jar;
add jar mongo-java-driver-3.3.0.jar;
add jar mongo-hadoop-hive-2.0.1.jar;
And if it is relevant in any sense, I am using Amazon's EMR cluster for this, and I'm connected through ssh.
Thanks for all the help
Here is the error hive throws out:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.exec.Utilities.deserializeExpression(Ljava/lang/String;)Lorg/apache/hadoop/hive/ql/plan/ExprNodeGenericFuncDesc;
at com.mongodb.hadoop.hive.input.HiveMongoInputFormat.getFilter(HiveMongoInputFormat.java:134)
at com.mongodb.hadoop.hive.input.HiveMongoInputFormat.getRecordReader(HiveMongoInputFormat.java:103)
at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:691)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:329)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:455)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:424)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:144)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1885)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:252)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Crete table like below and check.
CREATE [EXTERNAL] TABLE <tablename>
(<schema>)
ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe'
[WITH SERDEPROPERTIES('mongo.columns.mapping'='<JSON mapping>')]
STORED AS INPUTFORMAT 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
OUTPUTFORMAT 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat'
[LOCATION '<path to existing directory>'];
Instead of using a StorageHandler to read, serialize, deserialize, and output the data from Hive objects to BSON objects, the individual components are listed individually. This is because using a StorageHandler has too many negative effects when dealing with the native HDFS filesystem
I see
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"date1":"date"}')
and you are querying the column stars which is not mapped.
I met this probelem on our cluster.
The cluster hive version is higher than version in mongo-hive(which is 1.2.1)
The old class org.apache.hadoop.hive.ql.exec.Utilities.deserializeExpression has been renamed to org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpression
You need to rebuild the jar by yourself.

How to update a field which is indexed?

I want to update a field in Cassandra which is indexed using phantom scala sdk like:
this.update.where(_.id eqs folderId)
.and(_.owner eqs owner)
.modify(_.parent setTo parentId)
the parent field is a indexed field in table. But the operation is not allowed when compile the code, there will have compile exception like:
[error] C:\User\src\main\scala\com\autodesk\platform\columbus\cassandra\DataItem.scala:161: could not find implicit value for evidence parameter of type com.websudos.phantom.column.ModifiableColumn[T]
The error is caused by update the field which is indexed.
My workaround is to delete the record and insert a new record to "update" the record.
Is there a better way for the situation?
You are not allowed to update a field that is part of the primary key, because if you do so you are rendering Cassandra unable to ever re-compose the hash of the row you are updating.
Read here for details on the topic. In essence, if you had a HashMap[K, V] what you are trying to do is update the K, but in doing so you will never be able to retrieve the same V again.
So in Cassandra, just like in the HashMap, an update to an index is done with a DELETE and then a new INSERT. That's why phantom intentionally prevents you from compiling your query, I wrote those compile time restrictions in for the specific purposes of preventing invalid CQL.