Spark MongoDB Connector unable to df.join - Unspecialised MongoConfig - mongodb

Using the latest MongoDB connector for Spark (v10) and trying to join two dataframes yields the following unhelpful error.
Py4JJavaError: An error occurred while calling o64.showString.
: java.lang.UnsupportedOperationException: Unspecialised MongoConfig. Use `mongoConfig.toReadConfig()` or `mongoConfig.toWriteConfig()` to specialize
at com.mongodb.spark.sql.connector.config.MongoConfig.getDatabaseName(MongoConfig.java:201)
at com.mongodb.spark.sql.connector.config.MongoConfig.getNamespace(MongoConfig.java:196)
at com.mongodb.spark.sql.connector.MongoTable.name(MongoTable.java:99)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation.name(DataSourceV2Relation.scala:66)
at org.apache.spark.sql.execution.datasources.v2.V2ScanRelationPushDown$$anonfun$pushDownFilters$1.$anonfun$applyOrElse$2(V2ScanRelationPushDown.scala:65)
Pyspark Code is simply pulling in two tables and running a join:
dfa = spark.read.format("mongodb").option("uri", mongodb://127.0.0.1/people.contacts").load()
dfb = spark.read.format("mongodb").option("uri", mongodb://127.0.0.1/people.accounts").load()
dfa.join(dfb, 'PKey').count()
SQL gives the same error:
dfa.createOrReplaceTempView("usr")
dfb.createOrReplaceTempView("ast")
spark.sql("SELECT count(*) FROM ast JOIN usr on usr._id = ast._id").show()
Document structures are flat.

Have you try using the latest version (10.0.2) of mongo-spark-connector? Can find it at here
I had a similar problem, solved it by replace 10.0.1 by 10.0.2

Related

Import MongoDB data into Hive Error: Splitter implementation is incompatible

I'm trying to import mongodb data into hive.
The jar versions that i have used are
ADD JAR /root/HDL/mongo-java-driver-3.4.2.jar;
ADD JAR /root/HDL/mongo-hadoop-hive-2.0.2.jar;
ADD JAR /root/HDL/mongo-hadoop-core-2.0.2.jar;
And my cluster versions are
Ambari - Version 2.6.0.0, HDFS 2.7.3, Hive 1.2.1000, HBase 1.1.2, Tez 0.7.0
MongoDB Server version:- 3.6.5
Hive Script:-
CREATE TABLE sampletable
( ID STRING,
EmpID STRING,
BeginDate DATE,
EndDate DATE,
Time TIMESTAMP,
Type STRING,
Location STRING,
Terminal STRING)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"ID":"_id","EmpID":"emp_id","BeginDate":"begin_date","EndDate":"end_date","Time":"time","Type":"time_event_type","Location":"location","Terminal":"terminal"}')
TBLPROPERTIES('mongo.uri'='mongodb://username:password#10.10.170.43:27017/testdb.testtable');
Output:-
hive> select * from sampletable;
OK
Failed with exception java.io.IOException:java.io.IOException: Failed to aggregate sample documents. Note that this Splitter implementation is incompatible with MongoDB versions prior to 3.2.
Please suggest me how can i solve this.
Thanks,
Mohan V
set mongo.input.split_size=50;

transform_geom: couldn't parse proj4 output string: projection not named

I recently upgraded my Amazon PostgreSQL RDS to version 10.3 but while fetching the projections I am getting error:
ERROR: transform_geom: couldn't parse proj4 output string: '3857': projection not named
CONTEXT: SQL function "st_transform" statement 1
Same records i am able to fetch prior to version 9.5.xx
My PostGIS version is 2.4.2 which is compatible to RDS intance.
I perhaps faced the same problem after upgrading from postgis 2.2 to 2.3, some of my queries did no work anymore.
Old-query:
SELECT ST_X(ST_TRANSFORM(ST_SETSRID(ST_MAKEPOINT($1,$2),$3),$4));
query-params $1...$4:
602628,6663367,3857,3857
error message:
"transform_geom: couldn't parse proj4 output string: '3857': projection not named"
Reason:
ST_TRANSFORM comes in multiple flavours, two of them:
public.st_transform(geometry, integer)
public.st_transform(geometry, text)
Latter one, I assume new in postgis 2.3, caused my problem, because $4 (3857) was regarded as (proj4-) string and not as (SRID-) integer.
Workaround in my case: type-hint for param $4
SELECT ST_X(ST_TRANSFORM(ST_SETSRID(ST_MAKEPOINT($1,$2),$3),$4::int));

Compile error when using Spark to save data to MongoDB

I'm trying to read data from Hbase and write to MongoDB, my code as following in scala:
mongoConfig.set("mongo.output.uri", "mongodb://node1:57017/sampledb.sample")
mongoConfig.set("mongo.output.format","com.mongodb.hadoop.MongoOutputFormat")
val documents:RDD[Map[Object, BasicBSONObject]] = newRDD1.map(f => convert2BSON(f))
documents.saveAsNewAPIHadoopFile("file:///test",classOf[Object],classOf[BSONObject],
classOf[BSONFileOutputFormat[Object, BSONObject]],sparkConf)
But it got the following compiling error:
error: value saveAsNewAPIHadoopFile is not a member of
org.apache.spark.rdd.RDD[Map[Object,org.bson.BasicBSONObject]]
I'm using Spark1.5.2, mongo-hadoop-core 1.5.2 and mongo-java-driver 3.2.0
My mongoDB version is 3.2.5

SAP HANA Vora failing to see the table contents in Scala

Failing to see the data in Scala using Vora.
VORA: 1.2
Spark: 1.5.2 / Spark controller: 1.5.8
The hdfs file "content" showing fine.
hdfs dfs -cat /user/vora/XXXXXXXX/part-00000
AB05,560
CD06,340
EF07,590
GH08,230
Table showing up fine in the "show datasourcestables" command
scala> vc.sql(s"""SHOW DATASOURCETABLES USING com.sap.spark.vora""".stripMargin ).show
Output
Show table is failing in Scala
scala> vc.sql("select * from VVCSV").show
scala> vc.sql("select * from VVCSV").show
java.lang.RuntimeException: Table Not Found: VVCSV
at scala.sys.package$.error(package.scala:27)
at >org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:139)
at >org.apache.spark.sql.extension.ExtendableSQLContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(ExtendableSQLContext.scala:52)
at >org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
at >?>org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
at scala.Option.getOrElse(Option.scala:120)
Command show datasourcetables is deprecated with Vora1.2 and been replaced by show tables using com.sap.spark.vora. However that command is only showing what is persisted in the Vora catalog. To load the tables in the current Spark context (e.g. after restarting the spark-shell) you need to run the register tables command:
vc.sql("register all tables using com.sap.spark.vora").show
To check what is in the current Spark context you can use the show tables command (without the 'using' clause). For more details you can check the Vora Developer Guide and the Spark documentation.

HiveContext setting in scala+spark project to access existing HDFS

I am trying to access my existing hadoop setup in my spark+scala project
Spark Version 1.4.1
Hadoop 2.6
Hive 1.2.1
from Hive Console I able to create table and access it without any issue, I can also see the same table from Hadoop URL as well.
the problem is when I try to create a table from project, system shows error
ERROR Driver: FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:file:/user/hive/warehouse/src is not a directory
or unable to create one)
following is the code I write:
import
import org.apache.spark._
import org.apache.spark.sql.hive._
Code
val sparkContext = new SparkContext("local[2]", "HiveTable")
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
hiveContext.setConf("hive.metastore.warehouse.dir", "hdfs://localhost:54310/user/hive/warehouse")
hiveContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
Edit:
instead of create table if I had to execute insert statement like:
hiveContext.sql("INSERT INTO TABLE default.src SELECT 'username','password' FROM foo;")
any help to resolve his issue would be highly appreciable.