Why does MongoDB Spark Connector fail with AbstractMethodError? - mongodb

I am trying to insert a spark sql dataframe in a remote mongodb collection.
Previously I wrote a java program with MongoClient to check whether the remote collection is accessible and I was successfully able to do so.
My present spark code is as below -
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#1a8b22b5
scala> val depts = sqlContext.sql("select * from test.user_details")
depts: org.apache.spark.sql.DataFrame = [user_id: string, profile_name: string ... 7 more fields]
scala> depts.write.options(scala.collection.Map("uri" -> "mongodb://<username>:<pwd>#<hostname>:27017/<dbname>.<collection>")).mode(SaveMode.Overwrite).format("com.mongodb.spark.sql").save()
Ths is giving the following error -
java.lang.AbstractMethodError: com.mongodb.spark.sql.DefaultSource.createRelation(Lorg/apache/spark/sql/SQLContext;Lorg/apache/spark/sql/SaveMode;Lscala/collection/immutable/Map;Lorg/apache/spark/sql/Dataset;)Lorg/apache/spark/sql/sources/BaseRelation;
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:429)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
... 84 elided
I also tried the following which is throwing the below error :
scala> depts.write.options(scala.collection.Map("uri" -> "mongodb://<username>:<pwd>#<host>:27017/<database>.<collection>")).mode(SaveMode.Overwrite).save()
java.lang.IllegalArgumentException: 'path' is not specified
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$17.apply(DataSource.scala:438)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$17.apply(DataSource.scala:438)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.getOrElse(ddl.scala:117)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:437)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
... 58 elided
I have imported the following packages -
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import com.mongodb.casbah.{WriteConcern => MongodbWriteConcern}
import com.mongodb.spark.config._
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql._
depts.show() is working as expected, ie. dataframe is successfully Created.
Please can someone provide me any advice/suggestion on this.
Thanks

Assuming that you are using MongoDB Spark Connector v1.0, You can save DataFrames SQL like below:
// DataFrames SQL example
df.registerTempTable("temporary")
val depts = sqlContext.sql("select * from test.user_details")
depts.show()
// Save out the filtered DataFrame result
MongoSpark.save(depts.write.option("uri", "mongodb://hostname:27017/database.collection").mode("overwrite"))
For more information see MongoDB Spark Connector: Spark SQL
For a simple demo of MongoDB and Spark using docker see MongoDB Spark Docker: examples.scala - dataframes

Have a look at this error and think of possible ways to face it. That is due to a Spark version mismatch between the Spark Connector for MongoDB and Spark you use.
java.lang.AbstractMethodError: com.mongodb.spark.sql.DefaultSource.createRelation(Lorg/apache/spark/sql/SQLContext;Lorg/apache/spark/sql/SaveMode;Lscala/collection/immutable/Map;Lorg/apache/spark/sql/Dataset;)Lorg/apache/spark/sql/sources/BaseRelation;
Quoting the javadoc of java.lang.AbstractMethodError:
Thrown when an application tries to call an abstract method. Normally, this error is caught by the compiler; this error can only occur at run time if the definition of some class has incompatibly changed since the currently executing method was last compiled.
That pretty much explains what you experience (note the part that starts with "this error can only occur at run time").
My guess is that the part Lorg/apache/spark/sql/Dataset in the DefaultSource.createRelation method in the stack trace is exactly the culprit.
In other words, that line uses data: DataFrame not Dataset which are incompatible in this direction, i.e. DataFrame is simply a Scala type alias of Dataset[Row], but any Dataset is not a DataFrame and hence the runtime error.
override def createRelation(sqlContext: SQLContext, mode: SaveMode, parameters: Map[String, String], data: DataFrame): BaseRelation

Related

How to read 1M records from Elasticsearch into PySpark?

I have a problem with reading data from Elasticsearch into Spark cluster (I'm using Zeppelin environment, so all connection settings are configured in the Zeppelin interpreter settings).
First, I have tried to read it with PySpark:
%pyspark
from pyspark.sql import SQLContext
from pyspark.sql import functions as F
df = spark.read.format("org.elasticsearch.spark.sql").load("index")
df = df.limit(100).drop('tags').drop('a.b')
# if 'tags' field is not dropped, pyspark cannot map scala field and throws an exception.
# If the limit is not set, pyspark will probably try to get the whole index at once
# if "a.b" is not dropped, the dot in the field name causes mapping error: https://github.com/elastic/elasticsearch-hadoop/issues/853
df = df.cache()
z.show(df)
Unfortunately, in this case I face many mapping issues. Cause I have a lot of fields containing dots in the dataset, I decided to give Scala a try to read the data (in order to process it in PySpark later):
%spark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
import org.apache.spark.sql.SQLContext
import org.elasticsearch.spark
import org.elasticsearch.spark.sql
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoder
val conf = new SparkConf()
conf.set("spark.es.mapping.date.rich", "false");
conf.set("spark.serializer", classOf[KryoSerializer].getName)
val EsReadRDD = sc.esRDD("index")
However, even with Scala I can only retrieve small numbers of records, like
EsReadRDD.take(10).foreach(println)
For some reason, collect() does not work:
val esdf = EsReadRDD.collect() //does not work probably because data are too large
The error is:
Job aborted due to stage failure: Task 0 in stage 833.0 failed 4 times, most recent failure: Lost task 0.3 in stage 833.0 (TID 479, 10.10.11.37, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
I have also tried conversion to DF, but get an error:
val esdf = EsReadRDD.toDF()
java.lang.UnsupportedOperationException: No Encoder found for scala.AnyRef
- map value class: "java.lang.Object"
- field (class: "scala.collection.Map", name: "_2")
- root class: "scala.Tuple2"
Do you have any idea on how to deal with it?

Unable to create dataframe using SQLContext object in spark2.2

I am using spark 2.2 version on Microsoft Windows 7. I want to load csv file in one variable to perform SQL related actions later on but unable to do so. I referred accepted answer from this link but of no use. I followed below steps for creating SparkContext object and SQLContext object:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val sc=SparkContext.getOrCreate() // Creating spark context object
val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Creating SQL object for query related tasks
Objects are created successfully but when I execute below code it throws an error which can't be posted here.
val df = sqlContext.read.format("csv").option("header", "true").load("D://ResourceData.csv")
And when I try something like df.show(2) it says that df was not found. I tried databricks solution for loading CSV from the attached link. It downloads the packages but doesn't load csv file. So how can I rectify my problem?? Thanks in advance :)
I solved my problem for loading local file in dataframe using 1.6 version in cloudera VM with the help of below code:
1) sudo spark-shell --jars /usr/lib/spark/lib/spark-csv_2.10-1.5.0.jar,/usr/lib/spark/lib/commons-csv-1.5.jar,/usr/lib/spark/lib/univocity-parsers-1.5.1.jar
2) val df1 = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("treatEmptyValuesAsNulls", "true" ).option("parserLib", "univocity").load("file:///home/cloudera/Desktop/ResourceData.csv")
NOTE: sc and sqlContext variables are automatically created
But there are many improvements in the latest version i.e 2.2.1 which I am unable to use because metastore_db doesn't gets created in windows 7. I ll post a new question regarding the same.
In reference with your comment that you are able to access SparkSession variable, then follow below steps to process your csv file using SparkSQL.
Spark SQL is a Spark module for structured data processing.
There are mainly two abstractions - Dataset and Dataframe :
A Dataset is a distributed collection of data.
A DataFrame is a Dataset organized into named columns.
In the Scala API, DataFrame is simply a type alias of Dataset[Row].
With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.
You have a csv file and you can simply create a dataframe by doing one of the following:
From your spark-shell using the SparkSession variable spark:
val df = spark.read
.format("csv")
.option("header", "true")
.load("sample.csv")
After reading the file into dataframe, you can register it into a temporary view.
df.createOrReplaceTempView("foo")
SQL statements can be run by using the sql methods provided by Spark
val fooDF = spark.sql("SELECT name, age FROM foo WHERE age BETWEEN 13 AND 19")
You can also query that file directly with SQL:
val df = spark.sql("SELECT * FROM csv.'file:///path to the file/'")
Make sure that you run spark in local mode when you load data from local, or else you will get error. The error occurs when you have already set HADOOP_CONF_DIR environment variable,and which expects "hdfs://..." otherwise "file://".
Set your spark.sql.warehouse.dir (default: ${system:user.dir}/spark-warehouse).
.config("spark.sql.warehouse.dir", "file:///C:/path/to/my/")
It is the default location of Hive warehouse directory (using Derby)
with managed databases and tables. Once you set the warehouse directory, Spark will be able to locate your files, and you can load csv.
Reference : Spark SQL Programming Guide
Spark version 2.2.0 has built-in support for csv.
In your spark-shell run the following code
val df= spark.read
.option("header","true")
.csv("D:/abc.csv")
df: org.apache.spark.sql.DataFrame = [Team_Id: string, Team_Name: string ... 1 more field]

HiveContext - unable to access hbase table mapped in hive as external table

I am trying to access the hbase table mapped in hive using HiveContext in Spark. But I am getting ClassNotFoundException Exceptions.. Below is my code.
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("select * from dbn.hvehbasetable")
I am getting the below error..
17/06/22 07:17:30 ERROR log: error in initSerDe:
java.lang.ClassNotFoundException Class
org.apache.hadoop.hive.hbase.HBaseSerDe not found
java.lang.ClassNotFoundException: Class
org.apache.hadoop.hive.hbase.HBaseSerDe not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2120)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:385)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$3.apply(ClientWrapper.scala:342)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$3.apply(ClientWrapper.scala:337)
at scala.Option.map(Option.scala:145)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:337)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:332)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:290)
at org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:237)
Can anyone help which class I need to import to read the hbase tables.
I think, you need add hive-hbase-handler jar in classpath/ auxpath if you haven't done that already.
Get your version from here.
Let me know if this helps. Cheers.

Not able to load hive table into Spark

I am trying to load data from hive table using spark-sql. However, it doesn't return me anything. I tried to execute the same query in hive and it prints out the result. Below is my code which I am trying to execute in scala.
sc.setLogLevel("ERROR")
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._
val sqlContext = new HiveContext(sc)
sqlContext.setConf("spark.sql.hive.convertMetastoreOrc", "false")
val data = sqlContext.sql("select `websitename` from db1.table1 limit 10").toDF
Kindly let me know what could be the possible reason.
Spark- version : 1.6.2
Scala - 2.10
Depends how the table was created in the first place. If it was created by an external application and you have hive running as separate service make sure that the settings in SPARK_HOME/conf/hive-site.xml are correct.
If it's an internal spark-sql table, it sets up the metastore in a folder on the master node, which in your case might have been deleted or moved.

Apache Kudu with Apache Spark NoSuchMethodError: exportAuthenticationCredentials

I have this function with Spark and Scala:
import org.apache.kudu.client.CreateTableOptions
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Dataset, Encoders, SparkSession}
import org.apache.kudu.spark.kudu._
def save(df: DataFrame): Unit ={
val kuduContext: KuduContext = new KuduContext("quickstart.cloudera:7051")
kuduContext.createTable(
"test_table", df.schema, Seq("anotheKey", "id", "date"),
new CreateTableOptions()
.setNumReplicas(1))
kuduContext.upsertRows(df, "test_table")
}
But when trying to create the kuduContext raises an exception:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.kudu.client.KuduClient.exportAuthenticationCredentials()[B
at org.apache.kudu.spark.kudu.KuduContext.<init>(KuduContext.scala:63)
at com.mypackge.myObject$.save(myObject.scala:24)
at com.mypackge.myObject$$anonfun$main$1.apply$mcV$sp(myObject.scala:59)
at com.mypackge.myObject$$anonfun$main$1.apply(myObject.scala:57)
at com.mypackge.myObject$$anonfun$main$1.apply(myObject.scala:57)
at com.mypackge.myObject$.time(myObject.scala:17)
at com.mypackge.myObject$.main(myObject.scala:57)
at com.mypackge.myObject.main(myObject.scala)
Spark works without any problem. I have installed kudu VM as described on official docs and I have logged from bash to impala instance without a problem.
Someone have any idea about what I am doing wrong?
The problem was a dependency of the project using an old version of kudu-client (1.2.0), when I was using kudu-spark 1.3.0 (which includes kudu-client 1.3.0). Excluding kudu-client from pom.xml was the solution.