Apache Kudu with Apache Spark NoSuchMethodError: exportAuthenticationCredentials - scala

I have this function with Spark and Scala:
import org.apache.kudu.client.CreateTableOptions
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Dataset, Encoders, SparkSession}
import org.apache.kudu.spark.kudu._
def save(df: DataFrame): Unit ={
val kuduContext: KuduContext = new KuduContext("quickstart.cloudera:7051")
kuduContext.createTable(
"test_table", df.schema, Seq("anotheKey", "id", "date"),
new CreateTableOptions()
.setNumReplicas(1))
kuduContext.upsertRows(df, "test_table")
}
But when trying to create the kuduContext raises an exception:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.kudu.client.KuduClient.exportAuthenticationCredentials()[B
at org.apache.kudu.spark.kudu.KuduContext.<init>(KuduContext.scala:63)
at com.mypackge.myObject$.save(myObject.scala:24)
at com.mypackge.myObject$$anonfun$main$1.apply$mcV$sp(myObject.scala:59)
at com.mypackge.myObject$$anonfun$main$1.apply(myObject.scala:57)
at com.mypackge.myObject$$anonfun$main$1.apply(myObject.scala:57)
at com.mypackge.myObject$.time(myObject.scala:17)
at com.mypackge.myObject$.main(myObject.scala:57)
at com.mypackge.myObject.main(myObject.scala)
Spark works without any problem. I have installed kudu VM as described on official docs and I have logged from bash to impala instance without a problem.
Someone have any idea about what I am doing wrong?

The problem was a dependency of the project using an old version of kudu-client (1.2.0), when I was using kudu-spark 1.3.0 (which includes kudu-client 1.3.0). Excluding kudu-client from pom.xml was the solution.

Related

NoSuchMethodError in google dataproc cluster for excel files

While consuming Excel file in dataproc cluster, getting errorjava.lang.NoSuchMethodError.
Note: schema is getting printed but not the actual data.
Error:
py4j.protocol.Py4JJavaError: An error occurred while calling
o74.showString. : java.lang.NoSuchMethodError:
scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
at
com.crealytics.spark.excel.ExcelRelation.buildScan(ExcelRelation.scala:74)
Code:
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from google.cloud import storage
from google.cloud import bigquery
import pyspark
client = storage.Client()
bucket_name = "test_bucket"
path=f"gs://{bucket_name}/test_file.xlsx"
def make_spark_session(app_name, jars=[]):
configuration = (SparkConf()
.set("spark.jars", ','.join(jars)))
spark = SparkSession.builder.appName(app_name) \
.config(conf=configuration).getOrCreate()
return spark
app_name = 'test_app'
jars = ['gs://bucket/spark-excel_2.11_uber-0.12.0.jar']
spark = make_spark_session(app_name,jars)
df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader","true") \
.load(path)
df.show(1)
This appears to be Scala version mismatch between your job jars and the cluster. Both Dataproc 1.5 and 2.0 come with Scala 2.12. The gs://bucket/spark-excel_2.11_uber-0.12.0.jar in your code seems to be Scala 2.11 based, you might want to use spark-excel_2.12_... instead. In addition to that, make sure your Spark application is also built with Scala 2.12.

Unable to connect with S3 and Spark Locally

Below is my code :
I am trying to access s3 files from spark locally.
But getting error :
Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: s3n://bucketname/folder
I am also using jars :hadoop-aws-2.7.3.jar,aws-java-sdk-1.7.4.jar,hadoop-auth-2.7.1.jar while submitting spark job from cmd.
package org.test.snow
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.log4j._
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.SparkSession
import org.apache.spark.util.Utils
import org.apache.spark.sql._
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
object SnowS3 {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("IDV4")
val sc = new SparkContext(conf)
val spark = new org.apache.spark.sql.SQLContext(sc)
import spark.implicits._
sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", "A*******************A")
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey","A********************A")
val cus_1=spark.read.format("com.databricks.spark.csv")
.option("header","true")
.option("inferSchema","true")
.load("s3a://tb-us-east/working/customer.csv")
cus_1.show()
}
}
Any help would be appreciated.
FYI: I am using spark 2.1
You shouldn't set that fs.s3a.impl option; that's a superstition which seems to persist in spark examples.
Instead uses the S3A connector just by using the s3a:// prefix with
consistent versions of hadoop-* jar versions. Yes, hadoop-aws-2.7.3 needs hadoop-common-2.7.3
setting the s3a specific authentication options, fs.s3a.access.key and `fs.s3a.secret.key'
If that doesn't work, look at the s3a troubleshooting docs

Spark-Scala with Cassandra

I am beginner with Spark, Scala and Cassandra. I am working with ETL programming.
Now my project ETL POCs required Spark, Scala and Cassandra. I configured Cassandra with my ubuntu system in /usr/local/Cassandra/* and after that I installed Spark and Scala. Now I am using Scala editor to start my work, I created simply load a file in landing location, but after that I am trying to connect with cassandra in scala but I am not getting an help how we can connect and process the data in destination database?.
Any one help me Is this correct way? or some where I am wrong? please help me to how we can achieve this process with above combination.
Thanks in advance!
Add spark-cassandra-connector to your pom or sbt by reading instruction, then work this way
Import this in your file
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.cassandra._
spark scala file
object SparkCassandraConnector {
def main(args: Array[String]) {
val conf = new SparkConf(true)
.setAppName("UpdateCassandra")
.setMaster("spark://spark:7077") // spark server
.set("spark.cassandra.input.split.size_in_mb","67108864")
.set("spark.cassandra.connection.host", "192.168.3.167") // cassandra host
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
// connecting with cassandra for spark and sql query
val spark = SparkSession.builder()
.config(conf)
.getOrCreate()
// Load data from node publish table
val df = spark
.read
.cassandraFormat( "table_nmae", "keyspace_name")
.load()
}
}
This will work for spark 2.2 and cassandra 2
you can perform this easly with spark-cassandra-connector

Not able to load hive table into Spark

I am trying to load data from hive table using spark-sql. However, it doesn't return me anything. I tried to execute the same query in hive and it prints out the result. Below is my code which I am trying to execute in scala.
sc.setLogLevel("ERROR")
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._
val sqlContext = new HiveContext(sc)
sqlContext.setConf("spark.sql.hive.convertMetastoreOrc", "false")
val data = sqlContext.sql("select `websitename` from db1.table1 limit 10").toDF
Kindly let me know what could be the possible reason.
Spark- version : 1.6.2
Scala - 2.10
Depends how the table was created in the first place. If it was created by an external application and you have hive running as separate service make sure that the settings in SPARK_HOME/conf/hive-site.xml are correct.
If it's an internal spark-sql table, it sets up the metastore in a folder on the master node, which in your case might have been deleted or moved.

Why does MongoDB Spark Connector fail with AbstractMethodError?

I am trying to insert a spark sql dataframe in a remote mongodb collection.
Previously I wrote a java program with MongoClient to check whether the remote collection is accessible and I was successfully able to do so.
My present spark code is as below -
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#1a8b22b5
scala> val depts = sqlContext.sql("select * from test.user_details")
depts: org.apache.spark.sql.DataFrame = [user_id: string, profile_name: string ... 7 more fields]
scala> depts.write.options(scala.collection.Map("uri" -> "mongodb://<username>:<pwd>#<hostname>:27017/<dbname>.<collection>")).mode(SaveMode.Overwrite).format("com.mongodb.spark.sql").save()
Ths is giving the following error -
java.lang.AbstractMethodError: com.mongodb.spark.sql.DefaultSource.createRelation(Lorg/apache/spark/sql/SQLContext;Lorg/apache/spark/sql/SaveMode;Lscala/collection/immutable/Map;Lorg/apache/spark/sql/Dataset;)Lorg/apache/spark/sql/sources/BaseRelation;
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:429)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
... 84 elided
I also tried the following which is throwing the below error :
scala> depts.write.options(scala.collection.Map("uri" -> "mongodb://<username>:<pwd>#<host>:27017/<database>.<collection>")).mode(SaveMode.Overwrite).save()
java.lang.IllegalArgumentException: 'path' is not specified
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$17.apply(DataSource.scala:438)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$17.apply(DataSource.scala:438)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.getOrElse(ddl.scala:117)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:437)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
... 58 elided
I have imported the following packages -
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import com.mongodb.casbah.{WriteConcern => MongodbWriteConcern}
import com.mongodb.spark.config._
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql._
depts.show() is working as expected, ie. dataframe is successfully Created.
Please can someone provide me any advice/suggestion on this.
Thanks
Assuming that you are using MongoDB Spark Connector v1.0, You can save DataFrames SQL like below:
// DataFrames SQL example
df.registerTempTable("temporary")
val depts = sqlContext.sql("select * from test.user_details")
depts.show()
// Save out the filtered DataFrame result
MongoSpark.save(depts.write.option("uri", "mongodb://hostname:27017/database.collection").mode("overwrite"))
For more information see MongoDB Spark Connector: Spark SQL
For a simple demo of MongoDB and Spark using docker see MongoDB Spark Docker: examples.scala - dataframes
Have a look at this error and think of possible ways to face it. That is due to a Spark version mismatch between the Spark Connector for MongoDB and Spark you use.
java.lang.AbstractMethodError: com.mongodb.spark.sql.DefaultSource.createRelation(Lorg/apache/spark/sql/SQLContext;Lorg/apache/spark/sql/SaveMode;Lscala/collection/immutable/Map;Lorg/apache/spark/sql/Dataset;)Lorg/apache/spark/sql/sources/BaseRelation;
Quoting the javadoc of java.lang.AbstractMethodError:
Thrown when an application tries to call an abstract method. Normally, this error is caught by the compiler; this error can only occur at run time if the definition of some class has incompatibly changed since the currently executing method was last compiled.
That pretty much explains what you experience (note the part that starts with "this error can only occur at run time").
My guess is that the part Lorg/apache/spark/sql/Dataset in the DefaultSource.createRelation method in the stack trace is exactly the culprit.
In other words, that line uses data: DataFrame not Dataset which are incompatible in this direction, i.e. DataFrame is simply a Scala type alias of Dataset[Row], but any Dataset is not a DataFrame and hence the runtime error.
override def createRelation(sqlContext: SQLContext, mode: SaveMode, parameters: Map[String, String], data: DataFrame): BaseRelation