while reading multiline json file in Spark2.0 getting exception
val data = spark.read
.option("multiline",true)
.json("C:\\user\\Spark\\DataSets\\employees_multiLine.json")
Exception in thread "main" java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.()V from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:262)
at org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:51)
at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:51)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
Updating hadoop to 2.7.2 or higher will resolve this.
This issue is explained here in detail https://stackoverflow.com/a/36443787
Related
I am trying to read the JSON File and Store it in DataFrame, below are the code snippets
I am trying to read the List of Objects from JSON File
val df = spark.read.option("multiLine",true).json(path_variable)
df.show()
Error:
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.refArrayOps(\[Ljava/lang/Object;)Ljava/lang/Object;
at scala.runtime.LazyVals$.\<clinit\>(LazyVals.scala:8)
I couldn't find the root cause behind this issue
I get this error when using sparkDataset to write to S3:
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter not found
you need to get the matching spark-hadoop-cloud jar from your spark release into the spark classpath, that's where the class lives
I am invoking spark-shell like this
spark-shell --jars kafka-clients-0.10.2.1.jar,spark-sql-kafka-0-10_2.11-2.3.0.cloudera1.jar,spark-streaming-kafka-0-10_2.11-2.3.0.jar,spark-avro_2.11-2.4.0.jar,avro-1.9.1.jar
After That I read from a Kafka Topic using readStream()
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"kafka-1.x:9093,kafka-2.x:9093,kafka-0.x:9093").option("kafka.security.protocol","
SASL_SSL").option("kafka.ssl.protocol","TLSv1.2").option("kafka.sasl.mechanism","PLAIN").option("kafka.sasl.jaas.config","org.apache.kafka.common.security.plain.PlainLoginModule
required username=\"token\" password=\"XXXXXXXXXX\";").option("subscribe", "test-topic").option("startingOffsets",
"latest").load()
Then I read the AVRO Schema File
val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("/root/avro_schema.json")))
Then I make the DataFrame which matches the AVRO schema
val DataLineageDF = df.select(from_avro(col("value"),jsonFormatSchema).as("DataLineage")).select("DataLineage.*")
This Throws an Error on me :
java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
I could fix this Problem by replacing the jar spark-avro_2.11-2.4.0.jar with spark-avro_2.11-2.4.0-palantir.31.jar
Issue:
DataLineageDF.writeStream.format("console").outputMode("append").trigger(Trigger.ProcessingTime("10 seconds")).start
Fails, with this Error
Exception in thread "stream execution thread for [id = ad836d19-0f29-499a-adea-57c6d9c630b2, runId = 489b1123-a2b2-48ea-9d24-e6744e0959b0]" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.boxedType(Lorg/apache/spark/sql/types/DataType;)Ljava/lang/String;
which seems to be related to In-compatible jars. If anyone has any idea what's going wrong please comment
I'm trying to read CSV files which are on s3 bucket which is located in Mumbai Region.I'm trying to read the files using datastax dse spark-submit.
I tried changing hadoop-aws version to various other versions. Currently, hadoop-aws version is 2.7.3
spark.sparkContext.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "s3.ap-south-1.amazonaws.com")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", accessKeyId)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", secretAccessKey)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val df = spark.read.csv("s3a://bucket_path/csv_name.csv")
Upon Executing, Following is the error which I'm getting,
Exception in thread "main"
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400,
AWS Service: Amazon S3, AWS Request ID: 8C7D34A38E359FCE, AWS Error
Code: null, AWS Error Message: Bad Request at
com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at
com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at
com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at
com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at
org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at
org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:616)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392) at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355) at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
at
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
Your signature V4 option is not applied. See This
Add the java option when you run the spark-submit or spark-shell.
spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
Or, set the system property such as:
System.setProperty("com.amazonaws.services.s3.enableV4", "true");
Thank you for all the help. I figured out from the answer of Lamanus that signature V4 option was not applied even after adding it in
spark.sparkContext.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
So I added the following line and now the code works perfectly fine.
import com.amazonaws.SDKGlobalConfiguration
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true")
I have spark code which connects to Netezza and reads a table.
conf = SparkConf().setAppName("app").setMaster("yarn-client")
sc = SparkContext(conf=conf)
hc = HiveContext(sc)
nz_df=hc.load(source="jdbc",url=address dbname";username=;password=",dbtable="")
I do spark-submit and run the code in the following way..
spark-submit -jars nzjdbc.jar filename.py
And I get the following exception:
py4j.protocol.Py4JJavaError: An error occurred while calling o55.load.
: java.sql.SQLException: No suitable driver
Am I doing anything wrong over here?? is the jar not suitable or is it not able to recgonize the jar?? please let me know the correct way if this is not and also can anyone provide the link to get the jar for connecting netezza from spark.
I am using the 1.6.0 version of spark.