I'm trying to read multiple snappy parquet from aliyun oss server, but I got into error when I use
spark.read.parquet("oss://folder")
This is the error I get:
Py4JJavaError: An error occurred while calling o783.text.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.aliyun.emr.fs.oss.JindoOssFileSystem not found
Related
I am trying to read the JSON File and Store it in DataFrame, below are the code snippets
I am trying to read the List of Objects from JSON File
val df = spark.read.option("multiLine",true).json(path_variable)
df.show()
Error:
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.refArrayOps(\[Ljava/lang/Object;)Ljava/lang/Object;
at scala.runtime.LazyVals$.\<clinit\>(LazyVals.scala:8)
I couldn't find the root cause behind this issue
**I'm trying to connect to the Phoenix driver using Spark Structured Streaming and I'm getting the following exception when I'm trying to load the HBase table data via the Phoenix driver...please help on this **
Jars:
spark.version: 2.4.0
scala.version: 2.12
phoenix.version: 4.11.0-HBase-1.1
hbase.version: 1.4.4
confluent.version: 5.3.0
spark-sql-kafka-0-10_2.12
Code
val tableDF = sqlContext.phoenixTableAsDataFrame("DATA_TABLE", Array("ID","DEPARTMENT"), conf = configuration)
ERROR
Exception in thread "main" java.lang.UnsupportedOperationException: Data source org.apache.phoenix.spark does not support streamed writing
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:298)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:322)
at com.spark.streaming.process.StreamProcess.processDataPackets(StreamProcess.scala:81)
at com.spark.streaming.main.$anonfun$start$1(IAlertJob.scala:55)
at com.spark.streaming.main.$anonfun$start$1$adapted(IAlertJob.scala:27)
at com.spark.streaming.main.SparkStreamingApplication.withSparkStreamingContext(SparkStreamingApplication.scala:38)
at com.spark.streaming.main.SparkStreamingApplication.withSparkStreamingContext$(SparkStreamingApplication.scala:23)
I am unable to read Redis data in my spark dataframe. I am executing this on my databricks.
I have installed the required libraries in my cluster.
Below is the error:
from pyspark.sql import SparkSession
spark=SparkSession\
.builder\
.appName('MyApp')\
.config('spark.redis.host','')\
.config('spark.redis.port','16897')\
.config('spark.redis.auth','')\
.getOrCreate()
df = spark.read.format("org.apache.spark.sql.redis").option("table",
"school").option("key.column", "id").load()
Py4JJavaError: An error occurred while calling o631.load.
: java.lang.ClassNotFoundException:
Failed to find data source: org.apache.spark.sql.redis. Please find packages at
http://spark.apache.org/third-party-projects.html
atorg.apache.spark.sql.errors.QueryExecutionErrors$.failedToFindDataSourceError(QueryExecutionErrors.scala:557)
Can someone help here pleases?
Thanks
Narayana
We are currently working on a POC that deals with using a 'oneof' to have multiple events into the same topic. However, we seem to be getting a serialization exception when publishing to the union Kafka topics.
We are creating a union protobuf schema that calls the other event schemas using the oneof. These event schemas uses imports coming from google like (google/type/date.proto) that can't be added as references while evolving schemas in the registry.
Currently we are using 6.1.1 schema registry version and not sure if this is the cause or this is the way it works. Below is the error we are facing for your reference. We are not sure if there is any additional setup or configuration that is needed in such a scenario. Appreciate some advise on this !
org.apache.kafka.common.errors.SerializationException: Error
serializing Protobuf message at
io.confluent.kafka.serializers.protobuf.AbstractKafkaProtobufSerializer.serializeImpl(AbstractKafkaProtobufSerializer.java:106)
~[kafka-protobuf-serializer-6.1.1.jar:na] Caused by:
java.io.IOException: Incompatible schema syntax = "proto3"; ERROR
25580 --- [nio-8090-exec-1] o.a.c.c.C.[.[.[.[dispatcherServlet] :
Servlet.service() for servlet [dispatcherServlet] in context with path
[/kafka_producer_ri] threw exception [Request processing failed;
nested exception is org.apache.camel.CamelExecutionException:
Exception occurred during execution on the exchange:
Exchange[3497788AEF9624A-0000000000000000]] with root cause
java.io.IOException: Incompatible schema syntax = "proto3";
Thanks
I have followed this blog to read data stored in google bucket. https://cloud.google.com/dataproc/docs/connectors/install-storage-connector
It has worked fine. The following command
hadoop fs -ls gs://the-bucket-you-want-to-list
gave me expected results.But when I tried reading data using pyspark using
rdd = sc.textFile("gs://crawl_tld_bucket/"),
it throws the following error:
`
py4j.protocol.Py4JJavaError: An error occurred while calling o20.partitions.
: java.io.IOException: No FileSystem for scheme: gs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
`
How to get it done?
To access Google Cloud Storage you have to include Cloud Storage connector:
spark-submit --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar your-pyspark-script.py
or
pyspark --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar