Pyspark read multiple snappy parquet from oss aliyun - pyspark

I'm trying to read multiple snappy parquet from aliyun oss server, but I got into error when I use
spark.read.parquet("oss://folder")
This is the error I get:
Py4JJavaError: An error occurred while calling o783.text.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.aliyun.emr.fs.oss.JindoOssFileSystem not found

Related

No Such method Error While Executing the Scala-Spark Code

I am trying to read the JSON File and Store it in DataFrame, below are the code snippets
I am trying to read the List of Objects from JSON File
val df = spark.read.option("multiLine",true).json(path_variable)
df.show()
Error:
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.refArrayOps(\[Ljava/lang/Object;)Ljava/lang/Object;
at scala.runtime.LazyVals$.\<clinit\>(LazyVals.scala:8)
I couldn't find the root cause behind this issue

"Data source org.apache.phoenix.spark does not support streamed writing" in Structured Streaming

**I'm trying to connect to the Phoenix driver using Spark Structured Streaming and I'm getting the following exception when I'm trying to load the HBase table data via the Phoenix driver...please help on this **
Jars:
spark.version: 2.4.0
scala.version: 2.12
phoenix.version: 4.11.0-HBase-1.1
hbase.version: 1.4.4
confluent.version: 5.3.0
spark-sql-kafka-0-10_2.12
Code
val tableDF = sqlContext.phoenixTableAsDataFrame("DATA_TABLE", Array("ID","DEPARTMENT"), conf = configuration)
ERROR
Exception in thread "main" java.lang.UnsupportedOperationException: Data source org.apache.phoenix.spark does not support streamed writing
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:298)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:322)
at com.spark.streaming.process.StreamProcess.processDataPackets(StreamProcess.scala:81)
at com.spark.streaming.main.$anonfun$start$1(IAlertJob.scala:55)
at com.spark.streaming.main.$anonfun$start$1$adapted(IAlertJob.scala:27)
at com.spark.streaming.main.SparkStreamingApplication.withSparkStreamingContext(SparkStreamingApplication.scala:38)
at com.spark.streaming.main.SparkStreamingApplication.withSparkStreamingContext$(SparkStreamingApplication.scala:23)

Pyspark dataframe to read redis data

I am unable to read Redis data in my spark dataframe. I am executing this on my databricks.
I have installed the required libraries in my cluster.
Below is the error:
from pyspark.sql import SparkSession
spark=SparkSession\
.builder\
.appName('MyApp')\
.config('spark.redis.host','')\
.config('spark.redis.port','16897')\
.config('spark.redis.auth','')\
.getOrCreate()
df = spark.read.format("org.apache.spark.sql.redis").option("table",
"school").option("key.column", "id").load()
Py4JJavaError: An error occurred while calling o631.load.
: java.lang.ClassNotFoundException:
Failed to find data source: org.apache.spark.sql.redis. Please find packages at
http://spark.apache.org/third-party-projects.html
atorg.apache.spark.sql.errors.QueryExecutionErrors$.failedToFindDataSourceError(QueryExecutionErrors.scala:557)
Can someone help here pleases?
Thanks
Narayana

Schema Registry Issue - Protobuf Unions

We are currently working on a POC that deals with using a 'oneof' to have multiple events into the same topic. However, we seem to be getting a serialization exception when publishing to the union Kafka topics.
We are creating a union protobuf schema that calls the other event schemas using the oneof. These event schemas uses imports coming from google like (google/type/date.proto) that can't be added as references while evolving schemas in the registry.
Currently we are using 6.1.1 schema registry version and not sure if this is the cause or this is the way it works. Below is the error we are facing for your reference. We are not sure if there is any additional setup or configuration that is needed in such a scenario. Appreciate some advise on this !
org.apache.kafka.common.errors.SerializationException: Error
serializing Protobuf message at
io.confluent.kafka.serializers.protobuf.AbstractKafkaProtobufSerializer.serializeImpl(AbstractKafkaProtobufSerializer.java:106)
~[kafka-protobuf-serializer-6.1.1.jar:na] Caused by:
java.io.IOException: Incompatible schema syntax = "proto3"; ERROR
25580 --- [nio-8090-exec-1] o.a.c.c.C.[.[.[.[dispatcherServlet] :
Servlet.service() for servlet [dispatcherServlet] in context with path
[/kafka_producer_ri] threw exception [Request processing failed;
nested exception is org.apache.camel.CamelExecutionException:
Exception occurred during execution on the exchange:
Exchange[3497788AEF9624A-0000000000000000]] with root cause
java.io.IOException: Incompatible schema syntax = "proto3";
Thanks

reading google bucket data in spark

I have followed this blog to read data stored in google bucket. https://cloud.google.com/dataproc/docs/connectors/install-storage-connector
It has worked fine. The following command
hadoop fs -ls gs://the-bucket-you-want-to-list
gave me expected results.But when I tried reading data using pyspark using
rdd = sc.textFile("gs://crawl_tld_bucket/"),
it throws the following error:
`
py4j.protocol.Py4JJavaError: An error occurred while calling o20.partitions.
: java.io.IOException: No FileSystem for scheme: gs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
`
How to get it done?
To access Google Cloud Storage you have to include Cloud Storage connector:
spark-submit --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar your-pyspark-script.py
or
pyspark --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar