In Spark 3 major changes were introduced to DataSourceV2 API. I want to know if there is a proper migration document which specifies the usage of TableProvider instead of DataSourceV2 with read and write support
Related
I have Scala-based application and I need to connect it to Cassandra.
I found DataStax Enterprise drivers very useful in this regard, and those have a lot of cool features like in-built load balancing for Cassandra and that is really import for me.
Unfortunately there isn't any native DSE drivers for Scala. I know we can use DSE Java drivers, but in that case, we loose a lot of Scala cool features.
I also found spark-cassandra-connector that's built by Datastax as well, but this built-in load balancing thing is really important to me and I don't know if spark-cassandra-connector support it or not.
In the Java-based applications using DSE Java driver, I need to config the built-in load balancer in a configuration file as below:
datastax-java-driver.basic.load-balancing-policy {
class = DefaultLoadBalancingPolicy
}
I don't know the equivalent way in Scala using spark-cassandra-connector and I'm not even sure if it is possible or not.
Any help would be appreciated. Thanks.
In the Scala you can just use the Java driver - out of the box you don't have only support for base Scala types, but you can solve this problem by importing the java-driver-scala-extras into your project (as source code) - it works for at least for driver 3.x. Another issue is the support for Option, but this could done via Java's optional that has an extra codec in Java driver.
Regarding the customization of the driver - that part should work with Scala without change. Regarding the support of default policy in Spark - Spark Cassandra connector has a separate policy for a special reason - it's close to the Java's default policy, but with specifics for Spark.
Spark json4s[java.lang.NoSuchMethodError:
org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/Js]
Getting above error while parsing complex json when running spark scala structured streaming application on aws emr.
It looks like a binary compatibility error... Could you please check the dependency tree for incompatible versions of json4s artifacts?
If you will not able to upgrade them to use the same version then possible you can solve the problem by shading some of them with sbt-assembly plugin using rules like these.
In any case I would recommend to use more safe and efficient parser like jsoniter-scala (or dijon that is based on it).
I have kafka broker upgraded from 0.8 to 0.11, and now I am trying to upgrade the spark streaming job code to be compatible with the new kafka -I am using spark 1.6.2-.
I searched a lot for steps to follow to do this upgrade I didn't find any article either official or not-official.
The only article I found useful is this one, however it is mentioning spark 2.2 and kafka 0.10, but I got a line saying
However, because the newer integration uses the new Kafka consumer API instead of the simple API, there are notable differences in usage. This version of the integration is marked as experimental, so the API is potentially subject to change
Do anyone have tried to integrate spark streaming 1.6 with kafka 0.11, or is it better to upgrade spark first to 2.X , since there is a lack of info about and support regarding this version mix of spark-streaming and kafka?
After lots of investigations, found no way to do this move, as spark-streaming only supporting kafka version up to 0.10 (which has major differences from kafka 0.11, 1.0.X).
That's why I decided to move from spark-streaming to use the new kafka-streaming api, and simply it was awesome, simple to use, very flexible, and the big advantage is that: IT IS A LIBRARY, you can simply add it to your project, not a framework that is wrapping your code.
Kafka-streaming api almost support all functionality provided by spark (aggregation, windowing, filtering, MR).
I`m currently working on recommender system using pyspark and ipython-notebook. I want to get recommendations from data stored in BigQuery. There are two options:Spark BQ connector and Python BQ library.
What are the pros and cons of these two tools?
The Python BQ library is a standard way to interact with BQ from Python, and so it will include the full API capabilities of BigQuery. The Spark BQ connector you mention is the Hadoop Connector - a Java Hadoop library that will allow you to read/write from BigQuery using abstracted Hadoop classes. This will more closely resemble how you interact with native Hadoop inputs and outputs.
You can find example usage of the Hadoop Connector here.
I am planning to use orientdb in production using the jdbc drive so i need confirm some points
is jdbc driver can give all the orientdb Features like (transaction and links ...etc) or using the the java api is the best choice
I noticed that you have spring data implementation in the orientdb github is it ready to use in the production
At this link a discussion on the issue that you wrote.
in general, JDBC driver supports only a subset of OrientDB, only the part you can use with commands.
If you're a Java developer, I suggest you to use the Java Graph API: http://orientdb.com/docs/last/Graph-Database-Tinkerpop.html