I'm trying to write a dataframe on AWS (Keyspace), but I'm getting the following messages below:
Stack:
dfExploded.write.cassandraFormat(table = "table", keyspace = "hub").mode(SaveMode.Append).save()
21/08/18 21:45:18 WARN DefaultTokenFactoryRegistry: [s0] Unsupported partitioner 'com.amazonaws.cassandra.DefaultPartitioner', token map will be empty.
java.lang.AssertionError: assertion failed: There are no contact points in the given set of hosts
at scala.Predef$.assert(Predef.scala:223)
at com.datastax.spark.connector.cql.LocalNodeFirstLoadBalancingPolicy$.determineDataCenter(LocalNodeFirstLoadBalancingPolicy.scala:195)
at com.datastax.spark.connector.cql.CassandraConnector$.$anonfun$dataCenterNodes$1(CassandraConnector.scala:192)
at scala.Option.getOrElse(Option.scala:189)
at com.datastax.spark.connector.cql.CassandraConnector$.dataCenterNodes(CassandraConnector.scala:192)
at com.datastax.spark.connector.cql.CassandraConnector$.alternativeConnectionConfigs(CassandraConnector.scala:207)
at com.datastax.spark.connector.cql.CassandraConnector$.$anonfun$sessionCache$3(CassandraConnector.scala:169)
at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:34)
at com.datastax.spark.connector.cql.RefCountedCache.syncAcquire(RefCountedCache.scala:69)
at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:57)
at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:89)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
at com.datastax.spark.connector.datasource.CassandraCatalog$.com$datastax$spark$connector$datasource$CassandraCatalog$$getMetadata(CassandraCatalog.scala:455)
at com.datastax.spark.connector.datasource.CassandraCatalog$.getTableMetaData(CassandraCatalog.scala:421)
at org.apache.spark.sql.cassandra.DefaultSource.getTable(DefaultSource.scala:68)
at org.apache.spark.sql.cassandra.DefaultSource.inferSchema(DefaultSource.scala:72)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81)
at org.apache.spark.sql.DataFrameWriter.getTable$1(DataFrameWriter.scala:339)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:301)
SparkSubmit:
spark-submit --deploy-mode cluster --master yarn \
--conf=spark.cassandra.connection.port="9142" \
--conf=spark.cassandra.connection.host="cassandra.sa-east-1.amazonaws.com" \
--conf=spark.cassandra.auth.username="BUU" \
--conf=spark.cassandra.auth.password="123456789" \
--conf=spark.cassandra.connection.ssl.enabled="true" \
--conf=spark.cassandra.connection.ssl.trustStore.path="cassandra_truststore.jks"
--conf=spark.cassandra.connection.ssl.trustStore.password="123456"
Connection by cqlsh everything ok, but in spark got this error
To read and write data between Keyspaces and Apache Spark by using the open-source Spark Cassandra Connector all you have to do is update the partitioner for your Keyspaces account.
Docs: https://docs.aws.amazon.com/keyspaces/latest/devguide/spark-integrating.html
The issue as the error states is that AWS Keyspaces uses a partitioner (com.amazonaws.cassandra.DefaultPartitioner) that isn't supported by the Spark-Cassandra-connector.
There isn't a lot of public docs around what the underlying database is for AWS Keyspaces so I've long-suspected that there's a CQL API engine sitting in front of Keyspaces so it "looks" like Cassandra but it's probably backed by something else like Dynamo DB. I'm more than happy to be corrected by someone here from AWS just so I can put that to bed. 🙂
The default Cassandra partitioner is Murmur3Partitioner and is the only recommended partitioner. The older partitioners such as RandomPartitioner and ByteOrderedPartitioner are supported only for backward compatibility but should never be used for new clusters.
Finally, we don't test the Spark connector against AWS Keyspaces so be prepared for a lot of surprises there. Cheers!
Related
We are developing ETL tool using apache pyspark and apache airflow.
Apache airflow will be used for workflow management.
Can apache pyspark handle huge volume of data?
Can i get extract,transform count from apache airflow?
Yes, Apache (Py)Spark is built for dealing with big data
There is no magic out-of-the-box solution for getting metrics from PySpark into Airflow
Some solutions for #2 are:
Writing metrics from PySpark to another system (e.g. database, blob storage, ...) and reading those in a 2nd task in Airflow
Returning the values from the PySpark jobs and pushing them into Airflow XCom
My 2c: don't process large data in Airflow itself as it's built for orchestration and not data processing. If the intermediate data becomes big, use a dedicated storage system for that (database, blob storage, etc...). XComs are stored in the Airflow metastore itself (although custom XCom backends to store data in other systems are supported since Airflow 2.0 https://www.astronomer.io/guides/custom-xcom-backends), so make sure the data isn't too big if you're storing it in the Airflow metastore.
I have my MongoDB and Spark running on Zeppelin, both sharing the same HDFS. The MongoDB produces a .wt database stored in the same HDFS.
I want to load the database collection produced by that MongoDB from the HDFS into a Spark DataFrame.
Is it possible to load the database directly from HDFS into spark as a DataFrame? Or do I need to use a MongoDB Spark connector?
I would not recommend to read or modify the internal WiredTiger Storage Engine's *.wt files. Firstly, these internal files are subject to change without notifications (not a public facing API), also any unintended modifications to these files may cause the database to be in an invalid/corrupt state.
You can utilise MongoDB Spark Connector to load data from MongoDB to Spark. This connector is designed, developed and optimised for the purpose of read/write data between MongoDB and Apache Spark. For example, by accessing data via the database the client may utilise the database indexes to optimise read operations.
See also:
GitHub demo: Docker for MongoDB, Apache Spark and Apache Zeppelin
GitHub demo: Docker for MongoDB and Apache Spark
I have a running project with using of akka-persistence-jdbc plugin and postgresql as a backend.
Now I want to migrate to akka-persistence-cassandra.
But how can I convert the existing events (more than 4GB size in postgres) to cassandra?
Should I write a manual migration program? Reading from postgres and writing to right format in cassandra?
This is a classic migration problem. There are multiple solutions for this.
Spark SQL and Spark Cassandra Connector: Spark JDBC (called as Spark Dataframe, Spark SQL) API allows you to read from any JDBC source. You can read it in chunks by segmenting it otherwise you will go out of memory. Segmentation also makes the migration parallel. Then write the data into Cassandra by Cassandra Spark Connector. This is by far the simplest and efficient way I used in my tasks.
Java Agents: Java Agent can be written based on plain JDBC or other libraries and then write to Cassandra with Datastax driver. Spark program runs on multi machine - multi threaded way and recovers if something goes wrong automatically. But if you write an agent like this manually, then your agent only runs on single machine and multi threading also need to be coded.
Kafka Connectors: Kafka is a messaging broker. It can be used indirectly to migrate. Kafka has connector which can read and write to different databases. You can use JDBC connector to read from PostGres and Cassandra connector to write to Cassandra. It's not that easy to setup but it has the advantage of "no coding involved".
ETL Systems: Some ETL Systems have support for Cassandra but I haven't personally tried anything.
I saw some advantages in using Spark Cassandra and Spark SQL for migration, some of them are:
Code was concise. It was hardly 40 lines
Multi machine (Again multi threaded on each machine)
Job progress and statistics in Spark Master UI
Fault tolerance- if a spark node is down or thread/worker failed there then job is automatically started on other node - good for very long running jobs
If you don't know Spark then writing agent is okay for 4GB data.
I am trying to connect Apache Spark to MongoDB using Mesos. Here is my architecture: -
MongoDB: MongoDB Cluster of 2 shards, 1 config server and 1 query server.
Mesos: 1 Mesos Master, 4 Mesos slaves
Now I have installed Spark on just 1 node. There is not much information available on this out there. I just wanted to pose a few questions: -
As per what I understand, I can connect Spark to MongoDB via mesos. In other words, I end up using MongoDB as a storage layer. Do I really Need Hadoop? Is it mandatory to pull all the data into Hadoop just for Spark to read it?
Here is the reason I am asking this. The Spark Install expects the HADOOP_HOME variable to be set. This seems like very tight coupling !! Most of the posts on the net speak about MongoDB-Hadoop connector. It doesn't make sense if you're forcing me to move everything to hadoop.
Does anyone have an answer?
Regards
Mario
Spark itself takes a dependency on Hadoop and data in HDFS can be used as a datasource.
However, if you use the Mongo Spark Connector you can use MongoDB as a datasource for Spark without going via Hadoop at all.
Spark-mongo connector is good idea, moreover if your are executing Spark in a hadoop cluster you need set HADOOP_HOME.
Check your requeriments and test it (tutorial)
Basic working knowledge of MongoDB and Apache Spark. Refer to the MongoDB documentation and Spark documentation.
Running MongoDB instance (version 2.6 or later).
Spark 1.6.x.
Scala 2.10.x if using the mongo-spark-connector_2.10 package
Scala 2.11.x if using the mongo-spark-connector_2.11 package
The new MongoDB Connector for Apache Spark provides higher performance, greater ease of use and, access to more advanced Spark functionality than the MongoDB Connector for Hadoop. The following table compares the capabilities of both connectors.
Then you need to configure Spark with mesos:
Connecting Spark to Mesos
To use Mesos from Spark, you need a Spark binary package available in a place accessible by Mesos, and a Spark driver program configured to connect to Mesos.
Alternatively, you can also install Spark in the same location in all the Mesos slaves, and configure spark.mesos.executor.home (defaults to SPARK_HOME) to point to that location.
I am trying to stream data using Kafka-Connect with HDFS Sink Connector. Both Standalone and Distributed modes are running fine but its writing into HDFS only once (based on flush-size) and not streaming later on. Please help if I'm missing some thing.
Confluent 2.0.0 & Kafka 0.9.0
I faced this issue long back.Just check below parameter is missing
Connect-hdfs-sink properties
"logs.dir":"/hdfs_directory/data/log"
"request.timeout.ms":"310000"
"offset.flush.interval.ms":"5000"
"heartbeat.interval.ms":"60000"
"session.timeout.ms":"300000
"max.poll.records":"100"