How to load data from Cassandra table - scala

I am working on Spark version: 2.0.1 and Cassandra 3.9. I want to read data from a table in cassandra by CassandraSQLContext. However, Spark 2.0 was changed and using sparkSession. I am trying to use sparkSession and I am lucky, the following is my code.
Could you please review and give your advice?
def main(args: Array[String], date_filter: String): Unit = {
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
val sparkSession = SparkSession.builder
.master("local")
.appName("my-spark-app")
.config(conf)
.getOrCreate()
import sparkSession.implicits._
import org.apache.spark.sql._
val rdd = sparkSession
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "users", "keyspace" -> "monita"))
.load()
println("count: " +rdd.count())
}

Your code looks ok. You don't need to create SC. You can set Cassandra connection properties in config like below.
val sparkSession = SparkSession
.builder
.master("local")
.appName("my-spark-app")
.config("spark.cassandra.connection.host", "127.0.0.1")
.config("spark.cassandra.connection.port", "9042")
.getOrCreate()

Related

Adding Mongo config to active spark session

I am trying to add the configuraions to an active spark session. Below is my code
val spark = SparkSession.getActiveSession.get
spark.conf.set("spark.mongodb.input.uri",
"mongodb://hello_admin:hello123#localhost:27017/testdb.products?authSource=admin")
spark.conf.set("spark.mongodb.input.partitioner" ,"MongoPaginateBySizePartitioner")
import com.mongodb.spark._
val customRdd = MongoSpark.load(sc)
println(customRdd.count())
println(customRdd.first.toJson)
println(customRdd.collect().foreach(println))
But I am getting an error:
java.lang.IllegalArgumentException: Missing database name. Set via the
'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property
While when I write the code
val spark = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.input.uri", "mongodb://hello_admin:hello123#localhost:27017/testdb.products?authSource=admin")
// .config("spark.mongodb.output.uri", "mongodb://hello_admin:hello123#localhost:27017/testdb.products?authSource=admin")
.config("spark.mongodb.input.partitioner" ,"MongoPaginateBySizePartitioner")
.getOrCreate()
val sc = spark.sparkContext
val customRdd = MongoSpark.load(sc)
println(customRdd.count())
println(customRdd.first.toJson)
println(customRdd.collect().foreach(println))
My code is excecuting fine.
Kindly let me know what changes i need in the first code
You can define sparkSession like this with SparkConf. ( i don't know if this helps you )
def sparkSession(conf: SparkConf): SparkSession = SparkSession
.builder()
.config(conf)
.getOrCreate()
val sparkConf = new SparkConf()
sparkConf.set("prop","value")
val ss = sparkSession(sparkConf)
Or you can try to use SparkEnv ( i'm using sparkEnv for a lot of things to change props ):
SparkEnv.get.conf.set("prop", "value")

SparkSession.Builder Fails with "A master URL must be set in your configuration": "spark.master" is set to "local"

I have:
val sparkBuilder: SparkSession.Builder = SparkSession
.builder
.appName("CreateModelDataPreparation")
.config("spark.master", "local")
implicit val spark: SparkSession = sparkBuilder.getOrCreate()
However, when I run my program I still get:
org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:379)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
The SparkSession is set in the Main method as suggested in other posts. Those did not seem to solve the problem.
This differs from the suggested duplicate as I have tried both:
def main(argv: Array[String]): Unit = {
import DeweyConfigs.implicits.da3wConfig
val commandlineArgs: DeweyReaderArgs = processCommandLineArgs(argv)
val sparkBuilder: SparkSession.Builder = SparkSession
.builder
.appName("CreateModelDataPreparation")
.master("local")
implicit val spark: SparkSession = sparkBuilder.config("spark.master", "local").getOrCreate()
import spark.implicits._
...
and
def main(argv: Array[String]): Unit = {
import DeweyConfigs.implicits.da3wConfig
val commandlineArgs: DeweyReaderArgs = processCommandLineArgs(argv)
val sparkBuilder: SparkSession.Builder = SparkSession
.builder
.appName("CreateModelDataPreparation")
.config("master", "local")
implicit val spark: SparkSession = sparkBuilder.config("spark.master", "local").getOrCreate()
import spark.implicits._
...
Try adding .master("local") on the builder as opposed to the config parameter you provided.
I would have thought they did the same thing, but I'm pretty sure the latter works.

Spark Mongodb Connector Scala - Missing database name

I'm stuck with a weird issue. I'm trying to locally connect Spark to MongoDB using mongodb spark connector.
Apart from setting up spark I'm using the following code:
val readConfig = ReadConfig(Map("uri" -> "mongodb://localhost:27017/movie_db.movie_ratings", "readPreference.name" -> "secondaryPreferred"), Some(ReadConfig(sc)))
val writeConfig = WriteConfig(Map("uri" -> "mongodb://127.0.0.1/movie_db.movie_ratings"))
// Load the movie rating data from Mongo DB
val movieRatings = MongoSpark.load(sc, readConfig).toDF()
movieRatings.show(100)
However, I get a compilation error:
java.lang.IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property.
On line where I set up readConfig. I don't get why it's complaining for not set uri when I clearly have a uri property in the Map.
I might be missing something.
You can do it from SparkSession as mentioned here
val spark = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.input.uri", "mongodb://localhost:27017/movie_db.movie_ratings")
.config("spark.mongodb.input.readPreference.name", "secondaryPreferred")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/movie_db.movie_ratings")
.getOrCreate()
create dataframe using the config
val readConfig = ReadConfig(Map("uri" -> "mongodb://localhost:27017/movie_db.movie_ratings", "readPreference.name" -> "secondaryPreferred"))
val df = MongoSpark.load(spark)
Write df to mongodb
MongoSpark.save(
df.write
.option("spark.mongodb.output.uri", "mongodb://127.0.0.1/movie_db.movie_ratings")
.mode("overwrite"))
In your code: prefixes are missing in config
val readConfig = ReadConfig(Map(
"spark.mongodb.input.uri" -> "mongodb://localhost:27017/movie_db.movie_ratings",
"spark.mongodb.input.readPreference.name" -> "secondaryPreferred"),
Some(ReadConfig(sc)))
val writeConfig = WriteConfig(Map(
"spark.mongodb.output.uri" -> "mongodb://127.0.0.1/movie_db.movie_ratings"))
For Java, either you can set the configs while creating spark session or first create the session and then set it as runtime configs.
1.
SparkSession sparkSession = SparkSession.builder()
.master("local")
.appName("MongoSparkConnector")
.config("spark.mongodb.input.uri","mongodb://localhost:27017/movie_db.movie_ratings")
.config("spark.mongodb.input.readPreference.name", "secondaryPreferred")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/movie_db.movie_ratings")
.getOrCreate()
OR
2.
SparkSession sparkSession = SparkSession.builder()
.master("local")
.appName("MongoSparkConnector")
.getOrCreate()
Then,
String mongoUrl = "mongodb://localhost:27017/movie_db.movie_ratings";
sparkSession.sparkContext().conf().set("spark.mongodb.input.uri", mongoURL);
sparkSession.sparkContext().conf().set("spark.mongodb.output.uri", mongoURL);
Map<String, String> readOverrides = new HashMap<String, String>();
readOverrides.put("collection", sourceCollection);
readOverrides.put("readPreference.name", "secondaryPreferred");
ReadConfig readConfig = ReadConfig.create(sparkSession).withOptions(readOverrides);
Dataset<Row> df = MongoSpark.loadAndInferSchema(sparkSession,readConfig);

Spark Scala Cassandra CSV insert into cassandra

Here is the code below:
Scala Version: 2.11.
Spark Version: 2.0.2.6
Cassandra Version: cqlsh 5.0.1 | Cassandra 3.11.0.1855 | DSE 5.1.3 | CQL spec 3.4.4 | Native protocol v4
I am trying to read from CSV and write to Cassandra Table. I am new to Scala and Spark. Please correct me where I am doing wrong
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
import com.datastax
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
import org.apache.spark.sql._
import com.datastax.spark.connector.UDTValue
import com.datastax.spark.connector.mapper.DefaultColumnMapper
object dataframeset {
def main(args: Array[String]): Unit = {
// Cassandra Part
val conf = new SparkConf().setAppName("Sample1").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val rdd1 = sc.cassandraTable("tdata", "map")
rdd1.collect().foreach(println)
// Scala Read CSV Part
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val spark1 = org.apache.spark.sql.SparkSession
.builder()
.master("local")
.appName("Spark SQL basic example")
.getOrCreate()
val df = spark1.read.format("csv")
.option("header","true")
.option("inferschema", "true")
.load("/Users/tom/Desktop/del2.csv")
import spark1.implicits._
df.printSchema()
val dfprev = df.select(col = "Year","Measure").filter("Category = 'Prevention'" )
// dfprev.collect().foreach(println)
val a = dfprev.select("YEAR")
val b = dfprev.select("Measure")
val collection = sc.parallelize(Seq(a,b))
collection.saveToCassandra("tdata", "map", SomeColumns("sno", "name"))
spark1.stop()
}
}
Error:
Exception in thread "main" java.lang.IllegalArgumentException: Multiple constructors with the same number of parameters not allowed.
Cassandra Table
cqlsh:tdata> desc map
CREATE TABLE tdata.map (
sno int PRIMARY KEY,
name text;
I know I am missing something especially trying to write entire Data frame into Cassandra in one shot. Not I don't know what needs to be done either.
Thanks
tom
You can directly write a dataframe (dataset[Row] in spark 2.x) to cassandra.
You will have to define cassandra host, username and password if authentication is enabled in spark conf to connect to cassandra using somethin like
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "CASSANDRA_HOST")
.set("spark.cassandra.auth.username", "CASSANDRA_USERNAME")
.set("spark.cassandra.auth.password", "CASSANDRA_PASSWORD")
OR
val spark1 = org.apache.spark.sql.SparkSession
.builder()
.master("local")
.config("spark.cassandra.connection.host", "CASSANDRA_HOST")
.config("spark.cassandra.auth.username", "CASSANDRA_USERNAME")
.config("spark.cassandra.auth.password", "CASSANDRA_PASSWORD")
.appName("Spark SQL basic example")
.getOrCreate()
val dfprev = df.filter("Category = 'Prevention'" ).select(col("Year").as("yearAdded"),col("Measure").as("Recording"))
dfprev .write
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "map", "keyspace" -> "tdata"))
.save()
Dataframe in spark-cassandra-connector

How to add the "--deploy-mode cluster" option to my scala code

209/5000
Hello
I want to add the option "--deploy-mode cluster" to my code scala:
val sparkConf = new SparkConfig ().setMaster ("spark: //192.168.60.80:7077")
Without using the shell (the command. \ Spark-submit)
i whant to usage the " spark.submit.deployMode " in scala
with SparkConfig:
//set up the spark configuration and create contexts
val sparkConf = new SparkConf().setAppName("SparkApp").setMaster("spark: //192.168.60.80:7077")
val sc = new SparkContext(sparkConf).set("spark.submit.deployMode", "cluster")
with SparkSession:
val spark = SparkSession
.builder()
.appName("SparkApp")
.master("spark: //192.168.60.80:7077")
.config("spark.submit.deployMode","cluster")
.enableHiveSupport()
.getOrCreate()
You can use
val sparkConf = new SparkConf ().setMaster ("spark: //192.168.60.80:7077").set("spark.submit.deployMode","cluster")