Spark Scala Cassandra CSV insert into cassandra - scala

Here is the code below:
Scala Version: 2.11.
Spark Version: 2.0.2.6
Cassandra Version: cqlsh 5.0.1 | Cassandra 3.11.0.1855 | DSE 5.1.3 | CQL spec 3.4.4 | Native protocol v4
I am trying to read from CSV and write to Cassandra Table. I am new to Scala and Spark. Please correct me where I am doing wrong
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
import com.datastax
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
import org.apache.spark.sql._
import com.datastax.spark.connector.UDTValue
import com.datastax.spark.connector.mapper.DefaultColumnMapper
object dataframeset {
def main(args: Array[String]): Unit = {
// Cassandra Part
val conf = new SparkConf().setAppName("Sample1").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val rdd1 = sc.cassandraTable("tdata", "map")
rdd1.collect().foreach(println)
// Scala Read CSV Part
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val spark1 = org.apache.spark.sql.SparkSession
.builder()
.master("local")
.appName("Spark SQL basic example")
.getOrCreate()
val df = spark1.read.format("csv")
.option("header","true")
.option("inferschema", "true")
.load("/Users/tom/Desktop/del2.csv")
import spark1.implicits._
df.printSchema()
val dfprev = df.select(col = "Year","Measure").filter("Category = 'Prevention'" )
// dfprev.collect().foreach(println)
val a = dfprev.select("YEAR")
val b = dfprev.select("Measure")
val collection = sc.parallelize(Seq(a,b))
collection.saveToCassandra("tdata", "map", SomeColumns("sno", "name"))
spark1.stop()
}
}
Error:
Exception in thread "main" java.lang.IllegalArgumentException: Multiple constructors with the same number of parameters not allowed.
Cassandra Table
cqlsh:tdata> desc map
CREATE TABLE tdata.map (
sno int PRIMARY KEY,
name text;
I know I am missing something especially trying to write entire Data frame into Cassandra in one shot. Not I don't know what needs to be done either.
Thanks
tom

You can directly write a dataframe (dataset[Row] in spark 2.x) to cassandra.
You will have to define cassandra host, username and password if authentication is enabled in spark conf to connect to cassandra using somethin like
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "CASSANDRA_HOST")
.set("spark.cassandra.auth.username", "CASSANDRA_USERNAME")
.set("spark.cassandra.auth.password", "CASSANDRA_PASSWORD")
OR
val spark1 = org.apache.spark.sql.SparkSession
.builder()
.master("local")
.config("spark.cassandra.connection.host", "CASSANDRA_HOST")
.config("spark.cassandra.auth.username", "CASSANDRA_USERNAME")
.config("spark.cassandra.auth.password", "CASSANDRA_PASSWORD")
.appName("Spark SQL basic example")
.getOrCreate()
val dfprev = df.filter("Category = 'Prevention'" ).select(col("Year").as("yearAdded"),col("Measure").as("Recording"))
dfprev .write
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "map", "keyspace" -> "tdata"))
.save()
Dataframe in spark-cassandra-connector

Related

Unable to create multiple files using foreachBatch in spark (This Code Works Now)

I want to save files to multiple destination using foreachBatch , the code is running fine but foreachBatch isn't running the way wanted.
Kindly help me with this if you got any clue.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.streaming._
import org.apache.spark.storage.StorageLevel
object multiDestination {
val spark = SparkSession.builder()
.master("local")
.appName("Writing data to multiple destinations")
.getOrCreate()
def main(args: Array[String]): Unit = {
val mySchema = StructType(Array(
StructField("Id", IntegerType),
StructField("Name", StringType)
))
val askDF = spark
.readStream
.format("csv")
.option("header","true")
.schema(mySchema)
.load("/home/amulya/Desktop/csv/")
//println(askDF.show())
println(askDF.isStreaming)
askDF.writeStream.foreachBatch { (askDF : DataFrame , batchId:Long) =>
askDF.persist()

AWS EMR spark job step won't execute

I am trying to run a spark program on aws. It simple reads the csv file and prints it with dataframe.show(). I have been waiting for the step to execute since past 15-20 mins but no progress. The csv file inside the s3 bucket is very small only 10 rows of 2 columns.
Here is my program:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.LogManager
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.types._
object TriangleCountMain {
//Edge object
case class Edge(from: Int, to: Int)
def main(args: Array[String]) {
val logger: org.apache.log4j.Logger = LogManager.getRootLogger
if (args.length != 2) {
logger.error("Usage:\nTwitterDataSet_Spark.TriangleCountMain <input dir> <output dir>")
System.exit(1)
}
//Spark Session
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.getOrCreate()
import spark.implicits._
//Dataframe structure
val dfSchema = StructType(Array(
StructField("from", IntegerType, true),
StructField("to", IntegerType, true)))
//Data set of edges
val nonFilteredEdge: Dataset[Edge] = spark.read
.option("header", "false")
.option("inferSchema", "true")
.schema(dfSchema)
.csv(args(0))
.as[Edge]
val edge = nonFilteredEdge
edge.show
spark.stop
}
}
This programs runs on local successfully.
Thank you.

How to load files in sparksql through remote hive storage ( s3 orc) using spark/scala + code + configuration

intellij(spark)--->Hive(Remote)---storage on S3(orc format)
Not able to read remote Hive table through spark/scala.
was able to read table schema but not able to read table.
Error -Exception in thread "main" java.lang.IllegalArgumentException:
AWS Access Key ID and Secret Access Key must be specified as the
username or password (respectively) of a s3 URL, or by setting the
fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties
(respectively).
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.{Encoders, SparkSession}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql.types.StructType
object mainclas {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.master("local[*]")
.appName("hivetable")
.config("hive.metastore.uris", "thrift://10.20.30.40:9083")
.config("access-key","PQHFFDEGGDDVDVV")
.config("secret-key","FFGSGHhjhhhdjhJHJHHJGJHGjHH")
.config("format", "orc")
.enableHiveSupport()
.getOrCreate()
val res = spark.sqlContext.sql("show tables").show()
val res1 =spark.sql("select *from ace.visit limit 5").show()
}
}`
Try this:
val spark = SparkSession.builder
.master("local[*]")
.appName("hivetable")
.config("hive.metastore.uris", "thrift://10.20.30.40:9083")
.config("fs.s3n.awsAccessKeyId","PQHFFDEGGDDVDVV")
.config("fs.s3n.awsSecretAccessKey","FFGSGHhjhhhdjhJHJHHJGJHGjHH")
.config("format", "orc")
.enableHiveSupport()
.getOrCreate()
you need to prefix all the fs. options with spark.hadoop if you are setting them in the spark config. And as noted: use s3a over s3n if you can.

How to fix 22: error: not found: value SparkSession in Scala?

I am new to Spark and I would like to read a CSV-file to a Dataframe.
Spark 1.3.0 / Scala 2.3.0
This is what I have so far:
# Start Scala with CSV Package Module
spark-shell --packages com.databricks:spark-csv_2.10:1.3.0
# Import Spark Classes
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import sqlCtx ._
# Create SparkConf
val conf = new SparkConf().setAppName("local").setMaster("master")
val sc = new SparkContext(conf)
# Create SQLContext
val sqlCtx = new SQLContext(sc)
# Create SparkSession and use it for all purposes:
val session = SparkSession.builder().appName("local").master("master").getOrCreate()
# Read CSV-File and turn it into Dataframe.
val df_fc = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("/home/Desktop/test.csv")
However at SparkSession.builder() it gives the following error:
^
How can I fix this error?
SparkSession is available in spark 2. No need to create sparkcontext in spark version 2. sparksession itself provides the gateway to all .
Try below as you are using version 1.x:
val df_fc = sqlCtx.read.format("com.databricks.spark.csv").option("header", "true").load("/home/Desktop/test.csv")

How to load data from Cassandra table

I am working on Spark version: 2.0.1 and Cassandra 3.9. I want to read data from a table in cassandra by CassandraSQLContext. However, Spark 2.0 was changed and using sparkSession. I am trying to use sparkSession and I am lucky, the following is my code.
Could you please review and give your advice?
def main(args: Array[String], date_filter: String): Unit = {
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
val sparkSession = SparkSession.builder
.master("local")
.appName("my-spark-app")
.config(conf)
.getOrCreate()
import sparkSession.implicits._
import org.apache.spark.sql._
val rdd = sparkSession
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "users", "keyspace" -> "monita"))
.load()
println("count: " +rdd.count())
}
Your code looks ok. You don't need to create SC. You can set Cassandra connection properties in config like below.
val sparkSession = SparkSession
.builder
.master("local")
.appName("my-spark-app")
.config("spark.cassandra.connection.host", "127.0.0.1")
.config("spark.cassandra.connection.port", "9042")
.getOrCreate()