spark-submit 'Unable to coerce 'startDate' to a formatted date (long)' - scala

Getting error: error: Unable to coerce 'startDate' to a formatted date (long) when I ran spark submit as below:
dse -u cassandra -p cassandra spark-submit --class com.abc.rm.Total_count \
--master dse://x.x.x.x:9042 TotalCount.jar \
"2024-06-11 00:00:00.000+0000" "2027-11-15 00:00:00.000+0000" \
10-118-16-132.bbc.ds.com pramod history
Below is my code:
package com.abc.rm
import com.datastax.spark.connector._
import org.apache.spark.SparkContext
object Total_count {
def main(args: Array[String]):Unit = {
var startDate = args(0)
var endDate = args(1)
val master = args(2)
var ks = args(3)
var table_name = args(4)
println("startDate-->"+startDate)
println("endDate-->"+endDate)
println("master-->"+master)
val conf = new org.apache.spark.SparkConf().setAppName("Total_count")
.set("spark.cassandra.connection.host", master)
.set("spark.cassandra.auth.username","cassandra")
.set("spark.cassandra.auth.password","cassandra")
var sc = new SparkContext(conf)
val rdd = sc.cassandraTable("pramod", "history")
.where("sent_date>='startDate' and sent_date <='endDate'")
.cassandraCount()
println("count--> "+rdd)
sc.stop()
System.exit(1)
}}
How can I pass/convert the argument.

You aren't passing the arguments, but instead passing the strings startDate and endDate literally. To make it working you need to write it as
.where(s"sent_date>='$startDate' and sent_date <='$endDate'")

Related

How to pass dateformat value to spark job jar by CLI

i have a spark job, builded using by sbt to jar
when i spark-submit.
I want to pass param has space like yyyy-MM-dd HH:mm:ss, that is one param but CLI understand this is two. How i fix it?
spark-submit --class <className> --master local <jar path> <agr0:file path> yyyy-MM-dd dd-MM-yyyy string
here my code
val logFile = args(0)
val data = spark.sparkContext.textFile(logFile)
...
val formatInputType = args(1)
val requiredOutputFormat = args(2)
val formatOutputType = args(3)
val ds3 = ds2
.withColumn("formatInputType", lit(formatInputType)) // "yyyy-MM-dd HH:mm:ss" ??
.withColumn("requiredOutputFormat", lit(requiredOutputFormat)) // "dd-MM HH" ??
.withColumn("formatOutputType", lit(formatOutputType)) // "epoch/string"

How to write the dataframe to S3 after filter

I am trying to write the data-frame after filtering to S3 in CVS format in script editing with below Scala code.
Current status:
Does not show any error after run but just not writing to S3.
The logs screen print Start, however cannot see print End.
No particular error message indicating the problem.
Stops at temp.count.
Environment condition: I have admin rights to all S3.
import com.amazonaws.services.glue.GlueContext
import <others>
object GlueApp {
def main(sysArgs: Array[String]) {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
// #params: [JOB_NAME]
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
val datasource0 = glueContext.getCatalogSource(database = "db", tableName = "table", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
val appymapping1 = datasource0.appyMapping(mapping=........)
val temp=appymapping1.toDF.filter(some filtering rules)
print("start")
if (temp.count() <= 0) {
temp.write.format("csv").option("sep", ",").save("s3://directory/error.csv")
}
print("End")
you're writing Dataframe to S3 using if condition (If condition is to check whether dataframe has one or more row), but your If condition is invert. It's only true if dataframe has 0 (or lesser) row. so Change that.
Advance: Spark always saves files as "part-" name. so change S3 path as s3://directory/. and add .mode("overwrite") .
so your write df query should be
temp.write.format("csv").option("sep", ",").mode("overwrite").save("s3://directory")

Json argument in Spark submit

My spark-submit command :
spark-submit --deploy-mode cluster --class spark_package.import_jar s3://test-system/test.jar "{\"localparameter\" : {\"mail\": \"\", \"clusterid\": \"test\", \"clientCd\": \"1000\", \"processid\": \"1234\"} }"
Here i want to pass the clientCd as parameter to my Scala code.
My scala code :
package Spark_package
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object SampleFile {
def main(args: Array[String]) {
val spark = SparkSession.builder.master("local[*]").appName("SampleFile").getOrCreate()
val sc = spark.sparkContext
val conf = new SparkConf().setAppName("SampleFile")
val sqlContext = spark.sqlContext
val df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("s3a://test-system/data/*.gz")
df.createOrReplaceTempView("data")
val res = spark.sql("select count(*) from data where client_cd = $clientCd")
res.coalesce(1).write.format("csv").option("header","true").mode("Overwrite").save("s3a://dev-system/bkup/")
spark.stop()
}
}
Here My question is how to pass clientCd as parameter to my code.
val res = spark.sql("select count(*) from data where client_cd = $clientCd")
Kindly help me on this.
Append all program arguments in the end of spark-submit, they will be available in args at main.
eg. spark-submit --class xxx --deploy-mode xxx.jar arg1 arg2
then you can parse your arg1 by a json unmarshaller.

Executing Spark scala program after compilation

I have compiled Spark scala program on command line. But now I want to execute it. I dont want to use Maven or sbt.
the program .I have used the command to execute the
scala -cp ".:sparkDIrector/jars/*" wordcount
But I am getting this error
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
import org.apache.spark._
import org.apache.spark.SparkConf
/** Create a RDD of lines from a text file, and keep count of
* how often each word appears.
*/
object wordcount1 {
def main(args: Array[String]) {
// Set up a SparkContext named WordCount that runs locally using
// all available cores.
println("before conf")
val conf = new SparkConf().setAppName("WordCount")
conf.setMaster("local[*]")
val sc = new SparkContext(conf)
println("after the textfile")
// Create a RDD of lines of text in our book
val input = sc.textFile("book.txt")
println("after the textfile")
// Use flatMap to convert this into an rdd of each word in each line
val words = input.flatMap(line => line.split(' '))
// Convert these words to lowercase
val lowerCaseWords = words.map(word => word.toLowerCase())
// Count up the occurence of each unique word
println("before text file")
val wordCounts = lowerCaseWords.countByValue()
// Print the first 20 results
val sample = wordCounts.take(20)
for ((word, count) <- sample) {
println(word + " " + count)
}
sc.stop()
}
}
It is showing that the error is at location
val conf = new SparkConf().setAppName("WordCount").
Any help?
Starting from Spark 2.0 the entry point is the SparkSession:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder
.appName("App Name")
.getOrCreate()
Then you can access the SparkContext and read the file with:
spark.sparkContext().textFile(yourFileOrURL)
Remember to stop your session at the end:
spark.stop()
I suggest you to have a look at these examples: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples
Then, to launch your application, you have to use spark-submit:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
In your case, it will be something like:
./bin/spark-submit \
--class wordcount1 \
--master local \
/path/to/your.jar

Spark MLLib unable to write out to S3 : path already exists

I have data in a S3 bucket in directory /data/vw/. Each line is of the form:
| abc:2 def:1 ghi:3 ...
I want to convert it to the following format:
abc abc def ghi ghi ghi
The new converted lines should go to S3 in directory /data/spark
Basically, repeat each string the number of times that follows the colon. I am trying to convert a VW LDA input file to a corresponding file for consumption by Spark's LDA library.
The code:
import org.apache.spark.{SparkConf, SparkContext}
object Vw2SparkLdaFormatConverter {
def repeater(s: String): String = {
val ssplit = s.split(':')
(ssplit(0) + ' ') * ssplit(1).toInt
}
def main(args: Array[String]) {
val inputPath = args(0)
val outputPath = args(1)
val conf = new SparkConf().setAppName("FormatConverter")
val sc = new SparkContext(conf)
val vwdata = sc.textFile(inputPath)
val sparkdata = vwdata.map(s => s.trim().split(' ').map(repeater).mkString)
val coalescedSparkData = sparkdata.coalesce(100)
coalescedSparkData.saveAsTextFile(outputPath)
sc.stop()
}
}
When I run this (as a Spark EMR job in AWS), the step fails with exception:
18/01/20 00:16:28 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at ...
The code is run as:
spark-submit --class Vw2SparkLdaFormatConverter --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=true --executor-memory 4g s3a://mybucket/scripts/myscalajar.jar s3a://mybucket/data/vw s3a://mybucket/data/spark
I have tried specifying new output paths (/data/spark1 etc), ensuring that it does not exist before the step is run. Even then it is not working.
What am I doing wrong? I am new to Scala and Spark so I might be overlooking something here.
You could convert to a dataframe and then save with overwrite enabled.
coalescedSparkData.toDF.write.mode('overwrite').csv(outputPath)
Or if you insist on using RDD methods, you can do as described already in this answer