How to pass dateformat value to spark job jar by CLI - scala

i have a spark job, builded using by sbt to jar
when i spark-submit.
I want to pass param has space like yyyy-MM-dd HH:mm:ss, that is one param but CLI understand this is two. How i fix it?
spark-submit --class <className> --master local <jar path> <agr0:file path> yyyy-MM-dd dd-MM-yyyy string
here my code
val logFile = args(0)
val data = spark.sparkContext.textFile(logFile)
...
val formatInputType = args(1)
val requiredOutputFormat = args(2)
val formatOutputType = args(3)
val ds3 = ds2
.withColumn("formatInputType", lit(formatInputType)) // "yyyy-MM-dd HH:mm:ss" ??
.withColumn("requiredOutputFormat", lit(requiredOutputFormat)) // "dd-MM HH" ??
.withColumn("formatOutputType", lit(formatOutputType)) // "epoch/string"

Related

Processing data in directories with specific date range using Spark Scala

I am trying to load Incremental Data from HDFS folder using Spark Scala code.
So suppose if I have the following folders:
/hadoop/user/src/2021-01-22
/hadoop/user/src/2021-01-23
/hadoop/user/src/2021-01-24
/hadoop/user/src/2021-01-25
/hadoop/user/src/2021-01-26
/hadoop/user/src/2021-01-27
/hadoop/user/src/2021-01-28
/hadoop/user/src/2021-01-29
I am giving path /hadoop/user/src from spark-submit command then writing below code
val Temp_path: String = args(1) // hadoop/user/src
val incre_path = ZonedDateTime.now(ZoneId.of("UTC")).minusDays(1)
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
val incre_path_day = formatter format incre_path
val new_path = Temp_path.concat("/")
val path = new_path.concat(incre_path_day)
So it processes (sysdate-1) folder i.e. today's date is 2021-01-29 so it will process 2021-01-28 directory's data.
Is there any way to modify code so I can give path like hadoop/user/src/2021-01-22 and code will process data till 2021-01-28 (i.e. 2021-01-23, 2021-01-24, 2021-01-25, 2021-01-26, 2021-01-27, 2021-01-28).
Kindly suggest how should I Modify my code.
You can use listStatus from the Hadoop FileSystem to list all the folders from the input folder and filter on the date part :
import org.apache.hadoop.fs.Path
import java.time.{ZonedDateTime, ZoneId}
import java.time.format.DateTimeFormatter
val inputPath = "hadoop/user/src/2021-01-22"
val startDate = inputPath.substring(inputPath.lastIndexOf("/") + 1)
val endDate = DateTimeFormatter.ofPattern("yyyy-MM-dd").format(ZonedDateTime.now(ZoneId.of("UTC")).minusDays(1))
val baseFolder = new Path(inputPath.substring(0, inputPath.lastIndexOf("/") + 1))
val files = baseFolder.getFileSystem(sc.hadoopConfiguration).listStatus(baseFolder).map(_.getPath.toString)
val filtredFiles = files.filter(path => path.split("/").last > startDate && path.split("/").last < endDate)
// finally load only the folders you want
val df = spark.read.csv(filtredFiles: _*)
You could also pass a PathFilter function to listStatus to filter the paths while scanning the base folder

Spark submit ,how to read user input parameters?

Need to pass additional input parameter to spark job to validate . I know that after uber.jar we can pass all required parameters by giving space. Have option to read like below parameter using scala
spark-submit --jar uber.jar -Dtable.name=emp -Dfiltercondition=age,name
-D format is mostly for Java properties, not CLI arguments.
Spark accepts arguments through your app main method like any other Java/Scala program.
object App {
def main(args: Array[String]): Unit = {
val cmd: CommandLine = parseArg(args) // <-- here
val master = cmd.getOptionValue("master", "local[*]") // parse args
val spark = SparkSession.builder()
.appName(App.getClass.getName)
.master(master)
.getOrCreate()
...
}
// Using Apache Commons CLI
private def parseArg(args: Array[String]): CommandLine = {
import org.apache.commons.cli._
val options = new Options
...
}
Then spark-submit --jar app.jar --className=my.app.App --master='local[*]'

How to pass set of input files(not directory) to spark job and create dataframe on top of those files

I would like to pass set of avro files as input to spark job and create dataframe on top of those files. (I don't want to place files in a directory and pass directory as input).
In Spark shell, I'm able to create dataframe successfully like below.
val DF = hiveContext.read.format("com.databricks.spark.avro").load("/data/year=2019/month=09/day=28/hour=01/data_1.1569650402704.avro","/data/year=2019/month=09/day=28/hour=01/data_2.1569650402353.avro")
But the same is failing when I try to run through spark-submit command.
To pass the avro files independently to spark job, I'm trying to place avro files in a text file and pass this file as input argument to Driver class.
textFile:
/data/year=2019/month=09/day=28/hour=01/data_1.1569650402704.avro
/data/year=2019/month=09/day=28/hour=01/data_2.1569650402353.avro
spark-submit --class Spark_submit_test --master yarn Spark_submit_test.jar textFile
val filename = args(0)
val files = Source.fromFile(filename).getLines
val fileList = files.mkString(",")
println("fileList : "+fileList)
=> This prints
fileList : /data/ASDS/PNR/archive/year=2019/month=09/day=28/hour=01/data_1.1569650402704.avro,/data/ASDS/PNR/archive/year=2019/month=09/day=28/hour=01/data_2.1569650402353.avro
val DF = hiveContext.read.format("com.databricks.spark.avro").load(fileList)
Getting below exception :
Exception in thread "main" java.io.FileNotFoundException: File hdfs://bdaolc01-ns/data/ASDS/PNR/archive/year=2019/month=09/day=28/hour=01/data_1.1569650402704.avro,/data/ASDS/PNR/archive/year=2019/month=09/day=28/hour=01/data_2.1569650402353.avro does not exist.
Not sure how I can avoid "hdfs://bdaolc01-ns" appending in beginning.
Please correct me if I'm doing wrong or suggest better approach for doing the same.
Note : I tried enclosing file names in double quotes, but no use.
Expected Result :
Dataframe should be created successfully and df.printSchema should list proper schema of the avro files.
You want the splat operator!
myList: _*
scala> val data = spark.read.parquet(paths: _*)
data: org.apache.spark.sql.DataFrame = [id: bigint, a: int ... 1 more field]
scala> val paths = List("/tmp/example-parquet/part-00000-38cd8823-bff7-46f0-82a0-13d1d00ecce5-c000.snappy.parquet")
paths: List[String] = List(/tmp/example-parquet/part-00000-38cd8823-bff7-46f0-82a0-13d1d00ecce5-c000.snappy.parquet)
scala> val data = spark.read.parquet(paths: _*)
data: org.apache.spark.sql.DataFrame = [id: bigint, a: int ... 1 more field]
scala> data.count
res0: Long = 12500000
Pass the input file path to spark-submit command with --files option.
Also pass the input file name as command line argument.
This way I'll be able to read the file in Drive class.
val avrofiles = Source.fromFile(inputFileName).getLines.toArray
And create a dataframe
val dF = hiveContext.read.format("com.databricks.spark.avro").load(avrofiles:_*)

spark-submit 'Unable to coerce 'startDate' to a formatted date (long)'

Getting error: error: Unable to coerce 'startDate' to a formatted date (long) when I ran spark submit as below:
dse -u cassandra -p cassandra spark-submit --class com.abc.rm.Total_count \
--master dse://x.x.x.x:9042 TotalCount.jar \
"2024-06-11 00:00:00.000+0000" "2027-11-15 00:00:00.000+0000" \
10-118-16-132.bbc.ds.com pramod history
Below is my code:
package com.abc.rm
import com.datastax.spark.connector._
import org.apache.spark.SparkContext
object Total_count {
def main(args: Array[String]):Unit = {
var startDate = args(0)
var endDate = args(1)
val master = args(2)
var ks = args(3)
var table_name = args(4)
println("startDate-->"+startDate)
println("endDate-->"+endDate)
println("master-->"+master)
val conf = new org.apache.spark.SparkConf().setAppName("Total_count")
.set("spark.cassandra.connection.host", master)
.set("spark.cassandra.auth.username","cassandra")
.set("spark.cassandra.auth.password","cassandra")
var sc = new SparkContext(conf)
val rdd = sc.cassandraTable("pramod", "history")
.where("sent_date>='startDate' and sent_date <='endDate'")
.cassandraCount()
println("count--> "+rdd)
sc.stop()
System.exit(1)
}}
How can I pass/convert the argument.
You aren't passing the arguments, but instead passing the strings startDate and endDate literally. To make it working you need to write it as
.where(s"sent_date>='$startDate' and sent_date <='$endDate'")

How to use properties in spark scala maven project

i want to include properties file explicitly and include it in spark code , instead of hardcoding directly in spark code with all credentials.
i am trying following approach but not able to do, AppContext is not able to be resolved.
please guide me how to achieve this.
Spark_env.properties (under src/main/resourcses in maven project for spark with scala)
CASSANDRA_HOST1=127.0.0.133
CASSANDRA_PORT1=9042
CASSANDRA_USER1=usr1
CASSANDRA_PASS1=pas2
DataMigration.cassandra.keyspace1=demo2
DataMigration.cassandra.table1= data1
CASSANDRA_HOST2=
CASSANDRA_PORT2=9042
CASSANDRA_USER2=usr2
CASSANDRA_PASS2=pas2
D.cassandra.keyspace2=kesp2
D.cassandra.table2= data2
DataMigration.DifferencedRecords.output.path1=C:/spark_windows_proj/File1.csv
DataMigration.DifferencedRecords.output.path2=C:/spark_windows_proj/File1.parquet
----------------------------------------------------------------------------------
DM.scala
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.mapreduce.v2.app.AppContext
object Data_Migration {
def main(args: Array[String]) {
val host1: String = AppContext.getProperties().getProperty("CASSANDRA_HOST1")
val port1 = AppContext.getProperties().getProperty("CASSANDRA_PORT1").toInt
val keySpace1: String = AppContext.getProperties().getProperty("DataMigration.cassandra.keyspace1")
val DataMigrationTableName1: String = AppContext.getProperties().getProperty("DataMigration.cassandra.table1")
val username1: String = AppContext.getProperties().getProperty("CASSANDRA_USER1")
val pass1: String = AppContext.getProperties().getProperty("CASSANDRA_PASS1")
val host2: String = AppContext.getProperties().getProperty("CASSANDRA_HOST2")
val port2 = AppContext.getProperties().getProperty("CASSANDRA_PORT2").toInt
val keySpace2: String = AppContext.getProperties().getProperty("DataMigration.cassandra.keyspace2")
val DataMigrationTableName2: String = AppContext.getProperties().getProperty("DataMigration.cassandra.table2")
val username2: String = AppContext.getProperties().getProperty("CASSANDRA_USER2")
val pass2: String = AppContext.getProperties().getProperty("CASSANDRA_PASS2")
val Result_csv: String = AppContext.getProperties().getProperty("DataMigration.DifferencedRecords.output.path1")
val Result_parquet: String = AppContext.getProperties().getProperty("DataMigration.DifferencedRecords.output.path2")
val sc = AppContext.getSparkContext()
val spark = SparkSession
.builder() .master("local")
.appName("ABC")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val df_read1 = spark.read
.format("org.apache.spark.sql.cassandra")
.option("spark.cassandra.connection.host",host1)
.option("spark.cassandra.connection.port",port1)
.option( "spark.cassandra.auth.username",username1)
.option("spark.cassandra.auth.password",pass1)
.option("keyspace",keySpace1)
.option("table",DataMigrationTableName1)
.load()
I would rather pass the properties explicitly by passing the --properties-file option to the spark-submit when submitting the job.
The AppContext won't necessary work for all submission types, while passing config file should work everywhere.
Edit: For local usage without spark-submit, you can simply use the standard Properties class, loading it from the resources and get access to properties. You only need to put property file into src/main/resources instead of src/test/resources that is included into classpath only for tests. The code is something like:
val props = new Properties
props.load(getClass.getClassLoader.getResourceAsStream("file.props"))