i have a spark job, builded using by sbt to jar
when i spark-submit.
I want to pass param has space like yyyy-MM-dd HH:mm:ss, that is one param but CLI understand this is two. How i fix it?
spark-submit --class <className> --master local <jar path> <agr0:file path> yyyy-MM-dd dd-MM-yyyy string
here my code
val logFile = args(0)
val data = spark.sparkContext.textFile(logFile)
...
val formatInputType = args(1)
val requiredOutputFormat = args(2)
val formatOutputType = args(3)
val ds3 = ds2
.withColumn("formatInputType", lit(formatInputType)) // "yyyy-MM-dd HH:mm:ss" ??
.withColumn("requiredOutputFormat", lit(requiredOutputFormat)) // "dd-MM HH" ??
.withColumn("formatOutputType", lit(formatOutputType)) // "epoch/string"
Related
I am trying to load Incremental Data from HDFS folder using Spark Scala code.
So suppose if I have the following folders:
/hadoop/user/src/2021-01-22
/hadoop/user/src/2021-01-23
/hadoop/user/src/2021-01-24
/hadoop/user/src/2021-01-25
/hadoop/user/src/2021-01-26
/hadoop/user/src/2021-01-27
/hadoop/user/src/2021-01-28
/hadoop/user/src/2021-01-29
I am giving path /hadoop/user/src from spark-submit command then writing below code
val Temp_path: String = args(1) // hadoop/user/src
val incre_path = ZonedDateTime.now(ZoneId.of("UTC")).minusDays(1)
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
val incre_path_day = formatter format incre_path
val new_path = Temp_path.concat("/")
val path = new_path.concat(incre_path_day)
So it processes (sysdate-1) folder i.e. today's date is 2021-01-29 so it will process 2021-01-28 directory's data.
Is there any way to modify code so I can give path like hadoop/user/src/2021-01-22 and code will process data till 2021-01-28 (i.e. 2021-01-23, 2021-01-24, 2021-01-25, 2021-01-26, 2021-01-27, 2021-01-28).
Kindly suggest how should I Modify my code.
You can use listStatus from the Hadoop FileSystem to list all the folders from the input folder and filter on the date part :
import org.apache.hadoop.fs.Path
import java.time.{ZonedDateTime, ZoneId}
import java.time.format.DateTimeFormatter
val inputPath = "hadoop/user/src/2021-01-22"
val startDate = inputPath.substring(inputPath.lastIndexOf("/") + 1)
val endDate = DateTimeFormatter.ofPattern("yyyy-MM-dd").format(ZonedDateTime.now(ZoneId.of("UTC")).minusDays(1))
val baseFolder = new Path(inputPath.substring(0, inputPath.lastIndexOf("/") + 1))
val files = baseFolder.getFileSystem(sc.hadoopConfiguration).listStatus(baseFolder).map(_.getPath.toString)
val filtredFiles = files.filter(path => path.split("/").last > startDate && path.split("/").last < endDate)
// finally load only the folders you want
val df = spark.read.csv(filtredFiles: _*)
You could also pass a PathFilter function to listStatus to filter the paths while scanning the base folder
Need to pass additional input parameter to spark job to validate . I know that after uber.jar we can pass all required parameters by giving space. Have option to read like below parameter using scala
spark-submit --jar uber.jar -Dtable.name=emp -Dfiltercondition=age,name
-D format is mostly for Java properties, not CLI arguments.
Spark accepts arguments through your app main method like any other Java/Scala program.
object App {
def main(args: Array[String]): Unit = {
val cmd: CommandLine = parseArg(args) // <-- here
val master = cmd.getOptionValue("master", "local[*]") // parse args
val spark = SparkSession.builder()
.appName(App.getClass.getName)
.master(master)
.getOrCreate()
...
}
// Using Apache Commons CLI
private def parseArg(args: Array[String]): CommandLine = {
import org.apache.commons.cli._
val options = new Options
...
}
Then spark-submit --jar app.jar --className=my.app.App --master='local[*]'
I would like to pass set of avro files as input to spark job and create dataframe on top of those files. (I don't want to place files in a directory and pass directory as input).
In Spark shell, I'm able to create dataframe successfully like below.
val DF = hiveContext.read.format("com.databricks.spark.avro").load("/data/year=2019/month=09/day=28/hour=01/data_1.1569650402704.avro","/data/year=2019/month=09/day=28/hour=01/data_2.1569650402353.avro")
But the same is failing when I try to run through spark-submit command.
To pass the avro files independently to spark job, I'm trying to place avro files in a text file and pass this file as input argument to Driver class.
textFile:
/data/year=2019/month=09/day=28/hour=01/data_1.1569650402704.avro
/data/year=2019/month=09/day=28/hour=01/data_2.1569650402353.avro
spark-submit --class Spark_submit_test --master yarn Spark_submit_test.jar textFile
val filename = args(0)
val files = Source.fromFile(filename).getLines
val fileList = files.mkString(",")
println("fileList : "+fileList)
=> This prints
fileList : /data/ASDS/PNR/archive/year=2019/month=09/day=28/hour=01/data_1.1569650402704.avro,/data/ASDS/PNR/archive/year=2019/month=09/day=28/hour=01/data_2.1569650402353.avro
val DF = hiveContext.read.format("com.databricks.spark.avro").load(fileList)
Getting below exception :
Exception in thread "main" java.io.FileNotFoundException: File hdfs://bdaolc01-ns/data/ASDS/PNR/archive/year=2019/month=09/day=28/hour=01/data_1.1569650402704.avro,/data/ASDS/PNR/archive/year=2019/month=09/day=28/hour=01/data_2.1569650402353.avro does not exist.
Not sure how I can avoid "hdfs://bdaolc01-ns" appending in beginning.
Please correct me if I'm doing wrong or suggest better approach for doing the same.
Note : I tried enclosing file names in double quotes, but no use.
Expected Result :
Dataframe should be created successfully and df.printSchema should list proper schema of the avro files.
You want the splat operator!
myList: _*
scala> val data = spark.read.parquet(paths: _*)
data: org.apache.spark.sql.DataFrame = [id: bigint, a: int ... 1 more field]
scala> val paths = List("/tmp/example-parquet/part-00000-38cd8823-bff7-46f0-82a0-13d1d00ecce5-c000.snappy.parquet")
paths: List[String] = List(/tmp/example-parquet/part-00000-38cd8823-bff7-46f0-82a0-13d1d00ecce5-c000.snappy.parquet)
scala> val data = spark.read.parquet(paths: _*)
data: org.apache.spark.sql.DataFrame = [id: bigint, a: int ... 1 more field]
scala> data.count
res0: Long = 12500000
Pass the input file path to spark-submit command with --files option.
Also pass the input file name as command line argument.
This way I'll be able to read the file in Drive class.
val avrofiles = Source.fromFile(inputFileName).getLines.toArray
And create a dataframe
val dF = hiveContext.read.format("com.databricks.spark.avro").load(avrofiles:_*)
Getting error: error: Unable to coerce 'startDate' to a formatted date (long) when I ran spark submit as below:
dse -u cassandra -p cassandra spark-submit --class com.abc.rm.Total_count \
--master dse://x.x.x.x:9042 TotalCount.jar \
"2024-06-11 00:00:00.000+0000" "2027-11-15 00:00:00.000+0000" \
10-118-16-132.bbc.ds.com pramod history
Below is my code:
package com.abc.rm
import com.datastax.spark.connector._
import org.apache.spark.SparkContext
object Total_count {
def main(args: Array[String]):Unit = {
var startDate = args(0)
var endDate = args(1)
val master = args(2)
var ks = args(3)
var table_name = args(4)
println("startDate-->"+startDate)
println("endDate-->"+endDate)
println("master-->"+master)
val conf = new org.apache.spark.SparkConf().setAppName("Total_count")
.set("spark.cassandra.connection.host", master)
.set("spark.cassandra.auth.username","cassandra")
.set("spark.cassandra.auth.password","cassandra")
var sc = new SparkContext(conf)
val rdd = sc.cassandraTable("pramod", "history")
.where("sent_date>='startDate' and sent_date <='endDate'")
.cassandraCount()
println("count--> "+rdd)
sc.stop()
System.exit(1)
}}
How can I pass/convert the argument.
You aren't passing the arguments, but instead passing the strings startDate and endDate literally. To make it working you need to write it as
.where(s"sent_date>='$startDate' and sent_date <='$endDate'")
i want to include properties file explicitly and include it in spark code , instead of hardcoding directly in spark code with all credentials.
i am trying following approach but not able to do, AppContext is not able to be resolved.
please guide me how to achieve this.
Spark_env.properties (under src/main/resourcses in maven project for spark with scala)
CASSANDRA_HOST1=127.0.0.133
CASSANDRA_PORT1=9042
CASSANDRA_USER1=usr1
CASSANDRA_PASS1=pas2
DataMigration.cassandra.keyspace1=demo2
DataMigration.cassandra.table1= data1
CASSANDRA_HOST2=
CASSANDRA_PORT2=9042
CASSANDRA_USER2=usr2
CASSANDRA_PASS2=pas2
D.cassandra.keyspace2=kesp2
D.cassandra.table2= data2
DataMigration.DifferencedRecords.output.path1=C:/spark_windows_proj/File1.csv
DataMigration.DifferencedRecords.output.path2=C:/spark_windows_proj/File1.parquet
----------------------------------------------------------------------------------
DM.scala
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.mapreduce.v2.app.AppContext
object Data_Migration {
def main(args: Array[String]) {
val host1: String = AppContext.getProperties().getProperty("CASSANDRA_HOST1")
val port1 = AppContext.getProperties().getProperty("CASSANDRA_PORT1").toInt
val keySpace1: String = AppContext.getProperties().getProperty("DataMigration.cassandra.keyspace1")
val DataMigrationTableName1: String = AppContext.getProperties().getProperty("DataMigration.cassandra.table1")
val username1: String = AppContext.getProperties().getProperty("CASSANDRA_USER1")
val pass1: String = AppContext.getProperties().getProperty("CASSANDRA_PASS1")
val host2: String = AppContext.getProperties().getProperty("CASSANDRA_HOST2")
val port2 = AppContext.getProperties().getProperty("CASSANDRA_PORT2").toInt
val keySpace2: String = AppContext.getProperties().getProperty("DataMigration.cassandra.keyspace2")
val DataMigrationTableName2: String = AppContext.getProperties().getProperty("DataMigration.cassandra.table2")
val username2: String = AppContext.getProperties().getProperty("CASSANDRA_USER2")
val pass2: String = AppContext.getProperties().getProperty("CASSANDRA_PASS2")
val Result_csv: String = AppContext.getProperties().getProperty("DataMigration.DifferencedRecords.output.path1")
val Result_parquet: String = AppContext.getProperties().getProperty("DataMigration.DifferencedRecords.output.path2")
val sc = AppContext.getSparkContext()
val spark = SparkSession
.builder() .master("local")
.appName("ABC")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val df_read1 = spark.read
.format("org.apache.spark.sql.cassandra")
.option("spark.cassandra.connection.host",host1)
.option("spark.cassandra.connection.port",port1)
.option( "spark.cassandra.auth.username",username1)
.option("spark.cassandra.auth.password",pass1)
.option("keyspace",keySpace1)
.option("table",DataMigrationTableName1)
.load()
I would rather pass the properties explicitly by passing the --properties-file option to the spark-submit when submitting the job.
The AppContext won't necessary work for all submission types, while passing config file should work everywhere.
Edit: For local usage without spark-submit, you can simply use the standard Properties class, loading it from the resources and get access to properties. You only need to put property file into src/main/resources instead of src/test/resources that is included into classpath only for tests. The code is something like:
val props = new Properties
props.load(getClass.getClassLoader.getResourceAsStream("file.props"))