Process only a few files in one round - scala

I have a working solution, but I'm looking for some ways of doing this safer and in a better way.
Every time the job starts up, it looks up a custom checkpoint which indicates from which date should the processing start. From a source dataframe I create one that starts from the specified start date - based on the checkpoint. The solution now limits the rows of the dataframe that has to be processed:
val readFormat = "delta"
val sparkRead = spark.read.format(readFormat)
val fileFormat = if (readFormat == "delta") "" else "." + readFormat
val testData = sparkRead
.load(basePath + "/testData/table_name" + fileFormat)
.where(!((col("size") < 1)))
.where($"modified" >= start)
.limit(5000)
For each identifier I download files from Azure Storage, and save the content in a new column of the dataframe:
val tryDownload = testData
.withColumn(
"fileStringPreview",
downloadUDF($"id"))
.withColumn(
"status",
when(
(($"fileStringPreview"
.startsWith("failed:") === true) ||
($"fileStringPreview"
.startsWith("emptyUrl") === true)),
lit("failed")).otherwise(
lit("succeeded")))
When this is done, the checkpoint is updated by the latest modified date from the elements that are processed in this iteration.
def saveLatest(saved_df: DataFrame, timeSeriesColName: String): Unit = {
val latestTime = saved_df.agg(max(timeSeriesColName)).collect()(0)
try {
val timespanEnd = latestTime.getTimestamp(0).toInstant().toEpochMilli()
saveTimestamp(timespanEnd) // this function actually stores the data
} catch {
case e: java.lang.NullPointerException => {
LoggingWrapper.log("timespanEnd is null");
}
}
}
saveLatest(tryDownload, "modified")
I'm worried about this limit(5000) solution, is there a better way, that keeps a good performance of downloading the specified number of files in each iterations?
Thank you for the suggestions in advance! :)

Related

Spark : How to get the latest file from s3 in the last 10 days

I am trying to get the latest file from s3 in last 10 days when there is no file exist in the input. The issue is the path contains the date.
My path is like this :
val path = "s3://bucket-info/folder1/folder2"
val date = "2019/04/12" ## YYYY/MM/DD
I am doing this =
val update_path = path+"/" +date //this will become s3://bucket-info/folder1/folder2/2019/04/12
def fileExist(path: String, sc: SparkContext): Boolean = FileSystem.get(getS3OrFileUri(path),
sc.hadoopConfiguration).exists(new Path(path + "/_SUCCESS"))
if (fileExist(update_path, sc)) {
//read and process the file
} else {
log("File not exist")
// I need to get the latest file in the last five days and use. So that I can check "s3://bucket-info/folder1/folder2/2019/04/11" , s3://bucket-info/folder1/folder2/2019/04/10 and others. If no latest file in last 5 days. throw error. s
}
But my issue is how do I check when it is the end of the month ? I can do it in for loop but is there any optimized and elegant way to do this in spark ?
Not very optimal but if you want to utilise Spark, the data frame reader can take multiple paths and input_file_name gives you the path:
val path = "s3://bucket-info/folder1/folder2"
val date = "2019/04/12"
val fmt = DateTimeFormatter.ofPattern("yyyy/MM/dd")
val end = LocalDate.parse(date, fmt)
val prefixes = (0 until 10).map(end.minusDays(_)).map(d => s"$path/${fmt.format(d)}")
val prefix = spark.read
.textFile(prefixes:_*)
.select(input_file_name() as "file")
.distinct()
.orderBy(desc("file"))
.limit(1)
.collect().collectFirst {
case Row(prefix: String) => prefix
}
prefix.fold {
// log error
}
{ path =>
//read and process the file
}
This is quite inefficient and there is no clear way around that using Spark as the S3 Hadoop file system implementation is not very efficient using recursive structures. If you are willing to use S3 API directly, you could set s"$path/${fmt.format(end.minusDays(10))}" as a start after parameter and use something like this to list the keys. This works as S3 always returns the key listings sorted alphabetically and you have zero padding in date keys.

Spark : how to parallelize subsequent specific work on each dataframe partitions

My Spark application is as follow :
1) execute large query with Spark SQL into the dataframe "dataDF"
2) foreach partition involved in "dataDF" :
2.1) get the associated "filtered" dataframe, in order to have only the partition associated data
2.2) do specific work with that "filtered" dataframe and write output
The code is as follow :
val dataSQL = spark.sql("SELECT ...")
val dataDF = dataSQL.repartition($"partition")
for {
row <- dataDF.dropDuplicates("partition").collect
} yield {
val partition_str : String = row.getAs[String](0)
val filtered = dataDF.filter($"partition" .equalTo( lit( partition_str ) ) )
// ... on each partition, do work depending on the partition, and write result on HDFS
// Example :
if( partition_str == "category_A" ){
// do group by, do pivot, do mean, ...
val x = filtered
.groupBy("column1","column2")
...
// write final DF
x.write.parquet("some/path")
} else if( partition_str == "category_B" ) {
// select specific field and apply calculation on it
val y = filtered.select(...)
// write final DF
x.write.parquet("some/path")
} else if ( ... ) {
// other kind of calculation
// write results
} else {
// other kind of calculation
// write results
}
}
Such algorithm works successfully. The Spark SQL query is fully distributed. However the particular work done on each resulting partition is done sequentially, and the result is inneficient especially because each write related to a partition is done sequentially.
In such case, what are the ways to replace the "for yield" by something in parallel/async ?
Thanks
You could use foreachPartition if writing to data stores outside Hadoop scope with specific logic needed for that particular env.
Else map, etc.
.par parallel collections (Scala) - but that is used with caution. For reading files and pre-processing them, otherwise possibly considered risky.
Threads.
You need to check what you are doing and if the operations can be referenced, usewd within a foreachPartition block, etc. You need to try as some aspects can only be written for the driver and then get distributed to the executors via SPARK to the workers. But you cannot write, for example, spark.sql for the worker as per below - at the end due to some formatting aspect errors I just got here in the block of text. See end of post.
Likewise df.write or df.read cannot be used in the below either. What you can do is write individual execute/mutate statements to, say, ORACLE, mySQL.
Hope this helps.
rdd.foreachPartition(iter => {
while(iter.hasNext) {
val item = iter.next()
// do something
spark.sql("INSERT INTO tableX VALUES(2,7, 'CORN', 100, item)")
// do some other stuff
})
or
RDD.foreachPartition (records => {
val JDBCDriver = "com.mysql.jdbc.Driver" ...
...
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
val connection = DriverManager.getConnection(ConnectionURL, jdbcUsername, jdbcPassword)
...
val mutateStatement = connection.createStatement()
val queryStatement = connection.createStatement()
...
records.foreach (record => {
val val1 = record._1
val val2 = record._2
...
mutateStatement.execute (s"insert into sample (k,v) values(${val1}, ${nIterVal})")
})
}
)

not able to store result in hdfs when code runs for second iteration

Well I am new to spark and scala and have been trying to implement cleaning of data in spark. below code checks for the missing value for one column and stores it in outputrdd and runs loops for calculating missing value. code works well when there is only one missing value in file. Since hdfs does not allow writing again on the same location it fails if there are more than one missing value. can you please assist in writing finalrdd to particular location once calculating missing values for all occurrences is done.
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val files = sc.wholeTextFiles("/input/raw_files/")
val file = files.map { case (filename, content) => filename }
file.collect.foreach(filename => {
cleaningData(filename)
})
def cleaningData(file: String) = {
//headers has column headers of the files
var hdr = headers.toString()
var vl = hdr.split("\t")
sqlContext.clearCache()
if (hdr.contains("COLUMN_HEADER")) {
//Checks for missing values in dataframe and stores missing values' in outputrdd
if (!outputrdd.isEmpty()) {
logger.info("value is zero then performing further operation")
val outputdatetimedf = sqlContext.sql("select date,'/t',time from cpc where kwh = 0")
val outputdatetimerdd = outputdatetimedf.rdd
val strings = outputdatetimerdd.map(row => row.mkString).collect()
for (i <- strings) {
if (Coddition check) {
//Calculates missing value and stores in finalrdd
finalrdd.map { x => x.mkString("\t") }.saveAsTextFile("/output")
logger.info("file is written in file")
}
}
}
}
}
}``
It is not clear how (Coddition check) works in your example.
In any case function .saveAsTextFile("/output") should be called only once.
So I would rewrite your example into this:
val strings = outputdatetimerdd
.map(row => row.mkString)
.collect() // perhaps '.collect()' is redundant
val finalrdd = strings
.filter(str => Coddition check str) //don't know how this Coddition works
.map (x => x.mkString("\t"))
// this part is called only once but not in a loop
finalrdd.saveAsTextFile("/output")
logger.info("file is written in file")

NullPointerException applying a function to spark RDD that works on non-RDD

I have a function that I want to apply to a every row of a .csv file:
def convert(inString: Array[String]) : String = {
val country = inString(0)
val sellerId = inString(1)
val itemID = inString(2)
try{
val minidf = sqlContext.read.json( sc.makeRDD(inString(3):: Nil) )
.withColumn("country", lit(country))
.withColumn("seller_id", lit(sellerId))
.withColumn("item_id", lit(itemID))
val finalString = minidf.toJSON.collect().mkString(",")
finalString
} catch{
case e: Exception =>println("AN EXCEPTION "+inString.mkString(","))
("this is an exception "+e+" "+inString.mkString(","))
}
}
This function transforms an entry of the sort:
CA 112578240 132080411845 [{"id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
Where I have 4 columns, the 4th being a json blob, into
[{"country":"CA", "seller":112578240", "product":112578240, "id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
which is the json object where the first 3 columns have been inserted into the fourth.
Now, this works:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).collect().map(x => convert(x))
or this:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).take(10).map(x => convert(x))
but this does not
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).map(x => convert(x))
The last one throw a java.lang.NullPointerException.
I included a try catch clause so see where exactly is this failing and it's failing for every single row.
What am I doing wrong here?
You cannot put sqlContext or sparkContext in a Spark map, since that object can only exist on the driver node. Essentially they are in charge of distributing your tasks.
You could rewite the JSON parsing bit using one of these libraries in pure scala: https://manuel.bernhardt.io/2015/11/06/a-quick-tour-of-json-libraries-in-scala/

spark job freeze when started in ParArray

I want to convert a set of time-serial data to Labeledpoint from multiple csv files and save to parquet file. Csv Files are small, usually < 10MiB
When I start it with ParArray, it submit 4 jobs a time and freeze . codes here
val idx = Another_DataFrame
ListFiles(new File("data/stock data"))
.filter(_.getName.contains(".csv")).zipWithIndex
.par //comment this line and code runs smoothly
.foreach{
f=>
val stk = spark_csv(f._1.getPath) //doing good
ColMerge(stk,idx,RESULT_PATH(f)) //freeze here
stk.unpersist()
}
and the freeze part:
def ColMerge(ori:DataFrame,index:DataFrame,PATH:String) = {
val df = ori.join(index,ori("date")===index("index_date")).drop("index_date").orderBy("date").cache
val head = df.head
val col = df.columns.filter(e=>e!="code"&&e!="date"&&e!="name")
val toMap = col.filter{
e=>head.get(head.fieldIndex(e)).isInstanceOf[String]
}.sorted
val toCast = col.diff(toMap).filterNot(_=="data")
val res: Array[((String, String, Array[Double]), Long)] = df.sort("date").map{
row=>
val res1= toCast.map{
col=>
row.getDouble(row.fieldIndex(col))
}
val res2= toMap.flatMap{
col=>
val mapping = new Array[Double](GlobalConfig.ColumnMapping(col).size)
row.getString(row.fieldIndex(col)).split(";").par.foreach{
word=>
mapping(GlobalConfig.ColumnMapping(col)(word)) = 1
}
mapping
}
(
row.getString(row.fieldIndex("code")),
row.getString(row.fieldIndex("date")),
res1++res2++row.getAs[Seq[Double]]("data")
)
}.zipWithIndex.collect
df.unpersist
val dataset = GlobalConfig.sctx.makeRDD(res.map{
day=>
(day._1._1,
day._1._2,
try{
new LabeledPoint(GetHighPrice(res(day._2.toInt+2)._1._3.slice(0,4))/GetLowPrice(res(day._2.toInt)._1._3.slice(0,4))*1.03,Vectors.dense(day._1._3))
}
catch {
case ex:ArrayIndexOutOfBoundsException=>
new LabeledPoint(-1,Vectors.dense(day._1._3))
}
)
}).filter(_._3.label != -1).toDF("code","date","labeledpoint")
dataset.write.mode(SaveMode.Overwrite).parquet(PATH)
}
The exact job that freezes is the DataFrame.sort() or zipWithIndex when generating res in ColMerge
Since most part of the job get done after collect I really want to use ParArray to accelerate ColMerge but this weird freeze stopped me from doing so. Do I need to new a thread pool to do this?