How to write Spark Streaming output to HDFS without overwriting - apache-kafka

After some processing I have a DStream[String , ArrayList[String]] , so when I am writing it to hdfs using saveAsTextFile and after every batch it overwrites the data , so how to write new result by appending to previous results
output.foreachRDD(r => {
r.saveAsTextFile(path)
})
Edit :: If anyone could help me in converting the output to avro format and then writing to HDFS with appending

saveAsTextFile does not support append. If called with a fixed filename, it will overwrite it every time.
We could do saveAsTextFile(path+timestamp) to save to a new file every time. That's the basic functionality of DStream.saveAsTextFiles(path)
An easily accessible format that supports append is Parquet. We first transform our data RDD to a DataFrame or Dataset and then we can benefit from the write support offered on top of that abstraction.
case class DataStructure(field1,..., fieldn)
... streaming setup, dstream declaration, ...
val structuredOutput = outputDStream.map(record => mapFunctionRecordToDataStructure)
structuredOutput.foreachRDD(rdd =>
import sparkSession.implicits._
val df = rdd.toDF()
df.write.format("parquet").mode("append").save(s"$workDir/$targetFile")
})
Note that appending to Parquet files gets more expensive over time, so rotating the target file from time to time is still a requirement.

If you want to append the same file and store on file system, store it as a parquet file. You can do it by
kafkaData.foreachRDD( rdd => {
if(rdd.count()>0)
{
val df=rdd.toDF()
df.write(SaveMode.Append).save("/path")
}

Storing the streaming output to HDFS will always create a new files even in case when you use append with parquet which leads to a small files problems on Namenode. I may recommend to write your output to sequence files where you can keep appending to the same file.

Here I solve the issue without dataframe
import java.time.format.DateTimeFormatter
import java.time.LocalDateTime
messages.foreachRDD{ rdd =>
rdd.repartition(1)
val eachRdd = rdd.map(record => record.value)
if(!eachRdd.isEmpty) {
eachRdd.saveAsTextFile(hdfs_storage + DateTimeFormatter.ofPattern("yyyyMMddHHmmss").format(LocalDateTime.now) + "/")
}
}

Related

Save stream as parquet given by-element path

I would like to save (parquet) each value of a stream to a specific directory which path is given by the key. There should be less than five different keys, as many different directories.
As I did not find any example of what I want to do, I tried the following approach: filter() the stream by each key found inside; with the following code:
stream
.foreachRDD{ (rdd, time) =>
import spark.implicits._
if (rdd.take(1).length != 0) {
val directories= rdd.map(_._1).distinct().collect()
lDates.foreach { directory =>
rdd
.filter(_._1==directory).map(_._2)
.toDF()
.write.format("parquet").mode("append").save(directory)
}
}
}
But the barbaric collect() takes a heavy toll on the computation and leads to some scheduling delay...
Would anyone have a better way to implement this or improve performance?
EDIT: I do not have access to Structured streaming.

Spark Spark Empty Json Files reading from Directory

I'm reading from a path say /json//myfiles_.json
I'm then flattening the json using explode. This causes an error since I have some empty files. How do I tell it to ignore empty files of somehow filter them out?
I can detect individual files checking if the head is empty but I need to do this on the collection of files iterated in the dataframe with the use of the wildcard path.
So the answer seems to be that I need to provide a schema explicitly because it can't infer one from empty file - as you would expect!
e.g.
val schemadf = sqlContext.read.json(schemapath) //infer schema from file with data or do manually
val schema = schemadf.schema
val raw = sqlContext.read.schema(schema).json(monthfile)
val prep = raw.withColumn("MyArray", explode($"MyArray"))
.select($"ID", $"name", $"CreatedAt")
display(prep)

Write Header only CSV record from Spark Scala DataFrame

My requirement is to write only Header CSV record using Spark Scala DataFrame. Can any one help me on this.
val OHead1 = "/xxxxx/xxxx/xxxx/xxx/OHead1/"
val sc = sparkFile.sparkContext
val outDF = csvDF.select("col_01", "col_02", "col_03").schema
sc.parallelize(Seq(outDF.fieldNames.mkString("\t"))).coalesce(1).saveAsTextFile(s"$OHead1")
The above one is working and able to create header in the CSV with tab delimiter. Since I am using spark session I am creating sparkContext in the second line. outDF is my dataframe created before these statements.
Two things are outstanding, can you one of you help me.
1. The above working code is not overriding the files, so every time I need to delete the files manually. I could not find override option, can you help me.
2. Since I am doing a select statement and schema, will it be consider as action and start another lineage for this statement. If it is true then this would degrade the performance.
If you need to output only header you can use this code:
df.schema.fieldNames.reduce(_ + "," + _)
It will create line of CSV with names of columns
I tested and the solution below did not affect any performance.
val OHead1 = "/xxxxx/xxxx/xxxx/xxx/OHead1/"
val sc = sparkFile.sparkContext
val outDF = csvDF.select("col_01", "col_02", "col_03").schema
sc.parallelize(Seq(outDF.fieldNames.mkString("\t"))).coalesce(1).saveAsTextFile(s"$OHead1")
I got a solution to handle this situation. Define the columns in the configuration file and write those columns in an file. Here is the snipet.
val Header = prop.getProperty("OUT_HEADER_COLUMNS").replaceAll("\"","").replaceAll(",","\t")
scala.tools.nsc.io.File(s"$HeadOPath").writeAll(s"$Header")

Fast file writing in scala?

So I have a scala program that iterates through a graph and writes out data line by line to a text file. It is essentially an edge list file for use with graphx.
The biggest slow down is actually creating this text file, were talking maybe million records it writes to this text file. Is there a way I can somehow parallel this task or making faster in any way by somehow storing it in memory or anything?
More info:
I am using a hadoop cluster to iterate through a graph and here is my code snippet for my text file creation im doing now to write to HDFS:
val fileName = dbPropertiesFile + "-edgelist-" + System.currentTimeMillis()
val path = new Path("/home/user/graph/" + fileName + ".txt")
val conf = new Configuration()
conf.set("fs.defaultFS", "hdfs://host001:8020")
val fs = FileSystem.newInstance(conf)
val os = fs.create(path)
while (edges.hasNext) {
val current = edges.next()
os.write(current.inVertex().id().toString.getBytes())
os.write(" ".getBytes())
os.write(current.outVertex().id().toString.getBytes())
os.write("\n".toString.getBytes())
}
fs.close()
Writing files to HDFS is never fast. Your tags seem to suggest that you are already using spark anyway, so you could as well, take advantage of it.
sparkContext
.makeRDD(20, edges.toStream)
.map(e => e.inVertex.id -> e.outVertex.id)
.toDF
.write
.delimiter(" ")
.csv(path)
This splits your input into 20 partitions (you can control that number with the numeric parameter to makeRDD above), and writes them in parallel to 20 different chunks in hdfs, that represent your resulting file.

Bypass last line of each file in Spark (Scala)

This question is related to this.
I am processing an S3 folder containing csv.gz files in Spark. Each csv.gz file has a header that contains column names. This has been solved by the above SO link and the solution looks like this:
val rdd = sc.textFile("s3://.../my-s3-path").mapPartitions(_.drop(1))
The problem now is that it looks like some of the files have newline ('\n') at the end (we assume we are not sure which file). So when converting the RDD to DataFrame, I'm getting some error. The question now is:
How do I get rid of the last line of each file if it is '\n'?
Why not a simple filter:
val rdd = sc.textFile("s3...").filter(line => !line.equalsIgnoreCase("\n")).mapPartition...
Or filter any empty line:
val rdd = sc.textFile("s3...").filter(line => !line.trim().isEmpty)...