write an RDD into HDFS in a spark-streaming context - scala

I have a spark streaming environment with spark 1.2.0 where i retrieve data from a local folder and every time I find a new file added to the folder I perform some transformation.
val ssc = new StreamingContext(sc, Seconds(10))
val data = ssc.textFileStream(directory)
In order to perform my analysis on DStream data I have to transform it into an Array
var arr = new ArrayBuffer[String]();
data.foreachRDD {
arr ++= _.collect()
}
Then I use data obtained to extract the information I want and to save them on HDFS.
val myRDD = sc.parallelize(arr)
myRDD.saveAsTextFile("hdfs directory....")
Since I really need to manipulate data with an Array it's impossible to save data on HDFS with DStream.saveAsTextFiles("...") (which would work fine) and I have to save the RDD but with this preocedure I finally have empty output files named part-00000 etc...
With an arr.foreach(println) I am able to see the correct results of the transofmations.
My suspect is that spark tries at every batch to write data in the same files, deleting what was previously written. I tried to save in a dynamic named folder like myRDD.saveAsTextFile("folder" + System.currentTimeMillis().toString()) but always only one foldes is created and the output files are still empty.
How can I write an RDD into HDFS in a spark-streaming context?

You are using Spark Streaming in a way it wasn't designed. I'd either recommend drop using Spark for your use case, or adapt your code so it works the Spark way. Collecting the array to the driver defeats the purpose of using a distributed engine and makes your app effectively single-machine (two machines will also cause more overhead than just processing the data on a single machine).
Everything you can do with an array, you can do with Spark. So just run your computations inside the stream, distributed on the workers, and write your output using DStream.saveAsTextFiles(). You can use foreachRDD + saveAsParquet(path, overwrite = true) to write to a single file.

#vzamboni: Spark 1.5+ dataframes api has this feature :
dataframe.write().mode(SaveMode.Append).format(FILE_FORMAT).partitionBy("parameter1", "parameter2").save(path);

Related

Stitch Part Files to one with custom name

Data Fusion Pipeline gives us one or more part files at output if sync in GCS Bucket. My question is how we can combine those part files to one and also gave them a meaningful name ?
The Data Fusion transformations run in Dataproc clusters executing either Spark or MapReduce jobs. Your final output is split in many files because the jobs partition your data based on the HDFS partitions (this is the default behavior for Spark/Hadoop).
When writing a Spark script you are able to manipulate this default behavior and produce different outputs. However, Data Fusion was built to abstract the code layer and provide you the experience of using a fully managed data integrator. Using split files should not be a problem but if you really need to merge them I suggest that you use the following approach:
On the top of your Pipeline Studio click on Hub -> Plugins, search for Dynamic Spark Plugin, click on Deploy and then in Finish (you can ignore the JAR file)
Back to your pipeline, select Spark in the sink section.
Replace your GCS plugin with the Spark plugin
In your Spark plugin, set Compile at Deployment Time as false and replace the code with some Spark code that does what you want. The code below for example is hardcoded but works:
def sink(df: DataFrame) : Unit = {
new_df = df.coalesce(1)
new_df.write.format("csv").save("gs://your/path/")
}
This function receives the data from your pipeline as a Dataframe. The coalesce function reduces the number of partitions to 1 and the last line writes it to GCS.
Deploy your pipeline and it will be ready to run

Change spark _temporary directory path to avoid deletion of parquets

When two or more Spark jobs have the same output directory, mutual deletion of files will be inevitable.
I'm writting a dataframe in append mode with spark 2.4.4 and I want to add a timestamp to the tmp dir of spark to avoid these deletion.
example:
my JobSpark write in hdfs:/outputFile/0/tmp/file1.parquet
the same spark job called with other data and write in hdfs:/outputFil/0/tm/file2.parquet
I want jobSpark1 write in hdfs:/outputFile/0/tmp+(timeStamp)/file1.parquet
and the other job write in hdfs:/outputFile/0/tmp+(timeStamp)/file2.parquet and next move parquets to hdfs:/outputFile/
df
.write
.option("mapreduce.fileoutputcommitter.algorithm.version", "2")
.partitionBy("XXXXXXXX")
.mode(SaveMode.Append)
.format(fileFormat)
.save(path)
When Spark appends data to an existing dataset, Spark uses FileOutputCommitter to manage staging output files and final output files. The behavior of FileOutputCommitter has direct impact on the performance of jobs that write data.
A FileOutputCommitter has two methods, commitTask and commitJob. Apache Spark 2.0 and higher versions use Apache Hadoop 2, which uses the value of mapreduce.fileoutputcommitter.algorithm.version to control how commitTask and commitJob work. In Hadoop 2, the default value of mapreduce.fileoutputcommitter.algorithm.version is 1. For this version, commitTask moves data generated by a task from the task temporary directory to job temporary directory and when all tasks complete, commitJob moves data to from job temporary directory to the final destination.
Because the driver is doing the work of commitJob, for cloud storage, this operation can take a long time. You may often think that your cell is “hanging”. However, when the value of mapreduce.fileoutputcommitter.algorithm.version is 2, commitTask moves data generated by a task directly to the final destination and commitJob is basically a no-op.

Pyspark & HDFS: Add new dataframe column to existing parquet files in hdfs

let me first start with my scenario:
I have a huge dataframe stored in HDFS. I load the dataframe in a spark session
and create a new column without changing any of the existing content. After this, I want to store the dataframe to the original directory in HDFS.
Now, I know, I can practically do with with spark's write operation in the fashion df.parquet.write("my_df_path", mode="overwrite"). Since the data is immense, I'm investigating whether there is a so to speak column-wise append-mode or method, that does not write the complete dataframe back only the difference to the stored data. The final target is to save both memory and computational effort for the HDFS system.

Write and append Spark streaming data to a text file in HDFS

I am creating a spark scala code in which I am reading a continuous stream from MQTT server.
I am running my job in yarn cluster mode. I want to save and append this stream in a single text file in HDFS.
I will be receiving stream of data after every 1 second. So I need this data to be appended in single text file in HDFS.
Can any one help.
Use data frame and use mode Append
This will append the data every time new record comes.
val sqlContext = new org.apache.spark.sql.SQLContext(context)
import sqlContext.implicits._
stream.map(_.value).foreachRDD(rdd => {
rdd.foreach(println)
if (!rdd.isEmpty()) {
rdd.toDF("value").coalesce(1).write.mode(SaveMode.Append).save("C:/data/spark/")
// rdd.saveAsTextFile("C:/data/spark/")
}
})
#Amrutha J Raj
rdd.toDF("value").coalesce(1).write.mode(SaveMode.Append).json("C:/data/spark/")
This means, RDD will convert to DF and we have used coalesce(1) so it will have only one file if you wont use that then spark may generate multiple files so with this it will restrict to only one and our write mode is Append so it will be appending to the existing file and inn json format.

Spark saveAsTextFile to Azure Blob creates a blob instead of a text file

I am trying to save an RDD to a text file. My instance of Spark is running on Linux and connected to Azure Blob
val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
//find the rows which have only one digit in the 7th column in the CSV
val rdd1 = rdd.filter(s => s.split(",")(6).length() == 1)
rdd1.saveAsTextFile("wasb:///HVACOut")
When I look at the output, it is not as a single text file but as a series of application/octet-stream files in a folder called HVACOut.
How can I output it as a single text file instead?
Well I am not sure you can get just one file without a directory. If you do
rdd1 .coalesce(1).saveAsTextFile("wasb:///HVACOut")
you will get one file inside a directory called "HVACOut" the file should like something like part-00001. This is because your rdd is a disturbed on in your cluster with what they call partitions. When you do a call to save (all save functions) it is going to make a file per partition. So by call coalesce(1) your telling you want 1 partition.
Hope this helps.
After finished provisioning a Apache Spark cluster on Azure HDInsight, you can go to the built-in Jupyter notebook for your cluster at: https://YOURCLUSTERNAME.azurehdinsight.net/jupyter.
There you will find sample notebook with example on how to do this.
Specifically, for scala, you can go to the notebook named "02 - Read and write data from Azure Storage Blobs (WASB) (Scala)".
Copying some of the code and comments here:
Note:
Because CSV is not natively supported by Spark, so there is no built-in way to write an RDD to a CSV file. However, you can work around this if you want to save your data as CSV.
Code:
csvFile.map((line) => line.mkString(",")).saveAsTextFile("wasb:///example/data/HVAC2sc.csv")
Hope this helps!