I am creating a spark scala code in which I am reading a continuous stream from MQTT server.
I am running my job in yarn cluster mode. I want to save and append this stream in a single text file in HDFS.
I will be receiving stream of data after every 1 second. So I need this data to be appended in single text file in HDFS.
Can any one help.
Use data frame and use mode Append
This will append the data every time new record comes.
val sqlContext = new org.apache.spark.sql.SQLContext(context)
import sqlContext.implicits._
stream.map(_.value).foreachRDD(rdd => {
rdd.foreach(println)
if (!rdd.isEmpty()) {
rdd.toDF("value").coalesce(1).write.mode(SaveMode.Append).save("C:/data/spark/")
// rdd.saveAsTextFile("C:/data/spark/")
}
})
#Amrutha J Raj
rdd.toDF("value").coalesce(1).write.mode(SaveMode.Append).json("C:/data/spark/")
This means, RDD will convert to DF and we have used coalesce(1) so it will have only one file if you wont use that then spark may generate multiple files so with this it will restrict to only one and our write mode is Append so it will be appending to the existing file and inn json format.
Related
Is there any approach to read hdfs data to spark df without explicitly mentioning file type.
spark.read.format("auto_detect").option("header", "true").load(inputPath)
We can achieve above requirement by using scala.sys.process_ or python subprocess(cmd). and splitting the extension of any part file. But without using any subprocess or sys.process, can we achieve this ..?
let me first start with my scenario:
I have a huge dataframe stored in HDFS. I load the dataframe in a spark session
and create a new column without changing any of the existing content. After this, I want to store the dataframe to the original directory in HDFS.
Now, I know, I can practically do with with spark's write operation in the fashion df.parquet.write("my_df_path", mode="overwrite"). Since the data is immense, I'm investigating whether there is a so to speak column-wise append-mode or method, that does not write the complete dataframe back only the difference to the stored data. The final target is to save both memory and computational effort for the HDFS system.
We can read avro file using the below code,
val df = spark.read.format("com.databricks.spark.avro").load(path)
is it possible to read pdf files using Spark dataframes?
You cannot read a pdf and store in a df as it will cannot interrupt the columns of the dataframe(basically it doens't have a standard schema), so if you want to get some data from a pdf first convert that to csv or parquet and then you can read from that file and then create a dataframe as it has a defined schema
visit this gitbook to understand more on what are the available read formats which you can use to get the data as a Dataframe
DataFrameReader — Loading Data From External Data Sources
I am trying to save an RDD to a text file. My instance of Spark is running on Linux and connected to Azure Blob
val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
//find the rows which have only one digit in the 7th column in the CSV
val rdd1 = rdd.filter(s => s.split(",")(6).length() == 1)
rdd1.saveAsTextFile("wasb:///HVACOut")
When I look at the output, it is not as a single text file but as a series of application/octet-stream files in a folder called HVACOut.
How can I output it as a single text file instead?
Well I am not sure you can get just one file without a directory. If you do
rdd1 .coalesce(1).saveAsTextFile("wasb:///HVACOut")
you will get one file inside a directory called "HVACOut" the file should like something like part-00001. This is because your rdd is a disturbed on in your cluster with what they call partitions. When you do a call to save (all save functions) it is going to make a file per partition. So by call coalesce(1) your telling you want 1 partition.
Hope this helps.
After finished provisioning a Apache Spark cluster on Azure HDInsight, you can go to the built-in Jupyter notebook for your cluster at: https://YOURCLUSTERNAME.azurehdinsight.net/jupyter.
There you will find sample notebook with example on how to do this.
Specifically, for scala, you can go to the notebook named "02 - Read and write data from Azure Storage Blobs (WASB) (Scala)".
Copying some of the code and comments here:
Note:
Because CSV is not natively supported by Spark, so there is no built-in way to write an RDD to a CSV file. However, you can work around this if you want to save your data as CSV.
Code:
csvFile.map((line) => line.mkString(",")).saveAsTextFile("wasb:///example/data/HVAC2sc.csv")
Hope this helps!
I have a spark streaming environment with spark 1.2.0 where i retrieve data from a local folder and every time I find a new file added to the folder I perform some transformation.
val ssc = new StreamingContext(sc, Seconds(10))
val data = ssc.textFileStream(directory)
In order to perform my analysis on DStream data I have to transform it into an Array
var arr = new ArrayBuffer[String]();
data.foreachRDD {
arr ++= _.collect()
}
Then I use data obtained to extract the information I want and to save them on HDFS.
val myRDD = sc.parallelize(arr)
myRDD.saveAsTextFile("hdfs directory....")
Since I really need to manipulate data with an Array it's impossible to save data on HDFS with DStream.saveAsTextFiles("...") (which would work fine) and I have to save the RDD but with this preocedure I finally have empty output files named part-00000 etc...
With an arr.foreach(println) I am able to see the correct results of the transofmations.
My suspect is that spark tries at every batch to write data in the same files, deleting what was previously written. I tried to save in a dynamic named folder like myRDD.saveAsTextFile("folder" + System.currentTimeMillis().toString()) but always only one foldes is created and the output files are still empty.
How can I write an RDD into HDFS in a spark-streaming context?
You are using Spark Streaming in a way it wasn't designed. I'd either recommend drop using Spark for your use case, or adapt your code so it works the Spark way. Collecting the array to the driver defeats the purpose of using a distributed engine and makes your app effectively single-machine (two machines will also cause more overhead than just processing the data on a single machine).
Everything you can do with an array, you can do with Spark. So just run your computations inside the stream, distributed on the workers, and write your output using DStream.saveAsTextFiles(). You can use foreachRDD + saveAsParquet(path, overwrite = true) to write to a single file.
#vzamboni: Spark 1.5+ dataframes api has this feature :
dataframe.write().mode(SaveMode.Append).format(FILE_FORMAT).partitionBy("parameter1", "parameter2").save(path);