How to store an Array[String] to an output file [duplicate] - scala

This question already has answers here:
How to write to a file in Scala?
(19 answers)
Closed 5 years ago.
I'm having an Array[String] called samparr with some values in it, I want it to get stored as an output file.
var samparr: Array[String] = new Array[String](4)
samparr +:= print1 + " BEST_MATCH " + print2
just like,
val output = samparr.saveAsTextFile(outputpath)
but isn't a RDD its an Array[String]

You can use SparkContext.parallelize to "distribute" your Array onto the Spark cluster (in other words, to turn it into an RDD), and then call saveAsTextFile:
sc.parallelize(samparr).saveAsTextFile(outputpath)
This action will partition the data and send each partition to one of the executors, then each partition will be saved into a separate "file-part".
Alternatively, since the array is very small and doesn't really "justify" using Spark, you can try any non-Spark method of saving data to file, e.g. the one linked by #avihoo-mamka: How to write to a file in Scala?

Related

Spark - Write DataFrame with custom file name [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 1 year ago.
I have a Spark (2.4) DataFrame that I want to write as a Pipe separated file. It should be pretty straightforward like so
val myDF = spark.table("mySchema.myTable")
myDF.coalesce(1).write.format("csv").options("header", "true").options("delimiter", "|").save("/tmp/myDF")
I get a part-*.csv file in /tmp/myDF.
So far, so good. But I actually want the file name to be something specific, e.g. /tmp/myDF.csv
But giving this String in save will just create a dir called myDF.csv and create the part*.csv file in there.
Is there a way to write the DataFrame with a specific name?
You can't do that with Spark
You can rename the file later accessing the fileSystem
val directory = new File(/tmp/myDF)
if (directory.exists && directory.isDirectory) {
val file = directory.listFiles.filter(_.getName.endsWith(".csv")).head
file.renameTo("myDF.csv")
}

How to read data from hdfs using scala language [duplicate]

This question already has answers here:
Spark - load CSV file as DataFrame?
(14 answers)
Closed 4 years ago.
How do I read the data from hdfs data sets using scala language? data is any "CSV" file with limited records.
You tagged the question with Spark, so I'm assuming you are trying to use that. I would recommend you start by reading through the Spark documentation here to get an idea of how to use Spark to interact with your data.
https://spark.apache.org/docs/latest/quick-start.html
https://spark.apache.org/docs/latest/sql-programming-guide.html
But, to answer your specific question, in Spark you would read in the CSV file using code like this:
val csvDf = spark.read.format("csv")
.option("sep", ",")
.option("header", "true")
.load("hdfs://some/path/to/data.csv/")
The path your provide will be to a CSV file on HDFS, or a folder containing multiple CSV files. Also, Spark will accept other types of file systems. For example you could also use "file://" to access the local file system, or "s3://" to use S3. Once you have loaded the data you will have a Spark DataFrame object with SQL like methods available to interact with it.
Note, I provided an option for separator just to show you how to do it, but it defaults to "," anyways, so it is not required. Also, if your CSV files do not include a header, you will need to specify the Schema yourself and set header to false instead.
You can read data from HDFS by following this approach :-
val hdfs = FileSystem.get(new URI("hdfs://hdfsUrl:port/"), new Configuration())
val path = new Path("/pathOfTheFileInHDFS/")
val stream = hdfs.open(path)
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
//This example checks line for null and prints every existing line consequentally
readLines.takeWhile(_ != null).foreach(line => println(line))
Also please have a look at this article https://blog.matthewrathbone.com/2013/12/28/reading-data-from-hdfs-even-if-it-is-compressed
Please let me know if this answers your question.

Fast file writing in scala?

So I have a scala program that iterates through a graph and writes out data line by line to a text file. It is essentially an edge list file for use with graphx.
The biggest slow down is actually creating this text file, were talking maybe million records it writes to this text file. Is there a way I can somehow parallel this task or making faster in any way by somehow storing it in memory or anything?
More info:
I am using a hadoop cluster to iterate through a graph and here is my code snippet for my text file creation im doing now to write to HDFS:
val fileName = dbPropertiesFile + "-edgelist-" + System.currentTimeMillis()
val path = new Path("/home/user/graph/" + fileName + ".txt")
val conf = new Configuration()
conf.set("fs.defaultFS", "hdfs://host001:8020")
val fs = FileSystem.newInstance(conf)
val os = fs.create(path)
while (edges.hasNext) {
val current = edges.next()
os.write(current.inVertex().id().toString.getBytes())
os.write(" ".getBytes())
os.write(current.outVertex().id().toString.getBytes())
os.write("\n".toString.getBytes())
}
fs.close()
Writing files to HDFS is never fast. Your tags seem to suggest that you are already using spark anyway, so you could as well, take advantage of it.
sparkContext
.makeRDD(20, edges.toStream)
.map(e => e.inVertex.id -> e.outVertex.id)
.toDF
.write
.delimiter(" ")
.csv(path)
This splits your input into 20 partitions (you can control that number with the numeric parameter to makeRDD above), and writes them in parallel to 20 different chunks in hdfs, that represent your resulting file.

Specifying the filename when saving a DataFrame as a CSV [duplicate]

This question already has an answer here:
Spark dataframe save in single file on hdfs location [duplicate]
(1 answer)
Closed 5 years ago.
Say I have a Spark DF that I want to save to disk a CSV file. In Spark 2.0.0+, one can convert DataFrame(DataSet[Rows]) as a DataFrameWriter and use the .csv method to write the file.
The function is defined as
def csv(path: String): Unit
path : the location/folder name and not the file name.
Spark stores the csv file at the location specified by creating CSV files with name - part-*.csv.
Is there a way to save the CSV with specified filename instead of part-*.csv ? Or possible to specify prefix to instead of part-r ?
Code :
df.coalesce(1).write.csv("sample_path")
Current Output :
sample_path
|
+-- part-r-00000.csv
Desired Output :
sample_path
|
+-- my_file.csv
Note : The coalesce function is used to output a single file and the executor has enough memory to collect the DF without memory error.
It's not possible to do it directly in Spark's save
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. You can easily change filename after processing just like in this question
In Scala it will look like:
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val file = fs.globStatus(new Path("path/file.csv/part*"))(0).getPath().getName()
fs.rename(new Path("csvDirectory/" + file), new Path("mydata.csv"))
fs.delete(new Path("mydata.csv-temp"), true)
or just:
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
fs.rename(new Path("csvDirectory/data.csv/part-0000"), new Path("csvDirectory/newData.csv"))
Edit: As mentioned in comments, you can also write your own OutputFormat, please see documents for information about this approach to set file name

Apache Spark streaming mapping object and printing attribute

I'm reading from a text file, parsing each line to JSON and am attempting to print one of the attributes:
val msgData = ssc.textFileStream(dataDir)
val msgs = msgData.map(MessageParser.parse)
msgs.foreach(msg => println(msg.my_attribute))
However, I get the following error on compilation:
value my_attribute is not a member of org.apache.spark.rdd.RDD[com.imgzine.analytics.messages.Message]
What am I missing?
Thanks
Spark Streaming discretizes a stream of data by creating micro-batch containers. Those are called 'DStreams' and contain a collection of RDD's.
Translated to your case, you need to operate on the content of the RDD, not the DStream:
msgs.foreach(rdd => rdd.foreach(elem => println(elem.my_attribute))
DStreams offer a help method to print the first elements (10 I think) of each RDD:
dstream.print()
Of course, that will just invoke .toString on the objects contained in the RDD and print the result. Maybe not what you want with my_attribute as stated in the question.