Saving a string to HDFS creates line feeds for each character - scala

I have a plain text file that I am reading from my local system that I am uploading to HDFS. I have Spark/Scala code that reads the file in, converts the file to a string, and then i use saveAsTextFile function to specify my HDFS path where I want the file to be saved. Note I am using the coalesce function because I want one file saved, rather than the file getting split.
import scala.io.Source
val fields = Source.fromFile("MyFile.txt").getLines
val lines = fields.mkString
sc.makeRDD(lines).coalesce(1, true).saveAsTextFile("hdfs://myhdfs:8520/user/alan/mysavedfile")
The code I have saves my text successfully to HDFS, unfortunately though, for some reason each character in my string has a line feed character after it.
Is there a way around this?

I wasn't able to get this working exactly as I wanted, but I did come up with a work around. I saved the device locally and then called a shell command through Scala to upload the completed file to HDFS. Pretty straightforward.
Would still appreciate if anyone could tell me how to copy a string directly to a file in HDFS though.

Related

Save a comma delimited string line by line as a named .txt file into Google Storage Bucket

I have a comma delimited string in Scala, and I want to first split the string by comma, so each word is in one line, and save this file in Google Storage Bucket. I can only save the whole string as it is (so not breaking by comma to different lines) and I cannot rename the file in Google Storage Bucket. Can you help? Here is my code.
val x = "a,b,c,d"
sc.parallelize(sc.parallelize(List(x)).collect()).coalesce(1).saveAsTextFile("gs://myReport/Output_x")
This piece of script saves the string into a flat file automatically called part-00000, so no suffix. And all the contents are in the same line. So I have it like this:
a,b,c,d
What I want is a file called table1.txt saved in the same place of the Google Bucket. And the contents should be like this:
a
b
c
d
Can Scala do it?
As mentioned by #Jwvh, you can use x.replaceAll(",","\n") to replace the commas in your file by a line break.
For the second part of your question however, you can't save a file directly as you would in a local directory, you have to either save it locally or create the file in memory and only then use the Cloud Storage SDK to upload it to Cloud Storage, in order to do that you can use this GCS library for Scala.

Write RDD in txt file

I have the following type of data:
`org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[((String, String),Int)]] = MapPartitionsRDD[29] at map at <console>:38`
I'd like to write those data in a txt file to have something like
((like,chicken),2) ((like,dog),3) etc.
I store the data in a variable called res
But for the moment I tried with this:
res.coalesce(1).saveAsTextFile("newfile.txt")
But it doesn't seem to work...
If my assumption is correct, then you feel that the output should be a single .txt file if it was coalesced down to one worker. This is not how Spark is built. It is meant for distributed work and should not be attempted to be shoe-horned into a form where the output is not distributed. You should use a more generic command line tool for that.
All that said, you should see a folder named newfile.txt which contains data files with your expected output.

Write ArrayBuffer to file in scala

I want to write multiple ArrayBuffers to file one after the other in append mode in Scala and then I should be able to read the last ArrayBuffer from the file and delete it from the file and save the file with the remaining ArrayBuffer.
I cant think of any good solution for this. How should I do this?

Scalding: Ouptut schema from pipe operation

I am reaidng files on HDFS via scalding, aggregating on some fields, and writing to a tab delimited file via TSV. How can I write out a file that contains the schema of my output file? For example,
UnpackedAvroSource(args("input"))
.project('key, 'var1)
.groupBy('key){_.sum[Long]('var1 -> var1sum))}
.write(Tsv(args("output")))
I want to write an output text file that contains "Key, var1sum" that someone who picks up my ooutput file later knows what the columns. I'm assuming scalding doesn't embed this in the file somewhere?
Thanks.
Just found the option writeHeader = true which will write the column names to the output file, negating the need for writing out to a file.

Extracting file names from an online data server in Matlab

I am trying to write a script that will allow me to download numerous (1000s) of data files from a data server (e.g, http://hydro1.sci.gsfc.nasa.gov/thredds/catalog/GLDAS_NOAH10SUBP_3H/2011/345/). Unfortunately, the names of the files in each directory are not formatted in a similar way (the time that they were created were appended to the end of the file name). I need to be able to specify the file name to subset the data (I have a special tool for these data types) and download it. I cannot find a function in matlab that will extract the file names.
I have looked at URLREAD, but it downloads everything including html code.
Thanks for your help!
You can easily parse the link.
x=urlread(url)
links=regexp(x,'<a href=''([^>]+)''>','tokens')
Reads every link, you have to filter all unwanted links.
For example this gets all grb files:
a=regexp(x,'<a href=''([^>]+.grb)''>','tokens')