Write RDD in txt file - scala

I have the following type of data:
`org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[((String, String),Int)]] = MapPartitionsRDD[29] at map at <console>:38`
I'd like to write those data in a txt file to have something like
((like,chicken),2) ((like,dog),3) etc.
I store the data in a variable called res
But for the moment I tried with this:
res.coalesce(1).saveAsTextFile("newfile.txt")
But it doesn't seem to work...

If my assumption is correct, then you feel that the output should be a single .txt file if it was coalesced down to one worker. This is not how Spark is built. It is meant for distributed work and should not be attempted to be shoe-horned into a form where the output is not distributed. You should use a more generic command line tool for that.
All that said, you should see a folder named newfile.txt which contains data files with your expected output.

Related

Message passing between two perl files

I have 2 Perl files which cannot be merged and have to be run separately. My first file does certain initialization of parameters which are used by my second file, which performs some testing. Now I want to use the parameters initialized in the first file in the second file so how can I do that?
I will write a Perl script for Software testing. I need to write two files one is initialization file which will do all the initialization and the second file contains the test sequence to execute which will use initialize parameters. I need to run both files separately. Execution-wise my first file will execute first and then my second file will run.
I am thinking of using XML file where the first file will log the parameter in the file and the second file will get the parameters from that file? Is there any better way to do this?
If your initialization produces only plain key-value pairs then any way of serialising data will suffice. Otherwise XML is probably the worst option for your case. You might need to put a lot of effort to get the same data structure in your second script. This happens because by default xml modules do not know what should be an atrribute, a child node or an array of nodes. For example, passing a one-element array of hashes to xml from first script might turn to just a single hash in your second script. The results will highly depend on xml modules, options you pass to them and the data itself.
JSON should'n have such issues. It might have unnecessary type conversions but you shouldn't really notice them.
Storable guarantees that you get the same data in your second script.
You might find Data::Dumper to be an easier solution. But it has some security issues since you need to execute its output in your second script.
All of the above are not meant to be used with data containing self-references and anything but scalars, arrayrefs and hashrefs.

how to append to a file using scala/breeze library

I wish to write to a file the output row of result matrix (produced in iterations) so that I can support checkpointing.
I figured we can use the csvwrite command to write the entire matrix to a file but how can I append to a file?
I am looking for something like below:
breeze.linalg.csvwrite(new File("small.txt"),myMatrix(currRow,::).t.asDenseMatrix)
However the above command overwrites the file each time the command is executed.
There's nothing built-in to Breeze. (Contributions welcome!)
you can use breeze.io.CSVWriter.write directly if you would like.

Scalding: Ouptut schema from pipe operation

I am reaidng files on HDFS via scalding, aggregating on some fields, and writing to a tab delimited file via TSV. How can I write out a file that contains the schema of my output file? For example,
UnpackedAvroSource(args("input"))
.project('key, 'var1)
.groupBy('key){_.sum[Long]('var1 -> var1sum))}
.write(Tsv(args("output")))
I want to write an output text file that contains "Key, var1sum" that someone who picks up my ooutput file later knows what the columns. I'm assuming scalding doesn't embed this in the file somewhere?
Thanks.
Just found the option writeHeader = true which will write the column names to the output file, negating the need for writing out to a file.

Saving a string to HDFS creates line feeds for each character

I have a plain text file that I am reading from my local system that I am uploading to HDFS. I have Spark/Scala code that reads the file in, converts the file to a string, and then i use saveAsTextFile function to specify my HDFS path where I want the file to be saved. Note I am using the coalesce function because I want one file saved, rather than the file getting split.
import scala.io.Source
val fields = Source.fromFile("MyFile.txt").getLines
val lines = fields.mkString
sc.makeRDD(lines).coalesce(1, true).saveAsTextFile("hdfs://myhdfs:8520/user/alan/mysavedfile")
The code I have saves my text successfully to HDFS, unfortunately though, for some reason each character in my string has a line feed character after it.
Is there a way around this?
I wasn't able to get this working exactly as I wanted, but I did come up with a work around. I saved the device locally and then called a shell command through Scala to upload the completed file to HDFS. Pretty straightforward.
Would still appreciate if anyone could tell me how to copy a string directly to a file in HDFS though.

Writing a script for reading many .csv files with similar filenames

I have several .csv files with similar filenames except a numeric month (i.e. 03_data.csv, 04_data.csv, 05_data.csv, etc.) that I'd like to read into R.
I have two questions:
Is there a function in R similar to
MATLAB's varname and assignin that
will let me create/declare a variable name
within a function or loop that will allow me to
read the respective .csv file - i.e.
03_data.csv into 03_data data.frame,
etc.? I want to write a quick loop to
do this because the filenames are
similar.
As an alternative, is it better to
create one dataframe with the first
file and then append the rest using a
for loop? How would I do that?
You could look at this related question. You can create the file names easily with a paste command:
file.names <- paste(sprintf("%02d",1:10), "_data.csv", sep="")
Once you have your file names (whether by creating them or by reading them from the directory as in the other question), you can import them quickly with an lapply:
import.list <- lapply(file.names, read.csv)
Lastly, to combine the list into one dataframe, the easiest approach is to use the reshape function below:
library(reshape)
data <- merge_recurse(import.list)
It is also very easy to read the content of a directory including use of regular expressions to skip focus on certain names only, e.g.
filestoread <- list.files(someDir, pattern="\\.csv$", full.names=TRUE)
returns all (fully-formed, including full path) files in the given directory someDir that end on ".csv". You can get fancier with better regular expressions which are documented in many places.
Once you have your list of files, it is straightforward to read them all using apply or lapply or a loop.