Scalding: Ouptut schema from pipe operation - scala

I am reaidng files on HDFS via scalding, aggregating on some fields, and writing to a tab delimited file via TSV. How can I write out a file that contains the schema of my output file? For example,
UnpackedAvroSource(args("input"))
.project('key, 'var1)
.groupBy('key){_.sum[Long]('var1 -> var1sum))}
.write(Tsv(args("output")))
I want to write an output text file that contains "Key, var1sum" that someone who picks up my ooutput file later knows what the columns. I'm assuming scalding doesn't embed this in the file somewhere?
Thanks.

Just found the option writeHeader = true which will write the column names to the output file, negating the need for writing out to a file.

Related

Extract Doxygen functions along with namespace into a .csv file

While working with my Doxygen output doc, I've a requirement to extract all the functions into a spreadsheet. Additionally, each function had a requirement mapped to it using ALIASES defined in the configuration file. sample function as below:
#requirement{req-id}
void Myfunc()
I am able to see all the requirements documented in a separate page in my HTML output. But, I need to fetch the list of functions with respective requirement Ids into a .csv file for further processing. Could anyone please hep me out?
Thanks, Badri
Doxygen has no irect CSV output.
You would need the XML output (GENERATE_XML=YES) and process the resulting file into a format you want / directly process the file without the need of a CSV file.
When you have an ALIASES like
ALIASES += req{3}="\xrefitem req \"Requirement\" \"SW Requirements\" ID: \1 Requirement: \2 Verification Criteria: \3"
you will get a file req.xml that you can process further.

How to process CSV with different columns in CDAP (Datafusion)?

I have a case where I receive multiple CSV from third parties (little hard to make them change the format), and those CSVs should have the same columns, but sometimes one or more columns are missing. If I use CDAP File (reading as text) followed by a Wrangler to process the CSV the Wrangler with the following directive:
parse-as-csv :body '\\t' true
cleanse-column-names
It will assume that all files read have the same column format and will mess the data of the files that have less or more column than the first file.
So far I tried to use the File to read as blob and to have the output as bytes with a Wrangler configured with this directive:
set-type :body string
parse-as-csv :body '\t' true
cleanse-column-names
But now I do not even have any output (or error), so I am clueless how to parse those non uniform files. Is CDAP able to handle this case? If yes, how?
You can use the directive set-column to add new columns to the files which don't have all the needed columns. By and large, I would recommend you to look into all the directives documentation to preprocess your files.
I hope that helps.

Write RDD in txt file

I have the following type of data:
`org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[((String, String),Int)]] = MapPartitionsRDD[29] at map at <console>:38`
I'd like to write those data in a txt file to have something like
((like,chicken),2) ((like,dog),3) etc.
I store the data in a variable called res
But for the moment I tried with this:
res.coalesce(1).saveAsTextFile("newfile.txt")
But it doesn't seem to work...
If my assumption is correct, then you feel that the output should be a single .txt file if it was coalesced down to one worker. This is not how Spark is built. It is meant for distributed work and should not be attempted to be shoe-horned into a form where the output is not distributed. You should use a more generic command line tool for that.
All that said, you should see a folder named newfile.txt which contains data files with your expected output.

Saving a string to HDFS creates line feeds for each character

I have a plain text file that I am reading from my local system that I am uploading to HDFS. I have Spark/Scala code that reads the file in, converts the file to a string, and then i use saveAsTextFile function to specify my HDFS path where I want the file to be saved. Note I am using the coalesce function because I want one file saved, rather than the file getting split.
import scala.io.Source
val fields = Source.fromFile("MyFile.txt").getLines
val lines = fields.mkString
sc.makeRDD(lines).coalesce(1, true).saveAsTextFile("hdfs://myhdfs:8520/user/alan/mysavedfile")
The code I have saves my text successfully to HDFS, unfortunately though, for some reason each character in my string has a line feed character after it.
Is there a way around this?
I wasn't able to get this working exactly as I wanted, but I did come up with a work around. I saved the device locally and then called a shell command through Scala to upload the completed file to HDFS. Pretty straightforward.
Would still appreciate if anyone could tell me how to copy a string directly to a file in HDFS though.

Using the second row of a delimited text file as the header row when importing into Access 2010.

Is it possible to use the values of the second row of a delimited text file (e.g. a csv file) as the header row when importing into Access 2010?
No - the headers have to be in the first line of the imported file. You need to delete the empty first line of data.
If there are too many files for this to be practical, as you imply, you have a couple of options.
Presuming the headers are the same on all of your files to be imported, you could combine all of the text files into one file and import that.
If the headers are different, you could write some code to batch delete the first line from all your files, as is suggested here.