Write either csv output OR parquet output, controlled via a configuration setting - scala

I would like my program to write output files either in csv or parquet format and the decision to use either of the format should be controlled via a configuration.
I could use something like this below.
// I would probably read opType via a JSON or XML.
val opType = "csv"
// Write output based on appropriate opType
optype match {
case "csv" =>
df.write.csv("/some/output/location")
case "parquet" =>
df.write.parquet("/some/output/location")
case _ =>
df.write.csv("/some/output/location/")
}
Question: Is there a better way to handle this scenario? Is there anyway I could use the string value of opType to call the appropriate function call whether parquet or csv ?
Any help or pointers are appreciated.

Create Enum of possible file Types and make sure enum notations should follow spark source fileType keywords (i.e csv,parquet,orc,json,text etc)
Then you can do simply like this
df.write.format(optype).save(opPath)
Note: Enum is used only for type checking and making sure input is not some incorrect or garbled value.

Related

Spark write text file without ignoring escape(backslash)

I'm trying write DataSet into text file.
Example
datasets
.wirte
.text(path)
What I intended is to write "some\text"(String which dataset contains).
As scala to interpret this String, we should set String value like something this
val text: String = "some\\text"
Of course when testing in scala, it prints out correct value ("some\text").
But when I write this dataset with spark.write, it appears to be written "some\\text"
Reading the internal codes, I just found escape option only for csv writing.
Is there any way to solve this problem?
Thanks

Dataset Empty parameter value

I have an xml dataset, I want to parametrize the compression type to treat .xml and .xml.gz files with the same pipeline :
When I put 'gzip' value in compression type it reads xml.gzip file. I want to know what value I should put to read uncompressed .xml file because it does not accept empty value. It is able to read xml file just when I delete the compression_type parameter
You should pass "None" and it should work out .
I feel "None" is more of a workaround in this particular case. "None" is still a string value, not empty.
In my scenario right now, I have an Excel dataset. I want to make every parameter as generic as possible, including the file path/name, sheet name, and the range. The value of "Range" under Connection tab allows empty value. However if I specify it as #dataset().DataRange and leave my parameter DataRange empty, I cannot preview the data or submit the pipeline because it complains that the value cannot be empty.

how to read CSV file in scala

I have a CSV file and I want to read that file and store it in case class. As I know A CSV is a comma separated values file. But in case of my csv file there are some data which have already comma itself. and it creates new column for every comma. So the problem how to split data from that.
1st data
04/20/2021 16:20(1st column) Here a bunch of basic techniques that suit most businesses, and easy-to-follow steps that can help you create a strategy for your social media marketing goals.(2nd column)
2nd data
11-07-2021 12:15(1st column) Focus on attracting real followers who are genuinely interested in your content, and make the most of your social media marketing efforts.(2nd column)
var i=0
var length=0
val data=Source.fromFile(file)
for (line <- data.getLines) {
val cols = line.split(",").map(_.trim)
length = cols.length
while(i<length){
//println(cols(i))
i=i+1
}
i=0
}
If you are reading a complex CSV file then the ideal solution is to use an existing library. Here is a link to the ScalaDex search results for CSV.
ScalaDex CSV Search
However, based on the comments, it appears that you might actually be wanting to read data stored in a Google Sheet. If that is the case, you can utilize the fact that you have some flexibility to save the data in a text file yourself. When I want to read data from a Google Sheet in Scala, the approach I use first is to save the file in a format that isn't hard to read. If the fields have embedded commas but no tabs, which is common, then I will save the file as a TSV and parse that with split("\t").
A simple bit of code that only uses the standard library might look like the following:
val source = scala.io.Source.fromFile("data.tsv")
val data = source.getLines.map(_.split("\t")).toArray
source.close
After this, data will be an Array[Array[String]] with your data in it that you can process as you desire.
Of course, if your data includes both tabs and commas then you'll really want to use one of those more robust external libraries.
You could use univocity CSV parser for faster stuffs.
You can also use it for creation as well.
Univocity parsers

How to perform row and column filter operations in a csv file in scala

I am writing scala scripts. I need to perform row filter operations such as greater than,less than operations for the csv file. I have tried using filter option in the script unable to get the results. Please let me know how to perform filter operation for the csv file. The sample data has been attached here for reference.Thanks in advance.
for (line <- bufferedSource.getLines) {
cols += line.split(",").filter(csv => csv(1).toInt > 10000)}
Instead of resorting to for use map. This code snippet should work
bufferedSource.getLines.map(row => row.split(",")).filter(cols => cols(1).toInt > 10000).toList
Also, it's a better approach to use a case class for the CSV you are filtering to make your code more readable.

Spark/Scala read hadoop file

In a pig script I saved a table using PigStorage('|').
I have in the corresponding hadoop folder files like
part-r-00000
etc.
What is the best way to load it in Spark/Scala ? In this table I have 3 fields: Int, String, Float
I tried:
text = sc.hadoopFile("file", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitions)
But then I would need somehow to split each line. Is there a better way to do it?
If I were coding in python I would create a Dataframe indexed by the first field and whose columns are the values found in the string field and coefficients the float values. But I need to use scala to use the pca module. And the dataframes don't seem that close to python's ones
Thanks for the insight
PigStorage creates a text file without schema information so you need to do that work yourself something like
sc.textFile("file") // or directory where the part files are
val data = csv.map(line => {
vals=line.split("|")
(vals(0).toInt,vals(1),vals(2).toDouble)}
)