Spark write text file without ignoring escape(backslash) - scala

I'm trying write DataSet into text file.
Example
datasets
.wirte
.text(path)
What I intended is to write "some\text"(String which dataset contains).
As scala to interpret this String, we should set String value like something this
val text: String = "some\\text"
Of course when testing in scala, it prints out correct value ("some\text").
But when I write this dataset with spark.write, it appears to be written "some\\text"
Reading the internal codes, I just found escape option only for csv writing.
Is there any way to solve this problem?
Thanks

Related

Change the format of file path which is partitioned by java.sql.Timestamp

We are using spark as a data processing platform and Scala programming language. When we write data on storage account(ADLS gen 2), we partition the data by datetime column which is of type java.sql.Timestamp. We write the data using spark dataframe.write operation
By default, it creates following path on storage account and writes parquet files in it
Path - __datetime=a/b/c/yyyy-MM-dd HH%3Amm%3Ass
The problem is, it has encoded : but not space and because the URL is not fully encoded, it creates problems for us. Is there a fix to this problem?
Can I change the format of a column(of type java.sql.Timestamp), so that the output file path looks like this which does not have any encoding?
__datetime=a/b/c/yyyy-MM-dd-HH-mm-ss
or
__datetime=a/b/c/yyyy_MM_dd_HH_mm_ss
Is it possible to do this within java.sql.Timestamp object and without converting it to a string?
Thanks
You can change the name / type dataframe column with a simple select + alias.
The encoding is necessary, though because file paths cannot have : characters, but they can have spaces... Unclear why you need full URL encoding

Read oneline file into dataframe

I have the task of reading a one line json file into spark. I´ve thought about either modifying the input file so that it fits spark.read.json(path) or read the whole file and modify it inmemory to make it fit the previous line as shown bellow:
import spark.implicit._
val file = sc.textFile(path).collect()(0)
val data = file.split("},").map(json => s"$json}")
val ds = data.toSeq.toDF()
Is there a way of directly reading the json or read the one line file into multiple rows?
Edit:
Sorry I didn´t crealy explain the json format, all the json in the same line:
{"key":"value"},{"key":"value2"},{"key":"value2"}
If imported with spark.read.json(path) it would only take the first value.
Welcome to SO HugoDife! I believe single line load is what spark.read.json() does and you are perhaps looking for this answer. If not maybe you want to adjust your question with a data example.

Write either csv output OR parquet output, controlled via a configuration setting

I would like my program to write output files either in csv or parquet format and the decision to use either of the format should be controlled via a configuration.
I could use something like this below.
// I would probably read opType via a JSON or XML.
val opType = "csv"
// Write output based on appropriate opType
optype match {
case "csv" =>
df.write.csv("/some/output/location")
case "parquet" =>
df.write.parquet("/some/output/location")
case _ =>
df.write.csv("/some/output/location/")
}
Question: Is there a better way to handle this scenario? Is there anyway I could use the string value of opType to call the appropriate function call whether parquet or csv ?
Any help or pointers are appreciated.
Create Enum of possible file Types and make sure enum notations should follow spark source fileType keywords (i.e csv,parquet,orc,json,text etc)
Then you can do simply like this
df.write.format(optype).save(opPath)
Note: Enum is used only for type checking and making sure input is not some incorrect or garbled value.

how to read CSV file in scala

I have a CSV file and I want to read that file and store it in case class. As I know A CSV is a comma separated values file. But in case of my csv file there are some data which have already comma itself. and it creates new column for every comma. So the problem how to split data from that.
1st data
04/20/2021 16:20(1st column) Here a bunch of basic techniques that suit most businesses, and easy-to-follow steps that can help you create a strategy for your social media marketing goals.(2nd column)
2nd data
11-07-2021 12:15(1st column) Focus on attracting real followers who are genuinely interested in your content, and make the most of your social media marketing efforts.(2nd column)
var i=0
var length=0
val data=Source.fromFile(file)
for (line <- data.getLines) {
val cols = line.split(",").map(_.trim)
length = cols.length
while(i<length){
//println(cols(i))
i=i+1
}
i=0
}
If you are reading a complex CSV file then the ideal solution is to use an existing library. Here is a link to the ScalaDex search results for CSV.
ScalaDex CSV Search
However, based on the comments, it appears that you might actually be wanting to read data stored in a Google Sheet. If that is the case, you can utilize the fact that you have some flexibility to save the data in a text file yourself. When I want to read data from a Google Sheet in Scala, the approach I use first is to save the file in a format that isn't hard to read. If the fields have embedded commas but no tabs, which is common, then I will save the file as a TSV and parse that with split("\t").
A simple bit of code that only uses the standard library might look like the following:
val source = scala.io.Source.fromFile("data.tsv")
val data = source.getLines.map(_.split("\t")).toArray
source.close
After this, data will be an Array[Array[String]] with your data in it that you can process as you desire.
Of course, if your data includes both tabs and commas then you'll really want to use one of those more robust external libraries.
You could use univocity CSV parser for faster stuffs.
You can also use it for creation as well.
Univocity parsers

Spark/Scala read hadoop file

In a pig script I saved a table using PigStorage('|').
I have in the corresponding hadoop folder files like
part-r-00000
etc.
What is the best way to load it in Spark/Scala ? In this table I have 3 fields: Int, String, Float
I tried:
text = sc.hadoopFile("file", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitions)
But then I would need somehow to split each line. Is there a better way to do it?
If I were coding in python I would create a Dataframe indexed by the first field and whose columns are the values found in the string field and coefficients the float values. But I need to use scala to use the pca module. And the dataframes don't seem that close to python's ones
Thanks for the insight
PigStorage creates a text file without schema information so you need to do that work yourself something like
sc.textFile("file") // or directory where the part files are
val data = csv.map(line => {
vals=line.split("|")
(vals(0).toInt,vals(1),vals(2).toDouble)}
)