Spark Read/Write (csv) ISO-8859-1 - scala

I need to read an iso-8859-1 encoded file, do some operations then save it (with iso-8859-1 encoding). To test this, I'm losely mimicking a testcase I found on the Databricks CSV package:
https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/CsvSuite.scala
-- specifically: test("DSL test for iso-8859-1 encoded file")
val fileDF = spark.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("charset", "iso-8859-1")
.option("delimiter", "~") // bogus - hopefully something not in the file, just want 1 record per line
.load("s3://.../cars_iso-8859-1.csv")
fileDF.collect // I see the non-ascii characters correctly
val selectedData = fileDF.select("_c0") // just so show an operation
selectedData.write
.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", "~")
.option("charset", "iso-8859-1")
.save("s3://.../carOutput8859")
This code runs without an error - but it doesn't seem to honor the iso-8859-1 option on output. At a Linux prompt (after copying from S3 -> local Linux)
file -i cars_iso-8859-1.csv
cars_iso-8859-1.csv: text/plain; charset=iso-8859-1
file -i carOutput8859.csv
carOutput8859.csv: text/plain; charset=utf-8
I'm just looking for some good examples of reading and writing non-UTF8 files. At this point, I have plenty of flexibility in the approach. (doesn't have to be a csv reader) Any recommedations/examples?

Related

How to include default namespace in xml file while writing in scala spark

how can we add default namespace in xml file while writing through xmldataframewriter.
I want to add the highlighted text in my xm file.
<sitemapindex
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
I am writing it through:
list.coalesce(1)
.write
.mode("overwrite")
.option("rootTag", "sitemapindex")
.option("rowTag", "sitemap")
.xml("/Users/user1/Downloads/xml/main/")

Getting error in spark-sftp, no such file

In a databricks cluster Spark 2.4.5, Scala 2.1.1 I am trying to read a file into a spark data frame using the following code.
val df = spark.read
.format("com.springml.spark.sftp")
.option("host", "*")
.option("username", "*")
.option("password", "*")
.option("fileType", "csv")
.option("delimiter", ";")
.option("inferSchema", "true")
.load("/my_file.csv")
However, I get the following error
org.apache.spark.sql.AnalysisException: Path does not exist: dbfs:/local_disk0/tmp/my_file.csv;
I think I need to specify an option to save that file temporarily, but I can't find a way to do so. How can I solve that?

how to read a .dat file with delimiter /u0001 and record next record will be separating by next line in spark with scala

I have .dat extension file which not having any header
1.fields separated by '\u0001'
2.next record will be in new line
how can i read this file in spark with scala and convert to a dataframe.
Try below code, I assume you are using spark > 2.x version -
val df = spark
.read
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "\01")
.csv("<CSV_FILE_PATH_GOES_HERE>")

NUL Character is getting written to start and end of a file in CSV

I am trying to write some contents of text file to a csv file using spark databricks write package.
However, i am getting NUL character getting added to start and end of each line
Output : NUL"Transactions","Exit","Core1.0","Trade"2018-12-10T10:47:42Z"NUL
Expected output :"Transactions","Exit","Core1.0","Trade"2018-12-10T10:47:42Z"
code below :
df.write
.mode("overwrite")
.option("header","false")
.option("delimiter",",")
.option("quote","\u00000")
.csv(path)

How to read utf-8 encoding file in Spark Scala

I am trying to read utf-8 encoding file into Spark Scala. I am doing this
val nodes = sparkContext.textFile("nodes.csv")
where the given csv file is in UTF-8, but spark converts non-english characters to ? How do I get it to read actual values? I tried it in pyspark and it works fine because pyspark's textFile() function has encoding option and by default support utf-8 (it seems).
I am sure the file is in utf-8 encoding. I did this to confirm
➜ workspace git:(f/playground) ✗ file -I nodes.csv
nodes.csv: text/plain; charset=utf-8
Using this post, we can read the file first then feed it to the sparkContext
val decoder = Codec.UTF8.decoder.onMalformedInput(CodingErrorAction.IGNORE)
val rdd = sc.parallelize(Source.fromFile(filename)(decoder).getLines().toList)