How to save CSV with all fields quoted? - scala

The below code does not add the double quotes which is the default. I also tried adding # and single quote using option quote with no success. I also used quoteMode with ALL and NON_NUMERIC options, still no change in the output.
s2d.coalesce(64).write
.format("com.databricks.spark.csv")
.option("header", "false")
.save(fname)
Are there any other options I can try? I am using spark-csv 2.11 over spark 2.1.
Output it produces:
d4c354ef,2017-03-14 16:31:33,2017-03-14 16:31:46,104617772177,340618697
Output I am looking for:
“d4c354ef”,”2017-03-14 16:31:33”,”2017-03-14 16:31:46”,104617772177,340618697

tl;dr Enable quoteAll option.
scala> Seq(("hello", 5)).toDF.write.option("quoteAll", true).csv("hello5.csv")
The above gives the following output:
$ cat hello5.csv/part-00000-a0ecb4c2-76a9-4e08-9c54-6a7922376fe6-c000.csv
"hello","5"
That assumes the quote is " (see CSVOptions)
That however won't give you "Double quotes around all non-numeric characters." Sorry.
You can see all the options in CSVOptions that serves as the source of the options for the CSV reader and writer.
p.s. com.databricks.spark.csv is currently a mere alias for csv format. You can use both interchangeably, but the shorter csv is preferred.
p.s. Use option("header", false) (false as boolean not String) that will make your code slightly more type-safe.

In Spark 2.1 where the old CSV library has been inlined, I do not see any option for what you want in the csv method of DataFrameWriter as seen here.
So I guess you have to map over your data "manually" to determine which of the Row components are non-numbers and quote them accordingly. You could utilize a straightforward isNumeric helper function like this:
def isNumeric(s: String) = s.nonEmpty && s.forall(Character.isDigit)
As you map over your DataSet, quote the values where isNumeric is false.

Related

SalesForce Spark Delimiter issue

I have a glue job, in which am reading table from SF using soql:
df = (
spark.read.format("com.springml.spark.salesforce")
.option("soql", sql)
.option("queryAll", "true")
.option("sfObject", sf_table)
.option("bulk", bulk)
.option("pkChunking", pkChunking)
.option("version", "51.0")
.option("timeout", "99999999")
.option("username", login)
.option("password", password)
.load()
)
and whenever there is a combination of double-quotes and commas in the string it messes up my table schema, like so:
in source:
Column A
Column B
Column C
000AB
"text with, comma"
123XX
read from SF in df :
Column A
Column B
Column C
000AB
"text with
comma"
Is there any option to avoid such cases when this comma is treated as a delimiter? I tried various options but nothing worked. And SOQL doesn't accept REPLACE or SUBSTRING functions, their text manipulation functions are, well, basically there aren't any.
All the information I'm giving need to be tested. I do not have the same env so it is difficult for me to try anything but here is what I foud.
When you check the official doc, you find that there is a field metadataConfig. The documentation of this field can be found here : https://resources.docs.salesforce.com/sfdc/pdf/bi_dev_guide_ext_data_format.pdf
On page 2, csv format, it says :
If a field value contains a control character or a new line the field value must be contained within double quotes (or your
fieldsEscapedBy value). The default control characters (fieldsDelimitedBy, fieldsEnclosedBy,
fieldsEscapedBy, or linesTerminatedBy) are comma and double quote. For example, "Director of
Operations, Western Region".
which kinda sounds like you current problem.
By default, the values are comma and double quotes, so, I do not understand why it is failing. But, apparently, in your output, it keeps the double quotes, so, maybe, it considers only simple quote.
You should try to enforce the format and add in you code :
.option("metadataConfig", '{"fieldsEnclosedBy": "\"", "fieldsDelimitedBy": ","}')
# Or something similar - i could'nt test, so you need to try by yourself

Use of '\' in reading dataframe

# File location and type
file_location = "/FileStore/tables/FileName.csv"
file_type = "csv"
#CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
# The applied options are for CSV files. For other files types, these will be ignored.
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
display(df)
This is generic code to read the data from csv file. In this code, what is the use of ".option("inferSchema", infer_schema) " and what "" will do in this code?
The use of the backslash at the end of the line is considered as a line continuation, which means the following to the backslash will be considered as one line to the previous. In your case, those 5 lines are considered as one line.
The reason why you need "", first, whatever you put in quotes is considered as a string, for these functions "header", "inferShema", and others are part of the syntax and you will need to keep them as they are.
This answer https://stackoverflow.com/a/56933052/6633728 might help you more.
Backslash '' is used at the end of line to denote that the code after backslash is considered to be in the same line. This is mostly done is long code where code expands over single line.
inferSchema is used to infer the data types of the columns in dataframe. If we make inferSchema as true, then spark reads all the data in dataframe while loading data to infer the data types of the columns.
"" is used with .option function. It is used to add different parameter while reading a file. There can be many parameters added using option function such as header, inferSchema, sep, schema etc.
pyspark.sql.DataFrameReader.csv
You can refer the above link for further help.

How to parse a file with newline character, escaped with \ and not quoted

I am facing an issue when reading and parsing a CSV file. Some records have a newline symbol, "escaped" by a \, and that record not being quoted. The file might look like this:
Line1field1;Line1field2.1 \
Line1field2.2;Line1field3;
Line2FIeld1;Line2field2;Line2field3;
I've tried to read it using sc.textFile("file.csv") and using sqlContext.read.format("..databricks..").option("escape/delimiter/...").load("file.csv")
However doesn't matter how I read it, a record/line/row is created when "\ \n" si reached. So, instead of having 2 records from the previous file, I am getting three:
[Line1field1,Line1field2.1,null] (3 fields)
[Line1field.2,Line1field3,null] (3 fields)
[Line2FIeld1,Line2field2,Line2field3;] (3 fields)
The expected result is:
[Line1field1,Line1field2.1 Line1field.2,Line1field3] (3 fields)
[Line2FIeld1,Line2field2,Line2field3] (3 fields)
(How the newline symbol is saved in the record is not that important, main issue is having the correct set of records/lines)
Any ideas of how to be able to do that? Without modifying the original file and preferably without any post/re processing (for example reading the file and filtering any lines with a lower number of fields than expected and the concatenating them could be a solution, but not at all optimal)
My hope was to use databrick's csv parser to set the escape character to \ (which is supposed to be by default), but that didn't work [got an error saying
java.io.IOException: EOF whilst processing escape sequence].
Should I somehow extend the parser and edit something, creating my own parser? Which would be the best solution?
Thanks!
EDIT: Forgot to mention, i'm using spark 1.6
wholeTextFiles api should be a rescuer api in your case. It read files as key, value pairs : key as the path of the file and value as the whole text of the file. You will have to do some replacements and splittings to get the desired output though
val rdd = sparkSession.sparkContext.wholeTextFiles("path to the file")
.flatMap(x => x._2.replace("\\\n", "").replace(";\n", "\n").split("\n"))
.map(x => x.split(";"))
the rdd output is
[Line1field1,Line1field2.1 Line1field2.2,Line1field3]
[Line2FIeld1,Line2field2,Line2field3]

Spark CSV package not able to handle \n within fields

I have a CSV file which I am trying to load using Spark CSV package and it does not load data properly because few of the fields have \n within them for e.g. the following two rows
"XYZ", "Test Data", "TestNew\nline", "OtherData"
"XYZ", "Test Data", "blablablabla
\nblablablablablalbal", "OtherData"
I am using the following code which is straightforward I am using parserLib as univocity as read in internet it solves multiple newline problem but it does not seems to be the case for me.
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.option("parserLib","univocity")
.load("data.csv");
How do I replace newline within fields which starts with quotes. Is there any easier way?
According to SPARK-14194 (resolved as a duplicate) fields with new line characters are not supported and will never be.
I proposed to solve this via wholeFile option and it seems merged. I am resolving this as a duplicate of that as that one has a PR.
That's however Spark 2.0, and you use spark-csv module.
In the referenced SPARK-19610 it was fixed with the pull request:
hmm, I understand the motivation for this, though my understanding with csv generally either avoid having newline in field or some implementation would require quotes around field value with newline
In other words, use wholeFile option in Spark 2.x (as you can see in CSVDataSource).
As to spark-csv, this comment might be of some help (highlighting mine):
However, that there are a quite bit of similar JIRAs complaining about this and the original CSV datasource tried to support this although that was incorrectly implemented. This tries to match it with JSON one at least and it might be better to provide a way to process such CSV files. Actually, current implementation requires quotes :). (It was told R supports this case too actually).
In spark-csv's Features you can find the following:
The package also supports saving simple (non-nested) DataFrame. When writing files the API accepts several options:
quote: by default the quote character is ", but can be set to any character. This is written according to quoteMode.
quoteMode: when to quote fields (ALL, MINIMAL (default), NON_NUMERIC, NONE), see Quote Modes
There is an option available to users of Spark 2.2 to account for line breaks in CSV files. It was originally discussed as being called wholeFile but prior to release was renamed multiLine.
Here is an example of loading in a CSV to a dataframe with that option:
var webtrends_data = (sparkSession.read
.option("header", "true")
.option("inferSchema", "true")
.option("multiLine", true)
.option("delimiter", ",")
.format("csv")
.load("hdfs://hadoop-master:9000/datasource/myfile.csv"))
Upgrade to Spark 2.x. Newline is actually CRLF represented by ascii 13 and 10. But backslash and 'n' are different ascii which are programatically interpreted and written. Spark 2.x will read correctly.. I tried it..s.b.
val conf = new SparkConf().setAppName("HelloSpark").setMaster("local[2]")
val sc = SparkSession.builder().master("local").getOrCreate()
val df = sc.read.csv("src/main/resources/data.csv")
df.foreach(row => println(row.mkString(", ")))
If you cant upgrade, then do a cleanup of \n on RDD with regex. This wont remove end of line since it is $ in regex. S.b.
val conf = new SparkConf().setAppName("HelloSpark").setMaster("local")
val sc = new SparkContext(conf)
val rdd1 = sc.textFile("src/main/resources/data.csv")
val rdd2 = rdd1.map(row => row.replace("\\n", ""))
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = rdd2.toDF()
df.foreach(row => println(row.mkString(", ")))

How to parse a csv that uses ^A (i.e. \001) as the delimiter with spark-csv?

Terribly new to spark and hive and big data and scala and all. I'm trying to write a simple function that takes an sqlContext, loads a csv file from s3 and returns a DataFrame. The problem is that this particular csv uses the ^A (i.e. \001) character as the delimiter and the dataset is huge so I can't just do a "s/\001/,/g" on it. Besides, the fields might contain commas or other characters I might use as a delimiter.
I know that the spark-csv package that I'm using has a delimiter option, but I don't know how to set it so that it will read \001 as one character and not something like an escaped 0, 0 and 1. Perhaps I should use hiveContext or something?
If you check the GitHub page, there is a delimiter parameter for spark-csv (as you also noted).
Use it like this:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.option("delimiter", "\u0001")
.load("cars.csv")
With Spark 2.x and the CSV API, use the sep option:
val df = spark.read
.option("sep", "\u0001")
.csv("path_to_csv_files")