I need to write a csv file in spark with line ending with \r - carriage return. By default lines are ending with \n - newline. Any idea to change this.
Use LineSep option.
df.write.format("csv").option("lineSep", "\r").save(path)
Related
I need to export various tables in CSV for AWS Glue Catalog and I just noticed a major showstopper:
COPY command does not escape new line inputs in columns, only quotes them.
What confuses me even more is that I can switch to TEXT and get the format right - escape the special characters - but I cannot have HEADER in that format!
COPY (%s) TO STDOUT DELIMITER ',' NULL ''
Is there a way to get both HEADER and to escape the new line through COPY command?
I'm hoping that it's my overlook as the code is obviously there.
The text format does not produce CSV, that is why you cannot get headers or change the delimiter. It is the “internal” tab-separated format of PostgreSQL.
There are no provisions to replace newlines with \n in a CSV file, and indeed that would produce invalid CSV (according to what most people think; there is no standard).
You'll have to post-process the file.
How can i prevent the special characters i.e ^# from being written to the file while writing the dataframe to s3?
using df.option("quote", "") while saving to file handled the ascii null character.
I am using pyspark code to generate csv from a dataframe using below code,
df.repartition(1).write.format('com.databricks.spark.csv').option("header","true").mode("overwrite").save("/user/test")
But, when i open and see the line terminator in notepad++, it is coming with default line terminator "\n". I have tried different options such as textinputformat record delimiter set etc. but no luck. Is there a way to customize this EOL while exporting dataframe to csv in spark. Actually i need to customize this EOL with CRLF ("\r\n").Appreciate any help. Thanks.
You can use the lineSep option to set a single character as line separator.
(
df.repartition(1).write.format('com.databricks.spark.csv')
.option("header", "true")
.mode("overwrite")
.option("lineSep", "^")
.save("/user/test")
)
Docs source: https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option
Running this from the terminal prompt:
$ wc data.csv
195727 15924341 201584826 data.csv
So, 195727 lines. What about Scala?
val raw_rows: Iterator[String] = scala.io.Source.fromFile("data.csv").getLines()
println(raw_rows.length)
Result: 200945
What am I facing here? I wish for it to be the same. In fact, if I use mighty csv (opencsv wrapper lib) it also reads 195727 lines.
It might be a newline issue. From the doc of getLines
Returns an iterator who returns lines (NOT including newline character(s)). It will treat any of \r\n, \r, or \n as a line separator (longest match) - if you need more refined behavior you can subclass Source#LineIterator directly
I have a script that reads a large file line by line. The record separator ($/) that I would like to use is (\n). The only problem is that the data on each line contains CRLF characters (\r\n), which the program should not be considered the end of a line.
For example, here is a sample data file (with the newlines and CRLFs written out):
line1contents\n
line2contents\n
line3\r\ncontents\n
line4contents\n
If I set $/ = "\n", then it splits the third line into two lines. Ideally, I could just set $/ to a regex that matches \n and not \r\n, but I don't think that's possible. Another possibility is to read in the whole file, then use the split function to split on said regex. The only problem is that the file is too large to load into memory.
Any suggestions?
For this particular task, it sounds pretty straightforward to check your line ending and append the next line as necessary:
$/ = "\n";
...
while(<$input>) {
while( substr($_,-2) eq "\r\n" ) {
$_ .= <$input>;
}
...
}
This is the same logic used to support line continuation in a number of different programming contexts.
You are right that you can't set $/ to a regular expression.
dos2unix would put a UNIX newline character in for the "\r\n" and so wouldn't really solve the problem. I would use a regex that replaces all instances of "\r\n" with a space or tab character and save the results to a different file (since you don't want to split the line at those points). Then I would run your script on the new file.
Try using dos2unix on the file first, and then read in as normal.