I am using pyspark code to generate csv from a dataframe using below code,
df.repartition(1).write.format('com.databricks.spark.csv').option("header","true").mode("overwrite").save("/user/test")
But, when i open and see the line terminator in notepad++, it is coming with default line terminator "\n". I have tried different options such as textinputformat record delimiter set etc. but no luck. Is there a way to customize this EOL while exporting dataframe to csv in spark. Actually i need to customize this EOL with CRLF ("\r\n").Appreciate any help. Thanks.
You can use the lineSep option to set a single character as line separator.
(
df.repartition(1).write.format('com.databricks.spark.csv')
.option("header", "true")
.mode("overwrite")
.option("lineSep", "^")
.save("/user/test")
)
Docs source: https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option
Related
I need to export various tables in CSV for AWS Glue Catalog and I just noticed a major showstopper:
COPY command does not escape new line inputs in columns, only quotes them.
What confuses me even more is that I can switch to TEXT and get the format right - escape the special characters - but I cannot have HEADER in that format!
COPY (%s) TO STDOUT DELIMITER ',' NULL ''
Is there a way to get both HEADER and to escape the new line through COPY command?
I'm hoping that it's my overlook as the code is obviously there.
The text format does not produce CSV, that is why you cannot get headers or change the delimiter. It is the “internal” tab-separated format of PostgreSQL.
There are no provisions to replace newlines with \n in a CSV file, and indeed that would produce invalid CSV (according to what most people think; there is no standard).
You'll have to post-process the file.
I need to write a csv file in spark with line ending with \r - carriage return. By default lines are ending with \n - newline. Any idea to change this.
Use LineSep option.
df.write.format("csv").option("lineSep", "\r").save(path)
I need to read a csv delimited by "|": each column value is a string and it is included between "".
I use this code to read the file
val df = spark.read.option("header", "true").option("delimiter","|").option("escape", "_").option("inferSchema","false").csv("maprfs:///iper/file.txt")
Is there a way to ignore and not use the escape character?
Otherwise how I can delete a special character in the csv file (for example "\" or "_") and reload it as a dataframe?
How can i prevent the special characters i.e ^# from being written to the file while writing the dataframe to s3?
using df.option("quote", "") while saving to file handled the ascii null character.
I am trying to import a CSV file into Posgres that has a comma as delimiter. I do:
\COPY products(title, department) from 'toyd.csv' with (DELIMITER ',');
All super cool.
However, title and department are both strings. I have some commas that are in these columns that I don't want to be interpreted as delimiters. So I pass the strings in quotes. But this doesn't work. Postgres still thinks they are delimiters. What are my missing?
Here is a snippet from the CSV that causes the problem:
"Light","Reading, Writing & Spelling"
Any ideas?
You aren't using CSV format there, just a comma-delimited one.
Tell it you want FORMAT CSV and it should default to quoted text - you could also change the quoting character if necessary.