How can we have multicharacter line separator (line delimiter) in Spark?
Spark 3 allows multicharacter column delimiter but for line separator it only allows one character.
e.g. how to have this: ###\n as line separator?
Related
I try to import a text file into a single column table, i.e. I don't want a single line of the source file to be delimited into columns. The file contains many different characters (tabs, commas, spaces) that could be recognized as delimiters. Since bell (CHR(7)) doesn't exist in the data file I chose it as delimiter:
COPY data_table(single_column) FROM '/tmp/data' WITH ENCODING 'LATIN1' DELIMITER CHR(7);
Unfortunately, this results in an error:
ERROR: syntax error at or near "chr"
What would be the correct syntax?
You can't use a function there. Use the escape notation.
DELIMITER E'\007'
I need to write a csv file in spark with line ending with \r - carriage return. By default lines are ending with \n - newline. Any idea to change this.
Use LineSep option.
df.write.format("csv").option("lineSep", "\r").save(path)
I need to read a csv delimited by "|": each column value is a string and it is included between "".
I use this code to read the file
val df = spark.read.option("header", "true").option("delimiter","|").option("escape", "_").option("inferSchema","false").csv("maprfs:///iper/file.txt")
Is there a way to ignore and not use the escape character?
Otherwise how I can delete a special character in the csv file (for example "\" or "_") and reload it as a dataframe?
I am using pyspark code to generate csv from a dataframe using below code,
df.repartition(1).write.format('com.databricks.spark.csv').option("header","true").mode("overwrite").save("/user/test")
But, when i open and see the line terminator in notepad++, it is coming with default line terminator "\n". I have tried different options such as textinputformat record delimiter set etc. but no luck. Is there a way to customize this EOL while exporting dataframe to csv in spark. Actually i need to customize this EOL with CRLF ("\r\n").Appreciate any help. Thanks.
You can use the lineSep option to set a single character as line separator.
(
df.repartition(1).write.format('com.databricks.spark.csv')
.option("header", "true")
.mode("overwrite")
.option("lineSep", "^")
.save("/user/test")
)
Docs source: https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option
I have a csv that has as delimiter SOH character does neo4j import tool support this character ? through load csv i succeded to with fieldterminator '.' in browser
Yes, it possible - you need use escape sequence of SOH character:
LOAD CSV FROM "file:///soh.csv" as row FIELDTERMINATOR "\u0001"
RETURN row
For command line:
String expression can be normal characters as well as for example:
'\t', '\123', and "TAB".
../bin/neo4j-import --into ./db/ --nodes soh.csv --delimiter "\0001"