How to remove special characters ^# from a dataframe in pyspark - pyspark

How can i prevent the special characters i.e ^# from being written to the file while writing the dataframe to s3?

using df.option("quote", "") while saving to file handled the ascii null character.

Related

Escape \n when exporting CSV using COPY command

I need to export various tables in CSV for AWS Glue Catalog and I just noticed a major showstopper:
COPY command does not escape new line inputs in columns, only quotes them.
What confuses me even more is that I can switch to TEXT and get the format right - escape the special characters - but I cannot have HEADER in that format!
COPY (%s) TO STDOUT DELIMITER ',' NULL ''
Is there a way to get both HEADER and to escape the new line through COPY command?
I'm hoping that it's my overlook as the code is obviously there.
The text format does not produce CSV, that is why you cannot get headers or change the delimiter. It is the “internal” tab-separated format of PostgreSQL.
There are no provisions to replace newlines with \n in a CSV file, and indeed that would produce invalid CSV (according to what most people think; there is no standard).
You'll have to post-process the file.

Write csv file with \r in Scala spark

I need to write a csv file in spark with line ending with \r - carriage return. By default lines are ending with \n - newline. Any idea to change this.
Use LineSep option.
df.write.format("csv").option("lineSep", "\r").save(path)

Spark read delimited csv ignoring escape

I need to read a csv delimited by "|": each column value is a string and it is included between "".
I use this code to read the file
val df = spark.read.option("header", "true").option("delimiter","|").option("escape", "_").option("inferSchema","false").csv("maprfs:///iper/file.txt")
Is there a way to ignore and not use the escape character?
Otherwise how I can delete a special character in the csv file (for example "\" or "_") and reload it as a dataframe?

Postgres COPY multibyte delimiter

I have a large body of text and other data that I need to import into Postgres. This text contains all the possible single-byte characters. This means I can't choose ",", ";", "-" or any other single-byte character as a delimiter in my CSV file because it would be confused by the text that contains it.
Is there any way to chose a multibyte character as a delimiter, use multiple characters as a delimiter or use COPY command in some other way to solve this?
Command I'm using:
COPY site_articles(id,url,title,content) FROM '/home/sites/site_articles.csv' DELIMITER '^' CSV;
This means I can't choose ",", ";", "-" or any other single-byte character as a delimiter in my CSV file because it would be confused by the text that contains it.
CSV has an escaping mechanism. Use it. Quote strings that contain the delimiter character ,, and if the quoted string contains the quote character, double the quote character.
e.g. if you want to represent two values Fred "wiggle" Smith and one, two, you'd do so as:
"Fred ""Wiggle"" Smith","one, two"
At time of writing (9.5) copy does not support multi-byte characters as delimiters. You can use 3rd party ETL tools like Pentaho Kettle, though.

Postgresql COPY with text value containing \0 (backslash 0)

Setup: Postgresql Server 9.3 - OS: CentOS 6.6
Attempting to bulk insert 250 million records into a Postgresql 9.3 server using the COPY command. The data is in delimited format using a pipe '|' as the delimiter.
Almost all columns in the table that I'm copying to are TEXT datatypes. Unfortunately, out of the 250 million records, there's about 2 million that have legitimate textual values with a "\0" in the text.
Example entry:
245150963|DATASOURCE|736778|XYZNR-1B5.1|10-DEC-1984 00:00:00|||XYZNR-1B5.1\1984-12-10\0.5\1\ASDF1|pH|Physical|Water|XYZNR|Estuary
As you can see, the 8th column has a legitimate \0 in its value.
XYZNR-1B5.1\1984-12-10\0.5\1\ASDF1
No matter how I escape this, the COPY command will either convert this \0 into an actual "\x0" or the COPY command fails with "ERROR: invalid byte sequence for encoding "UTF8": 0x00".
I have tried replacing the \0 with "sed -i" with:
\\0
\\\0
'\0'
\'\'0
\\\\\0
... and many others I can't remember and none of them work.
What would be the correct escaping of these types of strings?
Thanks!
Per Postgres doc on COPY:
Backslash characters () can be used in the COPY data to quote data
characters that might otherwise be taken as row or column delimiters.
In particular, the following characters must be preceded by a
backslash if they appear as part of a column value: backslash itself,
newline, carriage return, and the current delimiter character.
Try to convert all your backslash characters in that path in the field to \\, not just the \0.
FYI \b is shorthand for backslash as well.
So either of these should work:
XYZNR-1B5.1\b1984-12-10\b0.5\b1\bASDF1
XYZNR-1B5.1\\1984-12-10\\0.5\\1\\ASDF1
The one you needed was the one example you didn't give:
sed -e 's/\\/\\\\/g'
You want this for all occurrences of \, not just for \0.
From the perspective of the file and postgres, were trying to convert \ to \\.
In sed, \ is a special character which we need to self escape, so \ becomes \\, and \\ becomes \\\\, hence the above expression.
Have you confirmed that your sed command is actually giving you \\0?